FAQ
explore using automaton for fuzzyquery
--------------------------------------

Key: LUCENE-2089
URL: https://issues.apache.org/jira/browse/LUCENE-2089
Project: Lucene - Java
Issue Type: Wish
Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor


Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required N edits needed to match the users supplied float threshold.
* for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNNNNNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)
Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
7N|10|64.0|4155.9|38.6|20.3|
14N|10|0.0|2511.6|46.0|37.9|
28N|10|0.0|2506.3|93.0|86.6|
56N|10|0.0|2524.5|304.4|298.5|
as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Mark Miller (JIRA) at Nov 22, 2009 at 3:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781136#action_12781136 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. (i will assign this to him, I know he is itching to write that nasty algorithm
    ha - too much wine last night to laugh that hard this morning. Painful.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required N edits needed to match the users supplied float threshold.
    * for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Description:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    was:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required N edits needed to match the users supplied float threshold.
    * for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.




    modify description to be readable, K is number of edits, N refers to length of word.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Description:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    was:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Nov 22, 2009 at 3:15 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781138#action_12781138 ]

    Uwe Schindler commented on LUCENE-2089:
    ---------------------------------------

    bq. ha - too much wine last night to laugh that hard this morning. Painful.

    I need more beer, after that 3.0.0 problem with the AttributeSource API :-( (LUCENE-2088).
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781139#action_12781139 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python.
    the download link for 0.2 is not available, and it appears from the website this is the version with the updated algorithm.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781140#action_12781140 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    I hope its obvious from the benchmark why we shouldn't use the crappy prototype, and optimize K=1 (which is probably very common).
    if someone has a small index, with very long terms (bio sequences, or something), then fuzzy would actually get slower.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 3:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781142#action_12781142 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway :)
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781144#action_12781144 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway

    you will need more wine. If it starts to catch, I'll be happy to offer help with use of the automaton API, etc.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 4:22 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781149#action_12781149 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    I wonder what the best way to handle this would be - these transition tables appear to get very large very fast. Looks like the python stuff is now calculating them on the fly. You wouldn't want to load them all up into RAM for those not using them. Perhaps some sort of lru cache load on demand or something - though can't really fully warm up that way.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 4:24 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781151#action_12781151 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Mark, they would get large fast, but i think we only need say 1,2,3.
    once you go high K, it will be many disk seeks anyway, and 'linear' mode like it does now will be probably faster.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 4:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781154#action_12781154 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Another twist, is that we have to support the 'constant prefix' as well.
    the easiest way would be that if the user asks for this, we intersect the automaton with a 'prefix automaton' such as abc.*

    but the BasicOperations.intersection is quadratic in # of states, because it supports both NFA and DFA
    we might need to implement a better DFA-only intersection alg for this constant prefix, so that supplying constant prefix doesnt actually make things slower :)

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 4:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781157#action_12781157 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. Mark, they would get large fast, but i think we only need say 1,2,3.

    Ah, right - you mention that above in the summary.

    bq. Another twist, is that we have to support the 'constant prefix' as well.

    Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 4:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781161#action_12781161 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.

    not so bad, but not so good either. see my wildcard results, for example (on the 10M equally distributed terms)
    Pattern||Iter||AvgHits||AvgMS (old)||AvgMS (new)||
    N?N?N?N||10||1000.0||288.6||38.5||
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 5:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781167#action_12781167 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    Right, I wouldn't expect it to be great with a single char prefix - especially with the random terms your making. But I think thats a much worse case than a slightly longer prefix with a normal term distribution.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 5:36 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

    for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

    so the performance is language-dependent.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 5:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170 ]

    Robert Muir edited comment on LUCENE-2089 at 11/22/09 5:44 PM:
    ---------------------------------------------------------------

    Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

    for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

    so the performance is language-dependent.

    also mark, you said you want to "scale" fuzzy query right? Taking a linear algorithm and dividing it by a constant (alphabet size), which is what constant prefix does, does not improve the scalability. for english, it just makes your fuzzy query 26 times faster. so imo it would be better to improve the real scalability... the constant prefix is just an optimization/hack but does not really solve the problem


    was (Author: rcmuir):
    Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

    for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

    so the performance is language-dependent.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 5:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781173#action_12781173 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. the constant prefix is just an optimization/hack

    Right - which is part of why I am not so worried about handling it with a new impl - its really only there now because the thing is so slow without it - if we really needed to support it, it wouldn't be so bad to fall back to as it is. Though if we can support it with the new impl fine - but I don't think I would go out of the way for it myself.

    bq. for english, it just makes your fuzzy query 26 times faster

    With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 5:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
    we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 6:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174 ]

    Robert Muir edited comment on LUCENE-2089 at 11/22/09 6:01 PM:
    ---------------------------------------------------------------

    bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
    we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

    btw, I am implying here that we should never do a real levenshtein dynamic programming calculation if we do not have to.
    for example, in my k=1 prototype, it is never used:
    {code}
    public float difference() {
    if (currentTerm == null || currentTerm.equals(term))
    return 1.0F;
    else
    return 1.0F - (1.0F / (Math.max(currentTerm.text().length(), term.text().length())));
    }
    {code}

    instead, if up to k=3 is required, we should do these steps:
    * equals(), then 1.0F
    * matches k=1 automaton?
    * matches k=2 automaton?
    * matches k=3 automaton?

    even for k=3 i am sure this is likely faster on average than doing levenshtein, especially if the query matches many terms.
    automaton matching is linear time, and just array access.


    was (Author: rcmuir):
    bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
    we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:04 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ]

    Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:06 PM:
    ---------------------------------------------------------------

    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

    *edit*

    Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

    I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.

    was (Author: markrmiller@gmail.com):
    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 6:12 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781179#action_12781179 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it.

    I agree with garbage, but we cannot completely replace it anyway. for example what if someone supplies a term of length 54 and asks for distance of 0.5?
    we should not use this algorithm for nonsense like that, in that case I think they should just use the garbage algorithm.

    Here is a quote from that moman page:
    It means that in theory, you could ask for a Levenshtein distance of 27! Well, if you have a week ahead of you...

    we shouldnt burn cycles creating useless tables that will be huge arrays either in fuzzyquery, or whatever. we can't compute all the way up to infinity, this is why i think something like 1,2,3 is "reasonable" , if it requires more edits than that, go with the old brute force algorithm?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ]

    Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:12 PM:
    ---------------------------------------------------------------

    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

    *edit*

    Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

    I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.

    *edit*

    Not that I can't see people wanting just the tail end to be fuzzy - in that case, fine - do your dirty work :) I just don't know how much of a real req that is compared to a prefix query. Perhaps its more useful than I'd guess myself. I think we are quite a ways from getting this alg implemented anyway :) Though I am trying to track down the latest python code.

    was (Author: markrmiller@gmail.com):
    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

    *edit*

    Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

    I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781180#action_12781180 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. if it requires more edits than that, go with the old brute force algorithm?

    Perhaps - but it should be an option. We don't want perf to just drop off a cliff (and thats an understatement on a large index), just because the input crossed some line. I'd almost prefer the input just doesn't work - in my mind FuzzyQuery just doesn't work on large indecies. But far be it from me to take away the option ...

    I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 6:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781182#action_12781182 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.

    you are completely right, it should not drop off a cliff. though i think as n increases, it will get significantly worse, because of so much seeking.
    we find the nice n where this is almost as bad as brute force, and cut her off there?

    this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around.
    Uwe and I are referring this as "dumb" mode versus "smart" mode. I'm gonna add a constructor to the enum, so this can be specified rather than "auto" as well.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 7:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781204#action_12781204 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. we find the nice n where this is almost as bad as brute force, and cut her off there?

    yeah, that sounds good if there is a nice transition at a relatively lower n.

    bq. this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around.

    Yeah, and I think this makes sense - but you will notice both the Lucene qp and Solr don't allow leading wildcard by default - the perf is generally bad enough on a large index that we assume you don't want to allow it and force users to "turn on" the option.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 8:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781205#action_12781205 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. but you will notice both the Lucene qp and Solr don't allow leading wildcard by default

    if we add this automaton stuff, I think we should reconsider this rule.
    instead of don't allow leading * or ? by default, just don't allow just leading * by default.

    i won't really argue for it though, because a wildcard query of "?????????????abc", probably just as bad as "*abc"
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 8:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781206#action_12781206 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    I think it makes sense to allow leading ? - ?????????abc is prob rare enough that its worth it to allow by default. Or perhaps a separate knob to turn it on and leave leading * off.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 8:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781207#action_12781207 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    mark, you are right.

    plus, the qp does not throw exceptions if you have a fuzzy query with no constant prefix, which is going to be actually worse than even leading *!!!
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 8:17 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781208#action_12781208 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    solr doesnt even allow for a constant prefix with fuzzy - its broken, broken, broken :) DisMax to the rescue.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 9:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Mark Miller updated LUCENE-2089:
    --------------------------------

    Attachment: Moman-0.1.tar.gz
    From Moman author:
    Absolutely. Sorry for the missing links. I had some problems with my
    provider which backed up an old version of my website.
    The library is MIT licensed, so feel free to take anything you want. I
    didn't made the subversion available since I was working
    on Daciuk's "Incremental construction of minimal acyclic finite-state
    automata", improving the memory footprint of the algorithm.
    Since I want to be sure the new algorithm is right before publishing
    it, the repository isn't available. Anyway, here's the source
    code (attached). I must warn you that there's no comments at all,
    which is unfortunate, since I was contacted by many people
    lately, that had the same kind of request than yours.

    If you need anything, just don't hesitate to ask.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 9:50 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781230#action_12781230 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Mark, from his page it seemed like 0.2 was the version with the generalized edit distance?

    {noformat}
    Moman 0.2 is out!
    (2005-07-29) This version add the possibility to use a Levenshtein distance greater than 1.
    Before, the transitions tables were static, now we build them.
    It means that in theory, you could ask for a Levenshtein distance of 27!
    Well, if you have a week ahead of you...
    {noformat}
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Mark Miller updated LUCENE-2089:
    --------------------------------

    Attachment: (was: Moman-0.1.tar.gz)
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Mark Miller updated LUCENE-2089:
    --------------------------------

    Attachment: Moman-0.2.1.tar.gz

    Sorry - I attached the wrong file.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 25, 2009 at 3:04 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782458#action_12782458 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    I might be on the trail of a java impl - get out the hounds!

    Perhaps I won't be looking over that paper for thanksgiving after all.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedNov 22, '09 at 3:01p
activeNov 25, '09 at 3:04p
posts36
users1
websitelucene.apache.org

1 user in discussion

Mark Miller (JIRA): 36 posts

People

Translate

site design / logo © 2021 Grokbase