FAQ
FastVectorHighlighter: IDF-weighted terms for ordered fragments
----------------------------------------------------------------

Key: LUCENE-3440
URL: https://issues.apache.org/jira/browse/LUCENE-3440
Project: Lucene - Java
Issue Type: Improvement
Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
Fix For: 3.5


The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.

This patch provides ordered fragments with IDF-weighted terms:

total weight = total weight + IDF for unique term per fragment * boost of query;

The ranking-formular should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.

The patch is simple, but it works for us.

Some ideas:
- A better approach would be moving the whole fragments-scoring into a separate class.
- Switch scoring via parameter
- Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
- edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher







--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • S.L. (JIRA) at Sep 20, 2011 at 9:35 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: LUCENE-3440.patch

    Works for lucene_solr_branch_3x.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formular should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 20, 2011 at 7:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: LUCENE-3440-1.patch

    Ups, wrong patch ... here's the right one.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch, LUCENE-3440.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formular should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 21, 2011 at 6:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Description:
    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.

    This patch provides ordered fragments with IDF-weighted terms:

    total weight = total weight + IDF for unique term per fragment * boost of query;

    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.

    The patch is simple, but it works for us.

    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher







    was:
    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.

    This patch provides ordered fragments with IDF-weighted terms:

    total weight = total weight + IDF for unique term per fragment * boost of query;

    The ranking-formular should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.

    The patch is simple, but it works for us.

    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher







    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 21, 2011 at 6:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: (was: LUCENE-3440.patch)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formular should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Sep 22, 2011 at 12:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112261#comment-13112261 ]

    Koji Sekiguchi commented on LUCENE-3440:
    ----------------------------------------

    I think this is an interesting point of view, thanks! But I couldn't apply the patch to the latest trunk:

    {code}
    [koji@MacBook LUCENE-3440]$ patch -p0 --dry-run < LUCENE-3440.patch
    patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldFragList.java
    patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
    patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java
    Hunk #1 FAILED at 31.
    Hunk #2 FAILED at 96.
    Hunk #3 FAILED at 108.
    Hunk #4 succeeded at 148 (offset -9 lines).
    3 out of 4 hunks FAILED -- saving rejects to file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java.rej
    {code}

    Can you verify that?
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 22, 2011 at 11:23 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112481#comment-13112481 ]

    S.L. commented on LUCENE-3440:
    ------------------------------

    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Bar.java|borderStyle=solid}
    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += Math.pow(ti.weight, ti.weight) * phraseInfo.boost;
    }
    }

    totalBoost *= distinctTerms.size();
    {code}



    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 22, 2011 at 11:25 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112481#comment-13112481 ]

    S.L. edited comment on LUCENE-3440 at 9/22/11 11:24 AM:
    --------------------------------------------------------

    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Bar.java|borderStyle=solid}

    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += Math.pow(ti.weight, ti.weight) * phraseInfo.boost;
    }
    }
    }
    totalBoost *= distinctTerms.size();
    {code}




    was (Author: mdz-munich):
    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Bar.java|borderStyle=solid}
    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += Math.pow(ti.weight, ti.weight) * phraseInfo.boost;
    }
    }

    totalBoost *= distinctTerms.size();
    {code}



    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 22, 2011 at 11:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112481#comment-13112481 ]

    S.L. edited comment on LUCENE-3440 at 9/22/11 11:31 AM:
    --------------------------------------------------------

    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Bar.java|borderStyle=solid}

    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += ti.weight * phraseInfo.boost;
    }
    }
    }
    totalBoost *= distinctTerms.size();
    {code}




    was (Author: mdz-munich):
    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Bar.java|borderStyle=solid}

    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += Math.pow(ti.weight, ti.weight) * phraseInfo.boost;
    }
    }
    }
    totalBoost *= distinctTerms.size();
    {code}



    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 22, 2011 at 11:35 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112481#comment-13112481 ]

    S.L. edited comment on LUCENE-3440 at 9/22/11 11:33 AM:
    --------------------------------------------------------

    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Boost with number of distinct terms per fragment|borderStyle=dotted}
    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += ti.weight * phraseInfo.boost;
    }
    }
    totalBoost *= distinctTerms.size();
    {code}




    was (Author: mdz-munich):
    No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

    Another approach:

    Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

    {code:title=Bar.java|borderStyle=solid}

    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );

    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;

    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    distinctTerms.add( ti.text );
    totalBoost += ti.weight * phraseInfo.boost;
    }
    }
    }
    totalBoost *= distinctTerms.size();
    {code}



    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Sep 22, 2011 at 11:49 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112490#comment-13112490 ]

    Koji Sekiguchi commented on LUCENE-3440:
    ----------------------------------------

    Ah, I see. I saw trunk, but you made the patch for 3x. I'll see.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 22, 2011 at 8:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: LUCENE-3440-2.patch
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch, LUCENE-3440-2.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 22, 2011 at 8:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112887#comment-13112887 ]

    S.L. commented on LUCENE-3440:
    ------------------------------

    Here another patch.

    - The calculation of WeightedFragInfo.totalBoost remains unmodified
    - A new field WeightedFragInfo.totalWeight has been introduced
    - A class WeightOrderFragmentsBuilder sorts now by WeightedFragInfo.totalWeight
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: patch
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch, LUCENE-3440-2.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Sep 23, 2011 at 1:07 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113086#comment-13113086 ]

    Koji Sekiguchi commented on LUCENE-3440:
    ----------------------------------------

    Hi,

    # Which patch do you want me to try?
    # Can you make that for trunk branch?

    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3440-1.patch, LUCENE-3440-2.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:32 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: LUCENE-3.5-SNAPSHOT-3440-3.patch
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-3440-1.patch, LUCENE-3440-2.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: (was: LUCENE-3440-2.patch)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: LUCENE-3.5-SNAPSHOT-3440-3.patch

    Patch for branch_3x.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: (was: LUCENE-3.5-SNAPSHOT-3440-3.patch)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: (was: LUCENE-3440-1.patch)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    S.L. updated LUCENE-3440:
    -------------------------

    Attachment: LUCENE-4.0-SNAPSHOT-3440-3.patch

    Patch for trunk.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113389#comment-13113389 ]

    S.L. commented on LUCENE-3440:
    ------------------------------

    Hi Koji,

    bq. 1. Which patch do you want me to try?

    Doesn't matter. First time I took the trunk for a long time. I'm looking forward to the new admin-interface in solr/lucene-4.0!

    bq. 2. Can you make that for trunk branch?

    Here we go. This Version is slightly different, the weight is now boosted by the normalized number of terms per fragment:

    {code:borderStyle=dotted}
    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );
    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;
    totalBoost += phraseInfo.boost;
    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    if ( uniqueTerms.add( ti.text ) )
    totalWeight += Math.pow(ti.weight, 2) * phraseInfo.boost;
    termsPerFrag++;
    }
    }
    totalWeight *= termsPerFrag * ( 1 / Math.sqrt( termsPerFrag ) );
    }
    {code}

    Due to a significant lack of mathematical knowledge, a *very* _intuitive_ solution.
    But it seems to work very well, at least for our data (highly multi-lingual, mostly historical, dirty OCRed, books, journals + papers).
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • S.L. (JIRA) at Sep 23, 2011 at 12:56 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113389#comment-13113389 ]

    S.L. edited comment on LUCENE-3440 at 9/23/11 12:56 PM:
    --------------------------------------------------------

    Hi Koji,

    bq. 1. Which patch do you want me to try?

    Doesn't matter. First time I took the trunk for a long time. I'm looking forward to the new admin-interface in solr/lucene-4.0!

    bq. 2. Can you make that for trunk branch?

    Here we go. This Version is slightly different, the weight is now boosted by the normalized number of terms per fragment:

    {code:borderStyle=dotted}
    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );
    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;
    totalBoost += phraseInfo.boost;
    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    if ( uniqueTerms.add( ti.text ) )
    totalWeight += Math.pow(ti.weight, 2) * phraseInfo.boost;
    termsPerFrag++;
    }
    }
    }
    totalWeight *= termsPerFrag * ( 1 / Math.sqrt( termsPerFrag ) );
    {code}

    Due to a significant lack of mathematical knowledge, a *very* _intuitive_ solution.
    But it seems to work very well, at least for our data (highly multi-lingual, mostly historical, dirty OCRed, books, journals + papers).

    was (Author: mdz-munich):
    Hi Koji,

    bq. 1. Which patch do you want me to try?

    Doesn't matter. First time I took the trunk for a long time. I'm looking forward to the new admin-interface in solr/lucene-4.0!

    bq. 2. Can you make that for trunk branch?

    Here we go. This Version is slightly different, the weight is now boosted by the normalized number of terms per fragment:

    {code:borderStyle=dotted}
    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
    SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
    subInfos.add( subInfo );
    Iterator it = phraseInfo.termInfos.iterator();
    TermInfo ti;
    totalBoost += phraseInfo.boost;
    while ( it.hasNext() ) {
    ti = ( TermInfo ) it.next();
    if ( uniqueTerms.add( ti.text ) )
    totalWeight += Math.pow(ti.weight, 2) * phraseInfo.boost;
    termsPerFrag++;
    }
    }
    totalWeight *= termsPerFrag * ( 1 / Math.sqrt( termsPerFrag ) );
    }
    {code}

    Due to a significant lack of mathematical knowledge, a *very* _intuitive_ solution.
    But it seems to work very well, at least for our data (highly multi-lingual, mostly historical, dirty OCRed, books, journals + papers).
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Sep 24, 2011 at 2:19 am
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113873#comment-13113873 ]

    Koji Sekiguchi commented on LUCENE-3440:
    ----------------------------------------

    Patch looks great! A few comments:

    # For the new totalWeight, add getter method and modify toString() in WeightedFragInfo().
    # The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object?
    # Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder)
    # Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures.
    # use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object.

    bq. Due to a significant lack of mathematical knowledge, a very intuitive solution.

    I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5
    Reporter: S.L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (JIRA) at Sep 24, 2011 at 2:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Fix Version/s: 4.0
    Affects Version/s: 4.0
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (JIRA) at Sep 25, 2011 at 4:52 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114291#comment-13114291 ]

    sebastian L. commented on LUCENE-3440:
    --------------------------------------

    bq. Patch looks great!

    Thanks.

    bq. 1. For the new totalWeight, add getter method and modify toString() in WeightedFragInfo().

    Okay.

    bq. 2. The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object?

    I played a little with log(numDocs - docFreq + 0.5 / docFreq + 0.5) but is seems to make no difference. If I'm not mistaken there is no method IndexReader.getSimilarity() or IndexReader.getDefaultSimilarity().

    Therefore: Okay.

    bq. 3. Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder)

    Hm, I thought about something like that:

    {code:xml}
    <highlighting>
    <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="false"/>
    <fragmentsBuilder name="weighted" class="org.apache.solr.highlight.WeightOrderFragmentsBuilder" default="true"/>
    </highlighting>
    {code}

    For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one.

    bq. 4 Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures.

    Okay.

    bq. 5. use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object.

    Okay.

    bq. I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot.

    I'll write some Proof-of-concept Test-Class. But this may take some time.


    I discovered a little problem with overlapping terms, depending on the analyzing-process:

    WeightedPhraseInfo.addIfNoOverlap() dumps the second part of hyphenated words (for example: social-economics). The result is that all informations in TermInfo are lost and not available for computing the fragments weight. I simple modified WeightedPhraseInfo.addIfNoOverlap() a little to change this behavior:

    {code:java}
    void addIfNoOverlap( WeightedPhraseInfo wpi ){
    for( WeightedPhraseInfo existWpi : phraseList ){
    if( existWpi.isOffsetOverlap( wpi ) ) {
    existWpi.termInfos.addAll( wpi.termInfos );
    return;
    }
    }
    phraseList.add( wpi );
    }
    {code}

    But I am not sure if there could be some unforeseen site-effects?





    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Sep 25, 2011 at 11:55 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114402#comment-13114402 ]

    Koji Sekiguchi commented on LUCENE-3440:
    ----------------------------------------

    {quote}
    Hm, I thought about something like that:

    {code:xml}
    <highlighting>
    <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="false"/>
    <fragmentsBuilder name="weighted" class="org.apache.solr.highlight.WeightOrderFragmentsBuilder" default="true"/>
    </highlighting>
    {code}

    For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one.
    {quote}

    I thought that, too. But I saw the following in the patch:

    {code}
    public List<WeightedFragInfo> getWeightedFragInfoList( List<WeightedFragInfo> src ) {
    Collections.sort( src, new ScoreComparator() );
    // Collections.sort( src, new WeightComparator() );
    return src;
    }
    {code}

    And I thought you wanted to use WeightComparator from ScoreOrderFragmentsBuilder. :)

    Well now, let's introduce WeightOrderFragmentsBuilder.

    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Sep 30, 2011 at 12:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: LUCENE-3.5-SNAPSHOT-3440-6.patch

    Patch for 3.5. Docs are still missing.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Sep 30, 2011 at 12:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: (was: LUCENE-3.5-SNAPSHOT-3440-3.patch)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Sep 30, 2011 at 12:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: WeightOrderFragmentsBuilder_table01.html

    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for 'testament'. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Sep 30, 2011 at 12:12 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: WeightOrderFragmentsBuilder_table02.html

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-queries for 'das alte testament'. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118009#comment-13118009 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:14 PM:
    ----------------------------------------------------------------

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for 'das alte testament'. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".

    was (Author: mdz-munich):
    WeightOrderFragmentsBuilder_table02.html:
    A more-word-queries for 'das alte testament'. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Sep 30, 2011 at 12:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java

    The two tables are created by this simple class. I took, representatively, some single pages as documents form our book-stock to build a "bag-of-words".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Commented) (JIRA) at Sep 30, 2011 at 12:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118012#comment-13118012 ]

    sebastian L. commented on LUCENE-3440:
    --------------------------------------

    Hm, I tried to do that all with trunk but:

    {code:borderStyle=dotted}
    29.09.2011 15:43:09 org.apache.solr.common.SolrException log
    SEVERE: java.lang.VerifyError: class org.apache.lucene.analysis.ReusableAnalyzerBase overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
    at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
    at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
    at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
    at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
    at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
    at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:403)
    at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:407)
    at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:456)
    at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1653)
    at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1647)
    at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1680)
    at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:875)
    at org.apache.solr.core.SolrCore.(SolrCore.java:507)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)
    at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)
    at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)
    at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
    at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)
    at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
    at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
    at org.apache.catalina.core.ApplicationFilterConfig.(StandardContext.java:4001)
    at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651)
    at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
    at org.apache.catalina.core.StandardHost.start(StandardHost.java:785)
    at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
    at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445)
    at org.apache.catalina.core.StandardService.start(StandardService.java:519)
    at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
    at org.apache.catalina.startup.Catalina.start(Catalina.java:581)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
    at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
    {code}
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Commented) (JIRA) at Sep 30, 2011 at 12:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118023#comment-13118023 ]

    sebastian L. commented on LUCENE-3440:
    --------------------------------------

    *testament*
    Terms in fragment||totalWeight||totalBoost||
    testament testament|1.8171139|2.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    *das alte testament*
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|4.0|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:40 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118010#comment-13118010 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:38 PM:
    ----------------------------------------------------------------

    LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java:
    The two tables are created by this simple class. I took, representatively, some single pages as documents form our book-stock to build a "bag-of-words".

    was (Author: mdz-munich):
    The two tables are created by this simple class. I took, representatively, some single pages as documents form our book-stock to build a "bag-of-words".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:40 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118010#comment-13118010 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:38 PM:
    ----------------------------------------------------------------

    LUCENE\-3.5-SNAPSHOT-3440-6-ProofOfConcept.java:
    The two tables are created by this simple class. I took, representatively, some single pages as documents form our book-stock to build a "bag-of-words".

    was (Author: mdz-munich):
    LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java:
    The two tables are created by this simple class. I took, representatively, some single pages as documents form our book-stock to build a "bag-of-words".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118023#comment-13118023 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:41 PM:
    ----------------------------------------------------------------

    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|1.8171139|2.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    {color:red} |das das das das|1.5566137|4.0|{color}
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!

    was (Author: mdz-munich):
    *testament*
    Terms in fragment||totalWeight||totalBoost||
    testament testament|1.8171139|2.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    *das alte testament*
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|4.0|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118023#comment-13118023 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:43 PM:
    ----------------------------------------------------------------

    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|{color:blue}1.8171139{color}|{color:blue}2.0{color}|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|{color:blue}5.799069{color}|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|{color:red}4.0{color}|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!

    was (Author: mdz-munich):
    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|1.8171139|2.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    {color:red} |das das das das|1.5566137|4.0|{color}
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118023#comment-13118023 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:45 PM:
    ----------------------------------------------------------------

    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the _sum-of-distinct-weights_-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|{color:blue}1.8171139{color}|{color:blue}2.0{color}|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the _sum-of-boosts_-approach scores "das das das das" higher than "das alte testament".
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|{color:blue}5.799069{color}|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|{color:red}4.0{color}|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!

    was (Author: mdz-munich):
    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|{color:blue}1.8171139{color}|{color:blue}2.0{color}|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|{color:blue}5.799069{color}|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|{color:red}4.0{color}|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 12:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118023#comment-13118023 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 12:46 PM:
    ----------------------------------------------------------------

    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the _sum-of-distinct-weights_-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|{color:blue}1.8171139{color}|{color:blue}2.0{color}|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the _sum-of-boosts_-approach scores *das das das das* higher than *das alte testament*.
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|{color:blue}5.799069{color}|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|{color:red}4.0{color}|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!

    was (Author: mdz-munich):
    WeightOrderFragmentsBuilder_table01.html:
    A one-word-query for *testament*. Obviously, the _sum-of-distinct-weights_-approach makes no difference to the existing one.
    Terms in fragment||totalWeight||totalBoost||
    testament testament|{color:blue}1.8171139{color}|{color:blue}2.0{color}|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    testament|1.2848935|1.0|
    ----

    WeightOrderFragmentsBuilder_table02.html:
    A more-word-query for *das alte testament*. Obviously, the _sum-of-boosts_-approach scores "das das das das" higher than "das alte testament".
    Terms in fragment||totalWeight||totalBoost||
    das alte testament|{color:blue}5.799069{color}|3.0|
    das alte testament|5.799069|3.0|
    das testament alte|5.799069|3.0|
    das alte testament|5.799069|3.0|
    das testament|2.9178061|2.0|
    das alte|2.9178061|2.0|
    testament testament|1.8171139|2.0|
    das das das das|1.5566137|{color:red}4.0{color}|
    das das das|1.348067|3.0|
    alte|1.2848935|1.0|
    alte|1.2848935|1.0|
    das das|1.100692|2.0|
    das das|1.100692|2.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    das|0.77830684|1.0|
    ----
    Awesome table-formatting!
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Sep 30, 2011 at 1:33 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118010#comment-13118010 ]

    sebastian L. edited comment on LUCENE-3440 at 9/30/11 1:31 PM:
    ---------------------------------------------------------------

    LUCENE\-3.5-SNAPSHOT-3440-6-ProofOfConcept.java:
    The two tables are created by this simple class. I took, representatively, some single pages as documents from our book-stock to build a "bag-of-words".

    was (Author: mdz-munich):
    LUCENE\-3.5-SNAPSHOT-3440-6-ProofOfConcept.java:
    The two tables are created by this simple class. I took, representatively, some single pages as documents form our book-stock to build a "bag-of-words".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Commented) (JIRA) at Oct 1, 2011 at 12:47 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118784#comment-13118784 ]

    sebastian L. commented on LUCENE-3440:
    --------------------------------------

    Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT.

    Another patch, another idea! :)

    Some thoughts:
    - With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder is used.
    - Also regardless of further calculations, FieldTermsStack retrieves document frequency for each term from IndexReader in any case.
    - Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on this patch anyway.

    Possible Solution:

    1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation
    - Introduction of TermInfo.fieldName
    - Introduction of WeightedFragInfo.phraseInfos
    - Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed statistical data from the index

    2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore()

    {code}
    /**
    * Compute WeightedFragInfo.score based on query-boosts
    * @throws IOException
    */
    public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
    for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
    wfi.score += wpi.boost;
    }
    }
    return weightedFragInfos;
    }
    {code}

    3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore()

    - In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder.
    - But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass of BoostOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy.
    - Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another idea.
    - The _sum-of-distinct-weight_-approach is the same as presented in the last patch.

    {code}
    /**
    * Compute WeightedFragInfo.score based on IDF-weighted terms
    * @throws IOException
    */
    @Override
    public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{

    Map<String, Float> lookup = new HashMap<String, Float>();
    HashSet<String> distinctTerms = new HashSet<String>();

    int numDocs = reader.numDocs() - reader.numDeletedDocs();

    int docFreq;
    int length;
    float boost;
    float weight;

    for( WeightedFragInfo wfi : weightedFragInfos ){
    uniqueTerms.clear();
    length = 0;
    boost = 0;
    for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
    for( TermInfo ti : wpi.termInfos ) {
    length++;
    if( !distinctTerms.add( ti.text ) )
    continue;
    if ( lookup.containsKey( ti.text ) )
    weight = lookup.get( ti.text ).floatValue();
    else {
    docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
    weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 );
    lookup.put( ti.text, new Float( weight ) );
    }
    boost += Math.pow( weight, 2 ) * wpi.boost;
    }
    }
    wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }

    return weightedFragInfos;
    }
    {code}

    With this approach programmers can implement their own fragments-weighting with ease, simply overwriting calculateScore().

    I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR, this _could_ be a problem. But I can't confirm that for sure. One way to avoid this would be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could be parametrized with the intended implementation of FragList:

    {code:xml}
    <highlighter>
    <fragmentsBuilder name="weight-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
    <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
    </fragmentsBuilder>
    <fragmentsBuilder name="boost-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
    <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
    </fragmentsBuilder>
    </highlighter>
    {code}

    Further notes:
    - As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score".
    - As shown in this patch "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Oct 1, 2011 at 12:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: LUCENE-4.0-SNAPSHOT-3440-6.patch

    Patch for 4.0 trunk.
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Oct 1, 2011 at 12:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: (was: LUCENE-4.0-SNAPSHOT-3440-3.patch)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Issue Comment Edited) (JIRA) at Oct 1, 2011 at 12:57 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118784#comment-13118784 ]

    sebastian L. edited comment on LUCENE-3440 at 10/1/11 12:56 PM:
    ----------------------------------------------------------------

    Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT.

    Another patch, another idea! :)

    Some thoughts:
    - With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder is used.
    - Also regardless of further calculations, FieldTermsStack retrieves document frequency for each term from IndexReader in any case.
    - Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on this patch anyway.

    Possible Solution:

    1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation
    - Introduction of TermInfo.fieldName
    - Introduction of WeightedFragInfo.phraseInfos
    - Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed statistical data from the index

    2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore()

    {code}
    /**
    * Compute WeightedFragInfo.score based on query-boosts
    * @throws IOException
    */
    public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
    for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
    wfi.score += wpi.boost;
    }
    }
    return weightedFragInfos;
    }
    {code}

    3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore()

    - In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder.
    - But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass of ScoreOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy.
    - Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another idea.
    - The _sum-of-distinct-weight_-approach is the same as presented in the last patch.

    {code}
    /**
    * Compute WeightedFragInfo.score based on IDF-weighted terms
    * @throws IOException
    */
    @Override
    public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{

    Map<String, Float> lookup = new HashMap<String, Float>();
    HashSet<String> distinctTerms = new HashSet<String>();

    int numDocs = reader.numDocs() - reader.numDeletedDocs();

    int docFreq;
    int length;
    float boost;
    float weight;

    for( WeightedFragInfo wfi : weightedFragInfos ){
    uniqueTerms.clear();
    length = 0;
    boost = 0;
    for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
    for( TermInfo ti : wpi.termInfos ) {
    length++;
    if( !distinctTerms.add( ti.text ) )
    continue;
    if ( lookup.containsKey( ti.text ) )
    weight = lookup.get( ti.text ).floatValue();
    else {
    docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
    weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 );
    lookup.put( ti.text, new Float( weight ) );
    }
    boost += Math.pow( weight, 2 ) * wpi.boost;
    }
    }
    wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }

    return weightedFragInfos;
    }
    {code}

    With this approach programmers can implement their own fragments-weighting with ease, simply overwriting calculateScore().

    I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR, this _could_ be a problem. But I can't confirm that for sure. One way to avoid this would be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could be parametrized with the intended implementation of FragList:

    {code:xml}
    <highlighter>
    <fragmentsBuilder name="weight-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
    <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
    </fragmentsBuilder>
    <fragmentsBuilder name="boost-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
    <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
    </fragmentsBuilder>
    </highlighter>
    {code}

    Further notes:
    - As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score".
    - "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".

    was (Author: mdz-munich):
    Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT.

    Another patch, another idea! :)

    Some thoughts:
    - With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder is used.
    - Also regardless of further calculations, FieldTermsStack retrieves document frequency for each term from IndexReader in any case.
    - Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on this patch anyway.

    Possible Solution:

    1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation
    - Introduction of TermInfo.fieldName
    - Introduction of WeightedFragInfo.phraseInfos
    - Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed statistical data from the index

    2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore()

    {code}
    /**
    * Compute WeightedFragInfo.score based on query-boosts
    * @throws IOException
    */
    public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
    for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
    wfi.score += wpi.boost;
    }
    }
    return weightedFragInfos;
    }
    {code}

    3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore()

    - In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder.
    - But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass of BoostOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy.
    - Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another idea.
    - The _sum-of-distinct-weight_-approach is the same as presented in the last patch.

    {code}
    /**
    * Compute WeightedFragInfo.score based on IDF-weighted terms
    * @throws IOException
    */
    @Override
    public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{

    Map<String, Float> lookup = new HashMap<String, Float>();
    HashSet<String> distinctTerms = new HashSet<String>();

    int numDocs = reader.numDocs() - reader.numDeletedDocs();

    int docFreq;
    int length;
    float boost;
    float weight;

    for( WeightedFragInfo wfi : weightedFragInfos ){
    uniqueTerms.clear();
    length = 0;
    boost = 0;
    for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
    for( TermInfo ti : wpi.termInfos ) {
    length++;
    if( !distinctTerms.add( ti.text ) )
    continue;
    if ( lookup.containsKey( ti.text ) )
    weight = lookup.get( ti.text ).floatValue();
    else {
    docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
    weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 );
    lookup.put( ti.text, new Float( weight ) );
    }
    boost += Math.pow( weight, 2 ) * wpi.boost;
    }
    }
    wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }

    return weightedFragInfos;
    }
    {code}

    With this approach programmers can implement their own fragments-weighting with ease, simply overwriting calculateScore().

    I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR, this _could_ be a problem. But I can't confirm that for sure. One way to avoid this would be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could be parametrized with the intended implementation of FragList:

    {code:xml}
    <highlighter>
    <fragmentsBuilder name="weight-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
    <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
    </fragmentsBuilder>
    <fragmentsBuilder name="boost-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
    <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
    </fragmentsBuilder>
    </highlighter>
    {code}

    Further notes:
    - As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score".
    - As shown in this patch "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Commented) (JIRA) at Oct 1, 2011 at 1:17 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118799#comment-13118799 ]

    sebastian L. commented on LUCENE-3440:
    --------------------------------------

    Hm, since FieldFragList is created in SimpleFraglistBuilder.createFieldFragList() it should look more like that:

    {code:xml}
    <highlighter>
    <fragListBuilder name="simple-boosted" class="org.apache.solr.highlight.SimpleFragListBuilder">
    <fragList name="boosted" class="org.apache.lucene.search.vectorhighlight.BoostedFragList"/>
    </fragListBuilder>
    <fragListBuilder name="simple-weighted" class="org.apache.solr.highlight.SimpleFragListBuilder" default="true">
    <fragList name="weighted" class="org.apache.lucene.search.vectorhighlight.WeightedFragList">
    </fragListBuilder>
    <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="true"/>
    </highlighter>
    {code}

    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Commented) (JIRA) at Oct 4, 2011 at 1:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120049#comment-13120049 ]

    sebastian L. commented on LUCENE-3440:
    --------------------------------------

    Another patch for 4.0. This one makes FieldFragList "plugable".

    This patch contains:
    - Introduction of interface FieldFragList
    - Introduction of abstract class BaseFieldFragList which contains SubInfo and FieldFragInfo (I renamed WeightedFragInfo)
    - Introduction of class SimpleFieldFragList (default)
    - Introduction of class WeightedFieldFragList
    - Introduction of abstract class BaseFragListBuilder
    - Introduction of class SimpleFragListBuilder (default)
    - Introduction of class WeightedFragListBuilder

    The weighting-formula now depends on the implementation of
    FieldFragList.add(int startOffset, int endOffset, List<FieldPhraseInfo> phraseInfoList):

    {code:java}
    /* (non-Javadoc)
    * @see org.apache.lucene.search.vectorhighlight.FieldFragList#getFragInfos()
    */
    @Override
    public void add( int startOffset, int endOffset, List<FieldPhraseInfo> phraseInfoList ) {
    float score = 0;
    List<SubInfo> subInfos = new ArrayList<SubInfo>();
    for( FieldPhraseInfo phraseInfo : phraseInfoList ){
    subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffset(), phraseInfo.getSeqnum() ) );
    score += phraseInfo.getBoost();
    }
    getFragInfos().add( new FieldFragInfo( startOffset, endOffset, subInfos, score ) );
    }
    {code}

    The choosen FieldFragList depends on FragListBuilder.createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ):

    {code:java}
    /* (non-Javadoc)
    * @see org.apache.lucene.search.vectorhighlight.FragListBuilder#createFieldFragList(FieldPhraseList fieldPhraseList, int fragCharSize)
    */
    @Override
    public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){
    return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize );
    }
    {code}

    Of course, Solr-config could look like this:

    {code:xml}
    <highlighter>
    <fragListBuilder name="simple" class="org.apache.solr.highlight.SimpleFragListBuilder"/>
    <fragListBuilder name="weighted" class="org.apache.solr.highlight.WeightedFragListBuilder" default="true"/>
    <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="true"/>
    </highlighter>
    {code}

    I think, this is the best possible approach, because it maintains backwards-compatibility, but do also some refactoring which would/could/should/can make it easier to plug-in different approaches in future.

    But, after a few weeks of banging my head against the wall I have to admit: I have no idea. ;)

    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Oct 4, 2011 at 1:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: LUCENE-4.0-SNAPSHOT-3440-7.patch

    Patch for trunk (1177996)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-7.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Oct 4, 2011 at 1:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: (was: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-7.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Oct 4, 2011 at 1:15 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: (was: WeightOrderFragmentsBuilder_table01.html)
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-7.patch, weight-vs-boost_table01.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • sebastian L. (Updated) (JIRA) at Oct 4, 2011 at 1:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    sebastian L. updated LUCENE-3440:
    ---------------------------------

    Attachment: weight-vs-boost_table01.html
    FastVectorHighlighter: IDF-weighted terms for ordered fragments
    ----------------------------------------------------------------

    Key: LUCENE-3440
    URL: https://issues.apache.org/jira/browse/LUCENE-3440
    Project: Lucene - Java
    Issue Type: Improvement
    Components: modules/highlighter
    Affects Versions: 3.5, 4.0
    Reporter: sebastian L.
    Priority: Minor
    Labels: FastVectorHighlighter
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-7.patch, weight-vs-boost_table01.html


    The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query.
    This patch provides ordered fragments with IDF-weighted terms:
    total weight = total weight + IDF for unique term per fragment * boost of query;
    The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
    The patch is simple, but it works for us.
    Some ideas:
    - A better approach would be moving the whole fragments-scoring into a separate class.
    - Switch scoring via parameter
    - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
    - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedSep 20, '11 at 9:33a
activeJun 13, '12 at 1:59p
posts101
users1
websitelucene.apache.org

1 user in discussion

Sebastian Lutze (JIRA): 101 posts

People

Translate

site design / logo © 2022 Grokbase