FAQ
Our apps use highlighting, and I expect that highlighting is an
expensive operation since it requires processing the text of the
documents, but I ran a test and was surprised just how expensive it is.
I made a test index with three fields: path, modified, and contents. I
made the index using org.apache.lucene.demo.IndexFiles modified so that
the contents field is stored and analyzed:

doc.add(new Field("contents", false, buf.toString(),
Store.YES, Index.ANALYZED,
TermVector.WITH_POSITIONS_OFFSETS));

There are about 8000 documents in the index, and the contents field
averages around 7500 bytes. The total index directory size is about 242M.

I ran a modified version of the demo.SearchFiles class that doesn't
print anything out (printing results takes most of the time for faster
queries), and runs random queries drawn from the text of the documents:
these are a mix of (mostly) term queries, and about 20% phrase queries
(that are phrases from the text).

I compared a few cases: no field access, un-highlighted retrieval,
highlighting, Highlighter and FastVectorHighlighter, always asking for
10 top scoring docs per query, and running at least 1000 queries for
each case.

No field access at all gets about 7000 qps; basically we just call
searcher.search(query, 10)

Then there is a big cost for retrieving the stored documents from the index:

Retrieving each document (calling search.doc(docID)) and the path field
only (a small field) gets about 250 qps

As a comparison, if I don't store the contents field in the index (and
don't retrieve it at all), I get similar performance to the no retrieval
case (around 7000 qps). OK - so there is a fair amount of I/O required
to retrieve the stored doc; this may be unavoidable, although do
consider that for highlighting only a small portion of the doc may
ultimately be required.

Then another big penalty is paid for highlighting:

Highlighter gets about 60 qps

And finally I am really mystified about this one:

FastVectorHighlighter gets about 20 qps. There is a lot of variance here
(say 9-44 qps), although always worse than Highlighter.

If these results hold up I'll be astonished, since they imply:

(1) FVH is not fast
(2) Highlighting consumes most processing time (around 80%) in the best
case, as compared to just retrieving un-highlighted documents.

and the follow on is that at least for users that need highlighting,
there is hardly any point in optimizing anything else!

I thought maybe FVH required a lot of memory, so I changed the -Xmx512m
(from the default: 64m I think), but this had no effect.

I also tried optimizing the index, and although this improved query
performance somewhat across the board, it actually accentuated the cost
of highlighting since the most marked improvement was in the basic
unhighlighted query.

Here is what the highlighting looks like:

For FVH we allocate a single SimpleFragsListBuilder,
SimpleFragmentBuilder, preTags[1], postTags[1] and DefaultEncoder so
these don't have to be created for each query. We also cache the
FastVectorHighlighter itself, and we call:

highlighter.getBestFragment(highlighter.getFieldQuery(query),
searcher.getIndexReader(), hits[i].doc, "contents", 40, flb, fb,
preTags, postTags, encoder);

once for each result.

In the Highlighter case, we also cache the Highlighter and call:

highlighter.getBestFragment(analyzer, "contents", doc.get("contents"));

does this performance profile match up with your expectations? Did I do
something stupid? Please let me know if I can provide more info. I'm
considering what can be done to speed up highlighting, but don't want to
go off half-cocked..

--
Michael Sokolov
Engineering Director
www.ifactory.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Koji Sekiguchi at Jun 21, 2011 at 12:22 am
    Mike,

    FVH used to be faster for large docs. I wrote FVH section for Lucene in Action and it said:

    In contrib/benchmark (covered in appendix C), there’s an algorithm
    file called highlight-vs-vector-highlight.alg that lets you see the difference
    between two highlighters in processing time. As of version 2.9, with modern hardware,
    that algorithm shows that FastVectorHighlighter is about two and a half times faster
    than Highlighter.

    The number was for Lucene 2.9 age and Wikipedia data, maybe different today.

    Anyway, thank you for sharing interesting result!

    koji
    --
    http://www.rondhuit.com/en/

    (11/06/21 5:20), Mike Sokolov wrote:
    Our apps use highlighting, and I expect that highlighting is an expensive operation since it
    requires processing the text of the documents, but I ran a test and was surprised just how expensive
    it is. I made a test index with three fields: path, modified, and contents. I made the index using
    org.apache.lucene.demo.IndexFiles modified so that the contents field is stored and analyzed:

    doc.add(new Field("contents", false, buf.toString(),
    Store.YES, Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));

    There are about 8000 documents in the index, and the contents field averages around 7500 bytes. The
    total index directory size is about 242M.

    I ran a modified version of the demo.SearchFiles class that doesn't print anything out (printing
    results takes most of the time for faster queries), and runs random queries drawn from the text of
    the documents: these are a mix of (mostly) term queries, and about 20% phrase queries (that are
    phrases from the text).

    I compared a few cases: no field access, un-highlighted retrieval, highlighting, Highlighter and
    FastVectorHighlighter, always asking for 10 top scoring docs per query, and running at least 1000
    queries for each case.

    No field access at all gets about 7000 qps; basically we just call searcher.search(query, 10)

    Then there is a big cost for retrieving the stored documents from the index:

    Retrieving each document (calling search.doc(docID)) and the path field only (a small field) gets
    about 250 qps

    As a comparison, if I don't store the contents field in the index (and don't retrieve it at all), I
    get similar performance to the no retrieval case (around 7000 qps). OK - so there is a fair amount
    of I/O required to retrieve the stored doc; this may be unavoidable, although do consider that for
    highlighting only a small portion of the doc may ultimately be required.

    Then another big penalty is paid for highlighting:

    Highlighter gets about 60 qps

    And finally I am really mystified about this one:

    FastVectorHighlighter gets about 20 qps. There is a lot of variance here (say 9-44 qps), although
    always worse than Highlighter.

    If these results hold up I'll be astonished, since they imply:

    (1) FVH is not fast
    (2) Highlighting consumes most processing time (around 80%) in the best case, as compared to just
    retrieving un-highlighted documents.

    and the follow on is that at least for users that need highlighting, there is hardly any point in
    optimizing anything else!

    I thought maybe FVH required a lot of memory, so I changed the -Xmx512m (from the default: 64m I
    think), but this had no effect.

    I also tried optimizing the index, and although this improved query performance somewhat across the
    board, it actually accentuated the cost of highlighting since the most marked improvement was in the
    basic unhighlighted query.

    Here is what the highlighting looks like:

    For FVH we allocate a single SimpleFragsListBuilder, SimpleFragmentBuilder, preTags[1], postTags[1]
    and DefaultEncoder so these don't have to be created for each query. We also cache the
    FastVectorHighlighter itself, and we call:

    highlighter.getBestFragment(highlighter.getFieldQuery(query), searcher.getIndexReader(),
    hits[i].doc, "contents", 40, flb, fb, preTags, postTags, encoder);

    once for each result.

    In the Highlighter case, we also cache the Highlighter and call:

    highlighter.getBestFragment(analyzer, "contents", doc.get("contents"));

    does this performance profile match up with your expectations? Did I do something stupid? Please let
    me know if I can provide more info. I'm considering what can be done to speed up highlighting, but
    don't want to go off half-cocked..



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Sokolov at Jun 21, 2011 at 2:55 am
    Koji- I'm not familiar with the benchmarking system, but maybe I'll see
    if I can run that benchmark on my test data as a point of comparison -
    thanks for the pointer!

    -Mike
    On 6/20/2011 8:21 PM, Koji Sekiguchi wrote:
    Mike,

    FVH used to be faster for large docs. I wrote FVH section for Lucene
    in Action and it said:

    In contrib/benchmark (covered in appendix C), there’s an algorithm
    file called highlight-vs-vector-highlight.alg that lets you see the
    difference
    between two highlighters in processing time. As of version 2.9, with
    modern hardware,
    that algorithm shows that FastVectorHighlighter is about two and a
    half times faster
    than Highlighter.

    The number was for Lucene 2.9 age and Wikipedia data, maybe different
    today.

    Anyway, thank you for sharing interesting result!

    koji

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Sokolov at Jun 22, 2011 at 12:48 am
    I did that, and the benchmark indicates FVH is 10x faster than
    Highlighter now. I ran with a subset of the wikipedia data since I
    didn't want to deal with the whole thing. I'm trying to reconcile these
    weirdly varying results. One difference is that the benchmark doesn't
    use PhraseQueries - I added those and it did make FVH slightly slower,
    but not all that much. I'll keep digging.

    -Mike
    On 6/20/2011 10:54 PM, Michael Sokolov wrote:
    Koji- I'm not familiar with the benchmarking system, but maybe I'll
    see if I can run that benchmark on my test data as a point of
    comparison - thanks for the pointer!

    -Mike
    On 6/20/2011 8:21 PM, Koji Sekiguchi wrote:
    Mike,

    FVH used to be faster for large docs. I wrote FVH section for Lucene
    in Action and it said:

    In contrib/benchmark (covered in appendix C), there’s an algorithm
    file called highlight-vs-vector-highlight.alg that lets you see the
    difference
    between two highlighters in processing time. As of version 2.9, with
    modern hardware,
    that algorithm shows that FastVectorHighlighter is about two and a
    half times faster
    than Highlighter.

    The number was for Lucene 2.9 age and Wikipedia data, maybe different
    today.

    Anyway, thank you for sharing interesting result!

    koji

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Sokolov at Jun 22, 2011 at 2:29 am
    OK - it seems as if there is a blow-up in FieldPhraseList if a document
    has a large number of occurrences of a term that is in the query. In
    one example, I searched for "1", and this occurs just under 2000 times
    in one of my test documents (as the value of HTML attributes).
    Admittedly a weird case, but when this happens, the highlighting can
    take 300x longer than when searching for a more distinctive term (like
    "distinctive").

    I think there may be a problem here in that every term occurrence is
    compared against every other term occurrence (or every "phrase" within
    which the term may occur - I think?) so there is an n^2 growth factor in
    the number of occurrences of a term in a document. Does that seem possible?

    -Mike
    On 6/21/2011 8:48 PM, Michael Sokolov wrote:
    I did that, and the benchmark indicates FVH is 10x faster than
    Highlighter now. I ran with a subset of the wikipedia data since I
    didn't want to deal with the whole thing. I'm trying to reconcile
    these weirdly varying results. One difference is that the benchmark
    doesn't use PhraseQueries - I added those and it did make FVH slightly
    slower, but not all that much. I'll keep digging.

    -Mike
    On 6/20/2011 10:54 PM, Michael Sokolov wrote:
    Koji- I'm not familiar with the benchmarking system, but maybe I'll
    see if I can run that benchmark on my test data as a point of
    comparison - thanks for the pointer!

    -Mike
    On 6/20/2011 8:21 PM, Koji Sekiguchi wrote:
    Mike,

    FVH used to be faster for large docs. I wrote FVH section for Lucene
    in Action and it said:

    In contrib/benchmark (covered in appendix C), there’s an algorithm
    file called highlight-vs-vector-highlight.alg that lets you see the
    difference
    between two highlighters in processing time. As of version 2.9, with
    modern hardware,
    that algorithm shows that FastVectorHighlighter is about two and a
    half times faster
    than Highlighter.

    The number was for Lucene 2.9 age and Wikipedia data, maybe
    different today.

    Anyway, thank you for sharing interesting result!

    koji

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Itamar Syn-Hershko at Jun 22, 2011 at 10:26 pm
    I'm not intimately familiar with FVH myself, but that sounds reasonable.
    Tests usually don't lie. I'd definitely like to see a patched version
    that avoids that!

    Itamar.
    On 22/06/2011 05:29, Michael Sokolov wrote:
    OK - it seems as if there is a blow-up in FieldPhraseList if a
    document has a large number of occurrences of a term that is in the
    query. In one example, I searched for "1", and this occurs just under
    2000 times in one of my test documents (as the value of HTML
    attributes). Admittedly a weird case, but when this happens, the
    highlighting can take 300x longer than when searching for a more
    distinctive term (like "distinctive").

    I think there may be a problem here in that every term occurrence is
    compared against every other term occurrence (or every "phrase" within
    which the term may occur - I think?) so there is an n^2 growth factor
    in the number of occurrences of a term in a document. Does that seem
    possible?

    -Mike
    On 6/21/2011 8:48 PM, Michael Sokolov wrote:
    I did that, and the benchmark indicates FVH is 10x faster than
    Highlighter now. I ran with a subset of the wikipedia data since I
    didn't want to deal with the whole thing. I'm trying to reconcile
    these weirdly varying results. One difference is that the benchmark
    doesn't use PhraseQueries - I added those and it did make FVH
    slightly slower, but not all that much. I'll keep digging.

    -Mike
    On 6/20/2011 10:54 PM, Michael Sokolov wrote:
    Koji- I'm not familiar with the benchmarking system, but maybe I'll
    see if I can run that benchmark on my test data as a point of
    comparison - thanks for the pointer!

    -Mike
    On 6/20/2011 8:21 PM, Koji Sekiguchi wrote:
    Mike,

    FVH used to be faster for large docs. I wrote FVH section for
    Lucene in Action and it said:

    In contrib/benchmark (covered in appendix C), there’s an algorithm
    file called highlight-vs-vector-highlight.alg that lets you see the
    difference
    between two highlighters in processing time. As of version 2.9,
    with modern hardware,
    that algorithm shows that FastVectorHighlighter is about two and a
    half times faster
    than Highlighter.

    The number was for Lucene 2.9 age and Wikipedia data, maybe
    different today.

    Anyway, thank you for sharing interesting result!

    koji

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 20, '11 at 8:20p
activeJun 22, '11 at 10:26p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase