FAQ
Hi all,

in our Lucene 3.0.3-based web application when a user clicks on a hit
link the targeted PDF should be opened in the browser with highlighted hits.

For this purpose using the Acrobat Highlight File (Parameter xml, see
http://www.pdfbox.org/userguide/highlighting.html and
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf)
seems most reasonable to me.

Since the position to highlight are given by (page and) character
offsets and Lucene uses offsets as well I think it could be easy (for
more Lucene-skilled people than me) to create an Highlighter which
produces this highlight file.

Does such a Highlighter already exists in the Lucene World?

If not could someone please point me the direction (e.g. where to hook
into the existing (fast vector?) highlighter just to extract the offsets).

BTW: Luke gyve me the impression that Term Vectors are only stored when
the field content is sored as well. Is that true?

Wulf


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Wulf Berschin at May 12, 2011 at 2:47 pm
    Well, AFAIS the Lucene Highlighters do not offer this functionality via
    their API, but could easily do.

    I think support for highlighting documents would be a very welcome
    feature. Highlighting HTML documents is already possible with the
    org.apache.solr.analysis.HTMLStripCharFilter and a NullFragmenter, but
    ther seems to be nothing for highlighting PDF files...

    As starting point I quarried out from
    org.apache.lucene.search.highlight.Highlighter the class below which
    just returns the Tokens contributing to the hit.

    Using the returned tokens a PDF highlight file could be easily generated
    and voilà..

    -- Wulf

    package org.apache.lucene.search.highlight;

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;

    import org.apache.lucene.analysis.Token;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.TermAttribute;


    public class HighlightTokensExtractor
    {
    private Scorer fragmentScorer = null;

    public HighlightTokensExtractor(Scorer fragmentScorer)
    {
    this.fragmentScorer = fragmentScorer;
    }

    public final List<Token> getTokens(TokenStream tokenStream, String text,
    boolean mergeContiguousFragments, int maxNumFragments)
    throws IOException, InvalidTokenOffsetsException
    {
    List<Token> result = new ArrayList<Token>();
    TermAttribute termAtt = tokenStream.addAttribute(TermAttribute.class);
    OffsetAttribute offsetAtt =
    tokenStream.addAttribute(OffsetAttribute.class);
    tokenStream.addAttribute(PositionIncrementAttribute.class);
    tokenStream.reset();

    // dummy text fragment
    TextFragment currentFrag = new TextFragment("", 0, 0);
    TokenStream newStream = fragmentScorer.init(tokenStream);
    if (newStream != null) {
    tokenStream = newStream;
    }
    fragmentScorer.startFragment(currentFrag);

    try {

    TokenGroup tokenGroup = new TokenGroup(tokenStream);

    for (boolean next = tokenStream.incrementToken(); next; next =
    tokenStream
    .incrementToken()) {
    if ((offsetAtt.endOffset() > text.length())
    (offsetAtt.startOffset() > text.length())) {
    throw new InvalidTokenOffsetsException("Token " + termAtt.term()
    + " exceeds length of provided text sized " + text.length());
    }
    if ((tokenGroup.numTokens > 0) && (tokenGroup.isDistinct())) {

    if (tokenGroup.getTotalScore() > 0) {
    System.out.println(tokenGroup.matchStartOffset + " "
    + tokenGroup.matchEndOffset);

    result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
    }
    tokenGroup.clear();

    }
    tokenGroup.addToken(fragmentScorer.getTokenScore());

    }

    if (tokenGroup.numTokens > 0) {

    if (tokenGroup.getTotalScore() > 0) {
    System.out.println(tokenGroup.matchStartOffset + " "
    + tokenGroup.matchEndOffset);

    result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
    }
    }

    return result;

    }
    finally {
    if (tokenStream != null) {
    try {
    tokenStream.close();
    }
    catch (Exception e) {
    }
    }
    }
    }

    }



    Am 10.05.2011 12:32, schrieb Wulf Berschin:
    Hi all,

    in our Lucene 3.0.3-based web application when a user clicks on a hit
    link the targeted PDF should be opened in the browser with highlighted
    hits.

    For this purpose using the Acrobat Highlight File (Parameter xml, see
    http://www.pdfbox.org/userguide/highlighting.html and
    http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf)
    seems most reasonable to me.

    Since the position to highlight are given by (page and) character
    offsets and Lucene uses offsets as well I think it could be easy (for
    more Lucene-skilled people than me) to create an Highlighter which
    produces this highlight file.

    Does such a Highlighter already exists in the Lucene World?

    If not could someone please point me the direction (e.g. where to hook
    into the existing (fast vector?) highlighter just to extract the offsets).

    BTW: Luke gyve me the impression that Term Vectors are only stored when
    the field content is sored as well. Is that true?

    Wulf

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dawn Zoë Raison at May 12, 2011 at 3:05 pm

    On 12/05/2011 15:47, Wulf Berschin wrote:
    I think support for highlighting documents would be a very welcome
    feature. Highlighting HTML documents is already possible with the
    org.apache.solr.analysis.HTMLStripCharFilter and a NullFragmenter, but
    ther seems to be nothing for highlighting PDF files...
    It would be very useful. That said being able to highlight words in any
    pictorial representation of a document would be a huge bonus.
    As starting point I quarried out from
    org.apache.lucene.search.highlight.Highlighter the class below which
    just returns the Tokens contributing to the hit.
    I use a similar (home brew) solution to extract the hit terms, and then
    pass them to the Adobe PDF viewer plugin as a search term via the PDF URL.


    --

    Rgds.
    *Dawn Raison*
    Technical Director, Digitorial Ltd.

    E:dawn@digitorial.co.uk W:http://www.digitorial.co.uk
    M: 07956 609 618 T: 01428 729 431
    Reg: 04644583, England& Wales
    Church Villas Ecchinswell, Newbury, RG20 4TT

    This email and any attached files are for the exclusive use of the
    addressee and may contain privileged and/or confidential information. If
    you receive this email in error you should not disclose the contents to
    any other person nor take copies but should delete it immediately.
    Digitorial Ltd makes no warranty as to the accuracy or completeness of
    this email and accepts no liability for its contents or use. Any
    opinions expressed in this email are those of the author and do not
    necessarily reflect the opinions of Digitorial Ltd.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 10, '11 at 10:33a
activeMay 12, '11 at 3:05p
posts3
users2
websitelucene.apache.org

2 users in discussion

Wulf Berschin: 2 posts Dawn Zoë Raison: 1 post

People

Translate

site design / logo © 2022 Grokbase