FAQ
Hi all,

I'm new to Lucene and have a question about indexing/highlighting of HTML
files with Lucene.

What I need to do is highlight the hits (terms) in the original HTML file
(or get the positions of the terms/tokens in the original file).
This problem has already been described by Fred Toth in this thread in 2005
(Preserving original HTML file offsets for highlighting, need
HTMLTokenizer?):

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3C6.2.1.2.2.20050530134630.063ae978@fast.synernet.com%3E

I've searched the mailing list archives hoping for an answer, but I had no
luck.

Does anyone have an idea, if there is a solution for this problem? Also if
you know, that it's not possible with Lucene to highlight the hits in the
original html-file, it would be helpful to know (I could stop looking for
it...).

Many thanks in advance!
Karo

P.S. Actually I wanted to answer the original thred/question from 2005 - is
there a way to do this? How can I post an answer to an old thread/mail from
the mailing list?

Search Discussions

  • Uwe Schindler at Jan 24, 2011 at 1:47 pm
    You can use HTMLStripCharFilter that is plugged into the chain before the
    Tokenizer. This one strips all HTML but preserves the Token positions, so
    you can later highlight using those positions.

    This filter is currently only released through Apache Solr, but in Lucene
    4.0 its part of the analysis module.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Karolina Bernat
    Sent: Monday, January 24, 2011 2:03 PM
    To: java-user@lucene.apache.org
    Subject: Preserving original HTML file offsets for highlighting

    Hi all,

    I'm new to Lucene and have a question about indexing/highlighting of HTML
    files with Lucene.

    What I need to do is highlight the hits (terms) in the original HTML file (or get
    the positions of the terms/tokens in the original file).
    This problem has already been described by Fred Toth in this thread in 2005
    (Preserving original HTML file offsets for highlighting, need
    HTMLTokenizer?):

    http://mail-archives.apache.org/mod_mbox/lucene-java-
    user/200505.mbox/%3C6.2.1.2.2.20050530134630.063ae978@fast.synernet.c
    om%3E

    I've searched the mailing list archives hoping for an answer, but I had no luck.
    Does anyone have an idea, if there is a solution for this problem? Also if you
    know, that it's not possible with Lucene to highlight the hits in the original
    html-file, it would be helpful to know (I could stop looking for it...).

    Many thanks in advance!
    Karo

    P.S. Actually I wanted to answer the original thred/question from 2005 - is
    there a way to do this? How can I post an answer to an old thread/mail from
    the mailing list?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karolina Bernat at Jan 25, 2011 at 12:46 pm
    Hi Uwe,

    thanks for this hint. I'm not sure, how much of the Solr functionality do I
    need to implement for using the HTTPStripCharFilter. I'm using Apache Tika
    for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my
    IndexWriter. I don't use a Tokenizer - this would be the Solr approach?

    At this point, I'm not sure, how to use the Solr within my application,
    where I already use Lucene. Can I use i.e. just this one or few classes from
    the Solr Core while indexing with Lucene IndexWriter? Or do I need to switch
    my indexing and searching to the Solr way, just to get what I need
    (highlighting of the hits within HTML files).

    Thank you so much for your help:-)
    Karo


    On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat wrote:

    Hi all,

    I'm new to Lucene and have a question about indexing/highlighting of HTML
    files with Lucene.

    What I need to do is highlight the hits (terms) in the original HTML file
    (or get the positions of the terms/tokens in the original file).
    This problem has already been described by Fred Toth in this thread in 2005
    (Preserving original HTML file offsets for highlighting, need
    HTMLTokenizer?):


    http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3C6.2.1.2.2.20050530134630.063ae978@fast.synernet.com%3E

    I've searched the mailing list archives hoping for an answer, but I had no
    luck.

    Does anyone have an idea, if there is a solution for this problem? Also if
    you know, that it's not possible with Lucene to highlight the hits in the
    original html-file, it would be helpful to know (I could stop looking for
    it...).

    Many thanks in advance!
    Karo

    P.S. Actually I wanted to answer the original thred/question from 2005 - is
    there a way to do this? How can I post an answer to an old thread/mail from
    the mailing list?
  • Uwe Schindler at Jan 25, 2011 at 1:16 pm
    Hi Karolina,

    for this no Solr is needed at all. The CharFilter is simply placed outside
    Lucene, but you can use without anything else from Solr. You can copy the
    java file from Solr's source, choose another package name and you are
    finished.

    About Tokenizer and Analyzer: StandardAnalyzer does the combination of
    Tokenizers and TokenFilters (and possibly CharFilters). It is just an
    easy-to-use class that serves as a factory for TokenStreams (which is the
    superclass of Tokenizers). If you want your own analysis, you have to
    implement an Analyzer class (possibly use StandardAnalyzer source code as
    basis) and add the needed Filters (this HTMLStripCharFilter) to the factory
    method.

    You may read the analysis' package javadocs to get information how to do
    this. Note: This HTMLStripCharFilter does not need TIKA at all).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Karolina Bernat
    Sent: Tuesday, January 25, 2011 1:45 PM
    To: java-user@lucene.apache.org
    Subject: Re: Preserving original HTML file offsets for highlighting

    Hi Uwe,

    thanks for this hint. I'm not sure, how much of the Solr functionality do I
    need to implement for using the HTTPStripCharFilter. I'm using Apache Tika
    for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my
    IndexWriter. I don't use a Tokenizer - this would be the Solr approach?

    At this point, I'm not sure, how to use the Solr within my application, where I
    already use Lucene. Can I use i.e. just this one or few classes from the Solr
    Core while indexing with Lucene IndexWriter? Or do I need to switch my
    indexing and searching to the Solr way, just to get what I need
    (highlighting
    of the hits within HTML files).

    Thank you so much for your help:-)
    Karo



    On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat <
    karolina.bernat@googlemail.com> wrote:
    Hi all,

    I'm new to Lucene and have a question about indexing/highlighting of
    HTML files with Lucene.

    What I need to do is highlight the hits (terms) in the original HTML
    file (or get the positions of the terms/tokens in the original file).
    This problem has already been described by Fred Toth in this thread in
    2005 (Preserving original HTML file offsets for highlighting, need
    HTMLTokenizer?):


    http://mail-archives.apache.org/mod_mbox/lucene-java-
    user/200505.mbox/
    %3C6.2.1.2.2.20050530134630.063ae978@fast.synernet.com%3E

    I've searched the mailing list archives hoping for an answer, but I
    had no luck.

    Does anyone have an idea, if there is a solution for this problem?
    Also if you know, that it's not possible with Lucene to highlight the
    hits in the original html-file, it would be helpful to know (I could
    stop looking for it...).

    Many thanks in advance!
    Karo

    P.S. Actually I wanted to answer the original thred/question from 2005
    - is there a way to do this? How can I post an answer to an old
    thread/mail from the mailing list?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karolina Bernat at Jan 26, 2011 at 9:53 am
    Hi Uwe,

    thank you so much for your help, it worked like a dream!:-)

    I made a custom analyzer classand extended it from the StandardAnalyzer.
    Then I needed to override the tokenStream method like that:

    public TokenStream tokenStream(String fieldName, Reader reader) {
    CharStream chStream = CharReader.get(reader);
    HTMLStripCharFilter filter = new HTMLStripCharFilter(chStream);
    return super.tokenStream(fieldName, filter);
    }

    and in the constructor I called the super constructor.
    That worked really good and it was the ony place I needed to make some
    changes.

    Thanks once again!

    Viele Grüße aus Hamburg,
    Karo

    On Tue, Jan 25, 2011 at 2:15 PM, Uwe Schindler wrote:

    Hi Karolina,

    for this no Solr is needed at all. The CharFilter is simply placed outside
    Lucene, but you can use without anything else from Solr. You can copy the
    java file from Solr's source, choose another package name and you are
    finished.

    About Tokenizer and Analyzer: StandardAnalyzer does the combination of
    Tokenizers and TokenFilters (and possibly CharFilters). It is just an
    easy-to-use class that serves as a factory for TokenStreams (which is the
    superclass of Tokenizers). If you want your own analysis, you have to
    implement an Analyzer class (possibly use StandardAnalyzer source code as
    basis) and add the needed Filters (this HTMLStripCharFilter) to the factory
    method.

    You may read the analysis' package javadocs to get information how to do
    this. Note: This HTMLStripCharFilter does not need TIKA at all).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Karolina Bernat
    Sent: Tuesday, January 25, 2011 1:45 PM
    To: java-user@lucene.apache.org
    Subject: Re: Preserving original HTML file offsets for highlighting

    Hi Uwe,

    thanks for this hint. I'm not sure, how much of the Solr functionality do I
    need to implement for using the HTTPStripCharFilter. I'm using Apache Tika
    for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my
    IndexWriter. I don't use a Tokenizer - this would be the Solr approach?

    At this point, I'm not sure, how to use the Solr within my application, where I
    already use Lucene. Can I use i.e. just this one or few classes from the Solr
    Core while indexing with Lucene IndexWriter? Or do I need to switch my
    indexing and searching to the Solr way, just to get what I need
    (highlighting
    of the hits within HTML files).

    Thank you so much for your help:-)
    Karo



    On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat <
    karolina.bernat@googlemail.com> wrote:
    Hi all,

    I'm new to Lucene and have a question about indexing/highlighting of
    HTML files with Lucene.

    What I need to do is highlight the hits (terms) in the original HTML
    file (or get the positions of the terms/tokens in the original file).
    This problem has already been described by Fred Toth in this thread in
    2005 (Preserving original HTML file offsets for highlighting, need
    HTMLTokenizer?):


    http://mail-archives.apache.org/mod_mbox/lucene-java-
    user/200505.mbox/
    %3C6.2.1.2.2.20050530134630.063ae978@fast.synernet.com%3E

    I've searched the mailing list archives hoping for an answer, but I
    had no luck.

    Does anyone have an idea, if there is a solution for this problem?
    Also if you know, that it's not possible with Lucene to highlight the
    hits in the original html-file, it would be helpful to know (I could
    stop looking for it...).

    Many thanks in advance!
    Karo

    P.S. Actually I wanted to answer the original thred/question from 2005
    - is there a way to do this? How can I post an answer to an old
    thread/mail from the mailing list?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 24, '11 at 1:34p
activeJan 26, '11 at 9:53a
posts5
users2
websitelucene.apache.org

2 users in discussion

Karolina Bernat: 3 posts Uwe Schindler: 2 posts

People

Translate

site design / logo © 2022 Grokbase