FAQ
Hi,

If I use tika for parsing HTML code and inject parsed String to a lucene
analyzer. What about the offset information for KWIC and return to text
(like the google cache view)? how can I keep track of the offsets
between tika parser and lucene analyzer?

What are the solutions/ideas to do a sort of google cache view with
tika and lucene analyzer API?

With the provided API I can't keep the original content as a cache, I
need to cache the tika output and result in degraded cache view. I
didn't look too closely at tika but there is maybe a way with SAX
Locators? Build an associative array of tika parsed string offsets vs
actual offsets and use a sort of token filter to rectify
OffsetAttribute?

--
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Jukka Zitting at Sep 3, 2009 at 1:08 pm
    Hi,

    On Wed, Sep 2, 2009 at 2:40 PM, David Caussewrote:
    If I use tika for parsing HTML code and inject parsed String to a lucene
    analyzer. What about the offset information for KWIC and return to text
    (like the google cache view)? how can I keep track of the offsets
    between tika parser and lucene analyzer?
    Currently Tika doesn't expose that information but the Tika Parser API
    was designed for such use in mind, so it will be possible to add the
    offset information. Please file a Tika feature request [1] for this.

    [1] https://issues.apache.org/jira/browse/TIKA

    BR,

    Jukka Zitting

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Sep 3, 2009 at 1:27 pm
    An additional good solution for Lucene (from 2.9 on), would be to create a
    special TIKA analyzer that can be used to directly add TIKA-parseable
    content and metadata to the Tokenstream as Attributes (using the new API) or
    only text and offset data (old Lucene TokenStream API).

    I wrote something similar for XML files that added the current XML element
    path as an additional Token Attribute. It also set the SAX parsers current
    position as offset. This attribute could then later be used to construct
    additional indexing setting (in our case the field name to index into).

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Jukka Zitting
    Sent: Thursday, September 03, 2009 3:07 PM
    To: java-user@lucene.apache.org; David Causse
    Subject: Re: Use of tika for parsing, offsets questions

    Hi,

    On Wed, Sep 2, 2009 at 2:40 PM, David Caussewrote:
    If I use tika for parsing HTML code and inject parsed String to a lucene
    analyzer. What about the offset information for KWIC and return to text
    (like the google cache view)? how can I keep track of the offsets
    between tika parser and lucene analyzer?
    Currently Tika doesn't expose that information but the Tika Parser API
    was designed for such use in mind, so it will be possible to add the
    offset information. Please file a Tika feature request [1] for this.

    [1] https://issues.apache.org/jira/browse/TIKA

    BR,

    Jukka Zitting

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • David Causse at Sep 4, 2009 at 8:30 am

    On Thu, Sep 03, 2009 at 03:07:18PM +0200, Jukka Zitting wrote:
    Hi,

    On Wed, Sep 2, 2009 at 2:40 PM, David Caussewrote:
    If I use tika for parsing HTML code and inject parsed String to a lucene
    analyzer. What about the offset information for KWIC and return to text
    (like the google cache view)? how can I keep track of the offsets
    between tika parser and lucene analyzer?
    Currently Tika doesn't expose that information but the Tika Parser API
    was designed for such use in mind, so it will be possible to add the
    offset information. Please file a Tika feature request [1] for this.
    I created TIKA-272, the idea behind is to be able to use unmodified
    lucene analyzers with tika and keep offset correctness.

    Thank you.

    --
    David Causse
    Spotter
    http://www.spotter.com/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Sep 3, 2009 at 1:52 pm

    On Sep 2, 2009, at 5:40 AM, David Causse wrote:

    Hi,

    If I use tika for parsing HTML code and inject parsed String to a
    lucene
    analyzer. What about the offset information for KWIC and return to
    text
    (like the google cache view)? how can I keep track of the offsets
    between tika parser and lucene analyzer?

    What are the solutions/ideas to do a sort of google cache view with
    tika and lucene analyzer API?

    With the provided API I can't keep the original content as a cache, I
    need to cache the tika output and result in degraded cache view. I
    didn't look too closely at tika but there is maybe a way with SAX
    Locators? Build an associative array of tika parsed string offsets vs
    actual offsets and use a sort of token filter to rectify
    OffsetAttribute?
    Hmm, maybe you could implement the ContentHandler for Tika that
    instead of creating a string for the Document, creates a TokenStream.
    Then, you can have it add the offsets as payloads so that you then
    have those offsets later when rendering your view.

    --
    David Causse
    Spotter
    http://www.spotter.com/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
    using Solr/Lucene:
    http://www.lucidimagination.com/search


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 2, '09 at 12:40p
activeSep 4, '09 at 8:30a
posts5
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase