FAQ
Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually? I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler. It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction. Am I going to regret this approach? I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.

Also, I was reading this
<http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin

Search Discussions

  • Charlie Hull at Jun 10, 2016 at 8:22 am

    On 10/06/2016 02:20, Justin Lee wrote:
    Has anybody had any experience bypassing ExtractingRequestHandler and
    simply managing Tika manually? I want to make a small modification to Tika
    to get and save additional data from my PDFs, but I have been
    procrastinating in no small part due to the unpleasant prospect of setting
    up a development environment where I could compile and debug modifications
    that might run through PDFBox, Tika, and ExtractingRequestHandler. It
    occurs to me that it would be much easier if the two were separate, so I
    could have direct control over Tika and just submit the text to Solr after
    extraction. Am I going to regret this approach? I'm not sure what
    ExtractingRequestHandler really does for me that Tika doesn't already do.
    We tend to prefer running Tika externally as it's entirely possible that
    Tika will crash or hang with certain files - and that will bring down
    Solr if you're running Tika within it. Here's a Dropwizard wrapper
    around Tika that might be of use:
    https://github.com/mattflax/dropwizard-tika-server

    Cheers

    Charlie
    Also, I was reading this
    <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
    stackoverflow entry and someone offhandedly mentioned that
    ExtractingRequestHandler might be separated in the future anyway. Is there
    a public roadmap for the project, or does one have to keep up with the
    developer's mailing list and hunt through JIRA entries to keep up with the
    pulse of the project?

    Thanks,
    Justin

    --
    Charlie Hull
    Flax - Open Source Enterprise Search

    tel/fax: +44 (0)8700 118334
    mobile: +44 (0)7767 825828
    web: www.flax.co.uk
  • Erick Erickson at Jun 12, 2016 at 8:34 pm
    Two things: Here's a sample bit of SolrJ code, pulling out
    the DB stuff should be straightforward:
    http://searchhub.org/2012/02/14/indexing-with-solrj/

    It's a little out of date, but not very much so. CloudSolrServer
    mentioned in one of the comments has been deprecated in
    favor of CloudSolrClient, similarly StreamingUpdateSolrServer
    is now ConcurrentUpdateSolrClient.


    Second, since Solr 5.4 there is the capability to add parser specific
    parameters through config, see SOLR-8166. I just added this to the
    6.x Ref Guide today, it missed getting into the earlier ref guide
    releases.

    Best,
    Erick
    On Fri, Jun 10, 2016 at 1:22 AM, Charlie Hull wrote:
    On 10/06/2016 02:20, Justin Lee wrote:

    Has anybody had any experience bypassing ExtractingRequestHandler and
    simply managing Tika manually? I want to make a small modification to
    Tika
    to get and save additional data from my PDFs, but I have been
    procrastinating in no small part due to the unpleasant prospect of setting
    up a development environment where I could compile and debug modifications
    that might run through PDFBox, Tika, and ExtractingRequestHandler. It
    occurs to me that it would be much easier if the two were separate, so I
    could have direct control over Tika and just submit the text to Solr after
    extraction. Am I going to regret this approach? I'm not sure what
    ExtractingRequestHandler really does for me that Tika doesn't already do.

    We tend to prefer running Tika externally as it's entirely possible that
    Tika will crash or hang with certain files - and that will bring down Solr
    if you're running Tika within it. Here's a Dropwizard wrapper around Tika
    that might be of use:
    https://github.com/mattflax/dropwizard-tika-server

    Cheers

    Charlie
    Also, I was reading this

    <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
    stackoverflow entry and someone offhandedly mentioned that
    ExtractingRequestHandler might be separated in the future anyway. Is there
    a public roadmap for the project, or does one have to keep up with the
    developer's mailing list and hunt through JIRA entries to keep up with the
    pulse of the project?

    Thanks,
    Justin

    --
    Charlie Hull
    Flax - Open Source Enterprise Search

    tel/fax: +44 (0)8700 118334
    mobile: +44 (0)7767 825828
    web: www.flax.co.uk
  • Allison, Timothy B. at Jun 13, 2016 at 2:22 pm
    Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should be straightforward:
    http://searchhub.org/2012/02/14/indexing-with-solrj/

    +1
    We tend to prefer running Tika externally as it's entirely possible
    that Tika will crash or hang with certain files - and that will bring
    down Solr if you're running Tika within it.
    +1
    I want to make a small modification
    to Tika to get and save additional data from my PDFs
    What info do you need, and if it is common enough, could you ask over on Tika's JIRA and we'll try to add it directly?
  • Justin Lee at Jun 13, 2016 at 9:05 pm
    Thanks everyone for the help and advice. The SolrJ exmaple makes sense to
    me. The import of SOLR-8166 was kind of mind boggling to me, but maybe
    I'll revisit after some time.

    Tim: for context, I'm ultimately trying to create an external highlighter.
    See https://issues.apache.org/jira/browse/SOLR-1397. I want to store the
    bounding box (in PDF units) for each token in the extracted text stream.
    Then when I get results from Solr using the above patch, I'll convert the
    UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate
    in the UI. I like this approach because I get highlighting that accurately
    reflects the search, even when the search is complex (e.g. wildcards or
    proximity searches).

    I think it would take quite a bit of thinking to get something general
    enough to add into Tika. For example, what units? Take a look at the
    discussion of what units to report offsets in here:
    https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert
    Muir -- although whatever issues there are here they are the same as the
    offsets reported in the Term Vector Component, it would seem to me). As
    another example, I'm just not sure what format is general enough to make
    sense for everybody. I think I'll just create a mapping from UTF-16
    offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store
    that in a NoSQL store. Then, when I get Solr results, I'll look at the
    matching offsets, the JSON blob, and the original document and be on my
    merry way. I'm happy to open a JIRA entry in Tika if you think this is a
    coherent request.

    The other approach, I suppose, is to try to pass the information along
    during indexing and store as a token payload. But it seems like the
    indexing interface is really text oriented. I have also thought about
    using DelimitedPayloadTokenFilter, which will increase the index size I
    imagine (how much, though?) and require more customization of Solr
    internals. I don't know which is the better approach.
    On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. wrote:



    Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff
    should be straightforward:
    http://searchhub.org/2012/02/14/indexing-with-solrj/

    +1
    We tend to prefer running Tika externally as it's entirely possible
    that Tika will crash or hang with certain files - and that will bring
    down Solr if you're running Tika within it.
    +1
    I want to make a small modification
    to Tika to get and save additional data from my PDFs
    What info do you need, and if it is common enough, could you ask over on
    Tika's JIRA and we'll try to add it directly?


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupsolr-user @
categorieslucene
postedJun 10, '16 at 1:20a
activeJun 13, '16 at 9:05p
posts5
users4
websitelucene.apache.org...

People

Translate

site design / logo © 2019 Grokbase