Thanks everyone for the help and advice. The SolrJ exmaple makes sense to
me. The import of SOLR-8166 was kind of mind boggling to me, but maybe
I'll revisit after some time.
Tim: for context, I'm ultimately trying to create an external highlighter.
I want to store the
bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate
in the UI. I like this approach because I get highlighting that accurately
reflects the search, even when the search is complex (e.g. wildcards or
I think it would take quite a bit of thinking to get something general
enough to add into Tika. For example, what units? Take a look at the
discussion of what units to report offsets in here:https://issues.apache.org/jira/browse/SOLR-1954
(see the comments by Robert
Muir -- although whatever issues there are here they are the same as the
offsets reported in the Term Vector Component, it would seem to me). As
another example, I'm just not sure what format is general enough to make
sense for everybody. I think I'll just create a mapping from UTF-16
offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store
that in a NoSQL store. Then, when I get Solr results, I'll look at the
matching offsets, the JSON blob, and the original document and be on my
merry way. I'm happy to open a JIRA entry in Tika if you think this is a
The other approach, I suppose, is to try to pass the information along
during indexing and store as a token payload. But it seems like the
indexing interface is really text oriented. I have also thought about
using DelimitedPayloadTokenFilter, which will increase the index size I
imagine (how much, though?) and require more customization of Solr
internals. I don't know which is the better approach.
On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. wrote:
Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff
should be straightforward:http://searchhub.org/2012/02/14/indexing-with-solrj/
We tend to prefer running Tika externally as it's entirely possible
that Tika will crash or hang with certain files - and that will bring
down Solr if you're running Tika within it.
I want to make a small modification
to Tika to get and save additional data from my PDFs
What info do you need, and if it is common enough, could you ask over on
Tika's JIRA and we'll try to add it directly?