FAQ
Hello,

I think there's a bug in the |ExtractingRequestHandler|Handler (Tika
parser).
Some tika's exception are not catch, and the handler return a 0 status,
indicating no problem's with that content.

I give a look at the code (Solr 5.1, ExtractingDocumentLoader:221), only
TikaException are catch and send back by SolrException.
The problem still remains on Solr 5.5.

Here's the two stacktrace's :

java.io.IOException :
ERROR - 2016-06-10 14:12:03.932; [ centreinffo]
org.apache.pdfbox.filter.FlateFilter; FlateFilter: stop reading
corrupt stream due to a DataFormatException
INFO - 2016-06-10 14:12:03.940; [ centreinffo]
org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo]
webapp=/solr path=/update/extract
params={fmap.content=contenuDocument&uprefix=tika_&literal.pk=document_Régionsetformation_280&wt=javabin&stream.file=/var/local/ci-services/documents/document_Régionsetformation_280&version=2}
{add=[document_Régionsetformation_280 (1536759351407017984)]} 0 74
and java.io.EOFException
ERROR - 2016-06-10 14:10:49.246; [ centreinffo]
org.apache.fontbox.ttf.TrueTypeFont; An error occured when reading
table hmtx
java.io.EOFException
at
org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
at
org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
at
org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
at
org.apache.fontbox.ttf.TrueTypeFont.getHorizontalMetrics(TrueTypeFont.java:204)
at
org.apache.fontbox.ttf.TrueTypeFont.getAdvanceWidth(TrueTypeFont.java:346)
at
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:677)
at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
at
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:221)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

...
INFO - 2016-06-10 14:10:50.207; [ centreinffo]
org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo]
webapp=/solr path=/update/extract
params={fmap.content=contenuDocument&uprefix=tika_&literal.pk=document_Régionsetformation_600&wt=javabin&stream.file=/var/local/ci-services/documents/document_Régionsetformation_600&version=2}
{add=[document_Régionsetformation_600 (1536759274020012032)]} 0 2061
Regards,
Gilbert Boyreau

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupsolr-user @
categorieslucene
postedJun 10, '16 at 4:20p
activeJun 10, '16 at 4:20p
posts1
users1
websitelucene.apache.org...

1 user in discussion

Gilbert Boyreau: 1 post

People

Translate

site design / logo © 2019 Grokbase