Hi,
I'm seeing a problem mentioned in Solr-42, Highlighting problems with
HTMLStripWhitespaceTokenizerFactory:
https://issues.apache.org/jira/browse/SOLR-42
I'm indexing HTML documents, and am getting reams of "Mark invalid"
IOExceptions:
SEVERE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(Unknown Source)
at
org
.apache
.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:
728)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:
742)
at java.io.Reader.read(Unknown Source)
at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118)
at
org
.apache
.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249)
at
org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
at
org
.apache
.solr
.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45)
at
org
.apache
.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
at
org
.apache
.solr
.analysis
.RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java:
33)
at
org
.apache
.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79)
at org.apache.lucene.index.DocumentsWriter$ThreadState
$FieldData.invertField(DocumentsWriter.java:1518)
at org.apache.lucene.index.DocumentsWriter$ThreadState
$FieldData.processField(DocumentsWriter.java:1407)
at org.apache.lucene.index.DocumentsWriter
$ThreadState.processDocument(DocumentsWriter.java:1116)
at
org
.apache
.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440)
at
org
.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:
2422)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:
1445)
This is using a ~1 week old version of Solr 1.3 from SVN.
One workaround mentioned in that Jira issue was to move HTML stripping
outside of Solr; can anyone suggest a better approach than that?
Thanks
James
[Solr-user] IOException: Mark invalid while analyzing HTML
| Tweet |
|
Search Discussions
Discussion Posts
Follow ups
- Dean Thompson: Was this one ever addressed? I'm seeing it in some small percentage of the documents that I index in 1.4-dev 708596M. I don't see a corresponding JIRA issue. James Brady-3 wrote: -- View this message in context: http://www.nabble.com/IOException%3A-Mark-invalid-while-analyzing-HTML-tp17052153p20859862.html Sent from the Solr - User mailing list archive at Nabble.com.
- Grant Ingersoll: About the only thing you can do here is to increase the readAheadLimit on the BufferedReader, but, by the looks of it, that also means we need to modify the TokenStream Factories that create the HTMLStripReader so that they take in some optional attributes. If you can open a JIRA issue for this, that would be great. -Grant -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Related Discussions
Discussion Overview
| group | solr-user
|
| categories | lucene |
| posted | May 4, '08 at 10:36p |
| active | Dec 6, '08 at 12:22p |
| posts | 3 |
| users | 3 |
| website | lucene.apache.org... |
