Hello,
I'm writing a highlighter by using term offsets info (yes, I borrowed
the idea
of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
when getting multi-valued field.
For example, if I indexed [" aaaa"," bbb "] (multi-valued), I got term info
bbb(7,10). This is expected result. But if I indexed [" aaa "," bbb "]
(note that using " aaa " instead of " aaaa"), I got term info bbb(6,9) which
is unexpected. I would like to get same offset info for bbb because they
are same length of field values.
Please use the following program to see the problem I'm seeing. I'm
using trunk:
public static void main(String[] args) throws Exception {
// create an index
Directory dir = new RAMDirectory();
Analyzer analyzer = new WhitespaceAnalyzer();
IndexWriter writer = new IndexWriter( dir, analyzer, true,
MaxFieldLength.LIMITED );
Document doc = new Document();
doc.add( new Field( "f", " aaa ", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
//doc.add( new Field( "f", " aaaa", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
doc.add( new Field( "f", " bbb ", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
writer.addDocument( doc );
writer.close();
// print the offsets
IndexReader reader = IndexReader.open( dir );
TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
0, "f" );
for( int i = 0; i < tpv.getTerms().length; i++ ){
System.out.print( "term = \"" + tpv.getTerms()[i] + "\"" );
TermVectorOffsetInfo[] tvois = tpv.getOffsets( i );
for( TermVectorOffsetInfo tvoi : tvois ){
System.out.println( "(" + tvoi.getStartOffset() + "," +
tvoi.getEndOffset() + ")" );
}
}
reader.close();
}
regards,
Koji
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org