FAQ
Hello,

I'm writing a highlighter by using term offsets info (yes, I borrowed
the idea
of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
when getting multi-valued field.

For example, if I indexed [" aaaa"," bbb "] (multi-valued), I got term info
bbb(7,10). This is expected result. But if I indexed [" aaa "," bbb "]
(note that using " aaa " instead of " aaaa"), I got term info bbb(6,9) which
is unexpected. I would like to get same offset info for bbb because they
are same length of field values.

Please use the following program to see the problem I'm seeing. I'm
using trunk:

public static void main(String[] args) throws Exception {
// create an index
Directory dir = new RAMDirectory();
Analyzer analyzer = new WhitespaceAnalyzer();
IndexWriter writer = new IndexWriter( dir, analyzer, true,
MaxFieldLength.LIMITED );
Document doc = new Document();
doc.add( new Field( "f", " aaa ", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
//doc.add( new Field( "f", " aaaa", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
doc.add( new Field( "f", " bbb ", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
writer.addDocument( doc );
writer.close();

// print the offsets
IndexReader reader = IndexReader.open( dir );
TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
0, "f" );
for( int i = 0; i < tpv.getTerms().length; i++ ){
System.out.print( "term = \"" + tpv.getTerms()[i] + "\"" );
TermVectorOffsetInfo[] tvois = tpv.getOffsets( i );
for( TermVectorOffsetInfo tvoi : tvois ){
System.out.println( "(" + tvoi.getStartOffset() + "," +
tvoi.getEndOffset() + ")" );
}
}
reader.close();
}

regards,

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Mark Miller at Jan 17, 2009 at 12:39 am
    Okay, Koji, hopefully I'll be more luckily suggesting this this time.

    Have you tried http://issues.apache.org/jira/browse/LUCENE-1448 yet? I am
    not sure if its in an applyable state, but I hope that covers your issue.
    On Fri, Jan 16, 2009 at 7:15 PM, Koji Sekiguchi wrote:

    Hello,

    I'm writing a highlighter by using term offsets info (yes, I borrowed
    the idea
    of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
    when getting multi-valued field.

    For example, if I indexed [" aaaa"," bbb "] (multi-valued), I got term info
    bbb(7,10). This is expected result. But if I indexed [" aaa "," bbb "]
    (note that using " aaa " instead of " aaaa"), I got term info bbb(6,9)
    which
    is unexpected. I would like to get same offset info for bbb because they
    are same length of field values.

    Please use the following program to see the problem I'm seeing. I'm
    using trunk:

    public static void main(String[] args) throws Exception {
    // create an index
    Directory dir = new RAMDirectory();
    Analyzer analyzer = new WhitespaceAnalyzer();
    IndexWriter writer = new IndexWriter( dir, analyzer, true,
    MaxFieldLength.LIMITED );
    Document doc = new Document();
    doc.add( new Field( "f", " aaa ", Store.YES, Index.ANALYZED,
    TermVector.WITH_OFFSETS ) );
    //doc.add( new Field( "f", " aaaa", Store.YES, Index.ANALYZED,
    TermVector.WITH_OFFSETS ) );
    doc.add( new Field( "f", " bbb ", Store.YES, Index.ANALYZED,
    TermVector.WITH_OFFSETS ) );
    writer.addDocument( doc );
    writer.close();

    // print the offsets
    IndexReader reader = IndexReader.open( dir );
    TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
    0, "f" );
    for( int i = 0; i < tpv.getTerms().length; i++ ){
    System.out.print( "term = \"" + tpv.getTerms()[i] + "\"" );
    TermVectorOffsetInfo[] tvois = tpv.getOffsets( i );
    for( TermVectorOffsetInfo tvoi : tvois ){
    System.out.println( "(" + tvoi.getStartOffset() + "," +
    tvoi.getEndOffset() + ")" );
    }
    }
    reader.close();
    }

    regards,

    Koji


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Koji Sekiguchi at Jan 17, 2009 at 1:14 am
    Mark,

    This is exactly what I want and It worked perfectly. Thanks!
    I'll post my highlighter to JIRA in a few days (hopegully).
    It uses term offsets with positions (WITH_POSITIONS_OFFSETS)
    to support PhraseQuery.

    Thanks again,

    Koji


    Mark Miller wrote:
    Okay, Koji, hopefully I'll be more luckily suggesting this this time.

    Have you tried http://issues.apache.org/jira/browse/LUCENE-1448 yet? I am
    not sure if its in an applyable state, but I hope that covers your issue.

    On Fri, Jan 16, 2009 at 7:15 PM, Koji Sekiguchi wrote:

    Hello,

    I'm writing a highlighter by using term offsets info (yes, I borrowed
    the idea
    of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
    when getting multi-valued field.

    For example, if I indexed [" aaaa"," bbb "] (multi-valued), I got term info
    bbb(7,10). This is expected result. But if I indexed [" aaa "," bbb "]
    (note that using " aaa " instead of " aaaa"), I got term info bbb(6,9)
    which
    is unexpected. I would like to get same offset info for bbb because they
    are same length of field values.

    Please use the following program to see the problem I'm seeing. I'm
    using trunk:

    public static void main(String[] args) throws Exception {
    // create an index
    Directory dir = new RAMDirectory();
    Analyzer analyzer = new WhitespaceAnalyzer();
    IndexWriter writer = new IndexWriter( dir, analyzer, true,
    MaxFieldLength.LIMITED );
    Document doc = new Document();
    doc.add( new Field( "f", " aaa ", Store.YES, Index.ANALYZED,
    TermVector.WITH_OFFSETS ) );
    //doc.add( new Field( "f", " aaaa", Store.YES, Index.ANALYZED,
    TermVector.WITH_OFFSETS ) );
    doc.add( new Field( "f", " bbb ", Store.YES, Index.ANALYZED,
    TermVector.WITH_OFFSETS ) );
    writer.addDocument( doc );
    writer.close();

    // print the offsets
    IndexReader reader = IndexReader.open( dir );
    TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
    0, "f" );
    for( int i = 0; i < tpv.getTerms().length; i++ ){
    System.out.print( "term = \"" + tpv.getTerms()[i] + "\"" );
    TermVectorOffsetInfo[] tvois = tpv.getOffsets( i );
    for( TermVectorOffsetInfo tvoi : tvois ){
    System.out.println( "(" + tvoi.getStartOffset() + "," +
    tvoi.getEndOffset() + ")" );
    }
    }
    reader.close();
    }

    regards,

    Koji


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 17, '09 at 12:15a
activeJan 17, '09 at 1:14a
posts3
users2
websitelucene.apache.org

2 users in discussion

Koji Sekiguchi: 2 posts Mark Miller: 1 post

People

Translate

site design / logo © 2022 Grokbase