FAQ
Hello,

I have an index with an 'actor' field, for each actor there exists an single field value entry, e.g.

stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition <movie_actors>

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with

termPositionVector = (TermPositionVector) reader.getTermFreqVector(docNumber, "movie_actors");
int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex);


I get one TermVectorOffsetInfo for the field - with offset numbers that are bigger than one single
Field entry.
I guessed that Lucene gives the offset number for the situation that all values were concatenated,
which is for the single (virtual) string:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

This fits in nearly no situation, so my second guess was that lucene adds some virtual delimiters between the single
field entries for offset calculation. I added a delimiter, so the result would be:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
(note the ' ' between each actor name)

..this also fits not for each situation - there are too much delimiters there now, so, further, I guessed that Lucene don't add
a delimiter in each situation. So I added only one when the last character of an entry was no alphanumerical one, with:
StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
{
strbAttContent.append(strAttValue);
if(strbAttContent.substring(strbAttContent.length() - 1).matches("\\w"))
strbAttContent.append(' ');
}

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

this fits in ~96% of all my queries....but still its not 100% the way lucene calculates the offset value for fields with multiple
value entries.


..maybe the problem is that there are special characters inside my database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
I have looked to this specific situation, but considering this one character don't solves the problem.


How do Lucene calculates these offsets? I also searched inside the source code, but can't find the correct place.


Thanks in advance!

Christian Reuschling





- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125
mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll at Aug 17, 2007 at 12:54 am
    Hi Christian,

    Is there anyway you can post a complete, self-contained example
    preferably as a JUnit test? I think it would be useful to know more
    about how you are indexing (i.e. what Analyzer, etc.)
    The offsets should be taken from whatever is set in on the Token
    during Analysis. I, too, am trying to remember where in the code
    this is taking place

    Also, what version of Lucene are you using?

    -Grant
    On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Hello,

    I have an index with an 'actor' field, for each actor there exists
    an single field value entry, e.g.

    stored/
    compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPo
    sition <movie_actors>

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
    movie_actors:Miguel Bosé
    movie_actors:Anna Lizaran (as Ana Lizaran)
    movie_actors:Raquel Sanchís
    movie_actors:Angelina Llongueras

    I try to get the term offset, e.g. for 'angelina' with

    termPositionVector = (TermPositionVector) reader.getTermFreqVector
    (docNumber, "movie_actors");
    int iTermIndex = termPositionVector.indexOf("angelina");
    TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets
    (iTermIndex);


    I get one TermVectorOffsetInfo for the field - with offset numbers
    that are bigger than one single
    Field entry.
    I guessed that Lucene gives the offset number for the situation
    that all values were concatenated,
    which is for the single (virtual) string:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel
    BoséAnna Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

    This fits in nearly no situation, so my second guess was that
    lucene adds some virtual delimiters between the single
    field entries for offset calculation. I added a delimiter, so the
    result would be:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé
    Anna Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
    (note the ' ' between each actor name)

    ..this also fits not for each situation - there are too much
    delimiters there now, so, further, I guessed that Lucene don't add
    a delimiter in each situation. So I added only one when the last
    character of an entry was no alphanumerical one, with:
    StringBuilder strbAttContent = new StringBuilder();
    for (String strAttValue : m_luceneDocument.getValues(strFieldName))
    {
    strbAttContent.append(strAttValue);
    if(strbAttContent.substring(strbAttContent.length() - 1).matches
    ("\\w"))
    strbAttContent.append(' ');
    }

    where I get the result (virtual) entry:
    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel
    BoséAnna Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

    this fits in ~96% of all my queries....but still its not 100% the
    way lucene calculates the offset value for fields with multiple
    value entries.


    ..maybe the problem is that there are special characters inside my
    database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
    I have looked to this specific situation, but considering this one
    character don't solves the problem.


    How do Lucene calculates these offsets? I also searched inside the
    source code, but can't find the correct place.


    Thanks in advance!

    Christian Reuschling





    - --
    ______________________________________________________________________
    ________
    Christian Reuschling, Dipl.-Ing.(BA)
    Software Engineer

    Knowledge Management Department
    German Research Center for Artificial Intelligence DFKI GmbH
    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

    Phone: +49.631.20575-125
    mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

    - ------------Legal Company Information Required by German
    Law------------------
    Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
    (Vorsitzender)
    Dr. Walter Olthoff
    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
    Amtsgericht Kaiserslautern, HRB 2313=
    ______________________________________________________________________
    ________
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.2 (GNU/Linux)

    iD8DBQFGxB3XQoTr50f1tpcRAti+AKCH0YgcHjA+bO9NTbuxaAlKb8dO5gCfSfSK
    oVOiAdWYROqXOMqHv176xBY=
    =b2jO
    -----END PGP SIGNATURE-----

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Duiduder at Aug 17, 2007 at 4:45 pm
    Hello community, dear Grant

    I have build a JUnit test case that illustrates the problem - there, I try to cut
    out the right substring with the offset values given from Lucene - and fail :(

    A few remarks:

    In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches -
    it is recognized, unlike in StandardAnalyzer - as delimiter sign.

    Analysis: It seems that Lucene calculates the offset values by adding a virtual
    delimiter between every field value.
    But Lucene forgets the last characters of a field value when these are
    analyzer-specific delimiter values. (I seem this because of DocumentWriter, line
    245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
    With this line of code, only the end offset of the last token is considered - by
    forgetting potential, trimmed delimiter chars.

    Thus, solving would be:
    1. Add a single delimiter char between the field values
    2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters
    that are at the end of all field values before the match

    For this, someone needs to know what a delimiter for an specific analyzer is.

    The other possibility of course is to change the behaviour inside Lucene, because
    the current offset values are more or less useless / hard to use (I currently have
    no idea how to get analyzer-specific delimiter chars).

    For me, this looks like a bug - am I wrong?

    Any ideas/hints/remarks? I would be very lucky about :)

    Greetings

    Christian



    Grant Ingersoll schrieb:
    Hi Christian,

    Is there anyway you can post a complete, self-contained example
    preferably as a JUnit test? I think it would be useful to know more
    about how you are indexing (i.e. what Analyzer, etc.)
    The offsets should be taken from whatever is set in on the Token during
    Analysis. I, too, am trying to remember where in the code this is
    taking place

    Also, what version of Lucene are you using?

    -Grant

    On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:

    Hello,

    I have an index with an 'actor' field, for each actor there exists an
    single field value entry, e.g.

    stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
    <movie_actors>

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
    movie_actors:Miguel Bosé
    movie_actors:Anna Lizaran (as Ana Lizaran)
    movie_actors:Raquel Sanchís
    movie_actors:Angelina Llongueras

    I try to get the term offset, e.g. for 'angelina' with

    termPositionVector = (TermPositionVector)
    reader.getTermFreqVector(docNumber, "movie_actors");
    int iTermIndex = termPositionVector.indexOf("angelina");
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);


    I get one TermVectorOffsetInfo for the field - with offset numbers
    that are bigger than one single
    Field entry.
    I guessed that Lucene gives the offset number for the situation that
    all values were concatenated,
    which is for the single (virtual) string:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

    This fits in nearly no situation, so my second guess was that lucene
    adds some virtual delimiters between the single
    field entries for offset calculation. I added a delimiter, so the
    result would be:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
    Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
    (note the ' ' between each actor name)

    ..this also fits not for each situation - there are too much
    delimiters there now, so, further, I guessed that Lucene don't add
    a delimiter in each situation. So I added only one when the last
    character of an entry was no alphanumerical one, with:
    StringBuilder strbAttContent = new StringBuilder();
    for (String strAttValue : m_luceneDocument.getValues(strFieldName))
    {
    strbAttContent.append(strAttValue);
    if(strbAttContent.substring(strbAttContent.length() -
    1).matches("\\w"))
    strbAttContent.append(' ');
    }

    where I get the result (virtual) entry:
    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

    this fits in ~96% of all my queries....but still its not 100% the way
    lucene calculates the offset value for fields with multiple
    value entries.


    ..maybe the problem is that there are special characters inside my
    database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
    I have looked to this specific situation, but considering this one
    character don't solves the problem.


    How do Lucene calculates these offsets? I also searched inside the
    source code, but can't find the correct place.


    Thanks in advance!

    Christian Reuschling





    --
    ______________________________________________________________________________

    Christian Reuschling, Dipl.-Ing.(BA)
    Software Engineer

    Knowledge Management Department
    German Research Center for Artificial Intelligence DFKI GmbH
    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

    Phone: +49.631.20575-125
    mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

    ------------Legal Company Information Required by German
    Law------------------
    Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
    (Vorsitzender)
    Dr. Walter Olthoff
    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
    Amtsgericht Kaiserslautern, HRB 2313=
    ______________________________________________________________________________
    - ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    >>
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com
    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Aug 17, 2007 at 9:19 pm
    What version of Lucene are you using?

    On Aug 17, 2007, at 12:44 PM, duiduder@web.de wrote:

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Hello community, dear Grant

    I have build a JUnit test case that illustrates the problem -
    there, I try to cut
    out the right substring with the offset values given from Lucene -
    and fail :(

    A few remarks:

    In this example, the 'é' from 'Bosé' makes that the '\w' pattern
    don't matches -
    it is recognized, unlike in StandardAnalyzer - as delimiter sign.

    Analysis: It seems that Lucene calculates the offset values by
    adding a virtual
    delimiter between every field value.
    But Lucene forgets the last characters of a field value when these are
    analyzer-specific delimiter values. (I seem this because of
    DocumentWriter, line
    245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
    With this line of code, only the end offset of the last token is
    considered - by
    forgetting potential, trimmed delimiter chars.

    Thus, solving would be:
    1. Add a single delimiter char between the field values
    2. Substract (from the Lucene Offset) the count of analyzer-
    specific delimiters
    that are at the end of all field values before the match

    For this, someone needs to know what a delimiter for an specific
    analyzer is.

    The other possibility of course is to change the behaviour inside
    Lucene, because
    the current offset values are more or less useless / hard to use (I
    currently have
    no idea how to get analyzer-specific delimiter chars).

    For me, this looks like a bug - am I wrong?

    Any ideas/hints/remarks? I would be very lucky about :)

    Greetings

    Christian



    Grant Ingersoll schrieb:
    Hi Christian,

    Is there anyway you can post a complete, self-contained example
    preferably as a JUnit test? I think it would be useful to know more
    about how you are indexing (i.e. what Analyzer, etc.)
    The offsets should be taken from whatever is set in on the Token
    during
    Analysis. I, too, am trying to remember where in the code this is
    taking place

    Also, what version of Lucene are you using?

    -Grant

    On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:

    Hello,

    I have an index with an 'actor' field, for each actor there exists an
    single field value entry, e.g.

    stored/
    compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorP
    osition
    <movie_actors>

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
    movie_actors:Miguel Bosé
    movie_actors:Anna Lizaran (as Ana Lizaran)
    movie_actors:Raquel Sanchís
    movie_actors:Angelina Llongueras

    I try to get the term offset, e.g. for 'angelina' with

    termPositionVector = (TermPositionVector)
    reader.getTermFreqVector(docNumber, "movie_actors");
    int iTermIndex = termPositionVector.indexOf("angelina");
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);


    I get one TermVectorOffsetInfo for the field - with offset numbers
    that are bigger than one single
    Field entry.
    I guessed that Lucene gives the offset number for the situation that
    all values were concatenated,
    which is for the single (virtual) string:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

    This fits in nearly no situation, so my second guess was that lucene
    adds some virtual delimiters between the single
    field entries for offset calculation. I added a delimiter, so the
    result would be:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé
    Anna
    Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
    (note the ' ' between each actor name)

    ..this also fits not for each situation - there are too much
    delimiters there now, so, further, I guessed that Lucene don't add
    a delimiter in each situation. So I added only one when the last
    character of an entry was no alphanumerical one, with:
    StringBuilder strbAttContent = new StringBuilder();
    for (String strAttValue : m_luceneDocument.getValues(strFieldName))
    {
    strbAttContent.append(strAttValue);
    if(strbAttContent.substring(strbAttContent.length() -
    1).matches("\\w"))
    strbAttContent.append(' ');
    }

    where I get the result (virtual) entry:
    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

    this fits in ~96% of all my queries....but still its not 100% the way
    lucene calculates the offset value for fields with multiple
    value entries.


    ..maybe the problem is that there are special characters inside my
    database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
    I have looked to this specific situation, but considering this one
    character don't solves the problem.


    How do Lucene calculates these offsets? I also searched inside the
    source code, but can't find the correct place.


    Thanks in advance!

    Christian Reuschling





    --
    _____________________________________________________________________
    _________

    Christian Reuschling, Dipl.-Ing.(BA)
    Software Engineer

    Knowledge Management Department
    German Research Center for Artificial Intelligence DFKI GmbH
    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

    Phone: +49.631.20575-125
    mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

    ------------Legal Company Information Required by German
    Law------------------
    Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
    (Vorsitzender)
    Dr. Walter Olthoff
    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
    Amtsgericht Kaiserslautern, HRB 2313=
    _____________________________________________________________________
    _________
    -
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com
    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.2 (GNU/Linux)

    iD8DBQFGxdBkQoTr50f1tpcRAjyHAJ9jvKORu0Ciepprg70z2qOdFhmnIgCgoaYA
    nkji0oFq2UGEkh7H9YLdqbE=
    =wKe1
    -----END PGP SIGNATURE-----
    package org.dynaq.index;



    import static org.junit.Assert.assertTrue;

    import java.io.IOException;
    import java.util.LinkedList;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.CorruptIndexException;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.TermPositionVector;
    import org.apache.lucene.index.TermVectorOffsetInfo;
    import org.apache.lucene.queryParser.ParseException;
    import org.apache.lucene.queryParser.QueryParser;
    import org.apache.lucene.search.Hits;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.search.Searcher;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.LockObtainFailedException;
    import org.apache.lucene.store.RAMDirectory;
    import org.junit.After;
    import org.junit.Before;
    import org.junit.Test;



    public class TokenizerTest
    {

    IndexReader m_indexReader;

    Analyzer m_analyzer = new StandardAnalyzer();


    @Before
    public void createIndex() throws CorruptIndexException,
    LockObtainFailedException, IOException
    {
    Directory ramDirectory = new RAMDirectory();

    // we create a fist little set of actors names
    LinkedList<String> llActorNames = new LinkedList<String>();
    llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
    llActorNames.add("Miguel Bosé");
    llActorNames.add("Anna Lizaran (as Ana Lizaran)");
    llActorNames.add("Raquel Sanchís");
    llActorNames.add("Angelina Llongueras");


    // store them into a single document with multiple values
    for one Field
    Document testDoc = new Document();
    for (String strActorsName : llActorNames)
    {
    Field testEntry =
    new Field("movie_actors", strActorsName,
    Field.Store.YES, Field.Index.TOKENIZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);

    testDoc.add(testEntry);
    }

    // now we write it into the index
    IndexWriter indexWriter = new IndexWriter(ramDirectory,
    true, m_analyzer, true);

    indexWriter.addDocument(testDoc);
    indexWriter.close();

    m_indexReader = IndexReader.open(ramDirectory);
    }



    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {

    // first, we search for 'angelina'
    String strSearchTerm = "Angelina";


    Searcher searcher = new IndexSearcher(m_indexReader);

    QueryParser parser = new QueryParser("movie_actors",
    m_analyzer);
    Query query = parser.parse(strSearchTerm);

    Hits hits = searcher.search(query);
    Document resultDoc = hits.doc(0);

    String[] straValues = resultDoc.getValues("movie_actors");

    // now, we get the field values and build a single string
    value out of them
    StringBuilder strbSimplyConcatenated = new StringBuilder();
    StringBuilder strbWithDelimiters = new StringBuilder();
    StringBuilder strbWithDelimitersAfterAlphaNumChar = new
    StringBuilder();
    for (String strActorName : straValues)
    {
    // first situation: we simply concatenate all field
    value entries
    strbSimplyConcatenated.append(strActorName);
    // second: we add a single delimiter char between the
    field values
    strbWithDelimiters.append(strActorName).append('$');
    // third try: we add a single delimiter, but only if
    the last char of the actor before was a alphanum-char
    strbWithDelimitersAfterAlphaNumChar.append(strActorName);
    String strLastChar = strActorName.substring
    (strActorName.length() - 1, strActorName.length());
    if(strLastChar.matches("\\w"))
    strbWithDelimitersAfterAlphaNumChar.append('$');
    }


    // this is the offset value from lucene. This should be the
    place of 'angelina' in one of the concatenated value Strings above
    TermPositionVector termPositionVector =
    (TermPositionVector) m_indexReader.getTermFreqVector(0,
    "movie_actors");
    int iTermIndex = termPositionVector.indexOf
    (strSearchTerm.toLowerCase());
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);
    int iStartOffset = termOffsets[0].getStartOffset();
    int iEndOffset = termOffsets[0].getEndOffset();

    //we create the substrings according to the offset value
    given from lucene
    String strSubString1 = strbSimplyConcatenated.substring
    (iStartOffset, iEndOffset);
    String strSubString2 = strbWithDelimiters.substring
    (iStartOffset, iEndOffset);
    String strSubString3 =
    strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset,
    iEndOffset);

    System.out.println("Offset value: " + iStartOffset + "-" +
    iEndOffset);
    System.out.println("simply concatenated:");
    System.out.println(strbSimplyConcatenated);
    System.out.println("SubString for offset: '" +
    strSubString1 + "'");
    System.out.println();
    System.out.println("with delimiters:");
    System.out.println(strbWithDelimiters);
    System.out.println("SubString for offset: '" +
    strSubString2 + "'");
    System.out.println();
    System.out.println("with delimiter after alphanum
    character:");
    System.out.println(strbWithDelimitersAfterAlphaNumChar);
    System.out.println("SubString for offset: '" +
    strSubString3 + "'");


    //is the offset value correct for one of the concatenated
    strings?

    //this fails for all situations
    assertTrue(strSubString1.equals(strSearchTerm) ||
    strSubString2.equals(strSearchTerm) || strSubString3.equals
    (strSearchTerm));

    /*
    * Comments: In this example, the 'é' from 'Bosé' makes
    that the '\w' pattern don't matches - it is recognized, unlike in
    * StandardAnalyzer - as delimiter sign.
    *
    * Analysis: It seems that Lucene calculates the offset
    values by adding a virtual delimiter between every field value.
    * But Lucene forgets the last characters of a field value
    when these are analyzer-specific delimiter values.
    * (I seem this because of DocumentWriter, line 245: 'if
    (lastToken != null) offset += lastToken.endOffset() + 1;)'
    * With this line of code, only the end offset of the last
    token is considered - by forgetting potential, trimmed delimiter
    * chars.
    *
    * Thus, solving would be:
    * 1. Add a single delimiter char between the field values
    * 2. Substract (from the Lucene Offset) the count of
    analyzer-specific delimiters that are at the end of all field
    values before the match
    *
    * For this, someone needs to know what a delimiter for an
    specific analyzer is.
    */



    }



    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
    m_indexReader.close();
    }


    }






























    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ------------------------------------------------------
    Grant Ingersoll
    http://www.grantingersoll.com/
    http://lucene.grantingersoll.com
    http://www.paperoftheweek.com/
  • Duiduder at Aug 20, 2007 at 7:01 am
    this is the 2.2.0 release




    Grant Ingersoll schrieb:
    What version of Lucene are you using?


    On Aug 17, 2007, at 12:44 PM, duiduder@web.de wrote:

    Hello community, dear Grant

    I have build a JUnit test case that illustrates the problem - there, I
    try to cut
    out the right substring with the offset values given from Lucene - and
    fail :(

    A few remarks:

    In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't
    matches -
    it is recognized, unlike in StandardAnalyzer - as delimiter sign.

    Analysis: It seems that Lucene calculates the offset values by adding
    a virtual
    delimiter between every field value.
    But Lucene forgets the last characters of a field value when these are
    analyzer-specific delimiter values. (I seem this because of
    DocumentWriter, line
    245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
    With this line of code, only the end offset of the last token is
    considered - by
    forgetting potential, trimmed delimiter chars.

    Thus, solving would be:
    1. Add a single delimiter char between the field values
    2. Substract (from the Lucene Offset) the count of analyzer-specific
    delimiters
    that are at the end of all field values before the match

    For this, someone needs to know what a delimiter for an specific
    analyzer is.

    The other possibility of course is to change the behaviour inside
    Lucene, because
    the current offset values are more or less useless / hard to use (I
    currently have
    no idea how to get analyzer-specific delimiter chars).

    For me, this looks like a bug - am I wrong?

    Any ideas/hints/remarks? I would be very lucky about :)

    Greetings

    Christian



    Grant Ingersoll schrieb:
    Hi Christian,

    Is there anyway you can post a complete, self-contained example
    preferably as a JUnit test? I think it would be useful to know more
    about how you are indexing (i.e. what Analyzer, etc.)
    The offsets should be taken from whatever is set in on the Token during
    Analysis. I, too, am trying to remember where in the code this is
    taking place

    Also, what version of Lucene are you using?

    -Grant

    On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:

    Hello,

    I have an index with an 'actor' field, for each actor there exists an
    single field value entry, e.g.

    stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition

    <movie_actors>

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
    movie_actors:Miguel Bosé
    movie_actors:Anna Lizaran (as Ana Lizaran)
    movie_actors:Raquel Sanchís
    movie_actors:Angelina Llongueras

    I try to get the term offset, e.g. for 'angelina' with

    termPositionVector = (TermPositionVector)
    reader.getTermFreqVector(docNumber, "movie_actors");
    int iTermIndex = termPositionVector.indexOf("angelina");
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);


    I get one TermVectorOffsetInfo for the field - with offset numbers
    that are bigger than one single
    Field entry.
    I guessed that Lucene gives the offset number for the situation that
    all values were concatenated,
    which is for the single (virtual) string:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

    This fits in nearly no situation, so my second guess was that lucene
    adds some virtual delimiters between the single
    field entries for offset calculation. I added a delimiter, so the
    result would be:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
    Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
    (note the ' ' between each actor name)

    ..this also fits not for each situation - there are too much
    delimiters there now, so, further, I guessed that Lucene don't add
    a delimiter in each situation. So I added only one when the last
    character of an entry was no alphanumerical one, with:
    StringBuilder strbAttContent = new StringBuilder();
    for (String strAttValue : m_luceneDocument.getValues(strFieldName))
    {
    strbAttContent.append(strAttValue);
    if(strbAttContent.substring(strbAttContent.length() -
    1).matches("\\w"))
    strbAttContent.append(' ');
    }

    where I get the result (virtual) entry:
    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

    this fits in ~96% of all my queries....but still its not 100% the way
    lucene calculates the offset value for fields with multiple
    value entries.


    ..maybe the problem is that there are special characters inside my
    database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
    I have looked to this specific situation, but considering this one
    character don't solves the problem.


    How do Lucene calculates these offsets? I also searched inside the
    source code, but can't find the correct place.


    Thanks in advance!

    Christian Reuschling





    --
    ______________________________________________________________________________


    Christian Reuschling, Dipl.-Ing.(BA)
    Software Engineer

    Knowledge Management Department
    German Research Center for Artificial Intelligence DFKI GmbH
    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

    Phone: +49.631.20575-125
    mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

    ------------Legal Company Information Required by German
    Law------------------
    Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
    (Vorsitzender)
    Dr. Walter Olthoff
    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
    Amtsgericht Kaiserslautern, HRB 2313=
    ______________________________________________________________________________
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    package org.dynaq.index;
    import static org.junit.Assert.assertTrue;
    >>
    import java.io.IOException;
    import java.util.LinkedList;
    >>
    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.CorruptIndexException;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.TermPositionVector;
    import org.apache.lucene.index.TermVectorOffsetInfo;
    import org.apache.lucene.queryParser.ParseException;
    import org.apache.lucene.queryParser.QueryParser;
    import org.apache.lucene.search.Hits;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.search.Searcher;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.LockObtainFailedException;
    import org.apache.lucene.store.RAMDirectory;
    import org.junit.After;
    import org.junit.Before;
    import org.junit.Test;
    public class TokenizerTest
    {
    >>
    IndexReader m_indexReader;
    >>
    Analyzer m_analyzer = new StandardAnalyzer();
    @Before
    public void createIndex() throws CorruptIndexException,
    LockObtainFailedException, IOException
    {
    Directory ramDirectory = new RAMDirectory();
    >>
    // we create a fist little set of actors names
    LinkedList<String> llActorNames = new LinkedList<String>();
    llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
    llActorNames.add("Miguel Bosé");
    llActorNames.add("Anna Lizaran (as Ana Lizaran)");
    llActorNames.add("Raquel Sanchís");
    llActorNames.add("Angelina Llongueras");
    // store them into a single document with multiple values for
    one Field
    Document testDoc = new Document();
    for (String strActorsName : llActorNames)
    {
    Field testEntry =
    new Field("movie_actors", strActorsName,
    Field.Store.YES, Field.Index.TOKENIZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);
    >>
    testDoc.add(testEntry);
    }
    >>
    // now we write it into the index
    IndexWriter indexWriter = new IndexWriter(ramDirectory, true,
    m_analyzer, true);
    >>
    indexWriter.addDocument(testDoc);
    indexWriter.close();
    >>
    m_indexReader = IndexReader.open(ramDirectory);
    }
    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {
    >>
    // first, we search for 'angelina'
    String strSearchTerm = "Angelina";
    Searcher searcher = new IndexSearcher(m_indexReader);
    >>
    QueryParser parser = new QueryParser("movie_actors", m_analyzer);
    Query query = parser.parse(strSearchTerm);
    >>
    Hits hits = searcher.search(query);
    Document resultDoc = hits.doc(0);
    >>
    String[] straValues = resultDoc.getValues("movie_actors");
    >>
    // now, we get the field values and build a single string
    value out of them
    StringBuilder strbSimplyConcatenated = new StringBuilder();
    StringBuilder strbWithDelimiters = new StringBuilder();
    StringBuilder strbWithDelimitersAfterAlphaNumChar = new
    StringBuilder();
    for (String strActorName : straValues)
    {
    // first situation: we simply concatenate all field value
    entries
    strbSimplyConcatenated.append(strActorName);
    // second: we add a single delimiter char between the
    field values
    strbWithDelimiters.append(strActorName).append('$');
    // third try: we add a single delimiter, but only if the
    last char of the actor before was a alphanum-char
    strbWithDelimitersAfterAlphaNumChar.append(strActorName);
    String strLastChar =
    strActorName.substring(strActorName.length() - 1, strActorName.length());
    if(strLastChar.matches("\\w"))
    strbWithDelimitersAfterAlphaNumChar.append('$');
    }
    // this is the offset value from lucene. This should be the
    place of 'angelina' in one of the concatenated value Strings above
    TermPositionVector termPositionVector = (TermPositionVector)
    m_indexReader.getTermFreqVector(0, "movie_actors");
    int iTermIndex =
    termPositionVector.indexOf(strSearchTerm.toLowerCase());
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);
    int iStartOffset = termOffsets[0].getStartOffset();
    int iEndOffset = termOffsets[0].getEndOffset();
    >>
    //we create the substrings according to the offset value given
    from lucene
    String strSubString1 =
    strbSimplyConcatenated.substring(iStartOffset, iEndOffset);
    String strSubString2 =
    strbWithDelimiters.substring(iStartOffset, iEndOffset);
    String strSubString3 =
    strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset);
    >>
    System.out.println("Offset value: " + iStartOffset + "-" +
    iEndOffset);
    System.out.println("simply concatenated:");
    System.out.println(strbSimplyConcatenated);
    System.out.println("SubString for offset: '" + strSubString1 +
    "'");
    System.out.println();
    System.out.println("with delimiters:");
    System.out.println(strbWithDelimiters);
    System.out.println("SubString for offset: '" + strSubString2 +
    "'");
    System.out.println();
    System.out.println("with delimiter after alphanum character:");
    System.out.println(strbWithDelimitersAfterAlphaNumChar);
    System.out.println("SubString for offset: '" + strSubString3 +
    "'");
    //is the offset value correct for one of the concatenated
    strings?
    >>
    //this fails for all situations
    assertTrue(strSubString1.equals(strSearchTerm) ||
    strSubString2.equals(strSearchTerm) ||
    strSubString3.equals(strSearchTerm));
    >>
    /*
    * Comments: In this example, the 'é' from 'Bosé' makes that
    the '\w' pattern don't matches - it is recognized, unlike in
    * StandardAnalyzer - as delimiter sign.
    *
    * Analysis: It seems that Lucene calculates the offset values
    by adding a virtual delimiter between every field value.
    * But Lucene forgets the last characters of a field value
    when these are analyzer-specific delimiter values.
    * (I seem this because of DocumentWriter, line 245:
    'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
    * With this line of code, only the end offset of the last
    token is considered - by forgetting potential, trimmed delimiter
    * chars.
    *
    * Thus, solving would be:
    * 1. Add a single delimiter char between the field values
    * 2. Substract (from the Lucene Offset) the count of
    analyzer-specific delimiters that are at the end of all field values
    before the match
    *
    * For this, someone needs to know what a delimiter for an
    specific analyzer is.
    */
    }
    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
    m_indexReader.close();
    }
    }



























    - ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ------------------------------------------------------
    Grant Ingersoll
    http://www.grantingersoll.com/
    http://lucene.grantingersoll.com
    http://www.paperoftheweek.com/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Duiduder at Aug 20, 2007 at 2:18 pm
    Hello Grant, dear community

    I have written some lines of code to adapt the offset values from Lucene to
    values where the terms really appear in the concatenated field value entries.

    My tests are successful :)

    There are two additional methods inside the test:
    1. calculateLuceneOffsetDiffs(String strFieldName, Document document2highlight, Analyzer analyzer)
    2. adaptLuceneOffset(int luceneOffset, LinkedHashMap<Integer, Integer> hsEndOffset2EndDelimiterCount)

    For calculating, I use the analyzer for the specific field, and concatenate the
    values with a single (whitespace) delimiter in between.

    It really looks like that Lucene forgets possibly trimmed delimiter chars at the
    end of field values in order to calculate the offsets. The delimiter chars are
    analyzer specific - thus I need the analyzer to calculate the corrections.

    Lucene offsets are too low by forgetting these values- so I calculate the count
    of delimiter chars that are before a given Lucene offset, and add them.

    Hope this helps - maybe this behaviour is still known? I haven't found comments
    on it, but I know that the origin highlighter from Lucene-contrib don't deals
    with multiple field entries (at least a version from last year) - maybe this
    behaviour was the reason.

    greetings

    Christian



    - --
    ______________________________________________________________________________
    Christian Reuschling, Dipl.-Ing.(BA)
    Software Engineer

    Knowledge Management Department
    German Research Center for Artificial Intelligence DFKI GmbH
    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

    Phone: +49.631.20575-125
    mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

    - ------------Legal Company Information Required by German Law------------------
    Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
    Dr. Walter Olthoff
    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
    Amtsgericht Kaiserslautern, HRB 2313=
    ______________________________________________________________________________


    Grant Ingersoll schrieb:
    What version of Lucene are you using?


    On Aug 17, 2007, at 12:44 PM, duiduder@web.de wrote:

    Hello community, dear Grant

    I have build a JUnit test case that illustrates the problem - there, I
    try to cut
    out the right substring with the offset values given from Lucene - and
    fail :(

    A few remarks:

    In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't
    matches -
    it is recognized, unlike in StandardAnalyzer - as delimiter sign.

    Analysis: It seems that Lucene calculates the offset values by adding
    a virtual
    delimiter between every field value.
    But Lucene forgets the last characters of a field value when these are
    analyzer-specific delimiter values. (I seem this because of
    DocumentWriter, line
    245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
    With this line of code, only the end offset of the last token is
    considered - by
    forgetting potential, trimmed delimiter chars.

    Thus, solving would be:
    1. Add a single delimiter char between the field values
    2. Substract (from the Lucene Offset) the count of analyzer-specific
    delimiters
    that are at the end of all field values before the match

    For this, someone needs to know what a delimiter for an specific
    analyzer is.

    The other possibility of course is to change the behaviour inside
    Lucene, because
    the current offset values are more or less useless / hard to use (I
    currently have
    no idea how to get analyzer-specific delimiter chars).

    For me, this looks like a bug - am I wrong?

    Any ideas/hints/remarks? I would be very lucky about :)

    Greetings

    Christian



    Grant Ingersoll schrieb:
    Hi Christian,

    Is there anyway you can post a complete, self-contained example
    preferably as a JUnit test? I think it would be useful to know more
    about how you are indexing (i.e. what Analyzer, etc.)
    The offsets should be taken from whatever is set in on the Token during
    Analysis. I, too, am trying to remember where in the code this is
    taking place

    Also, what version of Lucene are you using?

    -Grant

    On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:

    Hello,

    I have an index with an 'actor' field, for each actor there exists an
    single field value entry, e.g.

    stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition

    <movie_actors>

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
    movie_actors:Miguel Bosé
    movie_actors:Anna Lizaran (as Ana Lizaran)
    movie_actors:Raquel Sanchís
    movie_actors:Angelina Llongueras

    I try to get the term offset, e.g. for 'angelina' with

    termPositionVector = (TermPositionVector)
    reader.getTermFreqVector(docNumber, "movie_actors");
    int iTermIndex = termPositionVector.indexOf("angelina");
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);


    I get one TermVectorOffsetInfo for the field - with offset numbers
    that are bigger than one single
    Field entry.
    I guessed that Lucene gives the offset number for the situation that
    all values were concatenated,
    which is for the single (virtual) string:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

    This fits in nearly no situation, so my second guess was that lucene
    adds some virtual delimiters between the single
    field entries for offset calculation. I added a delimiter, so the
    result would be:

    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
    Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
    (note the ' ' between each actor name)

    ..this also fits not for each situation - there are too much
    delimiters there now, so, further, I guessed that Lucene don't add
    a delimiter in each situation. So I added only one when the last
    character of an entry was no alphanumerical one, with:
    StringBuilder strbAttContent = new StringBuilder();
    for (String strAttValue : m_luceneDocument.getValues(strFieldName))
    {
    strbAttContent.append(strAttValue);
    if(strbAttContent.substring(strbAttContent.length() -
    1).matches("\\w"))
    strbAttContent.append(' ');
    }

    where I get the result (virtual) entry:
    movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
    Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

    this fits in ~96% of all my queries....but still its not 100% the way
    lucene calculates the offset value for fields with multiple
    value entries.


    ..maybe the problem is that there are special characters inside my
    database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
    I have looked to this specific situation, but considering this one
    character don't solves the problem.


    How do Lucene calculates these offsets? I also searched inside the
    source code, but can't find the correct place.


    Thanks in advance!

    Christian Reuschling





    --
    ______________________________________________________________________________


    Christian Reuschling, Dipl.-Ing.(BA)
    Software Engineer

    Knowledge Management Department
    German Research Center for Artificial Intelligence DFKI GmbH
    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

    Phone: +49.631.20575-125
    mailto:reuschling@dfki.de http://www.dfki.uni-kl.de/~reuschling/

    ------------Legal Company Information Required by German
    Law------------------
    Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
    (Vorsitzender)
    Dr. Walter Olthoff
    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
    Amtsgericht Kaiserslautern, HRB 2313=
    ______________________________________________________________________________
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    package org.dynaq.index;
    import static org.junit.Assert.assertTrue;
    >>
    import java.io.IOException;
    import java.util.LinkedList;
    >>
    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.CorruptIndexException;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.TermPositionVector;
    import org.apache.lucene.index.TermVectorOffsetInfo;
    import org.apache.lucene.queryParser.ParseException;
    import org.apache.lucene.queryParser.QueryParser;
    import org.apache.lucene.search.Hits;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.search.Searcher;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.LockObtainFailedException;
    import org.apache.lucene.store.RAMDirectory;
    import org.junit.After;
    import org.junit.Before;
    import org.junit.Test;
    public class TokenizerTest
    {
    >>
    IndexReader m_indexReader;
    >>
    Analyzer m_analyzer = new StandardAnalyzer();
    @Before
    public void createIndex() throws CorruptIndexException,
    LockObtainFailedException, IOException
    {
    Directory ramDirectory = new RAMDirectory();
    >>
    // we create a fist little set of actors names
    LinkedList<String> llActorNames = new LinkedList<String>();
    llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
    llActorNames.add("Miguel Bosé");
    llActorNames.add("Anna Lizaran (as Ana Lizaran)");
    llActorNames.add("Raquel Sanchís");
    llActorNames.add("Angelina Llongueras");
    // store them into a single document with multiple values for
    one Field
    Document testDoc = new Document();
    for (String strActorsName : llActorNames)
    {
    Field testEntry =
    new Field("movie_actors", strActorsName,
    Field.Store.YES, Field.Index.TOKENIZED,
    Field.TermVector.WITH_POSITIONS_OFFSETS);
    >>
    testDoc.add(testEntry);
    }
    >>
    // now we write it into the index
    IndexWriter indexWriter = new IndexWriter(ramDirectory, true,
    m_analyzer, true);
    >>
    indexWriter.addDocument(testDoc);
    indexWriter.close();
    >>
    m_indexReader = IndexReader.open(ramDirectory);
    }
    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {
    >>
    // first, we search for 'angelina'
    String strSearchTerm = "Angelina";
    Searcher searcher = new IndexSearcher(m_indexReader);
    >>
    QueryParser parser = new QueryParser("movie_actors", m_analyzer);
    Query query = parser.parse(strSearchTerm);
    >>
    Hits hits = searcher.search(query);
    Document resultDoc = hits.doc(0);
    >>
    String[] straValues = resultDoc.getValues("movie_actors");
    >>
    // now, we get the field values and build a single string
    value out of them
    StringBuilder strbSimplyConcatenated = new StringBuilder();
    StringBuilder strbWithDelimiters = new StringBuilder();
    StringBuilder strbWithDelimitersAfterAlphaNumChar = new
    StringBuilder();
    for (String strActorName : straValues)
    {
    // first situation: we simply concatenate all field value
    entries
    strbSimplyConcatenated.append(strActorName);
    // second: we add a single delimiter char between the
    field values
    strbWithDelimiters.append(strActorName).append('$');
    // third try: we add a single delimiter, but only if the
    last char of the actor before was a alphanum-char
    strbWithDelimitersAfterAlphaNumChar.append(strActorName);
    String strLastChar =
    strActorName.substring(strActorName.length() - 1, strActorName.length());
    if(strLastChar.matches("\\w"))
    strbWithDelimitersAfterAlphaNumChar.append('$');
    }
    // this is the offset value from lucene. This should be the
    place of 'angelina' in one of the concatenated value Strings above
    TermPositionVector termPositionVector = (TermPositionVector)
    m_indexReader.getTermFreqVector(0, "movie_actors");
    int iTermIndex =
    termPositionVector.indexOf(strSearchTerm.toLowerCase());
    TermVectorOffsetInfo[] termOffsets =
    termPositionVector.getOffsets(iTermIndex);
    int iStartOffset = termOffsets[0].getStartOffset();
    int iEndOffset = termOffsets[0].getEndOffset();
    >>
    //we create the substrings according to the offset value given
    from lucene
    String strSubString1 =
    strbSimplyConcatenated.substring(iStartOffset, iEndOffset);
    String strSubString2 =
    strbWithDelimiters.substring(iStartOffset, iEndOffset);
    String strSubString3 =
    strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset);
    >>
    System.out.println("Offset value: " + iStartOffset + "-" +
    iEndOffset);
    System.out.println("simply concatenated:");
    System.out.println(strbSimplyConcatenated);
    System.out.println("SubString for offset: '" + strSubString1 +
    "'");
    System.out.println();
    System.out.println("with delimiters:");
    System.out.println(strbWithDelimiters);
    System.out.println("SubString for offset: '" + strSubString2 +
    "'");
    System.out.println();
    System.out.println("with delimiter after alphanum character:");
    System.out.println(strbWithDelimitersAfterAlphaNumChar);
    System.out.println("SubString for offset: '" + strSubString3 +
    "'");
    //is the offset value correct for one of the concatenated
    strings?
    >>
    //this fails for all situations
    assertTrue(strSubString1.equals(strSearchTerm) ||
    strSubString2.equals(strSearchTerm) ||
    strSubString3.equals(strSearchTerm));
    >>
    /*
    * Comments: In this example, the 'é' from 'Bosé' makes that
    the '\w' pattern don't matches - it is recognized, unlike in
    * StandardAnalyzer - as delimiter sign.
    *
    * Analysis: It seems that Lucene calculates the offset values
    by adding a virtual delimiter between every field value.
    * But Lucene forgets the last characters of a field value
    when these are analyzer-specific delimiter values.
    * (I seem this because of DocumentWriter, line 245:
    'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
    * With this line of code, only the end offset of the last
    token is considered - by forgetting potential, trimmed delimiter
    * chars.
    *
    * Thus, solving would be:
    * 1. Add a single delimiter char between the field values
    * 2. Substract (from the Lucene Offset) the count of
    analyzer-specific delimiters that are at the end of all field values
    before the match
    *
    * For this, someone needs to know what a delimiter for an
    specific analyzer is.
    */
    }
    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
    m_indexReader.close();
    }
    }



























    - ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ------------------------------------------------------
    Grant Ingersoll
    http://www.grantingersoll.com/
    http://lucene.grantingersoll.com
    http://www.paperoftheweek.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 16, '07 at 9:50a
activeAug 20, '07 at 2:18p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase