FAQ
I have browsed many suggestions on how to implement 'search within a
sentence', but all seem to have drawbacks. For example, from
http://lucene.472066.n3.nabble.com/Issue-with-sentence-specific-search-td1644352.html#a1645072

Steve Rowe writes:

----------
One common technique, instead of using a larger-than-normal position
increment gap between sentences, is using a sentence boundary token like '$'
or something else that won't ever itself be the target of search. Quoting
from a post Mark Miller made to the lucene-user list last year <
http://www.lucidimagination.com/search/document/c9641cbb1a3bf928/multiline_regex_with_lucene
):
First you inject special marker tokens as your paragraph/
sentence markers, then you use a SpanNotQuery that looks
for a SpanNearQuery that doesn't intersect with a
SpanTermQuery containing the special marker term.

Mark's suggestion would work for your within-sentence case, and for the case
where you don't care about sentence boundaries, you can use SpanNearQuery
without the SpanNotQuery.
----------

The problem with the last part is that the SpanNearQuery would have to have
a slop of 1 in order to accomodate the marker token between sentences. This
could result in incorrect matches if the a slop of 0 is intended. Another
suggestion was to overlap the marker token with the first or last token of
the sentence, but the SpanNotQuery would always exclude any terms in the
query that are at the intersection. Mark Miller's 'SpanWithinQuery' patch
seems to have the same issue.

Has anyone implemented a solution that works for both in-sentence and across
sentence boundaries?

Thanks,
Peter

Search Discussions

  • Darren at Jul 20, 2011 at 3:33 pm
    I just parse the text into sentences and put those in a multi-valued field
    and then search that.
    On Wed, 20 Jul 2011 11:27:38 -0400, Peter Keegan wrote:
    I have browsed many suggestions on how to implement 'search within a
    sentence', but all seem to have drawbacks. For example, from
    http://lucene.472066.n3.nabble.com/Issue-with-sentence-specific-search-td1644352.html#a1645072
    Steve Rowe writes:

    ----------
    One common technique, instead of using a larger-than-normal position
    increment gap between sentences, is using a sentence boundary token like
    '$'
    or something else that won't ever itself be the target of search. Quoting
    from a post Mark Miller made to the lucene-user list last year <
    http://www.lucidimagination.com/search/document/c9641cbb1a3bf928/multiline_regex_with_lucene
    ):
    First you inject special marker tokens as your paragraph/
    sentence markers, then you use a SpanNotQuery that looks
    for a SpanNearQuery that doesn't intersect with a
    SpanTermQuery containing the special marker term.

    Mark's suggestion would work for your within-sentence case, and for the
    case
    where you don't care about sentence boundaries, you can use
    SpanNearQuery
    without the SpanNotQuery.
    ----------

    The problem with the last part is that the SpanNearQuery would have to have
    a slop of 1 in order to accomodate the marker token between sentences. This
    could result in incorrect matches if the a slop of 0 is intended. Another
    suggestion was to overlap the marker token with the first or last token of
    the sentence, but the SpanNotQuery would always exclude any terms in the
    query that are at the intersection. Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.

    Has anyone implemented a solution that works for both in-sentence and
    across
    sentence boundaries?

    Thanks,
    Peter
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 20, 2011 at 4:28 pm
    It seems to me that to constrain the search to a sentence this way, you'd
    have to override 'getPositionIncrementGap', which would then break phrase
    searches across the field values (sentences).

    Peter
    On Wed, Jul 20, 2011 at 11:33 AM, wrote:


    I just parse the text into sentences and put those in a multi-valued field
    and then search that.
    On Wed, 20 Jul 2011 11:27:38 -0400, Peter Keegan wrote:
    I have browsed many suggestions on how to implement 'search within a
    sentence', but all seem to have drawbacks. For example, from
    http://lucene.472066.n3.nabble.com/Issue-with-sentence-specific-search-td1644352.html#a1645072
    Steve Rowe writes:

    ----------
    One common technique, instead of using a larger-than-normal position
    increment gap between sentences, is using a sentence boundary token like
    '$'
    or something else that won't ever itself be the target of search. Quoting
    from a post Mark Miller made to the lucene-user list last year <
    http://www.lucidimagination.com/search/document/c9641cbb1a3bf928/multiline_regex_with_lucene
    ):
    First you inject special marker tokens as your paragraph/
    sentence markers, then you use a SpanNotQuery that looks
    for a SpanNearQuery that doesn't intersect with a
    SpanTermQuery containing the special marker term.

    Mark's suggestion would work for your within-sentence case, and for the
    case
    where you don't care about sentence boundaries, you can use
    SpanNearQuery
    without the SpanNotQuery.
    ----------

    The problem with the last part is that the SpanNearQuery would have to have
    a slop of 1 in order to accomodate the marker token between sentences. This
    could result in incorrect matches if the a slop of 0 is intended. Another
    suggestion was to overlap the marker token with the first or last token of
    the sentence, but the SpanNotQuery would always exclude any terms in the
    query that are at the intersection. Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.

    Has anyone implemented a solution that works for both in-sentence and
    across
    sentence boundaries?

    Thanks,
    Peter
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Bamford at Jul 20, 2011 at 4:43 pm
    Hi there,

    I have my own Collector implementation which I use for searching, something like this skeleton:

    public class LightweightHitCollector extends Collector {

    private int maxHits;
    private int numHits;
    private int docBase;
    private boolean collecting;
    private Scorer scorer;
    private int[] hits;

    public LightweightHitCollector(int maxHits) {

    this.numHits = 0;
    this.maxHits = maxHits;
    this.collecting = true;
    hits = new int[maxHits];
    for (int i=0; i < maxHits; i++) { hits[i] = -1; }
    }

    public boolean acceptsDocsOutOfOrder() {
    return true;
    }

    public void setScorer(Scorer scorer) {
    this.scorer = scorer;
    }

    public void setNextReader(IndexReader reader, int docBase) {
    this.docBase = docBase;
    }

    public void collect(int docID) {

    if (! collecting) {
    return;
    }

    hits[numHits] = docBase + docID;

    if (++numHits == maxHits) {
    collecting = false;
    }
    }

    public int[] getHits() {
    return hits;
    }
    }

    Question: is there a way to prevent collect() being called after it has collected its quota (i.e. when collecting becomes false)? On large datasets this would save a lot of time.
    In this scenario I have no need for sort / ordering etc.
    Thanks.

    - Chris
  • Devon H. O'Dell at Jul 20, 2011 at 4:54 pm

    2011/7/20 Chris Bamford <chris.bamford@talktalk.net>:
    Hi there,

    I have my own Collector implementation which I use for searching, something like this skeleton: [snip]
    Question: is there a way to prevent collect() being called after it has collected its quota  (i.e. when collecting becomes false)?  On large datasets this would save a lot of time.
    In this scenario I have no need for sort / ordering etc.
    I'd be interested in knowing this as well. My application has the
    ability to limit the number of results returned, and I've "solved"
    this problem by simply checking whether the number of collected
    results exceeds this threshold and simply returning if it has. This
    seems fast enough, but it would be nice to be able to notify the
    caller that I don't want any more documents somehow.

    I was personally unable to find any other workaround for this, and
    perhaps my hackish solution will work for you (if you're not already
    doing this). But indeed on searches returning several million records,
    it's kind of silly to keep spinning.

    Kind regards,

    Devon H. O'Dell
    Thanks.

    - Chris
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Jul 20, 2011 at 9:07 pm
    you can advance the scorer to NO_MORE_DOCS if you have collected
    enough documents this will stop the loop.

    scorer.advance(Scorer.NO_MORE_DOCS);

    simon
    On Wed, Jul 20, 2011 at 6:53 PM, Devon H. O'Dell wrote:
    2011/7/20 Chris Bamford <chris.bamford@talktalk.net>:
    Hi there,

    I have my own Collector implementation which I use for searching, something like this skeleton: [snip]
    Question: is there a way to prevent collect() being called after it has collected its quota  (i.e. when collecting becomes false)?  On large datasets this would save a lot of time.
    In this scenario I have no need for sort / ordering etc.
    I'd be interested in knowing this as well. My application has the
    ability to limit the number of results returned, and I've "solved"
    this problem by simply checking whether the number of collected
    results exceeds this threshold and simply returning if it has. This
    seems fast enough, but it would be nice to be able to notify the
    caller that I don't want any more documents somehow.

    I was personally unable to find any other workaround for this, and
    perhaps my hackish solution will work for you (if you're not already
    doing this). But indeed on searches returning several million records,
    it's kind of silly to keep spinning.

    Kind regards,

    Devon H. O'Dell
    Thanks.

    - Chris
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Bamford at Jul 21, 2011 at 9:44 am
    Hi Simon,



    scorer.advance(Scorer.NO_MORE_DOCS);

    Hmm... doesn't seem to work :-( I tried to call it in collect() and setNextReader() - still loops to the end of the matched doc set.
    What have I missed?

    Thanks

    - Chris








    -----Original Message-----
    From: Simon Willnauer <simon.willnauer@googlemail.com>
    To: java-user@lucene.apache.org
    Sent: Wed, 20 Jul 2011 22:07
    Subject: Re: Short circuiting Collector


    you can advance the scorer to NO_MORE_DOCS if you have collected
    enough documents this will stop the loop.

    scorer.advance(Scorer.NO_MORE_DOCS);

    simon
    On Wed, Jul 20, 2011 at 6:53 PM, Devon H. O'Dell wrote:
    2011/7/20 Chris Bamford <chris.bamford@talktalk.net>:
    Hi there,

    I have my own Collector implementation which I use for searching, something
    like this skeleton:
    [snip]
    Question: is there a way to prevent collect() being called after it has
    collected its quota (i.e. when collecting becomes false)? On large datasets
    this would save a lot of time.
    In this scenario I have no need for sort / ordering etc.
    I'd be interested in knowing this as well. My application has the
    ability to limit the number of results returned, and I've "solved"
    this problem by simply checking whether the number of collected
    results exceeds this threshold and simply returning if it has. This
    seems fast enough, but it would be nice to be able to notify the
    caller that I don't want any more documents somehow.

    I was personally unable to find any other workaround for this, and
    perhaps my hackish solution will work for you (if you're not already
    doing this). But indeed on searches returning several million records,
    it's kind of silly to keep spinning.

    Kind regards,

    Devon H. O'Dell
    Thanks.

    - Chris
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Jul 21, 2011 at 9:50 am
    Hi,

    The reason is that some scorers passed into setScorer are "fake" scorers
    that only implement the "score()" method (depends on the type of
    BooleanQuery scorer used for result collection and if the scorer actively
    collects the results (happens of top-level-queries).
    In general to early exit collection the easiest is to throw a
    RuntimeException subclass (this is what TimeLimitingCollector does, see java
    file in Lucene source code).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Chris Bamford
    Sent: Thursday, July 21, 2011 11:44 AM
    To: java-user@lucene.apache.org
    Subject: Re: Short circuiting Collector


    Hi Simon,



    scorer.advance(Scorer.NO_MORE_DOCS);

    Hmm... doesn't seem to work :-( I tried to call it in collect() and
    setNextReader() - still loops to the end of the matched doc set.
    What have I missed?

    Thanks

    - Chris








    -----Original Message-----
    From: Simon Willnauer <simon.willnauer@googlemail.com>
    To: java-user@lucene.apache.org
    Sent: Wed, 20 Jul 2011 22:07
    Subject: Re: Short circuiting Collector


    you can advance the scorer to NO_MORE_DOCS if you have collected
    enough documents this will stop the loop.

    scorer.advance(Scorer.NO_MORE_DOCS);

    simon
    On Wed, Jul 20, 2011 at 6:53 PM, Devon H. O'Dell wrote:
    2011/7/20 Chris Bamford <chris.bamford@talktalk.net>:
    Hi there,

    I have my own Collector implementation which I use for searching,
    something
    like this skeleton:
    [snip]
    Question: is there a way to prevent collect() being called after it
    has
    collected its quota (i.e. when collecting becomes false)? On large datasets
    this would save a lot of time.
    In this scenario I have no need for sort / ordering etc.
    I'd be interested in knowing this as well. My application has the
    ability to limit the number of results returned, and I've "solved"
    this problem by simply checking whether the number of collected
    results exceeds this threshold and simply returning if it has. This
    seems fast enough, but it would be nice to be able to notify the
    caller that I don't want any more documents somehow.

    I was personally unable to find any other workaround for this, and
    perhaps my hackish solution will work for you (if you're not already
    doing this). But indeed on searches returning several million records,
    it's kind of silly to keep spinning.

    Kind regards,

    Devon H. O'Dell
    Thanks.

    - Chris
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 20, 2011 at 11:45 pm

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index the sentence markers at the same position as the last word in the sentence. And I think the limitation that I ate was that the word could belong to both it's true sentence, and the one after it.

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2011 at 1:22 am

    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index the sentence markers at the same position as the last word in the sentence. And I think the limitation that I ate was that the word could belong to both it's true sentence, and the one after it.

    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of the sentence as well as the first word of the next sentence if there is one. This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 21, 2011 at 1:28 pm
    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller wrote:

    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index the
    sentence markers at the same position as the last word in the sentence. And
    I think the limitation that I ate was that the word could belong to both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of the
    sentence as well as the first word of the next sentence if there is one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2011 at 7:08 pm
    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller wrote:

    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index the
    sentence markers at the same position as the last word in the sentence. And
    I think the limitation that I ate was that the word could belong to both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of the
    sentence as well as the first word of the next sentence if there is one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 21, 2011 at 8:02 pm
    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter
    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote:

    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller wrote:

    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the sentence.
    And
    I think the limitation that I ate was that the word could belong to both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of the
    sentence as well as the first word of the next sentence if there is one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2011 at 8:25 pm
    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change that to an IndexReader I believe.

    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter
    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote:

    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the sentence.
    And
    I think the limitation that I ate was that the word could belong to both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of the
    sentence as well as the first word of the next sentence if there is one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2011 at 9:23 pm
    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change that to an IndexReader I believe.

    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter
    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote:

    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the sentence.
    And
    I think the limitation that I ate was that the word could belong to both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of the
    sentence as well as the first word of the next sentence if there is one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 21, 2011 at 10:03 pm
    The 3X patch works great, Mark! (how do you get your head around spans so
    quickly after 2.5 years? :) )

    Thanks,
    Peter
    On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller wrote:


    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
    change that to an IndexReader I believe.
    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter
    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote:

    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for
    3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
    END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the
    sentence.
    And
    I think the limitation that I ate was that the word could belong to
    both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of
    the
    sentence as well as the first word of the next sentence if there is
    one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 25, 2011 at 2:15 pm
    Hi Mark,

    Sorry to bug you again, but there's another case that fails the unit test
    (search within the second sentence), as shown here in the last test:

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;
    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);
    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(0, hits.length);
    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    System.out.println("query: "+pq);
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[0] = makeSpanTermQuery("4");
    clauses[1] = makeSpanTermQuery("6");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller wrote:


    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
    change that to an IndexReader I believe.
    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter
    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote:

    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for
    3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
    END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the
    sentence.
    And
    I think the limitation that I ate was that the word could belong to
    both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of
    the
    sentence as well as the first word of the next sentence if there is
    one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 25, 2011 at 2:30 pm
    Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes.

    I can likely look at this later today.

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:

    Hi Mark,

    Sorry to bug you again, but there's another case that fails the unit test
    (search within the second sentence), as shown here in the last test:

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;
    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);
    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(0, hits.length);
    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    System.out.println("query: "+pq);
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[0] = makeSpanTermQuery("4");
    clauses[1] = makeSpanTermQuery("6");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller wrote:


    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
    change that to an IndexReader I believe.
    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter

    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for
    3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
    END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the
    sentence.
    And
    I think the limitation that I ate was that the word could belong to
    both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of
    the
    sentence as well as the first word of the next sentence if there is
    one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 25, 2011 at 9:57 pm
    Sorry Peter - I introduced this problem with some kind of typo type issue - I somehow changed an includeSpans variable to excludeSpans - but I certainly didn't mean too - it makes no sense. So not sure how it happened, and surprised the tests that passed still passed!

    We could probably use even more tests before feeling too confident here…

    I've attached a patch for 3X with the new test and fix (changed that include back to exclude).

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:

    Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes.

    I can likely look at this later today.

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:

    Hi Mark,

    Sorry to bug you again, but there's another case that fails the unit test
    (search within the second sentence), as shown here in the last test:

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;
    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);
    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(0, hits.length);
    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    System.out.println("query: "+pq);
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[0] = makeSpanTermQuery("4");
    clauses[1] = makeSpanTermQuery("6");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller wrote:


    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
    change that to an IndexReader I believe.
    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter

    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified for
    3.2
    ('getTerms' removed) . The last test fails (search for "1" and "3").

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
    END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did index
    the
    sentence markers at the same position as the last word in the
    sentence.
    And
    I think the limitation that I ate was that the word could belong to
    both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word of
    the
    sentence as well as the first word of the next sentence if there is
    one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

















    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 26, 2011 at 12:56 pm
    Thanks Mark! The new patch is working fine with the tests and a few more. If
    you have particular test cases in mind, I'd be happy to add them.

    Thanks,
    Peter
    On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller wrote:

    Sorry Peter - I introduced this problem with some kind of typo type issue -
    I somehow changed an includeSpans variable to excludeSpans - but I certainly
    didn't mean too - it makes no sense. So not sure how it happened, and
    surprised the tests that passed still passed!

    We could probably use even more tests before feeling too confident here…

    I've attached a patch for 3X with the new test and fix (changed that
    include back to exclude).

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:

    Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes.
    I can likely look at this later today.

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:

    Hi Mark,

    Sorry to bug you again, but there's another case that fails the unit
    test
    (search within the second sentence), as shown here in the last test:

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;
    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);
    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(0, hits.length);
    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    System.out.println("query: "+pq);
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[0] = makeSpanTermQuery("4");
    clauses[1] = makeSpanTermQuery("6");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter
    On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller wrote:


    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
    change that to an IndexReader I believe.
    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter

    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified
    for
    3.2
    ('getTerms' removed) . The last test fails (search for "1" and
    "3").
    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
    END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords,
    endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false);
    //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false);
    //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <
    markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did
    index
    the
    sentence markers at the same position as the last word in the
    sentence.
    And
    I think the limitation that I ate was that the word could belong
    to
    both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word
    of
    the
    sentence as well as the first word of the next sentence if there
    is
    one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

















    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 26, 2011 at 1:12 pm
    As long as you are happy with the results, I'm good. Always nice to have an excuse to dip back into Lucene. Just don't want you to feel over confident with the code without proper testing of it - I coded to fix the broken tests rather than taking the time to write a bunch more corner case tests like I likely should try if I was going to commit this thing.

    - Mark Miller
    lucidimagination.com
    On Jul 26, 2011, at 8:56 AM, Peter Keegan wrote:

    Thanks Mark! The new patch is working fine with the tests and a few more. If
    you have particular test cases in mind, I'd be happy to add them.

    Thanks,
    Peter
    On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller wrote:

    Sorry Peter - I introduced this problem with some kind of typo type issue -
    I somehow changed an includeSpans variable to excludeSpans - but I certainly
    didn't mean too - it makes no sense. So not sure how it happened, and
    surprised the tests that passed still passed!

    We could probably use even more tests before feeling too confident here…

    I've attached a patch for 3X with the new test and fix (changed that
    include back to exclude).

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:

    Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes.
    I can likely look at this later today.

    - Mark Miller
    lucidimagination.com
    On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:

    Hi Mark,

    Sorry to bug you again, but there's another case that fails the unit
    test
    (search within the second sentence), as shown here in the last test:

    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;
    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);
    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(0, hits.length);
    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    System.out.println("query: "+pq);
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    clauses[0] = makeSpanTermQuery("4");
    clauses[1] = makeSpanTermQuery("6");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    I just uploaded a patch for 3X that will work for 3.2.
    On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

    Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
    change that to an IndexReader I believe.
    - Mark
    On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

    Does this patch require the trunk version? I'm using 3.2 and
    'AtomicReaderContext' isn't there.

    Peter

    On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller <markrmiller@gmail.com>
    wrote:
    Hey Peter,

    Getting sucked back into Spans...

    That test should pass now - I uploaded a new patch to
    https://issues.apache.org/jira/browse/LUCENE-777

    Further tests may be needed though.

    - Mark

    On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

    Hi Mark,

    Here is a unit test using a version of 'SpanWithinQuery' modified
    for
    3.2
    ('getTerms' removed) . The last test fails (search for "1" and
    "3").
    package org.apache.lucene.search.spans;

    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import
    org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.RandomIndexWriter;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.PhraseQuery;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TermQuery;
    import org.apache.lucene.search.spans.SpanNearQuery;
    import org.apache.lucene.search.spans.SpanQuery;
    import org.apache.lucene.search.spans.SpanTermQuery;
    import org.apache.lucene.util.LuceneTestCase;

    public class TestSentence extends LuceneTestCase {
    public static final String field = "field";
    public static final String START = "^";
    public static final String END = "$";
    public void testSetPosition() throws Exception {
    Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
    return new TokenStream() {
    private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
    END,
    "9"};
    private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
    private int i = 0;

    PositionIncrementAttribute posIncrAtt =
    addAttribute(PositionIncrementAttribute.class);
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

    @Override
    public boolean incrementToken() {
    assertEquals(TOKENS.length, INCREMENTS.length);
    if (i == TOKENS.length)
    return false;
    clearAttributes();
    termAtt.append(TOKENS[i]);
    offsetAtt.setOffset(i,i);
    posIncrAtt.setPositionIncrement(INCREMENTS[i]);
    i++;
    return true;
    }
    };
    }
    };
    Directory store = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, store,
    analyzer);
    Document d = new Document();
    d.add(newField("field", "bogus", Field.Store.YES,
    Field.Index.ANALYZED));
    writer.addDocument(d);
    IndexReader reader = writer.getReader();
    writer.close();
    IndexSearcher searcher = newSearcher(reader);

    SpanTermQuery startSentence = makeSpanTermQuery(START);
    SpanTermQuery endSentence = makeSpanTermQuery(END);
    SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = makeSpanTermQuery("1");
    clauses[1] = makeSpanTermQuery("2");
    SpanNearQuery allKeywords = new SpanNearQuery(clauses,
    Integer.MAX_VALUE,
    false); // SpanAndQuery equivalent
    SpanWithinQuery query = new SpanWithinQuery(allKeywords,
    endSentence,
    0);
    System.out.println("query: "+query);
    ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("4");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false);
    //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 0);

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term(field, "3"));
    pq.add(new Term(field, "4"));
    hits = searcher.search(pq, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);

    clauses[1] = makeSpanTermQuery("3");
    allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false);
    //
    SpanAndQuery equivalent
    query = new SpanWithinQuery(allKeywords, endSentence, 0);
    System.out.println("query: "+query);
    hits = searcher.search(query, null, 1000).scoreDocs;
    assertEquals(hits.length, 1);


    }

    public SpanTermQuery makeSpanTermQuery(String text) {
    return new SpanTermQuery(new Term(field, text));
    }
    public TermQuery makeTermQuery(String text) {
    return new TermQuery(new Term(field, text));
    }
    }

    Peter

    On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <
    markrmiller@gmail.com>
    wrote:
    On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

    On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

    Mark Miller's 'SpanWithinQuery' patch
    seems to have the same issue.
    If I remember right (It's been more the a couple years), I did
    index
    the
    sentence markers at the same position as the last word in the
    sentence.
    And
    I think the limitation that I ate was that the word could belong
    to
    both
    it's true sentence, and the one after it.
    - Mark Miller
    lucidimagination.com
    Perhaps you could index the sentence marker at both the last word
    of
    the
    sentence as well as the first word of the next sentence if there
    is
    one.
    This would seem to solve the above limitation as well?

    - Mark Miller
    lucidimagination.com








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    - Mark Miller
    lucidimagination.com






    - Mark Miller
    lucidimagination.com









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

















    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 20, '11 at 3:28p
activeJul 26, '11 at 1:12p
posts21
users7
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase