FAQ
Hi,

Why DuplicateFilter doesn't work together with other filters? For
example, if a little remake of the test DuplicateFilterTest, then the
impression that the filter is not applied to other filters and first
trims results:

public void testKeepsLastFilter()
throws Throwable {
DuplicateFilter df = new DuplicateFilter(KEY_FIELD);
df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE);

Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
new QueryWrapperFilter(tq),
// new QueryWrapperFilter(new TermQuery(new Term("text", "out"))), //
works right, it is the last document.
new QueryWrapperFilter(new TermQuery(new Term("text", "now"))) // why
it doesn't work? It is the third document.

}, ChainedFilter.AND));

ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;

assertTrue("Filtered searching should have found some matches",
hits.length > 0);
for (int i = 0; i < hits.length; i++) {
Document d = searcher.doc(hits[i].doc);
String url = d.get(KEY_FIELD);
TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
int lastDoc = 0;
while (td.next()) {
lastDoc = td.doc();
}
assertEquals("Duplicate urls should return last doc", lastDoc, hits[i].doc);
}
}

--
С уважением,
Минченков Павел

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Uwe Schindler at May 31, 2010 at 6:40 am
    Where is df (the DuplicateFilter) used in your code?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Паша Минченков
    Sent: Monday, May 31, 2010 8:27 AM
    To: java-user@lucene.apache.org
    Subject: DuplicateFilter question

    Hi,

    Why DuplicateFilter doesn't work together with other filters? For example, if
    a little remake of the test DuplicateFilterTest, then the impression that the
    filter is not applied to other filters and first trims results:

    public void testKeepsLastFilter()
    throws Throwable {
    DuplicateFilter df = new DuplicateFilter(KEY_FIELD);
    df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE);

    Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
    new QueryWrapperFilter(tq),
    // new QueryWrapperFilter(new TermQuery(new Term("text",
    "out"))), // works right, it is the last document.
    new QueryWrapperFilter(new TermQuery(new Term("text",
    "now"))) // why it doesn't work? It is the third document.

    }, ChainedFilter.AND));

    ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;

    assertTrue("Filtered searching should have found some matches",
    hits.length > 0);
    for (int i = 0; i < hits.length; i++) {
    Document d = searcher.doc(hits[i].doc);
    String url = d.get(KEY_FIELD);
    TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
    int lastDoc = 0;
    while (td.next()) {
    lastDoc = td.doc();
    }
    assertEquals("Duplicate urls should return last doc", lastDoc,
    hits[i].doc);
    }
    }

    --
    С уважением,
    Минченков Павел

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Паша Минченков at May 31, 2010 at 7:16 am
    df (DuplicateFilter) is the second parameter in the searcher.search metod.
    ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;
    This varians doesn't hit too:
    ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, df), new
    QueryWrapperFilter(new TermQuery(new Term("text", "now"))),
    1000).scoreDocs;
    Or:
    ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, new
    QueryWrapperFilter(new TermQuery(new Term("text", "now")))), df,
    1000).scoreDocs;

    2010/5/31, Uwe Schindler <uwe@thetaphi.de>:
    Where is df (the DuplicateFilter) used in your code?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Паша Минченков
    Sent: Monday, May 31, 2010 8:27 AM
    To: java-user@lucene.apache.org
    Subject: DuplicateFilter question

    Hi,

    Why DuplicateFilter doesn't work together with other filters? For example, if
    a little remake of the test DuplicateFilterTest, then the impression that the
    filter is not applied to other filters and first trims results:

    public void testKeepsLastFilter()
    throws Throwable {
    DuplicateFilter df = new DuplicateFilter(KEY_FIELD);
    df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE);

    Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
    new QueryWrapperFilter(tq),
    // new QueryWrapperFilter(new TermQuery(new Term("text",
    "out"))), // works right, it is the last document.
    new QueryWrapperFilter(new TermQuery(new Term("text",
    "now"))) // why it doesn't work? It is the third document.

    }, ChainedFilter.AND));

    ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;

    assertTrue("Filtered searching should have found some matches",
    hits.length > 0);
    for (int i = 0; i < hits.length; i++) {
    Document d = searcher.doc(hits[i].doc);
    String url = d.get(KEY_FIELD);
    TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
    int lastDoc = 0;
    while (td.next()) {
    lastDoc = td.doc();
    }
    assertEquals("Duplicate urls should return last doc", lastDoc,
    hits[i].doc);
    }
    }

    --
    С уважением,
    Минченков Павел

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    С уважением,
    Минченков Павел

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Harwood at May 31, 2010 at 7:31 am
    The DuplicateFilter passed to the searcher does not have visibility of the text query and is therefore evaluated independently from all other criteria.
    Sounds like the behaviour you want is to get the last duplicate that also matches your criteria, which seems like something fairly common to need to do but unfortunately something DuplicateFilter will not help with. For this requirement you would need to have a new de-duping query that wraps a child query and takes the latest match for a given field. Unfortunately if the documents are not sequenced in URL-order this will either involve using a lot of expensive disk seeks or a lot of ram to evaluate efficiently.

    If your documents are stored in URL order (ie the URL is just the host part and all docs from a site are held together) you could look at the PerParentLimitingQuery I created as part of the NestedDocumentQuery package in Lucene 2454. It is designed to return the top N docs for a given parent (in this case, site). With some small modification it could return the last child for a parent. Take a look at the junit example that gets the best n chapters for each book.
    Cheers,
    Mark

    On 31 May 2010, at 08:15, Паша Минченков wrote:

    df (DuplicateFilter) is the second parameter in the searcher.search metod.
    ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;

    This varians doesn't hit too:
    ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, df), new
    QueryWrapperFilter(new TermQuery(new Term("text", "now"))),
    1000).scoreDocs;
    Or:
    ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, new
    QueryWrapperFilter(new TermQuery(new Term("text", "now")))), df,
    1000).scoreDocs;

    2010/5/31, Uwe Schindler <uwe@thetaphi.de>:
    Where is df (the DuplicateFilter) used in your code?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Паша Минченков
    Sent: Monday, May 31, 2010 8:27 AM
    To: java-user@lucene.apache.org
    Subject: DuplicateFilter question

    Hi,

    Why DuplicateFilter doesn't work together with other filters? For example,
    if
    a little remake of the test DuplicateFilterTest, then the impression that
    the
    filter is not applied to other filters and first trims results:

    public void testKeepsLastFilter()
    throws Throwable {
    DuplicateFilter df = new DuplicateFilter(KEY_FIELD);
    df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE);

    Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
    new QueryWrapperFilter(tq),
    // new QueryWrapperFilter(new TermQuery(new Term("text",
    "out"))), // works right, it is the last document.
    new QueryWrapperFilter(new TermQuery(new Term("text",
    "now"))) // why it doesn't work? It is the third document.

    }, ChainedFilter.AND));

    ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;

    assertTrue("Filtered searching should have found some matches",
    hits.length > 0);
    for (int i = 0; i < hits.length; i++) {
    Document d = searcher.doc(hits[i].doc);
    String url = d.get(KEY_FIELD);
    TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
    int lastDoc = 0;
    while (td.next()) {
    lastDoc = td.doc();
    }
    assertEquals("Duplicate urls should return last doc", lastDoc,
    hits[i].doc);
    }
    }

    --
    С уважением,
    Минченков Павел

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    --
    С уважением,
    Минченков Павел

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Паша Минченков at May 31, 2010 at 8:01 am
    Thanks. I do not mind the first or the last document. Most
    importantly, that in filtered documents there were no duplicates for a
    given field (in fact I need to group the filtered results to the
    specified field). Trying to use PerParentLimitingQuery and
    NestedDocumentQuery.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 31, '10 at 6:27a
activeMay 31, '10 at 8:01a
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase