FAQ
Is there a Filter to get a limited number of random collection docs from the
index which DO NOT contain a specific term ?

i.e. term="pizza"

I want to run the query against 10 random documents of the collection that
do not contain the term "pizza".

thanks

Search Discussions

  • Patrick Diviacco at Mar 29, 2011 at 7:00 pm
    Ok I've solved the first part of the problem. I'm now selecting all
    documents that do not contain a given term with a BooleanFilter
    and FilterClause, MUST NOT.

    I still have to understand how to retrieve random documents and limit the
    number of retrieved docs to N.

    thanks
    On 29 March 2011 20:40, Patrick Diviacco wrote:

    Is there a Filter to get a limited number of random collection docs from
    the index which DO NOT contain a specific term ?

    i.e. term="pizza"

    I want to run the query against 10 random documents of the collection that
    do not contain the term "pizza".

    thanks
  • Ian Lea at Mar 29, 2011 at 7:49 pm
    Here are a couple of ideas.

    Plan A.

    Think of a number, say 10, retrieve n * 10 docids in your search and
    then loop round java.util.Random.nextInt(n * 10) until you've got
    enough.

    Plan B.

    Reverse your MUST NOT search to get a list of docids that you don't
    want, then loop round Random.nextInt(indexreader.numDocs()), selecting
    those that are not deleted (!indexreader.isDeleted(docid)) and are not
    in your exclusion list.


    I'm sure there are other ways, probably better.


    --
    Ian.


    On Tue, Mar 29, 2011 at 8:00 PM, Patrick Diviacco
    wrote:
    Ok I've solved the first part of the problem. I'm now selecting all
    documents that do not contain a given term with a BooleanFilter
    and FilterClause, MUST NOT.

    I still have to understand how to retrieve random documents and limit the
    number of retrieved docs to N.

    thanks
    On 29 March 2011 20:40, Patrick Diviacco wrote:

    Is there a Filter to get a limited number of random collection docs from
    the index which DO NOT contain a specific term ?

    i.e. term="pizza"

    I want to run the query against 10 random documents of the collection that
    do not contain the term "pizza".

    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 29, 2011 at 7:56 pm
    Plan A sounds better because I don't want to consider the entire collection
    and then remove results from it.

    However, the same code has to work with 2 different collections. The first
    one has 30.000 docs the other one 90.000.

    How can I get the total amount of docs from a collection ?
    thanks



    On 29 March 2011 21:48, Ian Lea wrote:

    Here are a couple of ideas.

    Plan A.

    Think of a number, say 10, retrieve n * 10 docids in your search and
    then loop round java.util.Random.nextInt(n * 10) until you've got
    enough.

    Plan B.

    Reverse your MUST NOT search to get a list of docids that you don't
    want, then loop round Random.nextInt(indexreader.numDocs()), selecting
    those that are not deleted (!indexreader.isDeleted(docid)) and are not
    in your exclusion list.


    I'm sure there are other ways, probably better.


    --
    Ian.


    On Tue, Mar 29, 2011 at 8:00 PM, Patrick Diviacco
    wrote:
    Ok I've solved the first part of the problem. I'm now selecting all
    documents that do not contain a given term with a BooleanFilter
    and FilterClause, MUST NOT.

    I still have to understand how to retrieve random documents and limit the
    number of retrieved docs to N.

    thanks
    On 29 March 2011 20:40, Patrick Diviacco wrote:

    Is there a Filter to get a limited number of random collection docs from
    the index which DO NOT contain a specific term ?

    i.e. term="pizza"

    I want to run the query against 10 random documents of the collection
    that
    do not contain the term "pizza".

    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Mar 29, 2011 at 8:12 pm

    Plan A sounds better because I don't want to consider the entire collection
    and then remove results from it.
    Fine, your choice.
    However, the same code has to work with 2 different collections. The first
    one has 30.000 docs the other one 90.000.
    No problem. The number of docs is irrelevant.
    How can I get the total amount of docs from a collection ?
    IndexReader.numDocs(). See also maxDoc() and numDeletedDocs().


    --
    Ian.
    On 29 March 2011 21:48, Ian Lea wrote:

    Here are a couple of ideas.

    Plan A.

    Think of a number, say 10, retrieve n * 10 docids in your search and
    then loop round java.util.Random.nextInt(n * 10) until you've got
    enough.

    Plan B.

    Reverse your MUST NOT search to get a list of docids that you don't
    want, then loop round Random.nextInt(indexreader.numDocs()), selecting
    those that are not deleted (!indexreader.isDeleted(docid)) and are not
    in your exclusion list.


    I'm sure there are other ways, probably better.


    --
    Ian.


    On Tue, Mar 29, 2011 at 8:00 PM, Patrick Diviacco
    wrote:
    Ok I've solved the first part of the problem. I'm now selecting all
    documents that do not contain a given term with a BooleanFilter
    and FilterClause, MUST NOT.

    I still have to understand how to retrieve random documents and limit the
    number of retrieved docs to N.

    thanks

    On 29 March 2011 20:40, Patrick Diviacco <patrick.diviacco@gmail.com>
    wrote:
    Is there a Filter to get a limited number of random collection docs from
    the index which DO NOT contain a specific term ?

    i.e. term="pizza"

    I want to run the query against 10 random documents of the collection
    that
    do not contain the term "pizza".

    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 29, 2011 at 10:02 pm
    One last thing, how do I check if the random document does not contain the
    term ?

    In other words, I cannot just pass the TermsFilter but I need to check if
    the retrieved random document is valid or not to know if I have enough.

    Any code example is appreciated.. so far I have this one, to retrieve docs
    without that specific term.

    BooleanFilter termsNOTFilter = new BooleanFilter();
    FilterClause notTermClause = new FilterClause(termsFilter,
    org.apache.lucene.search.BooleanClause.Occur.MUST_NOT);
    termsNOTFilter.add(notTermClause);

    thanks



    On 29 March 2011 22:12, Ian Lea wrote:

    Plan A sounds better because I don't want to consider the entire
    collection
    and then remove results from it.
    Fine, your choice.
    However, the same code has to work with 2 different collections. The first
    one has 30.000 docs the other one 90.000.
    No problem. The number of docs is irrelevant.
    How can I get the total amount of docs from a collection ?
    IndexReader.numDocs(). See also maxDoc() and numDeletedDocs().


    --
    Ian.
    On 29 March 2011 21:48, Ian Lea wrote:

    Here are a couple of ideas.

    Plan A.

    Think of a number, say 10, retrieve n * 10 docids in your search and
    then loop round java.util.Random.nextInt(n * 10) until you've got
    enough.

    Plan B.

    Reverse your MUST NOT search to get a list of docids that you don't
    want, then loop round Random.nextInt(indexreader.numDocs()), selecting
    those that are not deleted (!indexreader.isDeleted(docid)) and are not
    in your exclusion list.


    I'm sure there are other ways, probably better.


    --
    Ian.


    On Tue, Mar 29, 2011 at 8:00 PM, Patrick Diviacco
    wrote:
    Ok I've solved the first part of the problem. I'm now selecting all
    documents that do not contain a given term with a BooleanFilter
    and FilterClause, MUST NOT.

    I still have to understand how to retrieve random documents and limit
    the
    number of retrieved docs to N.

    thanks

    On 29 March 2011 20:40, Patrick Diviacco <patrick.diviacco@gmail.com>
    wrote:
    Is there a Filter to get a limited number of random collection docs
    from
    the index which DO NOT contain a specific term ?

    i.e. term="pizza"

    I want to run the query against 10 random documents of the collection
    that
    do not contain the term "pizza".

    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Mar 30, 2011 at 9:45 am
    If your query explicitly excludes certain terms then surely you can be
    confident that matched docs will not contain those terms, and if your
    random docs are a subset of those matched docs they won't contain them
    either.


    --
    Ian.


    On Tue, Mar 29, 2011 at 11:01 PM, Patrick Diviacco
    wrote:
    One last thing, how do I check if the random document does not contain the
    term ?

    In other words, I cannot just pass the TermsFilter but I need to check if
    the retrieved random document is valid or not to know if I have enough.

    Any code example is appreciated.. so far I have this one, to retrieve docs
    without that specific term.

    BooleanFilter termsNOTFilter = new BooleanFilter();
    FilterClause notTermClause = new FilterClause(termsFilter,
    org.apache.lucene.search.BooleanClause.Occur.MUST_NOT);
    termsNOTFilter.add(notTermClause);

    thanks



    On 29 March 2011 22:12, Ian Lea wrote:

    Plan A sounds better because I don't want to consider the entire
    collection
    and then remove results from it.
    Fine, your choice.
    However, the same code has to work with 2 different collections. The first
    one has 30.000 docs the other one 90.000.
    No problem.  The number of docs is irrelevant.
    How can I get the total amount of docs from a collection ?
    IndexReader.numDocs().  See also maxDoc() and numDeletedDocs().


    --
    Ian.
    On 29 March 2011 21:48, Ian Lea wrote:

    Here are a couple of ideas.

    Plan A.

    Think of a number, say 10, retrieve n * 10 docids in your search and
    then loop round java.util.Random.nextInt(n * 10) until you've got
    enough.

    Plan B.

    Reverse your MUST NOT search to get a list of docids that you don't
    want, then loop round Random.nextInt(indexreader.numDocs()), selecting
    those that are not deleted (!indexreader.isDeleted(docid)) and are not
    in your exclusion list.


    I'm sure there are other ways, probably better.


    --
    Ian.


    On Tue, Mar 29, 2011 at 8:00 PM, Patrick Diviacco
    wrote:
    Ok I've solved the first part of the problem. I'm now selecting all
    documents that do not contain a given term with a BooleanFilter
    and FilterClause, MUST NOT.

    I still have to understand how to retrieve random documents and limit
    the
    number of retrieved docs to N.

    thanks

    On 29 March 2011 20:40, Patrick Diviacco <patrick.diviacco@gmail.com>
    wrote:
    Is there a Filter to get a limited number of random collection docs
    from
    the index which DO NOT contain a specific term ?

    i.e. term="pizza"

    I want to run the query against 10 random documents of the collection
    that
    do not contain the term "pizza".

    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 29, '11 at 6:40p
activeMar 30, '11 at 9:45a
posts7
users2
websitelucene.apache.org

2 users in discussion

Patrick Diviacco: 4 posts Ian Lea: 3 posts

People

Translate

site design / logo © 2022 Grokbase