We've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.

The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
* Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.

* http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
* http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
* http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
* http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html


Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 7 | next ›
Discussion Overview
groupsolr-user @
postedMar 8, '13 at 5:08p
activeMar 15, '13 at 1:40p



site design / logo © 2019 Grokbase