What we're doing right now is, roughly:
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
* Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
* Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
* An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
Andy Lester => email@example.com => www.petdance.com => AIM:petdance