On Dec 17, 2008, at 9:26 AM, Rajiv2 wrote:
Because, the search term is provided by a user, and that user would
explicity
have to put quotes around "marietta ga" when I beleive the search
text as it
is : fleming roofing inc., marietta ga -- should score higher for
"marietta
ga"
Just because the user doesn't do it, doesn't mean you can't. Your
stating that there is an implied ordering in their query, yet you
don't want to take advantage of that. You can often achieve better
results by generating phrase queries implicitly based on 2 or 3
grams. You might also even try generating the whole thing as a phrase
query with a really large slop value (like 100 or more). Thus,
scoring will reward things when they are closer together, but you
still get the flexibility of an AND-like query. Downside is,
possibly, a small performance hit, but you could test it first. Or,
you could add in the phrase query as an optional OR query to the
original query, something like" fleming OR roofing OR marietta OR ga
OR ("fleming roofing" OR "roofing marietta" OR "marietta ga".
You could also try using a more intelligent Query Parser that is tuned
to your domain. You could also try to factor in click-through stats
into your results. Probably not the answer you want to hear, but it
is doable and useful.
Do you have any a priori knowledge about Marietta GA over Fleming, GA
to begin with? Have you done any broader scale relevance assessment?
It is often the problem that "fixing" one query, results in breaking a
whole bunch of others. What I typically recommend is that you take
the top 50 queries plus 10-30 random queries from your logs and do an
assessment of the top 5/10 results for: relevant, somewhat relevant,
not relevant and embarrassing. The goal is to maximize relevant while
minimizing embarrassing and not relevant.
Is this particular example an isolated case or do you feel this is
systemic to your application? I've said it before, but it bears
repeating: Just because someone typed search terms into your search
box does not mean you have to actually do a search in order to present
them results. If you KNOW the Marietta result is a better result for
this query, then make it the top result. Solr has this feature via
the "QueryElevationComponent" (horrible name, I know), but I call it
Editorial Placement. It's not that hard to implement.
Finally, I'd say I wouldn't split hairs over position too much, if the
Marietta result is #2 and the Fleming result is #1. Now, if you're
telling me the Marietta result is something like #100 and Fleming is
#1, that's a different story. The fact is, b/c your user didn't put
quotes, you don't actually know for a fact that the Fleming result is
what they wanted (but I agree, it is highly likely). The point is, I
wouldn't quibble over anything that is in the top ten. Lucene is
doing what you told it to do, that is rank the results according to TF/
IDF, etc. If you have other pertinent information about Marietta or
the query then you should tell Lucene that via phrases, boosts or
payloads or altering the Similarity. But, like I said, be careful
that you aren't breaking other queries.
HTH,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]