FAQ

[OpenRelevance-dev] some general questions

Robert Muir
Nov 20, 2009 at 3:47 am
Hey, I have a couple questions/ideas and I wonder what your opinions are on
them.

First of all, I am interested in what interesting tasks we can do now that
there is some rough system for evaluating relevancy, i.e. how can we put
this to work.

1. Doing some kind of "basic relevancy test" for lucene release candidates?
Marvin mentioned at apachecon how maybe relevancy tests could have caught
bugs similar to the 2.9.0 scorer bug, (ok maybe not that specific one), but
I agree.

2. Evaluating scoring algorithms? For example, there is a ASL BM25 impl I
have been using now for some time, with good results.
Maybe we could run some tests on different impls to try to help move jira
issues or improvements along.

Other ideas?

Next, I am interested in ways we could improve the very rough system to make
things easier.

1. Move away from this current format / lucene benchmark package?
The lucene benchmark is in my opinion, maybe not the best framework for
experimentations.
I think it is more geared at benchmarking performance.
For example I do not even know if we can change Similarity easily using a
.alg file for indexing...
But it is something for now.

2. Setup easy framework inside openrelevance itself to do the relevance
tests?
Would this be better? Or does it not belong here, but instead inside other
projects (such as in lucene-java, solr, lucy etc)
For example, such a thing could possibly integrate the stuff Nicola spoke of
to produce output in a more digestable format.
Otherwise, each project will have to implement a framework itself (like
lucene-java's)
We could still produce raw files in a consistent (maybe better) format, for
testing other search engines, etc.

Other ideas?

Finally, I still think the whole crowdsourcing/building our own relevance
stuff is a fantastic idea and should still be pursued.
Its just that personally I am more capable to help with the above, and I
think it could complement that separate effort.

--
Robert Muir
rcmuir@gmail.com
reply

Search Discussions

7 responses

  • Marvin Humphrey at Nov 20, 2009 at 5:34 am

    On Thu, Nov 19, 2009 at 10:47:14PM -0500, Robert Muir wrote:
    1. Doing some kind of "basic relevancy test" for lucene release candidates?
    Marvin mentioned at apachecon how maybe relevancy tests could have caught
    bugs similar to the 2.9.0 scorer bug, (ok maybe not that specific one), but
    I agree.
    For Lucy, ideally I want us to be benchmarking relevancy just like we
    benchmark performance. Two major features of Lucy are pluggable index
    components and the ability to do rapid prototyping in the host language. We
    should have some criteria by which we judge the crazy ideas people come up
    with and we should be able to single out and reward the good ones.

    One thing we don't do by default in Lucene is take proximity of term matches
    into account. I'd like to measure how much relevance improves when we augment
    basic boolean queries with span queries. That would make it easier to make a
    sensible engineering tradeoff between responsiveness and relevance. And maybe
    if the improvements are significant enough, somebody will build an optimized
    scoring class tree which exploits that information.
    2. Evaluating scoring algorithms? For example, there is a ASL BM25 impl I
    have been using now for some time, with good results.
    Code-wise, changing up Similarity impelentations and settings is cake.

    I still don't understand exactly how we judge ranking with these collections,
    though. My understanding is that the collections mark each document as either
    1 or 0, meaning "relevant or not relevant". That's fine for precision and
    recall...

    http://en.wikipedia.org/wiki/Precision_and_recall

    ... but how do we use these materials to judge relative ranking of successful
    matches when the scores in the answer book are all exactly 1.0?
    1. Move away from this current format / lucene benchmark package?
    ORP should definitely do that, if for no other reason than that there seems to
    be consensus that other search engines need to be supported.

    If it makes sense for Lucene or Lucy to extend the ORP benchmark code, we can
    choose to do so. If we can share at least some of the benchmarking framework
    code across projects, that would be great. I'm not sure whether that will work
    out well in practice, but it's worth a try for Lucy at least -- in a worst
    case scenario, we fork and port.
    I think it is more geared at benchmarking performance.
    For example I do not even know if we can change Similarity easily using a
    .alg file for indexing...
    But it is something for now.

    2. Setup easy framework inside openrelevance itself to do the relevance
    tests?
    I would kind of like to see Open Relevance become the project that sets the
    standard for scientifically rigorous benchmarking, including but of course not
    limited to competitive performance benchmarking.

    There's some overlap between the tasks of benchmarking performance and
    benchmarking relevance.
    For example, such a thing could possibly integrate the stuff Nicola spoke of
    to produce output in a more digestable format.
    Otherwise, each project will have to implement a framework itself (like
    lucene-java's)
    Each project will still need to implement its own client to interface with
    the Open Relevance benchmarking API, but I think a lot can be shared.
    Finally, I still think the whole crowdsourcing/building our own relevance
    stuff is a fantastic idea and should still be pursued.
    Its just that personally I am more capable to help with the above, and I
    think it could complement that separate effort.
    Agreed. But we should move forward with the corpuses we've got today.

    Marvin Humphrey
  • Robert Muir at Nov 20, 2009 at 6:03 am
    Marvin, thanks for your ideas... i gave some quick comments below.
    On Fri, Nov 20, 2009 at 12:34 AM, Marvin Humphrey wrote:
    On Thu, Nov 19, 2009 at 10:47:14PM -0500, Robert Muir wrote:
    1. Doing some kind of "basic relevancy test" for lucene release
    candidates?
    Marvin mentioned at apachecon how maybe relevancy tests could have caught
    bugs similar to the 2.9.0 scorer bug, (ok maybe not that specific one), but
    I agree.
    For Lucy, ideally I want us to be benchmarking relevancy just like we
    benchmark performance. Two major features of Lucy are pluggable index
    components and the ability to do rapid prototyping in the host language.
    We
    should have some criteria by which we judge the crazy ideas people come up
    with and we should be able to single out and reward the good ones.
    I think one difference here though is that with performance, you can
    microbenchmark, although its tricky as you know.
    With relevance its not really safe to do that in my opinion...
    But for "big" ideas, we definitely can.

    I think in general the more test collections we have, the more accurate the
    picture becomes...
    We don't need a perfect collection immediately, maybe lots of imperfect ones
    will do the trick for the time being?

    One thing we don't do by default in Lucene is take proximity of term
    matches
    into account. I'd like to measure how much relevance improves when we
    augment
    basic boolean queries with span queries. That would make it easier to make
    a
    sensible engineering tradeoff between responsiveness and relevance. And
    maybe
    if the improvements are significant enough, somebody will build an
    optimized
    scoring class tree which exploits that information.
    yes, this is a really good idea to evaluate, although my gut instinct tells
    me the gains will be language-dependent to some extent,
    the same challenge faced in MT when related words can be very far apart in a
    sentence in some langs, and it comes out goobly-gook because
    some phrase-based MT doesn't use "enough" context for that language.

    if span queries help, i think it would be even better to produce an idea of
    what kind of 'span context' you need for some different languages/document
    types for good results.

    2. Evaluating scoring algorithms? For example, there is a ASL BM25 impl I
    have been using now for some time, with good results.
    Code-wise, changing up Similarity impelentations and settings is cake.

    I still don't understand exactly how we judge ranking with these
    collections,
    though. My understanding is that the collections mark each document as
    either
    1 or 0, meaning "relevant or not relevant". That's fine for precision and
    recall...

    http://en.wikipedia.org/wiki/Precision_and_recall

    ... but how do we use these materials to judge relative ranking of
    successful
    matches when the scores in the answer book are all exactly 1.0?
    oh you mean more fine-grained measurement where the top 100 docs are all
    relevant, but some more relevant than others?
    I guess you are right, but I think in general using precision/recall is a
    good indicator that things are being ranked correctly across the board?

    I think when i say 'scoring' i am usually referring to algorithms that are
    very different from what lucene provides via Similarity.
    For example, lucene does not provide 'average document length' in Similarity
    to support a lot of the more modern formulas like BM25

    (side note, I have measured nice gains on Hindi and Persian by using BM25 w/
    avg doc length calculated from norms, which you read in anyway)

    1. Move away from this current format / lucene benchmark package?
    ORP should definitely do that, if for no other reason than that there seems
    to
    be consensus that other search engines need to be supported.

    If it makes sense for Lucene or Lucy to extend the ORP benchmark code, we
    can
    choose to do so. If we can share at least some of the benchmarking
    framework
    code across projects, that would be great. I'm not sure whether that will
    work
    out well in practice, but it's worth a try for Lucy at least -- in a worst
    case scenario, we fork and port.
    yeah i hate the current format. I only wanted to get things off the ground
    quickly since lucene's benchmark pkg can already deal with it.
    And even lucene's "quality" pkg in the benchmark pkg is so limited:
    for example it only runs topic queries but most of the time you want to also
    be able to run topic+description or topic+description+narrative.
    You also have to change code (at query time) to swap in a new analyzer, or
    whatever... you can change analyzer in the .alg file at index time though.

    I think it is more geared at benchmarking performance.
    For example I do not even know if we can change Similarity easily using a
    .alg file for indexing...
    But it is something for now.

    2. Setup easy framework inside openrelevance itself to do the relevance
    tests?
    I would kind of like to see Open Relevance become the project that sets the
    standard for scientifically rigorous benchmarking, including but of course
    not
    limited to competitive performance benchmarking.

    There's some overlap between the tasks of benchmarking performance and
    benchmarking relevance.
    I hadn't really thought about performance as well, but you are right, we
    should keep this in mind. that's a good idea, its part of the whole picture.

    For example, such a thing could possibly integrate the stuff Nicola spoke of
    to produce output in a more digestable format.
    Otherwise, each project will have to implement a framework itself (like
    lucene-java's)
    Each project will still need to implement its own client to interface with
    the Open Relevance benchmarking API, but I think a lot can be shared.
    ok, lets start thinking of some ways to improve this. currently you have to
    apply ORP-2 to fix the README instructions,
    but then anyone should be able to run the current stuff if they want, and
    maybe have ideas for improvements.

    Finally, I still think the whole crowdsourcing/building our own relevance
    stuff is a fantastic idea and should still be pursued.
    Its just that personally I am more capable to help with the above, and I
    think it could complement that separate effort.
    Agreed. But we should move forward with the corpuses we've got today.
    Move forward doing more tests/producing worthwhile relevance experiments for
    projects like Lucene?
    Do you think this might create more developer interest for ORP,
    or is the problem that everyone is already familiar with relevance testing,
    but just has no time?

    Marvin Humphrey

    --
    Robert Muir
    rcmuir@gmail.com
  • Marvin Humphrey at Nov 20, 2009 at 5:44 pm

    On Fri, Nov 20, 2009 at 01:02:48AM -0500, Robert Muir wrote:

    I think one difference here though is that with performance, you can
    microbenchmark, although its tricky as you know.
    With relevance its not really safe to do that in my opinion...
    But for "big" ideas, we definitely can.
    Another difference is that it's important to perform multiple iterations for
    performance microbenchmarks, while the results of a relevance benchmark
    should be the same every time.

    Nevertheless, the rough outline is the same for search-time benchmarking.

    * Acquire a collection of documents, either by parsing and optionally
    manipulating an existing collection, or by creating documents artificially.
    * Issue instructions to the client code to build the index according to some
    plan.
    * Launch a client process which runs the queries, storing the results to an
    intermediate format such as JSON or XML.
    * Retrieve and parse the client output.
    * Perform statistical processing as required (e.g. find the truncated mean
    for performance benchmarks).
    * Generate a formatted report.
    We don't need a perfect collection immediately, maybe lots of imperfect ones
    will do the trick for the time being?
    Agreed -- there is a lot we can do with imperfect collections.
    oh you mean more fine-grained measurement where the top 100 docs are all
    relevant, but some more relevant than others?
    I guess you are right, but I think in general using precision/recall is a
    good indicator that things are being ranked correctly across the board?
    OK, I trust that existing academic approaches are sufficient for measuring and
    refining relevance. If that's truly what's done, then all we need to do is
    apply it.
    I think when i say 'scoring' i am usually referring to algorithms that are
    very different from what lucene provides via Similarity.
    Right, and this is where pluggable scoring components come in. "Flexible
    Indexing" a la LUCENE-1458, too, provided that the eventual interface is
    simple enough to be practical.
    yeah i hate the current format. I only wanted to get things off the ground
    quickly since lucene's benchmark pkg can already deal with it.
    And even lucene's "quality" pkg in the benchmark pkg is so limited:
    for example it only runs topic queries but most of the time you want to also
    be able to run topic+description or topic+description+narrative.
    You also have to change code (at query time) to swap in a new analyzer, or
    whatever... you can change analyzer in the .alg file at index time though.
    It's going to be a little tricky dividing up responsibilities between client
    code and the core ORP benching tools. The line between data description and
    instruction specification is fuzzy when describing algorithms.

    I think we will want to leave more to the client rather than less so that
    optimization is fully in the hands of the client. We'll have to trust that
    the client implementation is correct.
    Agreed. But we should move forward with the corpuses we've got today.
    Move forward doing more tests/producing worthwhile relevance experiments for
    projects like Lucene? Sure.
    Do you think this might create more developer interest for ORP,
    or is the problem that everyone is already familiar with relevance testing,
    but just has no time?
    Perhaps we could proceed like so:

    1. Build a benchmarking framework which leverages existing collections.
    2. Use some slice of the Apache email archives and some sort of
    crowdsourcing to create a deeply flawed but still useful corpus.
    3. Build a better corpus using lessons learned.

    The problem is that creating a spiffy corpus is labor-intensive and thus
    expensive. Everybody wants that, but it's hard to get over the initial hump.
    So let's try to achieve some intermediate objectives first.

    Marvin Humphrey
  • Grant Ingersoll at Nov 20, 2009 at 1:28 pm

    On Nov 19, 2009, at 10:47 PM, Robert Muir wrote:

    Hey, I have a couple questions/ideas and I wonder what your opinions are on
    them.

    First of all, I am interested in what interesting tasks we can do now that
    there is some rough system for evaluating relevancy, i.e. how can we put
    this to work.

    1. Doing some kind of "basic relevancy test" for lucene release candidates?
    Marvin mentioned at apachecon how maybe relevancy tests could have caught
    bugs similar to the 2.9.0 scorer bug, (ok maybe not that specific one), but
    I agree. +1
    2. Evaluating scoring algorithms? For example, there is a ASL BM25 impl I
    have been using now for some time, with good results.
    Maybe we could run some tests on different impls to try to help move jira
    issues or improvements along.
    This is definitely a major component and in a lot of ways the reason this project was started. We need a way for us developers to be able to talk about relevance in a standard way using all open materials so that we can compare relevance. For instance, length norm is most likely favoring shorter documents in Lucene right now, what if we were to change it?

    Other ideas?

    Next, I am interested in ways we could improve the very rough system to make
    things easier.

    1. Move away from this current format / lucene benchmark package?
    The lucene benchmark is in my opinion, maybe not the best framework for
    experimentations.
    I think it is more geared at benchmarking performance.
    For example I do not even know if we can change Similarity easily using a
    .alg file for indexing...
    But it is something for now.
    The thing about benchmark is it is pluggable, so making it possible to change Sim. should not be too hard.

    2. Setup easy framework inside openrelevance itself to do the relevance
    tests?
    Would this be better? Or does it not belong here, but instead inside other
    projects (such as in lucene-java, solr, lucy etc)
    For example, such a thing could possibly integrate the stuff Nicola spoke of
    to produce output in a more digestable format.
    Otherwise, each project will have to implement a framework itself (like
    lucene-java's)
    We could still produce raw files in a consistent (maybe better) format, for
    testing other search engines, etc.

    Other ideas?
    Yeah, I think we will have to write some code. There was some talk originally about crowd-sourcing tools, too, for collecting judgments, etc.

    Finally, I still think the whole crowdsourcing/building our own relevance
    stuff is a fantastic idea and should still be pursued.
    Its just that personally I am more capable to help with the above, and I
    think it could complement that separate effort.
    Agreed.
  • Omar Alonso at Nov 20, 2009 at 7:34 pm
    It would be nice to use crowdsourcing for this. It does need to be on Mechanical Turk. Crowdsourcing is a paradigm for relevance evaluation that could be useful here.

    o.
    Yeah, I think we will have to write some code.  There
    was some talk originally about crowd-sourcing tools, too,
    for collecting judgments, etc.

    Finally, I still think the whole
    crowdsourcing/building our own relevance
    stuff is a fantastic idea and should still be pursued.
    Its just that personally I am more capable to help
    with the above, and I
    think it could complement that separate effort.
    Agreed.
  • Grant Ingersoll at Nov 20, 2009 at 9:49 pm

    On Nov 20, 2009, at 2:34 PM, Omar Alonso wrote:

    It would be nice to use crowdsourcing for this. It does need to be on Mechanical Turk.
    Do you mean "does not need to be"?
    Crowdsourcing is a paradigm for relevance evaluation that could be useful here.
    +1
  • Omar Alonso at Nov 20, 2009 at 9:59 pm
    Ooops. Thanks for the correction :).

    o.

    --- On Fri, 11/20/09, Grant Ingersoll wrote:
    From: Grant Ingersoll <gsingers@apache.org>
    Subject: Re: some general questions
    To: openrelevance-dev@lucene.apache.org
    Date: Friday, November 20, 2009, 1:49 PM
    On Nov 20, 2009, at 2:34 PM, Omar Alonso wrote:

    It would be nice to use crowdsourcing for this. It
    does need to be on Mechanical Turk.

    Do you mean "does not need to be"?
    Crowdsourcing is a paradigm for relevance evaluation
    that could be useful here.

    +1

Related Discussions

Discussion Navigation
viewthread | post