On Sat, Aug 1, 2009 at 1:23 AM, Andrzej Bialeckiwrote:
Grant Ingersoll wrote:
OK, so how do we get this started? Seems like there are a lot of
collections out there we could use. Also, we can crawl. Seems the tricky
part is getting judgments.
I think we should establish first what kind of relevance judgments we want
This looks like two different things.
One thing is deciding what we use to get "a" collection of documents -
a corpus. It seems to be a very good idea to me to create a
heterogeneous collection of documents such as wikipedia to kick off
ORP. I guess we do not need a huge collection of documents to get
@Grant: I might have missed something but have we a list of available
collections on some wiki page?! Would be great to have something like
Once we got this project going we can start building various
collections from all kinds of areas. I found it interesting that all
collections I have seen are build from large documents but with the
advent of mobile devices collections could also be build from
"data-records" like SMS, Address-Records, image-metadata,
audio-metadata where the text/document is relatively small. I found
that searching on such "small" document puts different requirements on
scoring parameters than websearch...
Another thing is what we do with this collections. I kind of like the
idea of having something like a webapp that is able to preform corpus
selection, distance measurement etc.
I wanna extend Andrzej's list and throwing out some random thoughts...
- It would be nice to have something like a immediate representation
of a corpus that can be plugged into a relevance measurement app /
- such a relevance measurement should be able to work on top of custom
search applications. There could be an API which give applications
access to the corpus for indexing
and can search on this corpus through the API. I can imagine lots of
usecases where users want to judge their custom search engine against
a corpus and compare the results.
simon (in the middle of moving his apartment)
1. given a corpus, and a query, define an ordered list of top-N documents
that are relevant to the query. This is our baseline. Getting this sort of
information is very time-consuming and subjective.
2. given a corpus, a query and a list of top-N results obtained from a real
search, define what results are relevant and how they should be ordered. The
reviewed list of top-N results becomes then the initial approximation of our
baseline. Calculate a distance metric between real and reviewed result, and
adjust ranking to maximize this metric.
The second scenario could be handled by a webapp, which could present the
following areas of functionality:
* corpus selection and browsing
* searching using selected search impl and its ranking parameters, and
storing tuples of <corpus, impl, query, results>
* review of the results (marking relevant / non-relevant, reordering), and
saving of tuples <corpus, impl, query, reviewed results>
* calculation of distance metrics.
* adjustment of ranking parameters for a given search implementation.
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integrationhttp://www.sigram.com
Contact: info at sigram dot com