fyi, I added a page to the wiki with some links to existing test collections
that can be downloaded along with queries and relevance judgements. some of
these are smaller, maybe not perfect, but something to start with for
playing around.
if you know of others, please add!

http://cwiki.apache.org/confluence/display/ORP/ExistingCollections

also, if you have some ideas on how to perhaps create some ant tasks to make
downloading/running these thru the lucene benchmark package easier, that
would be great too. this is a bit frustrating because many collections claim
to be "trec" format but they are all formatted slightly differently...

--
Robert Muir
rcmuir@gmail.com

Search Discussions

  • Marvin Humphrey at Nov 10, 2009 at 6:07 am

    Robert Muir:

    fyi, I added a page to the wiki with some links to existing test collections
    that can be downloaded along with queries and relevance judgements.
    So, what comparisons can we set up using these collections?

    I seem to recall you suggesting at ApacheCon that they would be handy when
    judging Analyzer mods.

    These collections are all binary assertions -- relevant/not-relevant for a
    given query -- right? Am I correct in presuming that such corpora can't help
    us to judge scoring and ranking algorithms, or Similarity implementations?
    also, if you have some ideas on how to perhaps create some ant tasks to make
    downloading/running these thru the lucene benchmark package easier, that
    would be great too.
    Hmm, that approach is specific to Lucene Java. It's not handy for either of
    the projects I work on (Lucy, KinoSearch).

    At some point, I'd planned to write a loose port of the Lucene benchmarking
    suite so that Lucy (at least) could exploit it... The benchmarking code has
    gotten so elaborate and complex now, though -- I wonder how easy it will be to
    generalize...
    this is a bit frustrating because many collections claim to be "trec" format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a common format.

    Is ant the best approach here? Maybe we start off with a scripting language
    like Python?

    Marvin Humphrey
  • Robert Muir at Nov 10, 2009 at 6:34 am
    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these tonight to test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish collections.

    I seem to recall you suggesting at ApacheCon that they would be handy when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for analyzers or
    scoring or anything else, but then again I think we could detect if some
    change is completely broken or silly (bugs, etc).

    These collections are all binary assertions -- relevant/not-relevant for a
    given query -- right? Am I correct in presuming that such corpora can't
    help
    us to judge scoring and ranking algorithms, or Similarity implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare scoring/ranking
    algorithms all the time!

    also, if you have some ideas on how to perhaps create some ant tasks to make
    downloading/running these thru the lucene benchmark package easier, that
    would be great too.
    Hmm, that approach is specific to Lucene Java. It's not handy for either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you just want to
    produce a .txt file that you throw at the trec_eval commandline program or
    something similar. Doing it in a lucene-java specific way doesn't allow us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!

    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?

    At some point, I'd planned to write a loose port of the Lucene benchmarking
    suite so that Lucy (at least) could exploit it... The benchmarking code
    has
    gotten so elaborate and complex now, though -- I wonder how easy it will be
    to
    generalize...
    this is a bit frustrating because many collections claim to be "trec" format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a common
    format.

    Is ant the best approach here? Maybe we start off with a scripting
    language
    like Python?
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i change the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am doing
    (scripting language)

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com
  • Simon Willnauer at Nov 10, 2009 at 8:42 pm

    On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir wrote:
    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these tonight to test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish collections.

    I seem to recall you suggesting at ApacheCon that they would be handy when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for analyzers or
    scoring or anything else, but then again I think we could detect if some
    change is completely broken or silly (bugs, etc).
    This would bring a huge value to lucene and its derivatives. This
    sounds like a very good point to start from especially until we sorted
    out all the licensing issues, how to distribute collections or what we
    want to crawl. There is a huge +1 from my side to get started with the
    small collections - 100% more than we have today.
    These collections are all binary assertions -- relevant/not-relevant for a
    given query -- right?  Am I correct in presuming that such corpora can't
    help
    us to judge scoring and ranking algorithms, or Similarity implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare scoring/ranking
    algorithms all the time!
    Afaik, those collections yield pretty good results for all kinds of
    relevance judgements though.
    also, if you have some ideas on how to perhaps create some ant tasks to make
    downloading/running these thru the lucene benchmark package easier, that
    would be great too.
    Hmm, that approach is specific to Lucene Java.  It's not handy for either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you just want to
    produce a .txt file that you throw at the trec_eval commandline program or
    something similar. Doing it in a lucene-java specific way doesn't allow us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!
    This is maybe the most important issue for the first step. I would
    really like to see a standard format which can be parsed easily by
    whatever language you use. I personally prefer JSON for almost
    everything as it is soo easy to parse, read (human eyes) and write.
    Ant still sounds like a good plan as there are many many functions
    already implemented and it is easy to extend.
    +1 for a creating an issue for format and transformation.
    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?
    see above
    At some point, I'd planned to write a loose port of the Lucene benchmarking
    suite so that Lucy (at least) could exploit it...  The benchmarking code
    has
    gotten so elaborate and complex now, though -- I wonder how easy it will be
    to
    generalize...
    this is a bit frustrating because many collections claim to be "trec" format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a common
    format.

    Is ant the best approach here?  Maybe we start off with a scripting
    language
    like Python?
    you wanna use you object model, right ? :)
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i change the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am doing
    (scripting language)

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 10, 2009 at 8:49 pm
    Hi Simon, thanks for your comments.

    I guess in my opinion, the fastest way to having something would be to
    create scripts that munge these various collections into a standard format,
    as mentioned earlier.
    And I think the easiest format would actually be to format queries,
    judgements, and text into what the Lucene-java benchmark expects already.
    This format is pretty simple and I don't think it would be a headache to use
    for other projects such as lucy or solr or maybe even comparisons against
    other software.

    This is of course biased by the fact that I am lazy and I don't want to mess
    with the lucene benchmark package :)

    I would like to create a JIRA issue to start working this task, as I am
    maintaining this various junk internally at the moment.

    Does anyone have specific preference to what programming language/build
    system/etc is desired? I don't have a preference, I just care about
    relevance.
    On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer wrote:
    On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir wrote:
    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these tonight to test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish collections.

    I seem to recall you suggesting at ApacheCon that they would be handy
    when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for analyzers or
    scoring or anything else, but then again I think we could detect if some
    change is completely broken or silly (bugs, etc).
    This would bring a huge value to lucene and its derivatives. This
    sounds like a very good point to start from especially until we sorted
    out all the licensing issues, how to distribute collections or what we
    want to crawl. There is a huge +1 from my side to get started with the
    small collections - 100% more than we have today.
    These collections are all binary assertions -- relevant/not-relevant for
    a
    given query -- right? Am I correct in presuming that such corpora can't
    help
    us to judge scoring and ranking algorithms, or Similarity
    implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare scoring/ranking
    algorithms all the time!
    Afaik, those collections yield pretty good results for all kinds of
    relevance judgements though.
    also, if you have some ideas on how to perhaps create some ant tasks
    to
    make
    downloading/running these thru the lucene benchmark package easier,
    that
    would be great too.
    Hmm, that approach is specific to Lucene Java. It's not handy for
    either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you just want to
    produce a .txt file that you throw at the trec_eval commandline program or
    something similar. Doing it in a lucene-java specific way doesn't allow us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!
    This is maybe the most important issue for the first step. I would
    really like to see a standard format which can be parsed easily by
    whatever language you use. I personally prefer JSON for almost
    everything as it is soo easy to parse, read (human eyes) and write.
    Ant still sounds like a good plan as there are many many functions
    already implemented and it is easy to extend.
    +1 for a creating an issue for format and transformation.
    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?
    see above
    At some point, I'd planned to write a loose port of the Lucene
    benchmarking
    suite so that Lucy (at least) could exploit it... The benchmarking code
    has
    gotten so elaborate and complex now, though -- I wonder how easy it will
    be
    to
    generalize...
    this is a bit frustrating because many collections claim to be "trec" format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a common
    format.

    Is ant the best approach here? Maybe we start off with a scripting
    language
    like Python?
    you wanna use you object model, right ? :)
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i change the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am doing
    (scripting language)

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com
  • Grant Ingersoll at Nov 10, 2009 at 10:26 pm

    On Nov 10, 2009, at 3:48 PM, Robert Muir wrote:

    Hi Simon, thanks for your comments.

    I guess in my opinion, the fastest way to having something would be to
    create scripts that munge these various collections into a standard
    format,
    as mentioned earlier.
    And I think the easiest format would actually be to format queries,
    judgements, and text into what the Lucene-java benchmark expects
    already.
    This format is pretty simple and I don't think it would be a
    headache to use
    for other projects such as lucy or solr or maybe even comparisons
    against
    other software.

    This is of course biased by the fact that I am lazy and I don't want
    to mess
    with the lucene benchmark package :)

    I would like to create a JIRA issue to start working this task, as I
    am
    maintaining this various junk internally at the moment.

    Does anyone have specific preference to what programming language/
    build
    system/etc is desired? I don't have a preference, I just care about
    relevance.
    Since most of our projects are in Java, I would probably lean that
    way, but if it is just meant to be lightweight, then we could just use
    a scripting lang.

    On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer <
    simon.willnauer@googlemail.com> wrote:
    On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <rcmuir@gmail.com>
    wrote:
    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these
    tonight to test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it
    works as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish
    collections.

    I seem to recall you suggesting at ApacheCon that they would be
    handy
    when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for
    analyzers or
    scoring or anything else, but then again I think we could detect
    if some
    change is completely broken or silly (bugs, etc).
    This would bring a huge value to lucene and its derivatives. This
    sounds like a very good point to start from especially until we
    sorted
    out all the licensing issues, how to distribute collections or what
    we
    want to crawl. There is a huge +1 from my side to get started with
    the
    small collections - 100% more than we have today.
    These collections are all binary assertions -- relevant/not-
    relevant for
    a
    given query -- right? Am I correct in presuming that such
    corpora can't
    help
    us to judge scoring and ranking algorithms, or Similarity
    implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare scoring/
    ranking
    algorithms all the time!
    Afaik, those collections yield pretty good results for all kinds of
    relevance judgements though.
    also, if you have some ideas on how to perhaps create some ant
    tasks
    to
    make
    downloading/running these thru the lucene benchmark package
    easier,
    that
    would be great too.
    Hmm, that approach is specific to Lucene Java. It's not handy for
    either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you
    just want to
    produce a .txt file that you throw at the trec_eval commandline
    program or
    something similar. Doing it in a lucene-java specific way doesn't
    allow us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!
    This is maybe the most important issue for the first step. I would
    really like to see a standard format which can be parsed easily by
    whatever language you use. I personally prefer JSON for almost
    everything as it is soo easy to parse, read (human eyes) and write.
    Ant still sounds like a good plan as there are many many functions
    already implemented and it is easy to extend.
    +1 for a creating an issue for format and transformation.
    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?
    see above
    At some point, I'd planned to write a loose port of the Lucene
    benchmarking
    suite so that Lucy (at least) could exploit it... The
    benchmarking code
    has
    gotten so elaborate and complex now, though -- I wonder how easy
    it will
    be
    to
    generalize...
    this is a bit frustrating because many collections claim to be
    "trec" format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a
    common
    format.

    Is ant the best approach here? Maybe we start off with a scripting
    language
    like Python?
    you wanna use you object model, right ? :)
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i
    change the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I
    am doing
    (scripting language)

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 10, 2009 at 10:30 pm
    Grant, I am fine with java, really. Marvin brought up python, I am willing
    to learn the language if thats what it takes (I only have minor experience
    with it so far)

    Really, I think that any code that munges these collections isn't something
    we should worry about being nice from a software devel principle.

    To correct the formats of this stuff, I always use sed, grep, or even
    vi/notepad. Its a throwaway type of thing in my opinion.

    If people feel strongly towards any particular language/build system, let me
    know. Otherwise I want to start working on a patch sooner than later.
    Someone smarter than me could always help improve it.
    On Tue, Nov 10, 2009 at 5:25 PM, Grant Ingersoll wrote:


    On Nov 10, 2009, at 3:48 PM, Robert Muir wrote:

    Hi Simon, thanks for your comments.
    I guess in my opinion, the fastest way to having something would be to
    create scripts that munge these various collections into a standard
    format,
    as mentioned earlier.
    And I think the easiest format would actually be to format queries,
    judgements, and text into what the Lucene-java benchmark expects already.
    This format is pretty simple and I don't think it would be a headache to
    use
    for other projects such as lucy or solr or maybe even comparisons against
    other software.

    This is of course biased by the fact that I am lazy and I don't want to
    mess
    with the lucene benchmark package :)

    I would like to create a JIRA issue to start working this task, as I am
    maintaining this various junk internally at the moment.

    Does anyone have specific preference to what programming language/build
    system/etc is desired? I don't have a preference, I just care about
    relevance.
    Since most of our projects are in Java, I would probably lean that way, but
    if it is just meant to be lightweight, then we could just use a scripting
    lang.



    On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer <
    simon.willnauer@googlemail.com> wrote:
    On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir wrote:

    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these tonight to test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish collections.


    I seem to recall you suggesting at ApacheCon that they would be handy
    when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for analyzers or
    scoring or anything else, but then again I think we could detect if some
    change is completely broken or silly (bugs, etc).
    This would bring a huge value to lucene and its derivatives. This
    sounds like a very good point to start from especially until we sorted
    out all the licensing issues, how to distribute collections or what we
    want to crawl. There is a huge +1 from my side to get started with the
    small collections - 100% more than we have today.

    These collections are all binary assertions -- relevant/not-relevant
    for
    a
    given query -- right? Am I correct in presuming that such corpora can't
    help
    us to judge scoring and ranking algorithms, or Similarity
    implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare
    scoring/ranking
    algorithms all the time!
    Afaik, those collections yield pretty good results for all kinds of
    relevance judgements though.

    also, if you have some ideas on how to perhaps create some ant tasks
    to
    make
    downloading/running these thru the lucene benchmark package easier,
    that
    would be great too.
    Hmm, that approach is specific to Lucene Java. It's not handy for
    either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you just want to
    produce a .txt file that you throw at the trec_eval commandline program or
    something similar. Doing it in a lucene-java specific way doesn't allow us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!
    This is maybe the most important issue for the first step. I would
    really like to see a standard format which can be parsed easily by
    whatever language you use. I personally prefer JSON for almost
    everything as it is soo easy to parse, read (human eyes) and write.
    Ant still sounds like a good plan as there are many many functions
    already implemented and it is easy to extend.
    +1 for a creating an issue for format and transformation.
    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?

    see above
    At some point, I'd planned to write a loose port of the Lucene
    benchmarking
    suite so that Lucy (at least) could exploit it... The benchmarking code
    has
    gotten so elaborate and complex now, though -- I wonder how easy it
    will
    be
    to
    generalize...

    this is a bit frustrating because many collections claim to be "trec"
    format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a common
    format.

    Is ant the best approach here? Maybe we start off with a scripting
    language
    like Python?
    you wanna use you object model, right ? :)
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i change
    the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am
    doing
    (scripting language)


    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Simon Willnauer at Nov 10, 2009 at 10:43 pm
    IMO we should not waste too much time for a decision on a programming
    language. Lets just go to for Java / ANT as we all know what we are
    doing.

    Thoughts?

    +1 for java / ANT
    On Tue, Nov 10, 2009 at 11:30 PM, Robert Muir wrote:
    Grant, I am fine with java, really. Marvin brought up python, I am willing
    to learn the language if thats what it takes (I only have minor experience
    with it so far)

    Really, I think that any code that munges these collections isn't something
    we should worry about being nice from a software devel principle.

    To correct the formats of this stuff, I always use sed, grep, or even
    vi/notepad. Its a throwaway type of thing in my opinion.

    If people feel strongly towards any particular language/build system, let me
    know. Otherwise I want to start working on a patch sooner than later.
    Someone smarter than me could always help improve it.
    On Tue, Nov 10, 2009 at 5:25 PM, Grant Ingersoll wrote:


    On Nov 10, 2009, at 3:48 PM, Robert Muir wrote:

    Hi Simon, thanks for your comments.
    I guess in my opinion, the fastest way to having something would be to
    create scripts that munge these various collections into a standard
    format,
    as mentioned earlier.
    And I think the easiest format would actually be to format queries,
    judgements, and text into what the Lucene-java benchmark expects already.
    This format is pretty simple and I don't think it would be a headache to
    use
    for other projects such as lucy or solr or maybe even comparisons against
    other software.

    This is of course biased by the fact that I am lazy and I don't want to
    mess
    with the lucene benchmark package :)

    I would like to create a JIRA issue to start working this task, as I am
    maintaining this various junk internally at the moment.

    Does anyone have specific preference to what programming language/build
    system/etc is desired? I don't have a preference, I just care about
    relevance.
    Since most of our projects are in Java, I would probably lean that way, but
    if it is just meant to be lightweight, then we could just use a scripting
    lang.



    On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer <
    simon.willnauer@googlemail.com> wrote:
    On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir wrote:

    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these tonight to test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish collections.


    I seem to recall you suggesting at ApacheCon that they would be handy
    when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for analyzers or
    scoring or anything else, but then again I think we could detect if some
    change is completely broken or silly (bugs, etc).
    This would bring a huge value to lucene and its derivatives. This
    sounds like a very good point to start from especially until we sorted
    out all the licensing issues, how to distribute collections or what we
    want to crawl. There is a huge +1 from my side to get started with the
    small collections - 100% more than we have today.

    These collections are all binary assertions -- relevant/not-relevant
    for
    a
    given query -- right?  Am I correct in presuming that such corpora can't
    help
    us to judge scoring and ranking algorithms, or Similarity
    implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare
    scoring/ranking
    algorithms all the time!
    Afaik, those collections yield pretty good results for all kinds of
    relevance judgements though.

    also, if you have some ideas on how to perhaps create some ant tasks
    to
    make
    downloading/running these thru the lucene benchmark package easier,
    that
    would be great too.
    Hmm, that approach is specific to Lucene Java.  It's not handy for
    either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you just want to
    produce a .txt file that you throw at the trec_eval commandline program or
    something similar. Doing it in a lucene-java specific way doesn't allow us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!
    This is maybe the most important issue for the first step. I would
    really like to see a standard format which can be parsed easily by
    whatever language you use. I personally prefer JSON for almost
    everything as it is soo easy to parse, read (human eyes) and write.
    Ant still sounds like a good plan as there are many many functions
    already implemented and it is easy to extend.
    +1 for a creating an issue for format and transformation.
    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?

    see above
    At some point, I'd planned to write a loose port of the Lucene
    benchmarking
    suite so that Lucy (at least) could exploit it...  The benchmarking code
    has
    gotten so elaborate and complex now, though -- I wonder how easy it
    will
    be
    to
    generalize...

    this is a bit frustrating because many collections claim to be "trec"
    format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a common
    format.

    Is ant the best approach here?  Maybe we start off with a scripting
    language
    like Python?
    you wanna use you object model, right ? :)
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i change
    the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am
    doing
    (scripting language)


    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 10, 2009 at 11:10 pm
    +1 (for agreeing on just something, lets get going on this!)
    On Tue, Nov 10, 2009 at 5:42 PM, Simon Willnauer wrote:

    IMO we should not waste too much time for a decision on a programming
    language. Lets just go to for Java / ANT as we all know what we are
    doing.

    Thoughts?

    +1 for java / ANT
    On Tue, Nov 10, 2009 at 11:30 PM, Robert Muir wrote:
    Grant, I am fine with java, really. Marvin brought up python, I am willing
    to learn the language if thats what it takes (I only have minor
    experience
    with it so far)

    Really, I think that any code that munges these collections isn't something
    we should worry about being nice from a software devel principle.

    To correct the formats of this stuff, I always use sed, grep, or even
    vi/notepad. Its a throwaway type of thing in my opinion.

    If people feel strongly towards any particular language/build system, let me
    know. Otherwise I want to start working on a patch sooner than later.
    Someone smarter than me could always help improve it.

    On Tue, Nov 10, 2009 at 5:25 PM, Grant Ingersoll <gsingers@apache.org
    wrote:
    On Nov 10, 2009, at 3:48 PM, Robert Muir wrote:

    Hi Simon, thanks for your comments.
    I guess in my opinion, the fastest way to having something would be to
    create scripts that munge these various collections into a standard
    format,
    as mentioned earlier.
    And I think the easiest format would actually be to format queries,
    judgements, and text into what the Lucene-java benchmark expects
    already.
    This format is pretty simple and I don't think it would be a headache
    to
    use
    for other projects such as lucy or solr or maybe even comparisons
    against
    other software.

    This is of course biased by the fact that I am lazy and I don't want to
    mess
    with the lucene benchmark package :)

    I would like to create a JIRA issue to start working this task, as I am
    maintaining this various junk internally at the moment.

    Does anyone have specific preference to what programming language/build
    system/etc is desired? I don't have a preference, I just care about
    relevance.
    Since most of our projects are in Java, I would probably lean that way,
    but
    if it is just meant to be lightweight, then we could just use a
    scripting
    lang.



    On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer <
    simon.willnauer@googlemail.com> wrote:
    On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir wrote:

    So, what comparisons can we set up using these collections?
    I think we can be creative. for example I used one of these tonight
    to
    test
    LUCENE-1812, Andrzej's index pruning tool. Results showed that it
    works
    as
    he advertised at apachecon...

    also, we should be careful about the english ones i linked to (or
    preferably, find bigger ones), because they are smallish collections.


    I seem to recall you suggesting at ApacheCon that they would be
    handy
    when
    judging Analyzer mods.
    Yeah, definitely don't think any results should be gospel for
    analyzers
    or
    scoring or anything else, but then again I think we could detect if
    some
    change is completely broken or silly (bugs, etc).
    This would bring a huge value to lucene and its derivatives. This
    sounds like a very good point to start from especially until we sorted
    out all the licensing issues, how to distribute collections or what we
    want to crawl. There is a huge +1 from my side to get started with the
    small collections - 100% more than we have today.

    These collections are all binary assertions -- relevant/not-relevant
    for
    a
    given query -- right? Am I correct in presuming that such corpora
    can't
    help
    us to judge scoring and ranking algorithms, or Similarity
    implementations?
    I think most of them are binary... but I think I disagree with your second
    statement, these kinds of collections are used to compare
    scoring/ranking
    algorithms all the time!
    Afaik, those collections yield pretty good results for all kinds of
    relevance judgements though.

    also, if you have some ideas on how to perhaps create some ant
    tasks
    to
    make
    downloading/running these thru the lucene benchmark package easier,
    that
    would be great too.
    Hmm, that approach is specific to Lucene Java. It's not handy for
    either
    of
    the projects I work on (Lucy, KinoSearch).
    You raise a good point here. Really at the end of the day, you just
    want
    to
    produce a .txt file that you throw at the trec_eval commandline
    program
    or
    something similar. Doing it in a lucene-java specific way doesn't
    allow
    us
    to easily evaluate things even in solr, for example it has analysis
    components that affect relevance!
    This is maybe the most important issue for the first step. I would
    really like to see a standard format which can be parsed easily by
    whatever language you use. I personally prefer JSON for almost
    everything as it is soo easy to parse, read (human eyes) and write.
    Ant still sounds like a good plan as there are many many functions
    already implemented and it is easy to extend.
    +1 for a creating an issue for format and transformation.
    I guess one approach could be to create scripts and stuff here that download
    and munge these collections into a consistent format, and then lucy,
    lucene-java, solr, whatever would have an an easier time running the
    evaluations?

    see above
    At some point, I'd planned to write a loose port of the Lucene
    benchmarking
    suite so that Lucy (at least) could exploit it... The benchmarking
    code
    has
    gotten so elaborate and complex now, though -- I wonder how easy it
    will
    be
    to
    generalize...
    this is a bit frustrating because many collections claim to be
    "trec"
    format
    but they are all formatted slightly differently...
    Sounds like we need one module per corpus to explode it into a
    common
    format.

    Is ant the best approach here? Maybe we start off with a scripting
    language
    like Python?
    you wanna use you object model, right ? :)
    in reference to both your comments above, I don't modify the lucene
    benchmarking code really too much to run my tests, sometimes i change
    the
    analyzer or scoring but thats it.

    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am
    doing
    (scripting language)


    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com
  • Andrzej Bialecki at Nov 10, 2009 at 11:47 pm

    Robert Muir wrote:
    +1 (for agreeing on just something, lets get going on this!)

    On Tue, Nov 10, 2009 at 5:42 PM, Simon Willnauer <
    simon.willnauer@googlemail.com> wrote:
    IMO we should not waste too much time for a decision on a programming
    language. Lets just go to for Java / ANT as we all know what we are
    doing.

    Thoughts?

    +1 for java / ANT
    In the spirit of a long-time Unix hacker I love to use one-liners as the
    next guy .. but they are hard to maintain. So +1 to Java/ant, with a
    comment that we should not get too religious about _not_ using *nix
    utilities where it makes sense - ant can drive a shell script that
    munges the format with grep/sed/awk if that's easier/faster to do than
    writing a Java class.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com
  • Andrzej Bialecki at Nov 23, 2009 at 9:30 am

    Simon Willnauer wrote:
    IMO we should not waste too much time for a decision on a programming
    language. Lets just go to for Java / ANT as we all know what we are
    doing.

    Thoughts?
    As we start adding collections, IMHO it's important that we add a
    per-collection LICENSE.txt and README.txt - what good is a collection
    from some random URL without a record of its provenience and its
    suitability to be used in this project?


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com
  • Robert Muir at Nov 23, 2009 at 12:04 pm
    you are right, lets open a JIRA issue
    On Mon, Nov 23, 2009 at 4:29 AM, Andrzej Bialecki wrote:

    Simon Willnauer wrote:
    IMO we should not waste too much time for a decision on a programming
    language. Lets just go to for Java / ANT as we all know what we are
    doing.

    Thoughts?
    As we start adding collections, IMHO it's important that we add a
    per-collection LICENSE.txt and README.txt - what good is a collection from
    some random URL without a record of its provenience and its suitability to
    be used in this project?



    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com

    --
    Robert Muir
    rcmuir@gmail.com
  • Andrzej Bialecki at Nov 23, 2009 at 2:02 pm

    Robert Muir wrote:
    you are right, lets open a JIRA issue
    Done, ORP-3.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com
  • Robert Muir at Nov 23, 2009 at 3:25 pm
    thanks Andrzej, I added a few thoughts of my own.

    I might be completely off-base, but I think we should exercise a lot of
    caution to not give the impression these are apache works.
    On Mon, Nov 23, 2009 at 9:01 AM, Andrzej Bialecki wrote:

    Robert Muir wrote:
    you are right, lets open a JIRA issue
    Done, ORP-3.



    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com

    --
    Robert Muir
    rcmuir@gmail.com
  • Andrzej Bialecki at Nov 23, 2009 at 3:48 pm

    Robert Muir wrote:
    thanks Andrzej, I added a few thoughts of my own.

    I might be completely off-base, but I think we should exercise a lot of
    caution to not give the impression these are apache works.
    Good thinking, I agree with your comments - this is a tricky issue, and
    it's better to err on the side of caution.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com
  • Marvin Humphrey at Nov 10, 2009 at 11:46 pm

    Simon Willnauer replied to me:

    Sounds like we need one module per corpus to explode it into a common
    format.

    Is ant the best approach here?  Maybe we start off with a scripting
    language like Python?
    you wanna use you object model, right ? :)
    Haha, no. :) The Lucy object model is completely unrelated and wouldn't have
    been touched under what I was suggesting.

    In order to launch benchmarking apps for indexing libraries written in
    different languages -- Java Lucene, Perl Lucy, Python Lucy, etc -- our central
    library will need to launch external processes. That's the very definition of a
    scripting task.

    From <http://en.wikipedia.org/wiki/Scripting_language>:

    A scripting language, script language or extension language is a programming
    language that allows control of one or more software applications. "Scripts"
    are distinct from the core code of the application, which is usually written
    in a different language, and are often created or at least modified by the
    end-user.

    I suggested Python in particular because from a distance it looks like Python
    3.x has pretty decent Unicode support.

    Robert Muir wrote:
    instead, i use sed and perl and what not to reformat things into the format
    the benchmark package wants... so I guess this is already what I am doing
    (scripting language)
    I don't recommend Perl for this application. It's great for hacking up fast
    file wrangling stuff, but its Unicode support is hard to use and very hard to
    debug unless you understand the underlying implementation, which for
    backwards-compatibility reasons is very very complicated. I know it backwards
    and forwards so I could get good results, but I don't think other people
    should have to make that investment.

    The Java/ant combo wouldn't be my preference for the opposite reason: the
    Unicode support is there, but it's a lot more verbose and unwieldy for
    scripting tasks and quick file hackups.

    This would all matter more if we end up generalizing some portion of the
    Lucene benchmarking suite under Open Relevance so that other projects could
    use it. (The fact that Mike McCandless, primary author of the benchmarking
    suite, is fluent in Python, also drove the suggestion.) That seems natural
    because we need some framework to run the relevance tests under; exporting to
    a common intermediate format is nice, but running actual benchmarks is nicer.
    And since search-time benchmarking capabilities are sorely needed for Lucy,
    I'd get involved.

    However, I'd pretty much resigned myself to porting a separate implementation
    of the benchmarking suite eventually. And given the way this thread has
    progressed since I started writing this reply, looks like that's what I'll be
    falling back to after all. Oh well... no gain, no loss.

    Marvin Humphrey
  • Robert Muir at Nov 11, 2009 at 12:14 am
    Marvin, I am a little concerned about your comments.

    I think that there might be a little confusion:
    1. the trec portion of the lucene benchmark suite is essentially standalone,
    it doesn't really interact with the other components.
    2. creating some scripts/code to download and reformat collections into a
    standardized format, I think this is the way to go?
    Why not agree on the conventional TREC format that the lucene benchmark
    package expects?
    3. we are talking about downloading and reformatting text files into text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.

    by the way, If there's anything I can do to make this concept more amenable
    to you, please reply!
    actually I 100% agree that we should not limit anything to any specific
    lucene implementation.

    In fact, I don't want to try to imply some scope creep, but I'm completely
    for the idea that this kind of thing could be
    re-used to compare even non-apache projects (other search engines, etc).
    Surely we have some stuff to learn from each other.

    This is just about boilerplate code, build scripts, downloading,
    reformatting, very boring stuff :)

    However, I'd pretty much resigned myself to porting a separate
    implementation
    of the benchmarking suite eventually. And given the way this thread has
    progressed since I started writing this reply, looks like that's what I'll
    be
    falling back to after all. Oh well... no gain, no loss.

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com
  • Marvin Humphrey at Nov 11, 2009 at 5:56 pm

    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:

    Why not agree on the conventional TREC format that the lucene benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the format itself or
    possible alternatives.
    3. we are talking about downloading and reformatting text files into text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.
    OK. The next logical step is to actually do something with the files, and I
    figured you were going there. I didn't realize that simply converting the
    files was more-or-less sufficient for coaxing something useful out of the
    Lucene benchmark suite.

    Please carry on.
    In fact, I don't want to try to imply some scope creep, but I'm completely
    for the idea that this kind of thing could be re-used to compare even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project should be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite earlier.
    McCandless has been modding it recently, but the original patch was a team
    effort from Grant Ingersoll and Doron Cohen with prior art contributions by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey
  • Robert Muir at Nov 12, 2009 at 11:36 am
    Marvin, I'm not really sure its the format that we want to stick with
    either?

    For example, converting everything a least common denominator will work for
    now, but some collections might have special properties (i.e. fields with
    categorization values, other interesting things).

    just want to get something started and working, worst case: nobody likes the
    patch and we are back to where we are now!
    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:

    Why not agree on the conventional TREC format that the lucene benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the format itself
    or
    possible alternatives.
    3. we are talking about downloading and reformatting text files into text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.
    OK. The next logical step is to actually do something with the files, and
    I
    figured you were going there. I didn't realize that simply converting the
    files was more-or-less sufficient for coaxing something useful out of the
    Lucene benchmark suite.

    Please carry on.
    In fact, I don't want to try to imply some scope creep, but I'm
    completely
    for the idea that this kind of thing could be re-used to compare even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project should be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite
    earlier.
    McCandless has been modding it recently, but the original patch was a team
    effort from Grant Ingersoll and Doron Cohen with prior art contributions by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey

    --
    Robert Muir
    rcmuir@gmail.com
  • Nicola Ferro at Nov 12, 2009 at 12:19 pm
    Our experience in organizing and running CLEF for 10 years has been to
    not go for a least common denominator but leave collections as they are.

    The rationale is that:
    1) you loose the link/alignment with the original collection
    2) you loose or discard information (tags) that might be useful in the
    future for unforeseen evaluation tasks / reuses of the collection
    3) you might introduce errors, if you miss something in the semantics
    of the original collection or you have bugs in the software
    4) it is almost impossible to develop a format that fits for all the
    domains (e.g. news, library collections, patent collections, juridical
    documents, ...) or mixed media collections (images+text, speech
    +text, ...)
    5) errors / alternative transliterations (e.g. with accents, without
    accents) / documents with empty content/tags in the collection
    represent a real word situation which search engines should be able to
    cope with.

    What we only ask for the new collections (not the legacy ones) is to
    be in XML, UTF-8, and ensuring unique document identifier (possibly
    according to some meaningful/agreed format).

    All the best,
    Nicola



    ---------------------------------------------------------------------------------
    Nicola Ferro - Ph.D. in Computer Science
    Assistant Professor

    Department of Information Engineering (DEI)
    University of Padua
    Via Gradenigo, 6/A - 35131 Padova - Italy
    Tel +39 049 827 7939 Fax: +39 049 827 7799

    skype: nicola.ferro
    e-mail: ferro@dei.unipd.it
    home page: http://ims.dei.unipd.it/members/ferro/
    ----------------------------------------------------------------------------------

    Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:
    Marvin, I'm not really sure its the format that we want to stick with
    either?

    For example, converting everything a least common denominator will
    work for
    now, but some collections might have special properties (i.e. fields
    with
    categorization values, other interesting things).

    just want to get something started and working, worst case: nobody
    likes the
    patch and we are back to where we are now!

    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <marvin@rectangular.com
    wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:

    Why not agree on the conventional TREC format that the lucene
    benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the
    format itself
    or
    possible alternatives.
    3. we are talking about downloading and reformatting text files
    into text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.
    OK. The next logical step is to actually do something with the
    files, and
    I
    figured you were going there. I didn't realize that simply
    converting the
    files was more-or-less sufficient for coaxing something useful out
    of the
    Lucene benchmark suite.

    Please carry on.
    In fact, I don't want to try to imply some scope creep, but I'm
    completely
    for the idea that this kind of thing could be re-used to compare
    even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project
    should be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite
    earlier.
    McCandless has been modding it recently, but the original patch was
    a team
    effort from Grant Ingersoll and Doron Cohen with prior art
    contributions by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey

    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 12, 2009 at 12:39 pm
    Nicola,

    I agree with your assessment, however if someone wants the collection 'as it
    is', they can already do this without any openrelevance project (just
    download the collection, you have it).

    what I am proposing is some scripts to create a consistent format to make
    consumption easier, else every project that wants to run the tests must
    implement parsers/etc for each collection, due to these inconsistencies.

    Most of the formatting differences I speak of are things such as using
    various different tags to refer to the document id: Docname, DOCNAME, DOCID,
    ..., different formatting of queries and judgements files.

    I am not talking about changing any of the content (accents or errors), and
    I don't see how this really loses anything from the original collection...

    I'll look at including all tags, for lucene-java we can change
    TrecContentSource to ignore tags that don't matter for the time being.

    On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro wrote:

    Our experience in organizing and running CLEF for 10 years has been to not
    go for a least common denominator but leave collections as they are.

    The rationale is that:
    1) you loose the link/alignment with the original collection
    2) you loose or discard information (tags) that might be useful in the
    future for unforeseen evaluation tasks / reuses of the collection
    3) you might introduce errors, if you miss something in the semantics of
    the original collection or you have bugs in the software
    4) it is almost impossible to develop a format that fits for all the
    domains (e.g. news, library collections, patent collections, juridical
    documents, ...) or mixed media collections (images+text, speech+text, ...)
    5) errors / alternative transliterations (e.g. with accents, without
    accents) / documents with empty content/tags in the collection represent a
    real word situation which search engines should be able to cope with.

    What we only ask for the new collections (not the legacy ones) is to be in
    XML, UTF-8, and ensuring unique document identifier (possibly according to
    some meaningful/agreed format).

    All the best,
    Nicola




    ---------------------------------------------------------------------------------
    Nicola Ferro - Ph.D. in Computer Science
    Assistant Professor

    Department of Information Engineering (DEI)
    University of Padua
    Via Gradenigo, 6/A - 35131 Padova - Italy
    Tel +39 049 827 7939 Fax: +39 049 827 7799

    skype: nicola.ferro
    e-mail: ferro@dei.unipd.it
    home page: http://ims.dei.unipd.it/members/ferro/

    ----------------------------------------------------------------------------------

    Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:


    Marvin, I'm not really sure its the format that we want to stick with
    either?

    For example, converting everything a least common denominator will work
    for
    now, but some collections might have special properties (i.e. fields with
    categorization values, other interesting things).

    just want to get something started and working, worst case: nobody likes
    the
    patch and we are back to where we are now!

    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <marvin@rectangular.com
    wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:

    Why not agree on the conventional TREC format that the lucene benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the format
    itself
    or
    possible alternatives.

    3. we are talking about downloading and reformatting text files into
    text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.
    OK. The next logical step is to actually do something with the files,
    and
    I
    figured you were going there. I didn't realize that simply converting
    the
    files was more-or-less sufficient for coaxing something useful out of the
    Lucene benchmark suite.

    Please carry on.

    In fact, I don't want to try to imply some scope creep, but I'm

    completely
    for the idea that this kind of thing could be re-used to compare even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project should
    be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite
    earlier.
    McCandless has been modding it recently, but the original patch was a
    team
    effort from Grant Ingersoll and Doron Cohen with prior art contributions
    by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey

    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Nicola Ferro at Nov 12, 2009 at 1:51 pm
    Dear Robert,

    that's fine. Maybe I've been a little bit over concerned.

    But how do you plan to do with non-XML collections? Do you plan to
    XMLify them? And what about for those that are in SGML with a DTD? Do
    you plan to translate them to XML and provide also translations of
    their document type e.g. to XSchema? Do you in general think to add a
    document type for all the collections? Usually we do like that in CLEF
    because obviously gives the possibility of validate documents and it
    is a good documentation for the users of the collections who know what
    to expect.

    In general, I'm more than in favour to have standardised XML-based
    formats for topics, qrles, and runs instead of / in conjunction with
    the legacy TREC format - which is sometimes redundant and sometimes
    esoteric wrt unused fields.

    We have developed straightforward XML formats in CLEF for runs, qrels,
    and topics but we have publicly used only the one for topics because
    participants are used to trec_eval which does not work with runs and
    qrels in XML. You can have a look at the XML topics at:

    http://direct.dei.unipd.it/10.2452/100-AH

    If you are interested, we would be happy to share those formats.

    By the way we have also developed a Java wrapper for trec_eval 8.0
    (via JNI) which allows use to use trec_eval as a plain Java object
    still using its original code for computations to ensure compliancy
    with the actual implementation used at TREC (same results, same bugs
    if any -> comparable performance figures). Maybe, this could be of
    your interest as well.

    All the best,
    Nicola



    Il giorno 12 Nov 2009, alle ore 13:39, Robert Muir ha scritto:
    Nicola,

    I agree with your assessment, however if someone wants the
    collection 'as it
    is', they can already do this without any openrelevance project (just
    download the collection, you have it).

    what I am proposing is some scripts to create a consistent format to
    make
    consumption easier, else every project that wants to run the tests
    must
    implement parsers/etc for each collection, due to these
    inconsistencies.

    Most of the formatting differences I speak of are things such as using
    various different tags to refer to the document id: Docname,
    DOCNAME, DOCID,
    ..., different formatting of queries and judgements files.

    I am not talking about changing any of the content (accents or
    errors), and
    I don't see how this really loses anything from the original
    collection...

    I'll look at including all tags, for lucene-java we can change
    TrecContentSource to ignore tags that don't matter for the time being.

    On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro wrote:

    Our experience in organizing and running CLEF for 10 years has been
    to not
    go for a least common denominator but leave collections as they are.

    The rationale is that:
    1) you loose the link/alignment with the original collection
    2) you loose or discard information (tags) that might be useful in
    the
    future for unforeseen evaluation tasks / reuses of the collection
    3) you might introduce errors, if you miss something in the
    semantics of
    the original collection or you have bugs in the software
    4) it is almost impossible to develop a format that fits for all the
    domains (e.g. news, library collections, patent collections,
    juridical
    documents, ...) or mixed media collections (images+text, speech
    +text, ...)
    5) errors / alternative transliterations (e.g. with accents, without
    accents) / documents with empty content/tags in the collection
    represent a
    real word situation which search engines should be able to cope with.

    What we only ask for the new collections (not the legacy ones) is
    to be in
    XML, UTF-8, and ensuring unique document identifier (possibly
    according to
    some meaningful/agreed format).

    All the best,
    Nicola




    ---------------------------------------------------------------------------------
    Nicola Ferro - Ph.D. in Computer Science
    Assistant Professor

    Department of Information Engineering (DEI)
    University of Padua
    Via Gradenigo, 6/A - 35131 Padova - Italy
    Tel +39 049 827 7939 Fax: +39 049 827 7799

    skype: nicola.ferro
    e-mail: ferro@dei.unipd.it
    home page: http://ims.dei.unipd.it/members/ferro/

    ----------------------------------------------------------------------------------

    Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:


    Marvin, I'm not really sure its the format that we want to stick with
    either?

    For example, converting everything a least common denominator will
    work
    for
    now, but some collections might have special properties (i.e.
    fields with
    categorization values, other interesting things).

    just want to get something started and working, worst case: nobody
    likes
    the
    patch and we are back to where we are now!

    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <marvin@rectangular.com
    wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:

    Why not agree on the conventional TREC format that the lucene
    benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the
    format
    itself
    or
    possible alternatives.

    3. we are talking about downloading and reformatting text files
    into
    text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of
    this.
    OK. The next logical step is to actually do something with the
    files,
    and
    I
    figured you were going there. I didn't realize that simply
    converting
    the
    files was more-or-less sufficient for coaxing something useful
    out of the
    Lucene benchmark suite.

    Please carry on.

    In fact, I don't want to try to imply some scope creep, but I'm

    completely
    for the idea that this kind of thing could be re-used to compare
    even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project
    should
    be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite
    earlier.
    McCandless has been modding it recently, but the original patch
    was a
    team
    effort from Grant Ingersoll and Doron Cohen with prior art
    contributions
    by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey

    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 12, 2009 at 3:07 pm
    Nicola, actually for now I am thinking just to use the legacy trec format.
    The only reason is that this way, I don't have to change the lucene-java
    benchmark package to make use of it.

    I think an xml format for everything might be better in the future (and
    looks like you have some experience with this kind of thing already),
    perhaps someone would be interested in doing this as a later improvement,
    and also contributing code to make use of it for lucene-java benchmark.

    as far as the actual collection content, right now this package only expects
    "body" and "docname" index fields to be present, so this is a very simple
    format for the time being. I don't really care about the format of the
    existing collection for the time being, I'm not going to do anything
    complicated involving xml schema or dtds. Im just going to parse out the
    "body" and "docname" and output in a consistent way, so that we can have
    some repeatable relevance tests in the near future.

    In the future, I think people can contribute improvements to all of this.
    I'm trying to start with the very barebones basics, expecting that others
    can contribute improvements, or even completely replace all of this!

    On Thu, Nov 12, 2009 at 8:51 AM, Nicola Ferro wrote:

    Dear Robert,

    that's fine. Maybe I've been a little bit over concerned.

    But how do you plan to do with non-XML collections? Do you plan to XMLify
    them? And what about for those that are in SGML with a DTD? Do you plan to
    translate them to XML and provide also translations of their document type
    e.g. to XSchema? Do you in general think to add a document type for all the
    collections? Usually we do like that in CLEF because obviously gives the
    possibility of validate documents and it is a good documentation for the
    users of the collections who know what to expect.

    In general, I'm more than in favour to have standardised XML-based formats
    for topics, qrles, and runs instead of / in conjunction with the legacy
    TREC format - which is sometimes redundant and sometimes esoteric wrt unused
    fields.

    We have developed straightforward XML formats in CLEF for runs, qrels, and
    topics but we have publicly used only the one for topics because
    participants are used to trec_eval which does not work with runs and qrels
    in XML. You can have a look at the XML topics at:

    http://direct.dei.unipd.it/10.2452/100-AH

    If you are interested, we would be happy to share those formats.

    By the way we have also developed a Java wrapper for trec_eval 8.0 (via
    JNI) which allows use to use trec_eval as a plain Java object still using
    its original code for computations to ensure compliancy with the actual
    implementation used at TREC (same results, same bugs if any -> comparable
    performance figures). Maybe, this could be of your interest as well.

    All the best,
    Nicola



    Il giorno 12 Nov 2009, alle ore 13:39, Robert Muir ha scritto:


    Nicola,
    I agree with your assessment, however if someone wants the collection 'as
    it
    is', they can already do this without any openrelevance project (just
    download the collection, you have it).

    what I am proposing is some scripts to create a consistent format to make
    consumption easier, else every project that wants to run the tests must
    implement parsers/etc for each collection, due to these inconsistencies.

    Most of the formatting differences I speak of are things such as using
    various different tags to refer to the document id: Docname, DOCNAME,
    DOCID,
    ..., different formatting of queries and judgements files.

    I am not talking about changing any of the content (accents or errors),
    and
    I don't see how this really loses anything from the original collection...

    I'll look at including all tags, for lucene-java we can change
    TrecContentSource to ignore tags that don't matter for the time being.


    On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro wrote:

    Our experience in organizing and running CLEF for 10 years has been to
    not
    go for a least common denominator but leave collections as they are.

    The rationale is that:
    1) you loose the link/alignment with the original collection
    2) you loose or discard information (tags) that might be useful in the
    future for unforeseen evaluation tasks / reuses of the collection
    3) you might introduce errors, if you miss something in the semantics of
    the original collection or you have bugs in the software
    4) it is almost impossible to develop a format that fits for all the
    domains (e.g. news, library collections, patent collections, juridical
    documents, ...) or mixed media collections (images+text, speech+text,
    ...)
    5) errors / alternative transliterations (e.g. with accents, without
    accents) / documents with empty content/tags in the collection represent
    a
    real word situation which search engines should be able to cope with.

    What we only ask for the new collections (not the legacy ones) is to be
    in
    XML, UTF-8, and ensuring unique document identifier (possibly according
    to
    some meaningful/agreed format).

    All the best,
    Nicola





    ---------------------------------------------------------------------------------
    Nicola Ferro - Ph.D. in Computer Science
    Assistant Professor

    Department of Information Engineering (DEI)
    University of Padua
    Via Gradenigo, 6/A - 35131 Padova - Italy
    Tel +39 049 827 7939 Fax: +39 049 827 7799

    skype: nicola.ferro
    e-mail: ferro@dei.unipd.it
    home page: http://ims.dei.unipd.it/members/ferro/


    ----------------------------------------------------------------------------------

    Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:


    Marvin, I'm not really sure its the format that we want to stick with
    either?

    For example, converting everything a least common denominator will work
    for
    now, but some collections might have special properties (i.e. fields
    with
    categorization values, other interesting things).

    just want to get something started and working, worst case: nobody likes
    the
    patch and we are back to where we are now!

    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <
    marvin@rectangular.com
    wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:


    Why not agree on the conventional TREC format that the lucene benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the format
    itself
    or
    possible alternatives.

    3. we are talking about downloading and reformatting text files into
    text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.
    OK. The next logical step is to actually do something with the files,
    and
    I
    figured you were going there. I didn't realize that simply converting
    the
    files was more-or-less sufficient for coaxing something useful out of
    the
    Lucene benchmark suite.

    Please carry on.

    In fact, I don't want to try to imply some scope creep, but I'm
    completely
    for the idea that this kind of thing could be re-used to compare even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project should
    be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite
    earlier.
    McCandless has been modding it recently, but the original patch was a
    team
    effort from Grant Ingersoll and Doron Cohen with prior art
    contributions
    by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com
    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Nicola Ferro at Nov 13, 2009 at 8:11 am
    Ok, I see your point.

    Nicola

    Il giorno 12 Nov 2009, alle ore 16:06, Robert Muir ha scritto:
    Nicola, actually for now I am thinking just to use the legacy trec
    format.
    The only reason is that this way, I don't have to change the lucene-
    java
    benchmark package to make use of it.

    I think an xml format for everything might be better in the future
    (and
    looks like you have some experience with this kind of thing already),
    perhaps someone would be interested in doing this as a later
    improvement,
    and also contributing code to make use of it for lucene-java
    benchmark.

    as far as the actual collection content, right now this package only
    expects
    "body" and "docname" index fields to be present, so this is a very
    simple
    format for the time being. I don't really care about the format of the
    existing collection for the time being, I'm not going to do anything
    complicated involving xml schema or dtds. Im just going to parse out
    the
    "body" and "docname" and output in a consistent way, so that we can
    have
    some repeatable relevance tests in the near future.

    In the future, I think people can contribute improvements to all of
    this.
    I'm trying to start with the very barebones basics, expecting that
    others
    can contribute improvements, or even completely replace all of this!

    On Thu, Nov 12, 2009 at 8:51 AM, Nicola Ferro wrote:

    Dear Robert,

    that's fine. Maybe I've been a little bit over concerned.

    But how do you plan to do with non-XML collections? Do you plan to
    XMLify
    them? And what about for those that are in SGML with a DTD? Do you
    plan to
    translate them to XML and provide also translations of their
    document type
    e.g. to XSchema? Do you in general think to add a document type for
    all the
    collections? Usually we do like that in CLEF because obviously
    gives the
    possibility of validate documents and it is a good documentation
    for the
    users of the collections who know what to expect.

    In general, I'm more than in favour to have standardised XML-based
    formats
    for topics, qrles, and runs instead of / in conjunction with the
    legacy
    TREC format - which is sometimes redundant and sometimes esoteric
    wrt unused
    fields.

    We have developed straightforward XML formats in CLEF for runs,
    qrels, and
    topics but we have publicly used only the one for topics because
    participants are used to trec_eval which does not work with runs
    and qrels
    in XML. You can have a look at the XML topics at:

    http://direct.dei.unipd.it/10.2452/100-AH

    If you are interested, we would be happy to share those formats.

    By the way we have also developed a Java wrapper for trec_eval 8.0
    (via
    JNI) which allows use to use trec_eval as a plain Java object still
    using
    its original code for computations to ensure compliancy with the
    actual
    implementation used at TREC (same results, same bugs if any ->
    comparable
    performance figures). Maybe, this could be of your interest as well.

    All the best,
    Nicola



    Il giorno 12 Nov 2009, alle ore 13:39, Robert Muir ha scritto:


    Nicola,
    I agree with your assessment, however if someone wants the
    collection 'as
    it
    is', they can already do this without any openrelevance project
    (just
    download the collection, you have it).

    what I am proposing is some scripts to create a consistent format
    to make
    consumption easier, else every project that wants to run the tests
    must
    implement parsers/etc for each collection, due to these
    inconsistencies.

    Most of the formatting differences I speak of are things such as
    using
    various different tags to refer to the document id: Docname,
    DOCNAME,
    DOCID,
    ..., different formatting of queries and judgements files.

    I am not talking about changing any of the content (accents or
    errors),
    and
    I don't see how this really loses anything from the original
    collection...

    I'll look at including all tags, for lucene-java we can change
    TrecContentSource to ignore tags that don't matter for the time
    being.


    On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro <ferro@dei.unipd.it>
    wrote:

    Our experience in organizing and running CLEF for 10 years has
    been to
    not
    go for a least common denominator but leave collections as they
    are.

    The rationale is that:
    1) you loose the link/alignment with the original collection
    2) you loose or discard information (tags) that might be useful
    in the
    future for unforeseen evaluation tasks / reuses of the collection
    3) you might introduce errors, if you miss something in the
    semantics of
    the original collection or you have bugs in the software
    4) it is almost impossible to develop a format that fits for all
    the
    domains (e.g. news, library collections, patent collections,
    juridical
    documents, ...) or mixed media collections (images+text, speech
    +text,
    ...)
    5) errors / alternative transliterations (e.g. with accents,
    without
    accents) / documents with empty content/tags in the collection
    represent
    a
    real word situation which search engines should be able to cope
    with.

    What we only ask for the new collections (not the legacy ones) is
    to be
    in
    XML, UTF-8, and ensuring unique document identifier (possibly
    according
    to
    some meaningful/agreed format).

    All the best,
    Nicola





    ---------------------------------------------------------------------------------
    Nicola Ferro - Ph.D. in Computer Science
    Assistant Professor

    Department of Information Engineering (DEI)
    University of Padua
    Via Gradenigo, 6/A - 35131 Padova - Italy
    Tel +39 049 827 7939 Fax: +39 049 827 7799

    skype: nicola.ferro
    e-mail: ferro@dei.unipd.it
    home page: http://ims.dei.unipd.it/members/ferro/


    ----------------------------------------------------------------------------------

    Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:


    Marvin, I'm not really sure its the format that we want to stick
    with
    either?

    For example, converting everything a least common denominator
    will work
    for
    now, but some collections might have special properties (i.e.
    fields
    with
    categorization values, other interesting things).

    just want to get something started and working, worst case:
    nobody likes
    the
    patch and we are back to where we are now!

    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <
    marvin@rectangular.com
    wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:


    Why not agree on the conventional TREC format that the lucene
    benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the
    format
    itself
    or
    possible alternatives.

    3. we are talking about downloading and reformatting text files
    into
    text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of
    this.
    OK. The next logical step is to actually do something with the
    files,
    and
    I
    figured you were going there. I didn't realize that simply
    converting
    the
    files was more-or-less sufficient for coaxing something useful
    out of
    the
    Lucene benchmark suite.

    Please carry on.

    In fact, I don't want to try to imply some scope creep, but I'm
    completely
    for the idea that this kind of thing could be re-used to
    compare even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the
    project should
    be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking
    suite
    earlier.
    McCandless has been modding it recently, but the original patch
    was a
    team
    effort from Grant Ingersoll and Doron Cohen with prior art
    contributions
    by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com
    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 12, 2009 at 3:44 pm
    sorry, I wanted to respond to this also.
    This sounds interesting for the future I think.

    For now, the lucene-java code simply outputs a submission.txt file, which
    can then be used from the commandline with trec_eval (invoked manually).

    I think we want to continue to do this for starters. In the future it might
    be neat to have something that uses a wrapper like this to create, say
    jira-formatted tables for us to use in benchmarks, but there might be other
    ways to do the same thing...

    I agree that for comparisons though, it would be best if everyone used the
    official trec_eval to make comparisons, and not the summary output from the
    lucene-java benchmark package.

    By the way we have also developed a Java wrapper for trec_eval 8.0 (via JNI)
    which allows use to use trec_eval as a plain Java object still using its
    original code for computations to ensure compliancy with the actual
    implementation used at TREC (same results, same bugs if any -> comparable
    performance figures). Maybe, this could be of your interest as well.
  • Nicola Ferro at Nov 13, 2009 at 8:14 am
    This sounds reasonable.

    If, in the future, you are interested in this kind of approach
    (wrapper to trec_eval via JNI) let us know: it would be a pity to
    duplicate existing work and our package is quite tested since we use
    it in CLEF since 2005.

    All the best,
    Nicola

    Il giorno 12 Nov 2009, alle ore 16:43, Robert Muir ha scritto:
    sorry, I wanted to respond to this also.
    This sounds interesting for the future I think.

    For now, the lucene-java code simply outputs a submission.txt file,
    which
    can then be used from the commandline with trec_eval (invoked
    manually).

    I think we want to continue to do this for starters. In the future
    it might
    be neat to have something that uses a wrapper like this to create, say
    jira-formatted tables for us to use in benchmarks, but there might
    be other
    ways to do the same thing...

    I agree that for comparisons though, it would be best if everyone
    used the
    official trec_eval to make comparisons, and not the summary output
    from the
    lucene-java benchmark package.

    By the way we have also developed a Java wrapper for trec_eval 8.0
    (via JNI)
    which allows use to use trec_eval as a plain Java object still
    using its
    original code for computations to ensure compliancy with the actual
    implementation used at TREC (same results, same bugs if any ->
    comparable
    performance figures). Maybe, this could be of your interest as well.
  • Robert Muir at Nov 13, 2009 at 11:55 am
    personally, I try to avoid writing JNI at all costs so if I ever want such a
    thing I will certainly shoot you an email!

    I guess this what you use to produce the charts/graphs for CLEF results?
    On Fri, Nov 13, 2009 at 3:13 AM, Nicola Ferro wrote:

    This sounds reasonable.

    If, in the future, you are interested in this kind of approach (wrapper to
    trec_eval via JNI) let us know: it would be a pity to duplicate existing
    work and our package is quite tested since we use it in CLEF since 2005.

    All the best,
    Nicola

    Il giorno 12 Nov 2009, alle ore 16:43, Robert Muir ha scritto:


    sorry, I wanted to respond to this also.
    This sounds interesting for the future I think.

    For now, the lucene-java code simply outputs a submission.txt file, which
    can then be used from the commandline with trec_eval (invoked manually).

    I think we want to continue to do this for starters. In the future it
    might
    be neat to have something that uses a wrapper like this to create, say
    jira-formatted tables for us to use in benchmarks, but there might be
    other
    ways to do the same thing...

    I agree that for comparisons though, it would be best if everyone used the
    official trec_eval to make comparisons, and not the summary output from
    the
    lucene-java benchmark package.

    By the way we have also developed a Java wrapper for trec_eval 8.0 (via
    JNI)
    which allows use to use trec_eval as a plain Java object still using its
    original code for computations to ensure compliancy with the actual
    implementation used at TREC (same results, same bugs if any -> comparable
    performance figures). Maybe, this could be of your interest as well.

    --
    Robert Muir
    rcmuir@gmail.com
  • Nicola Ferro at Nov 13, 2009 at 12:32 pm
    I don't like JNI-based solutions too but copycat results with
    trec_eval are mandatory for us; e.g. participants use trec_eval on
    their own and this ensure exactly the same computations. This is why I
    didn't write a pure Java version doing the code for the computations
    from scratch and it was anyway better than calling an external OS
    process.

    And yes, this is what we use in CLEF for computations inside the Java-
    based system we use for managing the campaign.

    Nicola

    Il giorno 13 Nov 2009, alle ore 12:54, Robert Muir ha scritto:
    personally, I try to avoid writing JNI at all costs so if I ever
    want such a
    thing I will certainly shoot you an email!

    I guess this what you use to produce the charts/graphs for CLEF
    results?
    On Fri, Nov 13, 2009 at 3:13 AM, Nicola Ferro wrote:

    This sounds reasonable.

    If, in the future, you are interested in this kind of approach
    (wrapper to
    trec_eval via JNI) let us know: it would be a pity to duplicate
    existing
    work and our package is quite tested since we use it in CLEF since
    2005.

    All the best,
    Nicola

    Il giorno 12 Nov 2009, alle ore 16:43, Robert Muir ha scritto:


    sorry, I wanted to respond to this also.
    This sounds interesting for the future I think.

    For now, the lucene-java code simply outputs a submission.txt
    file, which
    can then be used from the commandline with trec_eval (invoked
    manually).

    I think we want to continue to do this for starters. In the future
    it
    might
    be neat to have something that uses a wrapper like this to create,
    say
    jira-formatted tables for us to use in benchmarks, but there might
    be
    other
    ways to do the same thing...

    I agree that for comparisons though, it would be best if everyone
    used the
    official trec_eval to make comparisons, and not the summary output
    from
    the
    lucene-java benchmark package.

    By the way we have also developed a Java wrapper for trec_eval 8.0
    (via
    JNI)
    which allows use to use trec_eval as a plain Java object still
    using its
    original code for computations to ensure compliancy with the actual
    implementation used at TREC (same results, same bugs if any ->
    comparable
    performance figures). Maybe, this could be of your interest as
    well.

    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Nov 12, 2009 at 1:10 pm
    here are some more examples:

    some judgement files are tab delimited, some space delimited.
    some collections concatenate all documents into one big file, some have
    thousands of smaller files.
    some have these under subdirectories, some do not.

    so, yeah it sounds like i'm not proposing much value-add, but all of these
    little inconsistencies add up into annoyances and processing to make use of
    the collection.

    I'm definitely not proposing changing any of the actual content.

    For the openly-available collections I have worked with and listed on the
    wiki, once you fix these structural differences, its easy to make use of
    them. I am already doing it.

    I'm not concerned about any theoretical properties of collections that are
    not openly available (things with speech or whatever).

    On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro wrote:

    Our experience in organizing and running CLEF for 10 years has been to not
    go for a least common denominator but leave collections as they are.

    The rationale is that:
    1) you loose the link/alignment with the original collection
    2) you loose or discard information (tags) that might be useful in the
    future for unforeseen evaluation tasks / reuses of the collection
    3) you might introduce errors, if you miss something in the semantics of
    the original collection or you have bugs in the software
    4) it is almost impossible to develop a format that fits for all the
    domains (e.g. news, library collections, patent collections, juridical
    documents, ...) or mixed media collections (images+text, speech+text, ...)
    5) errors / alternative transliterations (e.g. with accents, without
    accents) / documents with empty content/tags in the collection represent a
    real word situation which search engines should be able to cope with.

    What we only ask for the new collections (not the legacy ones) is to be in
    XML, UTF-8, and ensuring unique document identifier (possibly according to
    some meaningful/agreed format).

    All the best,
    Nicola




    ---------------------------------------------------------------------------------
    Nicola Ferro - Ph.D. in Computer Science
    Assistant Professor

    Department of Information Engineering (DEI)
    University of Padua
    Via Gradenigo, 6/A - 35131 Padova - Italy
    Tel +39 049 827 7939 Fax: +39 049 827 7799

    skype: nicola.ferro
    e-mail: ferro@dei.unipd.it
    home page: http://ims.dei.unipd.it/members/ferro/

    ----------------------------------------------------------------------------------

    Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:


    Marvin, I'm not really sure its the format that we want to stick with
    either?

    For example, converting everything a least common denominator will work
    for
    now, but some collections might have special properties (i.e. fields with
    categorization values, other interesting things).

    just want to get something started and working, worst case: nobody likes
    the
    patch and we are back to where we are now!

    On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <marvin@rectangular.com
    wrote:
    On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:

    Why not agree on the conventional TREC format that the lucene benchmark
    package expects?
    +0, seems logical, but I'm not well informed about either the format
    itself
    or
    possible alternatives.

    3. we are talking about downloading and reformatting text files into
    text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this.
    OK. The next logical step is to actually do something with the files,
    and
    I
    figured you were going there. I didn't realize that simply converting
    the
    files was more-or-less sufficient for coaxing something useful out of the
    Lucene benchmark suite.

    Please carry on.

    In fact, I don't want to try to imply some scope creep, but I'm

    completely
    for the idea that this kind of thing could be re-used to compare even
    non-apache projects (other search engines, etc).
    If it's not going to work with other search engines, the project should
    be
    called "Open Irrelevance". :P

    PS: I misremembered the authorship of the Lucene benchmarking suite
    earlier.
    McCandless has been modding it recently, but the original patch was a
    team
    effort from Grant Ingersoll and Doron Cohen with prior art contributions
    by
    Andrzej Bialecki and myself. Apologies.

    Marvin Humphrey

    --
    Robert Muir
    rcmuir@gmail.com

    --
    Robert Muir
    rcmuir@gmail.com
  • Grant Ingersoll at Nov 12, 2009 at 12:00 pm

    On Nov 10, 2009, at 7:13 PM, Robert Muir wrote:

    Marvin, I am a little concerned about your comments.

    I think that there might be a little confusion:
    1. the trec portion of the lucene benchmark suite is essentially
    standalone,
    it doesn't really interact with the other components. +1
    2. creating some scripts/code to download and reformat collections
    into a
    standardized format, I think this is the way to go?
    Why not agree on the conventional TREC format that the lucene
    benchmark
    package expects?
    +1. This makes the most sense. No sense in re-inventing the wheel.
    3. we are talking about downloading and reformatting text files into
    text
    files, seriously I don't think you need to/should understand
    really anything about the lucene benchmark impl to make use of this. +1
    by the way, If there's anything I can do to make this concept more
    amenable
    to you, please reply!
    actually I 100% agree that we should not limit anything to any
    specific
    lucene implementation.

    In fact, I don't want to try to imply some scope creep, but I'm
    completely
    for the idea that this kind of thing could be
    re-used to compare even non-apache projects (other search engines,
    etc).
    Surely we have some stuff to learn from each other. +1
    This is just about boilerplate code, build scripts, downloading,
    reformatting, very boring stuff :) Yawn.
    However, I'd pretty much resigned myself to porting a separate
    implementation
    of the benchmarking suite eventually. And given the way this
    thread has
    progressed since I started writing this reply, looks like that's
    what I'll
    be
    falling back to after all. Oh well... no gain, no loss.

    Marvin Humphrey


    --
    Robert Muir
    rcmuir@gmail.com
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
    using Solr/Lucene:
    http://www.lucidimagination.com/search

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
categorieslucene
postedNov 7, '09 at 8:38a
activeNov 23, '09 at 3:48p
posts30
users6
websitelucene.apache.org...

People

Translate

site design / logo © 2018 Grokbase