Hello,

I'm helping out a student interested in using query and click logs to build
custom relevance models for Lucene. Step #1 is finding a good dataset that
contains the needed data. I've looked around, found a few things, but nothing
that looks very good.

I was wondering if anyone has any dataset suggestions?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Search Discussions

  • Tommaso Teofili at Mar 2, 2011 at 10:04 am
    Hi Otis,
    you may find some resources (mainly code, not datasets) for your student's
    use case at [1] .
    Also I know LWE from LucidImagination [2] has a click & scoring framework
    but that component is not open source at the moment; however I don't know if
    they used also publicly available datasets to build such a feature.
    My 0.2 cents,
    Tommaso

    [1] : http://code.google.com/p/oluolu
    [2] :
    http://www.lucidimagination.com/enterprise-search-solutions/lucidworks/1.6

    2011/3/2 Otis Gospodnetic <otis_gospodnetic@yahoo.com>
    Hello,

    I'm helping out a student interested in using query and click logs to build
    custom relevance models for Lucene. Step #1 is finding a good dataset that
    contains the needed data. I've looked around, found a few things, but
    nothing
    that looks very good.

    I was wondering if anyone has any dataset suggestions?

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/
  • Andrzej Bialecki at Mar 2, 2011 at 10:08 am

    On 3/2/11 3:39 AM, Otis Gospodnetic wrote:
    Hello,

    I'm helping out a student interested in using query and click logs to build
    custom relevance models for Lucene. Step #1 is finding a good dataset that
    contains the needed data. I've looked around, found a few things, but nothing
    that looks very good.

    I was wondering if anyone has any dataset suggestions?
    The (in)famous AOL dataset comes to my mind, and it's very good, maybe
    even too good :) AOL officially pulled it back, but it's still available
    and IMHO legitimate to use - it was a blunder all right but it carried a
    suitable license and things can't be un-published ...

    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com
  • Otis Gospodnetic at Mar 2, 2011 at 3:23 pm
    Hi Andrzej,

    Thanks for bringing up that AOL dataset (I've got a copy of that stashed away),
    because the person I'm helping looked at this, and we thought it didn't have all
    the data one needs to build custom relevance models.
    Here is a small sample:

    AnonID Query QueryTime ItemRank ClickURL
    217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com
    217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com
    217 ameriprise.com 2006-03-01 14:06:23 1
    http://www.ameriprise.com
    217 susheme 2006-03-02 12:31:08
    217 united.com 2006-03-03 14:54:13
    217 mizuno.com 2006-03-07 22:41:17 1
    http://www.mizuno.com
    217 p; .; p;' p; ' ;' ;'; 2006-03-09 12:09:27
    217 p; .; p;' p; ' ;' ;'; 2006-03-09 12:09:35
    217 buddylis 2006-03-16 15:23:33
    217 bestasiancompany.com 2006-03-20 15:15:43 1
    http://www.bestasiancompany.com
    217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com
    217 lottery 2006-03-27 16:34:59 1 http://www.calottery.com
    217 ask.com 2006-03-31 14:31:10 1 http://www.ask.com

    For instance, in order to build custom relevance models, wouldn't we need to
    have the actual corpus/index associated with this data in order to get the base
    relevance scores first?

    Or could one just look at clicks where ItemRank is low (meaning they were not
    close to the top of search results) and apply some algo that essentially
    produces a boost score that stands on its own and is applied on top of the
    relevance score at search time?
    Would it make sense to have a global boost score for each document, or would
    that need to be query-specific and thus applied at query-time and not at
    index-time?

    If you have an idea how one could/should go about using just the above to build
    custom relevance models for Lucene, I'm all eyeballs.

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: Andrzej Bialecki <ab@getopt.org>
    To: openrelevance-dev@lucene.apache.org
    Sent: Wed, March 2, 2011 5:07:55 AM
    Subject: Re: Query & click logs for custom Lucene relevance models
    On 3/2/11 3:39 AM, Otis Gospodnetic wrote:
    Hello,

    I'm helping out a student interested in using query and click logs to build
    custom relevance models for Lucene. Step #1 is finding a good dataset
    that
    contains the needed data. I've looked around, found a few things, but nothing
    that looks very good.

    I was wondering if anyone has any dataset suggestions?
    The (in)famous AOL dataset comes to my mind, and it's very good, maybe even
    too good :) AOL officially pulled it back, but it's still available and IMHO
    legitimate to use - it was a blunder all right but it carried a suitable
    license and things can't be un-published ...

    -- Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
categorieslucene
postedMar 2, '11 at 2:39a
activeMar 2, '11 at 3:23p
posts4
users3
websitelucene.apache.org...

People

Translate

site design / logo © 2018 Grokbase