FAQ
Hi.

I have a situation where I'm searching amongst some 100K feeds and only want
one result per site in return. I have developed a really simple method of
grouping which just scrolls through the resultset(hitset) until a maxNum
docs of feeds from a set of unique sites is populated. Since I don't wanna
reinvent the wheel, I want to know if Lucene has something like this built.
I as well will use Solr soon and then my own homecooked recipe will not work
so I really need a standard way of doing this.

I know Nutch has something like it called depupField which default is set to
2.

Anyone?


Kindly

//Marcus

--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com

Search Discussions

  • Grant Ingersoll at Nov 5, 2007 at 12:02 pm
    Solr has an issue outstanding right now that implements something that
    may be close to what you want. They are calling it Field Collapsing.
    See https://issues.apache.org/jira/browse/SOLR-236

    -Grant
    On Nov 5, 2007, at 12:57 AM, Marcus Herou wrote:

    Hi.

    I have a situation where I'm searching amongst some 100K feeds and
    only want
    one result per site in return. I have developed a really simple
    method of
    grouping which just scrolls through the resultset(hitset) until a
    maxNum
    docs of feeds from a set of unique sites is populated. Since I don't
    wanna
    reinvent the wheel, I want to know if Lucene has something like this
    built.
    I as well will use Solr soon and then my own homecooked recipe will
    not work
    so I really need a standard way of doing this.

    I know Nutch has something like it called depupField which default
    is set to
    2.

    Anyone?


    Kindly

    //Marcus

    --
    Marcus Herou Solution Architect & Core Java developer Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Boot Camp Training:
    ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Herou at Nov 5, 2007 at 12:49 pm
    Thanks. They seem to have got real far in the dev cycle on this. Seems like
    it will hit the road in Solr 1.3.

    However I would really like this feature to be developed for Core Lucene,
    how do I start that process?
    Develop it yourself you would say :) I'm serious isn't it a really cool and
    useful feature ?

    Kindly

    //Marcus
    On 11/5/07, Grant Ingersoll wrote:

    Solr has an issue outstanding right now that implements something that
    may be close to what you want. They are calling it Field Collapsing.
    See https://issues.apache.org/jira/browse/SOLR-236

    -Grant
    On Nov 5, 2007, at 12:57 AM, Marcus Herou wrote:

    Hi.

    I have a situation where I'm searching amongst some 100K feeds and
    only want
    one result per site in return. I have developed a really simple
    method of
    grouping which just scrolls through the resultset(hitset) until a
    maxNum
    docs of feeds from a set of unique sites is populated. Since I don't
    wanna
    reinvent the wheel, I want to know if Lucene has something like this
    built.
    I as well will use Solr soon and then my own homecooked recipe will
    not work
    so I really need a standard way of doing this.

    I know Nutch has something like it called depupField which default
    is set to
    2.

    Anyone?


    Kindly

    //Marcus

    --
    Marcus Herou Solution Architect & Core Java developer Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Boot Camp Training:
    ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou Solution Architect & Core Java developer Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com
  • Grant Ingersoll at Nov 5, 2007 at 9:03 pm

    On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:

    Thanks. They seem to have got real far in the dev cycle on this.
    Seems like
    it will hit the road in Solr 1.3.

    However I would really like this feature to be developed for Core
    Lucene,
    how do I start that process?
    Develop it yourself you would say :) I'm serious isn't it a really
    cool and
    useful feature ?

    We're always open to well-thought out and tested patches. See the
    Wiki for info on contributing.

    -Grant


    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Boot Camp Training:
    ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Herou at Nov 6, 2007 at 8:20 am
    Cool.

    I'll do since this is a field which I can spend time in.

    Kindly

    //Marcus
    On 11/5/07, Grant Ingersoll wrote:

    On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:

    Thanks. They seem to have got real far in the dev cycle on this.
    Seems like
    it will hit the road in Solr 1.3.

    However I would really like this feature to be developed for Core
    Lucene,
    how do I start that process?
    Develop it yourself you would say :) I'm serious isn't it a really
    cool and
    useful feature ?

    We're always open to well-thought out and tested patches. See the
    Wiki for info on contributing.

    -Grant


    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Boot Camp Training:
    ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou Solution Architect & Core Java developer Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com
  • ninaS at Dec 5, 2007 at 12:18 pm
    Hey Marcus,

    have you already implemented this feature?
    I'm searching a group by function for lucene, too.

    More precisely I need it in Compass, which is built on top of lucene.

    I was thinking about using a HitCollector to get only one result per group.

    How did you do it?

    Cheers,
    Nina



    Marcus Herou-2 wrote:
    Cool.

    I'll do since this is a field which I can spend time in.

    Kindly

    //Marcus
    On 11/5/07, Grant Ingersoll wrote:

    On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:

    Thanks. They seem to have got real far in the dev cycle on this.
    Seems like
    it will hit the road in Solr 1.3.

    However I would really like this feature to be developed for Core
    Lucene,
    how do I start that process?
    Develop it yourself you would say :) I'm serious isn't it a really
    cool and
    useful feature ?

    We're always open to well-thought out and tested patches. See the
    Wiki for info on contributing.

    -Grant


    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Boot Camp Training:
    ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou Solution Architect & Core Java developer Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com
    --
    View this message in context: http://www.nabble.com/Group-by-in-Lucene---tf4749806.html#a14170395
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Herou at Jan 28, 2009 at 7:12 am
    Hi.

    I did partly solve this with Solr with faceting but it does not solve the
    quite normally use feature in db's:
    num_en_entries = select count distinct(id) from BlogEntry where
    language='en'
    num_sv_entries = select count distinct(id) from BlogEntry where
    language='sv'

    it solves however the feature:
    select count(id),date from BlogEntry group by date

    I now need this feature elsewhere when parsing accesslogs etc so I am
    looking into MonetDB, LucidDB and FastBit. Sphinx search seem like they have
    something like this:
    http://www.sphinxsearch.com/docs/current.html#clustering

    Did you ever try a HitCollector ?

    //Marcus
    On Wed, Dec 5, 2007 at 1:17 PM, ninaS wrote:


    Hey Marcus,

    have you already implemented this feature?
    I'm searching a group by function for lucene, too.

    More precisely I need it in Compass, which is built on top of lucene.

    I was thinking about using a HitCollector to get only one result per group.

    How did you do it?

    Cheers,
    Nina



    Marcus Herou-2 wrote:
    Cool.

    I'll do since this is a field which I can spend time in.

    Kindly

    //Marcus
    On 11/5/07, Grant Ingersoll wrote:

    On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:

    Thanks. They seem to have got real far in the dev cycle on this.
    Seems like
    it will hit the road in Solr 1.3.

    However I would really like this feature to be developed for Core
    Lucene,
    how do I start that process?
    Develop it yourself you would say :) I'm serious isn't it a really
    cool and
    useful feature ?

    We're always open to well-thought out and tested patches. See the
    Wiki for info on contributing.

    -Grant


    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Boot Camp Training:
    ApacheCon Atlanta, Nov. 12, 2007. Sign up now!
    http://www.apachecon.com
    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou Solution Architect & Core Java developer Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com
    --
    View this message in context:
    http://www.nabble.com/Group-by-in-Lucene---tf4749806.html#a14170395
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/
  • ninaS at Jan 28, 2009 at 9:50 am
    Hello,

    yes I tried HitCollector but I am not satisfied with it because you can not
    use sorting with HitCollector unless you implement a way to use
    TopFieldTocCollector. I did not manage to do that in a performant way.

    It is easier to first do a normal search und "group by" afterwards:

    Iterate through the result documents and take one of each group. Each
    document has a groupingKey. I remember which groupingKey is already used and
    don't take another document of this group into the result list.

    Regards,
    Nina
    --
    View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702721.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • ninaS at Jan 28, 2009 at 9:52 am
    By the way: if you only need to count documents (count groups) HitCollector
    is a good choice. If you only count you don't need to sort anything.


    ninaS wrote:
    Hello,

    yes I tried HitCollector but I am not satisfied with it because you can
    not use sorting with HitCollector unless you implement a way to use
    TopFieldTocCollector. I did not manage to do that in a performant way.

    It is easier to first do a normal search und "group by" afterwards:

    Iterate through the result documents and take one of each group. Each
    document has a groupingKey. I remember which groupingKey is already used
    and don't take another document of this group into the result list.

    Regards,
    Nina
    --
    View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Herou at Jan 28, 2009 at 11:44 am
    Hi.

    This is way too slow I think since what you are explaining is something I
    already tested. However I might be using the HitCollector badly.

    Please prove me wrong. Supplying some code which I tested this with.
    It stores a hash of the value of the term in a TIntHashSet and just
    calculates the size of that set.
    This one takes approx 3 sec on about 0.5M rows = way too slow.


    main test class:
    public class GroupingTest
    {
    protected static final Log log =
    LogFactory.getLog(GroupingTest.class.getName());
    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    public static void main(String[] args)
    {
    Utils.initLogger();
    String[] fields =
    {"uid","ip","date","siteId","visits","countryCode"};
    try
    {
    IndexFactory fact = new IndexFactory();
    String d = "/tmp/csvtest";
    fact.initDir(d);
    IndexReader reader = fact.getReader(d);
    IndexSearcher searcher = fact.getSearcher(d, reader);
    QueryParser parser = new MultiFieldQueryParser(fields,
    fact.getAnalyzer());
    Query q = parser.parse("date:20090125");


    GroupingHitCollector coll = new GroupingHitCollector();
    coll.setDistinct(true);
    coll.setGroupField("uid");
    coll.setIndexReader(reader);
    long start = System.currentTimeMillis();
    searcher.search(q, coll);
    long stop = System.currentTimeMillis();
    System.out.println("Time: " + (stop-start) + ", distinct
    count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
    }
    catch (Exception e)
    {
    log.error(e.toString(), e);
    }
    }
    }


    public class GroupingHitCollector extends HitCollector
    {
    protected IndexReader indexReader;
    protected String groupField;
    protected boolean distinct;
    //protected TLongHashSet set;
    protected TIntHashSet set;
    protected int distinctSize;

    int count = 0;
    int sum = 0;

    public GroupingHitCollector()
    {
    set = new TIntHashSet();
    }

    public String getGroupField()
    {
    return groupField;
    }

    public void setGroupField(String groupField)
    {
    this.groupField = groupField;
    }

    public IndexReader getIndexReader()
    {
    return indexReader;
    }

    public void setIndexReader(IndexReader indexReader)
    {
    this.indexReader = indexReader;
    }

    public boolean isDistinct()
    {
    return distinct;
    }

    public void setDistinct(boolean distinct)
    {
    this.distinct = distinct;
    }

    public void collect(int doc, float score)
    {
    if(distinct)
    {
    try
    {
    Document document = this.indexReader.document(doc);
    if(document != null)
    {
    String s = document.get(groupField);
    if(s != null)
    {
    set.add(s.hashCode());
    //set.add(Crc64.generate(s));
    }
    }
    }
    catch (IOException e)
    {
    e.printStackTrace();
    }
    }
    count++;
    sum += doc; // use it to avoid any possibility of being optimized
    away
    }

    public int getCount() { return count; }
    public int getSum() { return sum; }

    public int getDistinctCount()
    {
    distinctSize = set.size();
    return distinctSize;
    }
    }

    On Wed, Jan 28, 2009 at 10:51 AM, ninaS wrote:


    By the way: if you only need to count documents (count groups) HitCollector
    is a good choice. If you only count you don't need to sort anything.


    ninaS wrote:
    Hello,

    yes I tried HitCollector but I am not satisfied with it because you can
    not use sorting with HitCollector unless you implement a way to use
    TopFieldTocCollector. I did not manage to do that in a performant way.

    It is easier to first do a normal search und "group by" afterwards:

    Iterate through the result documents and take one of each group. Each
    document has a groupingKey. I remember which groupingKey is already used
    and don't take another document of this group into the result list.

    Regards,
    Nina
    --
    View this message in context:
    http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/
  • Marcus Herou at Jan 28, 2009 at 11:45 am
    Oh bytw, faceting is easy it's the distinct part I think is hard.

    Example Lucene Facet:
    http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html

    On Wed, Jan 28, 2009 at 12:43 PM, Marcus Herou
    wrote:
    Hi.

    This is way too slow I think since what you are explaining is something I
    already tested. However I might be using the HitCollector badly.

    Please prove me wrong. Supplying some code which I tested this with.
    It stores a hash of the value of the term in a TIntHashSet and just
    calculates the size of that set.
    This one takes approx 3 sec on about 0.5M rows = way too slow.


    main test class:
    public class GroupingTest
    {
    protected static final Log log =
    LogFactory.getLog(GroupingTest.class.getName());
    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    public static void main(String[] args)
    {
    Utils.initLogger();
    String[] fields =
    {"uid","ip","date","siteId","visits","countryCode"};
    try
    {
    IndexFactory fact = new IndexFactory();
    String d = "/tmp/csvtest";
    fact.initDir(d);
    IndexReader reader = fact.getReader(d);
    IndexSearcher searcher = fact.getSearcher(d, reader);
    QueryParser parser = new MultiFieldQueryParser(fields,
    fact.getAnalyzer());
    Query q = parser.parse("date:20090125");


    GroupingHitCollector coll = new GroupingHitCollector();
    coll.setDistinct(true);
    coll.setGroupField("uid");
    coll.setIndexReader(reader);
    long start = System.currentTimeMillis();
    searcher.search(q, coll);
    long stop = System.currentTimeMillis();
    System.out.println("Time: " + (stop-start) + ", distinct
    count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
    }
    catch (Exception e)
    {
    log.error(e.toString(), e);
    }
    }
    }


    public class GroupingHitCollector extends HitCollector
    {
    protected IndexReader indexReader;
    protected String groupField;
    protected boolean distinct;
    //protected TLongHashSet set;
    protected TIntHashSet set;
    protected int distinctSize;

    int count = 0;
    int sum = 0;

    public GroupingHitCollector()
    {
    set = new TIntHashSet();
    }

    public String getGroupField()
    {
    return groupField;
    }

    public void setGroupField(String groupField)
    {
    this.groupField = groupField;
    }

    public IndexReader getIndexReader()
    {
    return indexReader;
    }

    public void setIndexReader(IndexReader indexReader)
    {
    this.indexReader = indexReader;
    }

    public boolean isDistinct()
    {
    return distinct;
    }

    public void setDistinct(boolean distinct)
    {
    this.distinct = distinct;
    }

    public void collect(int doc, float score)
    {
    if(distinct)
    {
    try
    {
    Document document = this.indexReader.document(doc);
    if(document != null)
    {
    String s = document.get(groupField);
    if(s != null)
    {
    set.add(s.hashCode());
    //set.add(Crc64.generate(s));
    }
    }
    }
    catch (IOException e)
    {
    e.printStackTrace();
    }
    }
    count++;
    sum += doc; // use it to avoid any possibility of being optimized
    away
    }

    public int getCount() { return count; }
    public int getSum() { return sum; }

    public int getDistinctCount()
    {
    distinctSize = set.size();
    return distinctSize;

    }
    }

    On Wed, Jan 28, 2009 at 10:51 AM, ninaS wrote:


    By the way: if you only need to count documents (count groups)
    HitCollector
    is a good choice. If you only count you don't need to sort anything.


    ninaS wrote:
    Hello,

    yes I tried HitCollector but I am not satisfied with it because you can
    not use sorting with HitCollector unless you implement a way to use
    TopFieldTocCollector. I did not manage to do that in a performant way.

    It is easier to first do a normal search und "group by" afterwards:

    Iterate through the result documents and take one of each group. Each
    document has a groupingKey. I remember which groupingKey is already used
    and don't take another document of this group into the result list.

    Regards,
    Nina
    --
    View this message in context:
    http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/
  • Erick Erickson at Jan 28, 2009 at 2:03 pm
    At a quick glance, this line is really suspicious:

    Document document = this.indexReader.document(doc)
    From the Javadoc for HitCollector.collect:
    Note: This is called in an inner search loop. For good search performance,
    implementations of this method should not call
    Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or
    IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on
    every document number encountered. Doing so can slow searches by an
    order
    of magnitude or more.

    You're loading the document each time through the loop. I think you'd get
    much better
    performance by making sure that your groupField is indexed, then use
    TermDocs (TermEnum?)
    to get the value of the field.

    Best
    Erick


    On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou wrote:

    Hi.

    This is way too slow I think since what you are explaining is something I
    already tested. However I might be using the HitCollector badly.

    Please prove me wrong. Supplying some code which I tested this with.
    It stores a hash of the value of the term in a TIntHashSet and just
    calculates the size of that set.
    This one takes approx 3 sec on about 0.5M rows = way too slow.


    main test class:
    public class GroupingTest
    {
    protected static final Log log =
    LogFactory.getLog(GroupingTest.class.getName());
    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    public static void main(String[] args)
    {
    Utils.initLogger();
    String[] fields =
    {"uid","ip","date","siteId","visits","countryCode"};
    try
    {
    IndexFactory fact = new IndexFactory();
    String d = "/tmp/csvtest";
    fact.initDir(d);
    IndexReader reader = fact.getReader(d);
    IndexSearcher searcher = fact.getSearcher(d, reader);
    QueryParser parser = new MultiFieldQueryParser(fields,
    fact.getAnalyzer());
    Query q = parser.parse("date:20090125");


    GroupingHitCollector coll = new GroupingHitCollector();
    coll.setDistinct(true);
    coll.setGroupField("uid");
    coll.setIndexReader(reader);
    long start = System.currentTimeMillis();
    searcher.search(q, coll);
    long stop = System.currentTimeMillis();
    System.out.println("Time: " + (stop-start) + ", distinct
    count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
    }
    catch (Exception e)
    {
    log.error(e.toString(), e);
    }
    }
    }


    public class GroupingHitCollector extends HitCollector
    {
    protected IndexReader indexReader;
    protected String groupField;
    protected boolean distinct;
    //protected TLongHashSet set;
    protected TIntHashSet set;
    protected int distinctSize;

    int count = 0;
    int sum = 0;

    public GroupingHitCollector()
    {
    set = new TIntHashSet();
    }

    public String getGroupField()
    {
    return groupField;
    }

    public void setGroupField(String groupField)
    {
    this.groupField = groupField;
    }

    public IndexReader getIndexReader()
    {
    return indexReader;
    }

    public void setIndexReader(IndexReader indexReader)
    {
    this.indexReader = indexReader;
    }

    public boolean isDistinct()
    {
    return distinct;
    }

    public void setDistinct(boolean distinct)
    {
    this.distinct = distinct;
    }

    public void collect(int doc, float score)
    {
    if(distinct)
    {
    try
    {
    Document document = this.indexReader.document(doc);
    if(document != null)
    {
    String s = document.get(groupField);
    if(s != null)
    {
    set.add(s.hashCode());
    //set.add(Crc64.generate(s));
    }
    }
    }
    catch (IOException e)
    {
    e.printStackTrace();
    }
    }
    count++;
    sum += doc; // use it to avoid any possibility of being optimized
    away
    }

    public int getCount() { return count; }
    public int getSum() { return sum; }

    public int getDistinctCount()
    {
    distinctSize = set.size();
    return distinctSize;
    }
    }

    On Wed, Jan 28, 2009 at 10:51 AM, ninaS wrote:


    By the way: if you only need to count documents (count groups)
    HitCollector
    is a good choice. If you only count you don't need to sort anything.


    ninaS wrote:
    Hello,

    yes I tried HitCollector but I am not satisfied with it because you can
    not use sorting with HitCollector unless you implement a way to use
    TopFieldTocCollector. I did not manage to do that in a performant way.

    It is easier to first do a normal search und "group by" afterwards:

    Iterate through the result documents and take one of each group. Each
    document has a groupingKey. I remember which groupingKey is already
    used
    and don't take another document of this group into the result list.

    Regards,
    Nina
    --
    View this message in context:
    http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/
  • Mark Miller at Jan 28, 2009 at 4:03 pm
    Group-by in Lucene/Solr has not been solved in a great general way yet
    to my knowledge.

    Ideally, we would want a solution that does not need to fit into memory.
    However, you need the value of the field for each document. to do the
    grouping As you are finding, this is not cheap to get. Currently, the
    efficient way to get it is to use a FieldCache. This, however, requires
    that every distinct value can fit into memory.

    Once you have efficient access to the values, you need to be able to
    efficiently group the results, again not bounded by memory (which we
    already are with the FieldCache).

    There are quite a few ways to do this. The simplest is to group until
    you have used all the memory you want, then for everything left,
    anything that doesnt match a group, write it to a file, if it does,
    increment the group count. Use the overflow file as the input in the
    next run, repeat until there is no overflow. You can improve on that by
    partitioning the overflow file.

    And then there are a dozen other methods.

    Solr has a patch in JIRA that uses a sorting method. First the results
    are sorted on the group-by field, then scanned through for grouping -
    all field values that are the same will be next to each other. Finally,
    if you really wanted to sort on a different field, another sort is
    applied. Thats not ideal IMO, but its a start.

    - Mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Herou at Feb 1, 2009 at 3:01 pm
    Yep. Probably an external sort should be used when flushing to disk. I have
    written such code so that is probably a no brainer, the problem is to get it
    speedy :)
    <http://dev.tailsweep.com/projects/utils/apidocs/org/tailsweep/utils/sort/TupleSorter.html>
    http://dev.tailsweep.com/projects/utils/apidocs/com/tailsweep/utils/sort/TupleSorter.html

    Another way could be to use HDFS and MapFiles/SequenceFiles Not speedy at
    all but scalable.

    Thinking of writing my own Inverted Index, specialized for these kind of
    operations. Any pointers in where to start look for material for that ?

    /Marcus






















    On Wed, Jan 28, 2009 at 5:02 PM, Mark Miller wrote:

    Group-by in Lucene/Solr has not been solved in a great general way yet to
    my knowledge.

    Ideally, we would want a solution that does not need to fit into memory.
    However, you need the value of the field for each document. to do the
    grouping As you are finding, this is not cheap to get. Currently, the
    efficient way to get it is to use a FieldCache. This, however, requires that
    every distinct value can fit into memory.

    Once you have efficient access to the values, you need to be able to
    efficiently group the results, again not bounded by memory (which we already
    are with the FieldCache).

    There are quite a few ways to do this. The simplest is to group until you
    have used all the memory you want, then for everything left, anything that
    doesnt match a group, write it to a file, if it does, increment the group
    count. Use the overflow file as the input in the next run, repeat until
    there is no overflow. You can improve on that by partitioning the overflow
    file.

    And then there are a dozen other methods.

    Solr has a patch in JIRA that uses a sorting method. First the results are
    sorted on the group-by field, then scanned through for grouping - all field
    values that are the same will be next to each other. Finally, if you really
    wanted to sort on a different field, another sort is applied. Thats not
    ideal IMO, but its a start.

    - Mark


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/
  • Mschipperheyn at Aug 1, 2009 at 9:43 am
    http://code.google.com/p/bobo-browse

    looks like it may be the ticket.

    Marc

    --
    View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Aug 2, 2009 at 8:56 am
    Don't overlook Solr: http://lucene.apache.org/solr

    Erik
    On Aug 1, 2009, at 5:43 AM, mschipperheyn wrote:


    http://code.google.com/p/bobo-browse

    looks like it may be the ticket.

    Marc

    --
    View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Herou at Feb 1, 2009 at 10:43 am
    Yep, you are correct, this is a lousy implementation which I knew when I
    wrote it.

    I'm not interested in the entire document just the grouping term and the
    docId which it is connected to.

    So how do I get hold of the TermDocs for the grouping field ?

    I mean I probably first need to perform the query: searcher.search(...)
    which would give me set of doc ids. Then I need to group them all by for
    instance: "ip-address", save each ip-address in another set and in the end
    calculate the size of that set.

    i.e the equiv of: select count(distinct(ipAddress)) from AccessLog where
    date='2009-01-25' (optionally group by ipAddress ?)


    //Marcus




    On Wed, Jan 28, 2009 at 3:02 PM, Erick Erickson wrote:

    At a quick glance, this line is really suspicious:

    Document document = this.indexReader.document(doc)

    From the Javadoc for HitCollector.collect:

    Note: This is called in an inner search loop. For good search performance,
    implementations of this method should not call

    Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or

    IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on
    every document number encountered. Doing so can slow searches by an
    order
    of magnitude or more.

    You're loading the document each time through the loop. I think you'd get
    much better
    performance by making sure that your groupField is indexed, then use
    TermDocs (TermEnum?)
    to get the value of the field.

    Best
    Erick



    On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <marcus.herou@tailsweep.com
    wrote:
    Hi.

    This is way too slow I think since what you are explaining is something I
    already tested. However I might be using the HitCollector badly.

    Please prove me wrong. Supplying some code which I tested this with.
    It stores a hash of the value of the term in a TIntHashSet and just
    calculates the size of that set.
    This one takes approx 3 sec on about 0.5M rows = way too slow.


    main test class:
    public class GroupingTest
    {
    protected static final Log log =
    LogFactory.getLog(GroupingTest.class.getName());
    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    public static void main(String[] args)
    {
    Utils.initLogger();
    String[] fields =
    {"uid","ip","date","siteId","visits","countryCode"};
    try
    {
    IndexFactory fact = new IndexFactory();
    String d = "/tmp/csvtest";
    fact.initDir(d);
    IndexReader reader = fact.getReader(d);
    IndexSearcher searcher = fact.getSearcher(d, reader);
    QueryParser parser = new MultiFieldQueryParser(fields,
    fact.getAnalyzer());
    Query q = parser.parse("date:20090125");


    GroupingHitCollector coll = new GroupingHitCollector();
    coll.setDistinct(true);
    coll.setGroupField("uid");
    coll.setIndexReader(reader);
    long start = System.currentTimeMillis();
    searcher.search(q, coll);
    long stop = System.currentTimeMillis();
    System.out.println("Time: " + (stop-start) + ", distinct
    count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
    }
    catch (Exception e)
    {
    log.error(e.toString(), e);
    }
    }
    }


    public class GroupingHitCollector extends HitCollector
    {
    protected IndexReader indexReader;
    protected String groupField;
    protected boolean distinct;
    //protected TLongHashSet set;
    protected TIntHashSet set;
    protected int distinctSize;

    int count = 0;
    int sum = 0;

    public GroupingHitCollector()
    {
    set = new TIntHashSet();
    }

    public String getGroupField()
    {
    return groupField;
    }

    public void setGroupField(String groupField)
    {
    this.groupField = groupField;
    }

    public IndexReader getIndexReader()
    {
    return indexReader;
    }

    public void setIndexReader(IndexReader indexReader)
    {
    this.indexReader = indexReader;
    }

    public boolean isDistinct()
    {
    return distinct;
    }

    public void setDistinct(boolean distinct)
    {
    this.distinct = distinct;
    }

    public void collect(int doc, float score)
    {
    if(distinct)
    {
    try
    {
    Document document = this.indexReader.document(doc);
    if(document != null)
    {
    String s = document.get(groupField);
    if(s != null)
    {
    set.add(s.hashCode());
    //set.add(Crc64.generate(s));
    }
    }
    }
    catch (IOException e)
    {
    e.printStackTrace();
    }
    }
    count++;
    sum += doc; // use it to avoid any possibility of being optimized
    away
    }

    public int getCount() { return count; }
    public int getSum() { return sum; }

    public int getDistinctCount()
    {
    distinctSize = set.size();
    return distinctSize;
    }
    }

    On Wed, Jan 28, 2009 at 10:51 AM, ninaS wrote:


    By the way: if you only need to count documents (count groups)
    HitCollector
    is a good choice. If you only count you don't need to sort anything.


    ninaS wrote:
    Hello,

    yes I tried HitCollector but I am not satisfied with it because you
    can
    not use sorting with HitCollector unless you implement a way to use
    TopFieldTocCollector. I did not manage to do that in a performant
    way.
    It is easier to first do a normal search und "group by" afterwards:

    Iterate through the result documents and take one of each group. Each
    document has a groupingKey. I remember which groupingKey is already
    used
    and don't take another document of this group into the result list.

    Regards,
    Nina
    --
    View this message in context:
    http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
    http://blogg.tailsweep.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 5, '07 at 5:58a
activeAug 2, '09 at 8:56a
posts17
users7
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase