FAQ
Hello

I am trying to develop an API for a search application that is using Lucene 2.4.1
The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).

I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.

The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?

With regards Henrik

Search Discussions

  • Ian Lea at Nov 3, 2009 at 9:44 am
    Well, lucene is blazingly quick and sometimes things take less time
    than one might expect, but your combination of large and very large is
    not encouraging. It doesn't sounds like the new API method would
    necessarily need an exact reply - could you run something in the
    background out of peak hours that built the XML response or at least
    saved numbers for it to be built quickly when requested? Depending on
    the volatility of your indexes, the background job could be somewhat
    intelligent and only update the figures for indexes that have had
    significant activity. Defining significant activity is left as an
    exercise for the reader ...

    Good luck.


    --
    Ian.


    On Tue, Nov 3, 2009 at 9:23 AM, Henrik Hjalmarsson
    wrote:
    Hello

    I am trying to develop an API for a search application that is using Lucene 2.4.1
    The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).

    I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.

    The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?

    With regards Henrik
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Nov 3, 2009 at 1:43 pm

    On Tue, 2009-11-03 at 10:23 +0100, Henrik Hjalmarsson wrote:
    I have gotten a demand for an API method that returns an XML response,
    listing all the indexes in this application and the number of unique
    values these indexes can have, filtered by a query that is recieved in
    the method request.
    We've had the same request a number of times, but when we discuss the
    scenarios in detail, they can always be scaled down to "The first X
    values", instead of "all values", where X is < 1000.


    While you can build efficient handling of the faceting on fields with
    many terms, simply returning the Strings for the terms (ignoring all the
    grunt work of extraction) poses problems.

    5 million values of 20 characters each takes up about
    5M * (20 * 2 + ~40) bytes ~ 400MByte
    of RAM. If you wrap that in nice XML and send it using SOAP, memory
    usage goes through the roof. Streaming, as Ian suggests, seems to be the
    answer here.
    The application contains a large amount of indexes and some indexes
    contains a very large amount of unique values. Is there some way to
    achive this in an effective way?
    It is definitely possible in the case where you limit the number of
    returned values. Well, at least we've tested it for 1000M unique values
    in 100M documents. But before we go there, it would help to know what
    you mean by "large".

    Regards,
    Toke Eskildsen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jake Mannix at Nov 3, 2009 at 7:08 pm
    If you need faceting on top of Lucene and you're not using Solr, Bobo-browse
    ( http://bobo-browse.googlecode.com ) is a high-performance open source
    faceting library which may suit your needs. You're asking for "all facet
    values", which in bobo isn't terribly hard to get: because of the way bobo
    keeps the facet counts, it already has all of the counts in memory for all
    the unique values once you've run the query with faceting turned on, and
    it's just a question of returning them.

    How big is your index, and how many unique values for this field?

    -jake
    On Tue, Nov 3, 2009 at 1:23 AM, Henrik Hjalmarsson wrote:

    Hello

    I am trying to develop an API for a search application that is using Lucene
    2.4.1
    The search application is maintained by RAA (swedish goverment organization
    that keeps track of historical and cultural data).

    I have gotten a demand for an API method that returns an XML response,
    listing all the indexes in this application and the number of unique values
    these indexes can have, filtered by a query that is recieved in the method
    request.

    The application contains a large amount of indexes and some indexes
    contains a very large amount of unique values. Is there some way to achive
    this in an effective way?

    With regards Henrik
  • Chris Lu at Nov 4, 2009 at 1:50 am
    If the query is a very selective one, you can go through the XML
    document and do the counting.

    If the query is not so selective, which is usually the case, and the
    number of matches are large, basically all the values need to be loaded
    into memory, or solid state disk, to do a fast counting.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!


    Henrik Hjalmarsson wrote:
    Hello

    I am trying to develop an API for a search application that is using Lucene 2.4.1
    The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).

    I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.

    The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?

    With regards Henrik
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 3, '09 at 9:24a
activeNov 4, '09 at 1:50a
posts5
users5
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase