Our project requires near-real time searches and constant updating. The
data is currently stored in a MySQL database and the Lucene index is
updated as the database is modified

We have the search capability currently where we want it. However, we
are attempting to add the ability to "tag" documents in the
index/database. Since the data pots can be millions of records, we
don't want to update the lucene index for tagging, but instead have a
table of document IDs that we would like to be using to determine the
tag sets.

The best option I've so far found is to retrieve both list of IDs as an
integer array, sort them (so I only need one loop through), then loop
through and look for matches between the two. Attempting to use the
list of Lucene IDs in an "In" query fails because the number of
documents can be in the millions and MySQL chokes on it.

Any insight into how we could optimize this or do it?

Thanks in advance

Trevor Watson

Search Discussions

  • Andrew C. Smith at Feb 9, 2010 at 11:59 pm
    I think the way I would tackle this, since you don't want to update your
    main index. Is to still leave MySql out of it and to do the following:

    1) Create a separate Lucene index that only contains the tags information
    and an ID
    2) Take advantage of the MultiSearcher or ParallelMultiSearcher

    This should still allow you to get some pretty good performance without
    having to do expensive database queries.

    Thanks,
    Andrew Smith
    On Tue, Feb 9, 2010 at 9:24 AM, Trevor Watson wrote:

    Our project requires near-real time searches and constant updating. The
    data is currently stored in a MySQL database and the Lucene index is updated
    as the database is modified

    We have the search capability currently where we want it. However, we are
    attempting to add the ability to "tag" documents in the index/database.
    Since the data pots can be millions of records, we don't want to update the
    lucene index for tagging, but instead have a table of document IDs that we
    would like to be using to determine the tag sets.

    The best option I've so far found is to retrieve both list of IDs as an
    integer array, sort them (so I only need one loop through), then loop
    through and look for matches between the two. Attempting to use the list of
    Lucene IDs in an "In" query fails because the number of documents can be in
    the millions and MySQL chokes on it.

    Any insight into how we could optimize this or do it?

    Thanks in advance

    Trevor Watson

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedFeb 9, '10 at 2:32p
activeFeb 9, '10 at 11:59p
posts2
users2
websitelucene.apache.org

2 users in discussion

Andrew C. Smith: 1 post Trevor Watson: 1 post

People

Translate

site design / logo © 2022 Grokbase