FAQ
I’m using Lucene to index database records and text documents.

I want to provide efficient fuzzy queries over the data so I’m using a secondary
Lucene index for all of the distinct terms encountered in the primary index.

Each ‘document’ in the secondary index is a term from the primary index with
fields for its q-grams, phonetic key(s) and synonyms.

It’s easy to populate the secondary index after indexing all of the records and
text documents using an IndexReader. However, to keep the secondary index up to
date I need to recognise when new terms are encountered for the first time, but
even looking deep into Lucene code and stepping through the indexing process
hasn’t revealed where this occurs – I presume because it doesn’t happen in a
single place but rather once in the in-memory term cache, once when the cache is
flushed into a segment, and again when segments are optimised.

Is this correct? Can anyone suggest how to maintain a secondary index of terms?
Perhaps only when the main index is optimised?

Thanks, Mike




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Li Li at Dec 16, 2010 at 3:42 am
    I don't understand your problem well. but needing know when a new
    term occur is a hard problem because when new document is added, it
    will be added to a new segment. I think you can only do this in the
    last merge in optimization stage. You can read the codes in
    SegmentMerger.mergeTermInfos() . It merges all the terms of the merged
    segments. because terms are order by fieldName then term, it can use
    very small memory to merge terms.
    Or if you need knowing the new terms in current segment when
    building index, FreqProxTermsWriterPerField.newTerm will be called if
    the term occured for the first time.

    2010/12/16 Mike Cawson <mike.cawson@yahoo.co.uk>:
    I’m using Lucene to index database records and text documents.

    I want to provide efficient fuzzy queries over the data so I’m using a secondary
    Lucene index for all of the distinct terms encountered in the primary index.

    Each ‘document’ in the secondary index is a term from the primary index with
    fields for its q-grams, phonetic key(s) and synonyms.

    It’s easy to populate the secondary index after indexing all of the records and
    text documents using an IndexReader. However, to keep the secondary index up to
    date I need to recognise when new terms are encountered for the first time, but
    even looking deep into Lucene code and stepping through the indexing process
    hasn’t revealed where this occurs – I presume because it doesn’t happen in a
    single place but rather once in the in-memory term cache, once when the cache is
    flushed into a segment, and again when segments are optimised.

    Is this correct? Can anyone suggest how to maintain a secondary index of terms?
    Perhaps only when the main index is optimised?

    Thanks, Mike




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 16, '10 at 1:50a
activeDec 16, '10 at 3:42a
posts2
users2
websitelucene.apache.org

2 users in discussion

Mike Cawson: 1 post Li Li: 1 post

People

Translate

site design / logo © 2022 Grokbase