FAQ
I have added a new field to each document in my index containing
substrings of another field to speed up initial-wildcard searches.

Each document has a field "text" which might contain "the quick brown
fox jumped over the lazy dogs"
The new field - "text_substrings" would then contain "the quick uick ick
brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"

This allows me to convert initial wildcard queries "*own" into a term
query "own".

However adding this field has exactly doubled the size of the index.
Given that the term list is a small fraction of the index (?), I find
this strange. I think it might be storing the documents twice.

Is there any way to stop this from happening?

Thanks

Greg Tarr




This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Limited group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.

Search Discussions

  • Uwe Schindler at Jul 15, 2009 at 11:00 am
    Field.Store.NO used for the text_substrings field? :-)

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Gregory Tarr
    Sent: Wednesday, July 15, 2009 12:49 PM
    To: java-user@lucene.apache.org
    Subject: Index doubling in size when adding extra terms

    I have added a new field to each document in my index containing
    substrings of another field to speed up initial-wildcard searches.

    Each document has a field "text" which might contain "the quick brown
    fox jumped over the lazy dogs"
    The new field - "text_substrings" would then contain "the quick uick ick
    brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"

    This allows me to convert initial wildcard queries "*own" into a term
    query "own".

    However adding this field has exactly doubled the size of the index.
    Given that the term list is a small fraction of the index (?), I find
    this strange. I think it might be storing the documents twice.

    Is there any way to stop this from happening?

    Thanks

    Greg Tarr




    This message should be regarded as confidential. If you have received this
    email in error please notify the sender and destroy it immediately.
    Statements of intent shall only become binding when confirmed in hard copy
    by an authorised signatory. The contents of this email may relate to
    dealings with other companies within the Detica Limited group of
    companies.

    Detica Limited is registered in England under No: 1337451.

    Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
    England.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 15, 2009 at 11:40 am
    It looks like your "text_substrings" field will have many more unique
    terms than the original text, right? And, since it's indexed (I
    assume), the docIDs will in fact be stored twice (once in the postings
    for your orig text and once in the postings for text_substrings). So
    I think it's expected that the postings (*.frq/.prx/.tis/.tii) would
    at least double in size.

    Mike

    On Wed, Jul 15, 2009 at 6:48 AM, Gregory Tarrwrote:
    I have added a new field to each document in my index containing
    substrings of another field to speed up initial-wildcard searches.

    Each document has a field "text" which might contain "the quick brown
    fox jumped over the lazy dogs"
    The new field - "text_substrings" would then contain "the quick uick ick
    brown rown own fox jumped umped mped ped over ver the lazy azy dogs ogs"

    This allows me to convert initial wildcard queries "*own" into a term
    query "own".

    However adding this field has exactly doubled the size of the index.
    Given that the term list is a small fraction of the index (?), I find
    this strange. I think it might be storing the documents twice.

    Is there any way to stop this from happening?

    Thanks

    Greg Tarr




    This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.
    Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.  The contents of this email may relate to dealings with other companies within the Detica Limited group of companies.

    Detica Limited is registered in England under No: 1337451.

    Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 15, '09 at 10:49a
activeJul 15, '09 at 11:40a
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase