FAQ
United Rentals
Consider it done.™
800-UR-RENTS
unitedrentals.com

Search Discussions

  • Yueyu lin at Aug 14, 2006 at 6:30 pm
    2GB limitation only exists when you want to put them to memory in 32bits
    box.
    Our index size is larger than 13 giga bytes, and it works fine.
    I think it must be something error in your design. You can use Luke to see
    what happened in your index.
    On 8/14/06, Van Nguyen wrote:

    Hi,



    I have a 7GB index (about 45 fields per document X roughly 5.5 million
    docs) running on a Windows 2003 32bit machine (dual proc, 2GB memory). The
    index is optimized. Performing a search on this index will just "hang" when
    performing the search (wild card query with a sort). At first the CPU usage
    is 100%, then drops down to 50% after a minute or so, and then no CPU
    utilization… but the thread is still trying to perform the search. I've
    tried this in my J2EE app and in a main program. Is this due to the 2GB
    limitation of the 32bit OS (I didn't realize the index would be this big…
    just let it run over the weekend).



    If this is due to the 2GB limitation of the 32bit OS and since I have this
    7GB index built already (and optimized), is there a way to split this into
    2GB indices w/o having to re-index? Or is this due to another factor?



    Van

    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin
  • Van Nguyen at Aug 15, 2006 at 12:30 am
    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com
  • Yueyu lin at Aug 15, 2006 at 12:40 am
    To avoid "TooManyClauses", you can try Filter instead of Query. But that
    will be slower.
    Form what I see is that there are so many keys that match your query, it
    will be tough for Lucene.
    On 8/14/06, Van Nguyen wrote:

    It was how I was implementing the search.

    I am using a boolean query. Prior to the 7GB index, I was searching
    over a 150MB index that consist of a very small part of the bigger
    index. I was able to set my BooleanQuery to
    BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked fine.
    But I think that's the cause of my problem with this bigger index.
    Commenting that out, I get an TooManyClause Exception. A typical query
    would look something like this:

    +CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
    +LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72

    BooleanQuery q = new BooleanQuery();

    WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*");
    WildcardQuery wc2 = new WildcardQuery("CONTENTS", "*hard*");
    WildcardQuery wc3 = new WildcardQuery("CONTENTS", "*hat*");
    q.add(wc1, BooleanClause.Occur.MUST);
    q.add(wc2, BooleanClause.Occur.MUST);
    q.add(wc3, BooleanClause.Occur.MUST);

    TermQuery t1 = new TermQuery("COMPANY_CODE", "u1");
    q.add(t1, BooleanClause.Occur.MUST);

    TermQuery t2 = new TermQuery("LANGUAGE", "enu");
    q.add(t2, BooleanClause.Occur.MUST);
    .
    .
    .

    I take it this is not the most optimal way about this.

    So that leads me to my next question... What is the most optimal way
    about this?

    Van

    -----Original Message-----
    From: yueyu lin
    Sent: Monday, August 14, 2006 11:30 AM
    To: java-user@lucene.apache.org
    Subject: Re: 7GB index taking forever to return hits

    2GB limitation only exists when you want to put them to memory in 32bits
    box.
    Our index size is larger than 13 giga bytes, and it works fine.
    I think it must be something error in your design. You can use Luke to
    see what happened in your index.
    On 8/14/06, Van Nguyen wrote:

    Hi,



    I have a 7GB index (about 45 fields per document X roughly 5.5 million
    docs) running on a Windows 2003 32bit machine (dual proc, 2GB memory).
    The index is optimized. Performing a search on this index will just
    "hang" when performing the search (wild card query with a sort). At
    first the CPU usage is 100%, then drops down to 50% after a minute or
    so, and then no CPU utilization... but the thread is still trying to
    perform the search. I've tried this in my J2EE app and in a main
    program. Is this due to the 2GB limitation of the 32bit OS (I didn't
    realize the index would be this big... just let it run over the weekend).


    If this is due to the 2GB limitation of the 32bit OS and since I have
    this 7GB index built already (and optimized), is there a way to split
    this into 2GB indices w/o having to re-index? Or is this due to
    another factor?


    Van

    United Rentals
    Consider it done.(tm)
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin

    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin
  • Erick Erickson at Aug 15, 2006 at 1:20 am
    I actually suspect that your process isn't hung, it's just taking forever
    because it's swapping a lot. Like a really, really, really lot. Like more
    than you ever want to deal with <G>.

    I think you're pretty much forced, as Lin said, to use a filter. I was
    pleasantly surprised at how quickly filters can be built, and if you have a
    small set of wildcards you really expect, you can use a CachingWrapperFilter
    to keep them around. NOTE: we're only recommending the filters for the
    wildcard portions of the query.

    For a generous explanation of wildcards in general from "the guys", search
    the archive for a thread "I just don't get wildcards at all". There's quite
    an explanation of wildcards, as well as some alternative indexing schemes to
    make wildcard queries without wildcards. Which is faster but bloats the
    index.

    If you're unfamiliar with filters, you really need to become familiar with
    them. Basically, they're a bitset with the bit corresponding to each
    document turned on for terms that could match.

    You'll probably want to use a WildcardTermEnum, or perhaps a RegexTermEnum,
    and for each term, enumerate all the docs containing that term (see
    TermDocs) and set your bit in the filter. When the filter is created, use it
    to construct a ConstantScoreQuery that you add to your BooleanQuery. It
    sounds more complicated than it is <G>.

    Just to clarify Lin's comment. The filter won't be slower than what you are
    doing now. But it will be slower than a straight-up query. Each filter will
    be under a megabyte (one bit for each of 5.5M docs), so you could consider
    caching a bunch of them if you expect a small number of terms to be
    repeated. Or let CachingWrapperFilter do it for you (I think).

    Note that ConstantScoreQuery loses relevance scoring (but you'll still get
    relevance for the non-wildcard terms in your BooleanQuery). I don't consider
    this a flaw since it's tricky defining how much *use* relevance really is
    for wildcards.

    Finally, your test queries (and I commend you for making limit-testing tests
    before putting it in production) are about as bad as they get. From the
    javadoc for WildcardQuery.

    " In order to prevent extremely slow WildcardQueries, a Wildcard term should
    not start with one of the wildcards * or ?"

    Best
    Erick


    On 8/14/06, yueyu lin wrote:

    To avoid "TooManyClauses", you can try Filter instead of Query. But that
    will be slower.
    Form what I see is that there are so many keys that match your query, it
    will be tough for Lucene.
    On 8/14/06, Van Nguyen wrote:

    It was how I was implementing the search.

    I am using a boolean query. Prior to the 7GB index, I was searching
    over a 150MB index that consist of a very small part of the bigger
    index. I was able to set my BooleanQuery to
    BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked fine.
    But I think that's the cause of my problem with this bigger index.
    Commenting that out, I get an TooManyClause Exception. A typical query
    would look something like this:

    +CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
    +LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72

    BooleanQuery q = new BooleanQuery();

    WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*");
    WildcardQuery wc2 = new WildcardQuery("CONTENTS", "*hard*");
    WildcardQuery wc3 = new WildcardQuery("CONTENTS", "*hat*");
    q.add(wc1, BooleanClause.Occur.MUST);
    q.add(wc2, BooleanClause.Occur.MUST);
    q.add(wc3, BooleanClause.Occur.MUST);

    TermQuery t1 = new TermQuery("COMPANY_CODE", "u1");
    q.add(t1, BooleanClause.Occur.MUST);

    TermQuery t2 = new TermQuery("LANGUAGE", "enu");
    q.add(t2, BooleanClause.Occur.MUST);
    .
    .
    .

    I take it this is not the most optimal way about this.

    So that leads me to my next question... What is the most optimal way
    about this?

    Van

    -----Original Message-----
    From: yueyu lin
    Sent: Monday, August 14, 2006 11:30 AM
    To: java-user@lucene.apache.org
    Subject: Re: 7GB index taking forever to return hits

    2GB limitation only exists when you want to put them to memory in 32bits
    box.
    Our index size is larger than 13 giga bytes, and it works fine.
    I think it must be something error in your design. You can use Luke to
    see what happened in your index.
    On 8/14/06, Van Nguyen wrote:

    Hi,



    I have a 7GB index (about 45 fields per document X roughly 5.5 million
    docs) running on a Windows 2003 32bit machine (dual proc, 2GB memory).
    The index is optimized. Performing a search on this index will just
    "hang" when performing the search (wild card query with a sort). At
    first the CPU usage is 100%, then drops down to 50% after a minute or
    so, and then no CPU utilization... but the thread is still trying to
    perform the search. I've tried this in my J2EE app and in a main
    program. Is this due to the 2GB limitation of the 32bit OS (I didn't
    realize the index would be this big... just let it run over the weekend).


    If this is due to the 2GB limitation of the 32bit OS and since I have
    this 7GB index built already (and optimized), is there a way to split
    this into 2GB indices w/o having to re-index? Or is this due to
    another factor?


    Van

    United Rentals
    Consider it done.(tm)
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin

    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin
  • Rob Staveley (Tom) at Aug 15, 2006 at 10:26 am
    Sounds like you want to tokenise CONTENTS, if you are not already doing so.

    Then you could simply have:

    +CONTENTS:white +CONTENTS:hard +CONTENTS:hat

    -----Original Message-----
    From: Van Nguyen
    Sent: 15 August 2006 01:30
    To: java-user@lucene.apache.org
    Subject: RE: 7GB index taking forever to return hits

    It was how I was implementing the search.

    I am using a boolean query. Prior to the 7GB index, I was searching over a
    150MB index that consist of a very small part of the bigger index. I was
    able to set my BooleanQuery to
    BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked fine.
    But I think that's the cause of my problem with this bigger index.
    Commenting that out, I get an TooManyClause Exception. A typical query
    would look something like this:

    +CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
    +LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72

    BooleanQuery q = new BooleanQuery();

    WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*"); WildcardQuery
    wc2 = new WildcardQuery("CONTENTS", "*hard*"); WildcardQuery wc3 = new
    WildcardQuery("CONTENTS", "*hat*"); q.add(wc1, BooleanClause.Occur.MUST);
    q.add(wc2, BooleanClause.Occur.MUST); q.add(wc3, BooleanClause.Occur.MUST);

    TermQuery t1 = new TermQuery("COMPANY_CODE", "u1"); q.add(t1,
    BooleanClause.Occur.MUST);

    TermQuery t2 = new TermQuery("LANGUAGE", "enu"); q.add(t2,
    BooleanClause.Occur.MUST); .
    .
    .

    I take it this is not the most optimal way about this.

    So that leads me to my next question... What is the most optimal way about
    this?

    Van

    -----Original Message-----
    From: yueyu lin
    Sent: Monday, August 14, 2006 11:30 AM
    To: java-user@lucene.apache.org
    Subject: Re: 7GB index taking forever to return hits

    2GB limitation only exists when you want to put them to memory in 32bits
    box.
    Our index size is larger than 13 giga bytes, and it works fine.
    I think it must be something error in your design. You can use Luke to see
    what happened in your index.
    On 8/14/06, Van Nguyen wrote:

    Hi,



    I have a 7GB index (about 45 fields per document X roughly 5.5 million
    docs) running on a Windows 2003 32bit machine (dual proc, 2GB memory).
    The index is optimized. Performing a search on this index will just
    "hang" when performing the search (wild card query with a sort). At
    first the CPU usage is 100%, then drops down to 50% after a minute or
    so, and then no CPU utilization... but the thread is still trying to
    perform the search. I've tried this in my J2EE app and in a main
    program. Is this due to the 2GB limitation of the 32bit OS (I didn't
    realize the index would be this big... just let it run over the weekend).


    If this is due to the 2GB limitation of the 32bit OS and since I have
    this 7GB index built already (and optimized), is there a way to split
    this into 2GB indices w/o having to re-index? Or is this due to
    another factor?


    Van

    United Rentals
    Consider it done.(tm)
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin
  • Van Nguyen at Aug 15, 2006 at 8:00 pm
    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com
  • Van Nguyen at Aug 15, 2006 at 8:03 pm
    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com
  • Chris Hostetter at Aug 15, 2006 at 8:15 pm
    : I'm now using a filter on the wildcard portion of the query. Takes 10s
    : to return results. I took out the wildcard query and just search on
    : terms and that takes 187ms to return the same results (well... Not
    : same... But a lot faster to return results). Unfortunately, the
    : wildcard query is part of the requirement. I'll play around with the
    : filter to see if I can reduce the search time.

    If i'm reading between the lines correctly, you have a requirement that
    every search be a "substring" search, so that the input "foo" searches for
    any docs containing any words where "foo" is a substring.

    if that's the case, you should definitely read up on teh thread someone
    else mentioned which talks about a lot of tricks to make prefix wildcards
    work faster -- you have to understand that as is to query on *foo* it's
    got to scan every term in the index for you.

    Even if index size is precious to you and you don't wnat to index
    ngrams or rotated tokens or any of those other ideas, use a tokenizer that
    splits the input on every letter, and puts a big position incriment gap
    between words so that all of your "wild card" searches are really just
    phrase searches ... that should still be faster then what you're doing
    now.




    :
    : Thanks,
    :
    : Van
    :
    : -----Original Message-----
    : From: Erick Erickson
    : Sent: Monday, August 14, 2006 6:20 PM
    : To: java-user@lucene.apache.org
    : Subject: Re: 7GB index taking forever to return hits
    :
    : I actually suspect that your process isn't hung, it's just taking
    : forever because it's swapping a lot. Like a really, really, really lot.
    : Like more than you ever want to deal with <G>.
    :
    : I think you're pretty much forced, as Lin said, to use a filter. I was
    : pleasantly surprised at how quickly filters can be built, and if you
    : have a small set of wildcards you really expect, you can use a
    : CachingWrapperFilter to keep them around. NOTE: we're only recommending
    : the filters for the wildcard portions of the query.
    :
    : For a generous explanation of wildcards in general from "the guys",
    : search the archive for a thread "I just don't get wildcards at all".
    : There's quite an explanation of wildcards, as well as some alternative
    : indexing schemes to make wildcard queries without wildcards. Which is
    : faster but bloats the index.
    :
    : If you're unfamiliar with filters, you really need to become familiar
    : with them. Basically, they're a bitset with the bit corresponding to
    : each document turned on for terms that could match.
    :
    : You'll probably want to use a WildcardTermEnum, or perhaps a
    : RegexTermEnum, and for each term, enumerate all the docs containing that
    : term (see
    : TermDocs) and set your bit in the filter. When the filter is created,
    : use it to construct a ConstantScoreQuery that you add to your
    : BooleanQuery. It sounds more complicated than it is <G>.
    :
    : Just to clarify Lin's comment. The filter won't be slower than what you
    : are doing now. But it will be slower than a straight-up query. Each
    : filter will be under a megabyte (one bit for each of 5.5M docs), so you
    : could consider caching a bunch of them if you expect a small number of
    : terms to be repeated. Or let CachingWrapperFilter do it for you (I
    : think).
    :
    : Note that ConstantScoreQuery loses relevance scoring (but you'll still
    : get relevance for the non-wildcard terms in your BooleanQuery). I don't
    : consider this a flaw since it's tricky defining how much *use* relevance
    : really is for wildcards.
    :
    : Finally, your test queries (and I commend you for making limit-testing
    : tests before putting it in production) are about as bad as they get.
    : >From the javadoc for WildcardQuery.
    :
    : " In order to prevent extremely slow WildcardQueries, a Wildcard term
    : should not start with one of the wildcards * or ?"
    :
    : Best
    : Erick
    :
    :
    :
    : On 8/14/06, yueyu lin wrote:
    : >
    : > To avoid "TooManyClauses", you can try Filter instead of Query. But
    : > that will be slower.
    : > Form what I see is that there are so many keys that match your query,
    : > it will be tough for Lucene.
    : >
    : > On 8/14/06, Van Nguyen wrote:
    : > >
    : > > It was how I was implementing the search.
    : > >
    : > > I am using a boolean query. Prior to the 7GB index, I was searching
    :
    : > > over a 150MB index that consist of a very small part of the bigger
    : > > index. I was able to set my BooleanQuery to
    : > > BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked
    : fine.
    : > > But I think that's the cause of my problem with this bigger index.
    : > > Commenting that out, I get an TooManyClause Exception. A typical
    : > > query would look something like this:
    : > >
    : > > +CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
    : > > +LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72
    : > >
    : > > BooleanQuery q = new BooleanQuery();
    : > >
    : > > WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*");
    : > > WildcardQuery wc2 = new WildcardQuery("CONTENTS", "*hard*");
    : > > WildcardQuery wc3 = new WildcardQuery("CONTENTS", "*hat*");
    : > > q.add(wc1, BooleanClause.Occur.MUST); q.add(wc2,
    : > > BooleanClause.Occur.MUST); q.add(wc3, BooleanClause.Occur.MUST);
    : > >
    : > > TermQuery t1 = new TermQuery("COMPANY_CODE", "u1"); q.add(t1,
    : > > BooleanClause.Occur.MUST);
    : > >
    : > > TermQuery t2 = new TermQuery("LANGUAGE", "enu"); q.add(t2,
    : > > BooleanClause.Occur.MUST); .
    : > > .
    : > > .
    : > >
    : > > I take it this is not the most optimal way about this.
    : > >
    : > > So that leads me to my next question... What is the most optimal way
    :
    : > > about this?
    : > >
    : > > Van
    : > >
    : > > -----Original Message-----
    : > > From: yueyu lin
    : > > Sent: Monday, August 14, 2006 11:30 AM
    : > > To: java-user@lucene.apache.org
    : > > Subject: Re: 7GB index taking forever to return hits
    : > >
    : > > 2GB limitation only exists when you want to put them to memory in
    : > > 32bits box.
    : > > Our index size is larger than 13 giga bytes, and it works fine.
    : > > I think it must be something error in your design. You can use Luke
    : > > to see what happened in your index.
    : > >
    : > > On 8/14/06, Van Nguyen wrote:
    : > > >
    : > > > Hi,
    : > > >
    : > > >
    : > > >
    : > > > I have a 7GB index (about 45 fields per document X roughly 5.5
    : > > > million
    : > > > docs) running on a Windows 2003 32bit machine (dual proc, 2GB
    : memory).
    : > >
    : > > > The index is optimized. Performing a search on this index will
    : > > > just "hang" when performing the search (wild card query with a
    : > > > sort). At first the CPU usage is 100%, then drops down to 50%
    : > > > after a minute or so, and then no CPU utilization... but the
    : > > > thread is still trying to perform the search. I've tried this in
    : > > > my J2EE app and in a main program. Is this due to the 2GB
    : > > > limitation of the 32bit OS (I didn't realize the index would be
    : > > > this big... just let it run over the
    : > > weekend).
    : > > >
    : > > >
    : > > >
    : > > > If this is due to the 2GB limitation of the 32bit OS and since I
    : > > > have this 7GB index built already (and optimized), is there a way
    : > > > to split this into 2GB indices w/o having to re-index? Or is this
    :
    : > > > due to
    : > > another factor?
    : > > >
    : > > >
    : > > >
    : > > > Van
    : > > >
    : > > > United Rentals
    : > > > Consider it done.(tm)
    : > > > 800-UR-RENTS
    : > > > unitedrentals.com
    : > > >
    : > > >
    : > > >
    : > > > ------------------------------------------------------------------
    : > > > --- To unsubscribe, e-mail:
    : > > > java-user-unsubscribe@lucene.apache.org
    : > > > For additional commands, e-mail: java-user-help@lucene.apache.org
    : > > >
    : > > >
    : > >
    : > >
    : > > --
    : > > --
    : > > Yueyu Lin
    : > >
    : > > United Rentals
    : > > Consider it done.(tm)
    : > > 800-UR-RENTS
    : > > unitedrentals.com
    : > >
    : > >
    : > >
    : > > --------------------------------------------------------------------
    : > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : > > For additional commands, e-mail: java-user-help@lucene.apache.org
    : > >
    : > >
    : >
    : >
    : > --
    : > --
    : > Yueyu Lin
    : >
    : >



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Aug 15, 2006 at 9:20 pm
    Three things come to mind:

    1> You can think about caching your filters. This only makes sense if you
    have a fairly small number of wildcards that you *expect*. Probably not
    really viable, but I thought I'd mention it.

    2> You say that the search is "part of the requirements". It might be worth
    asking the people who wrote the requirements why. I've sometimes found that
    requirements aren't cast in stone, and things get in there that would be
    "nice" to have, but when informed of the trade-offs, the requirements may
    not be inflexible. Speaking of tradeoffs, I'm not sure a person can
    reasonably complain about 10s wildcard searches on a 7G index of the type
    you outlined.<G>.

    3> Not that I expect it to matter much for search speed, but is the 7G
    necessary? In other words, do you store things that you don't need to store
    in your index? This could be important if you decide to try some of the
    alternative indexing schemes.

    Best
    Erick
    On 8/15/06, Van Nguyen wrote:

    I'm now using a filter on the wildcard portion of the query. Takes 10s
    to return results. I took out the wildcard query and just search on
    terms and that takes 187ms to return the same results (well... Not
    same... But a lot faster to return results). Unfortunately, the
    wildcard query is part of the requirement. I'll play around with the
    filter to see if I can reduce the search time.

    Thanks,

    Van

    -----Original Message-----
    From: Erick Erickson
    Sent: Monday, August 14, 2006 6:20 PM
    To: java-user@lucene.apache.org
    Subject: Re: 7GB index taking forever to return hits

    I actually suspect that your process isn't hung, it's just taking
    forever because it's swapping a lot. Like a really, really, really lot.
    Like more than you ever want to deal with <G>.

    I think you're pretty much forced, as Lin said, to use a filter. I was
    pleasantly surprised at how quickly filters can be built, and if you
    have a small set of wildcards you really expect, you can use a
    CachingWrapperFilter to keep them around. NOTE: we're only recommending
    the filters for the wildcard portions of the query.

    For a generous explanation of wildcards in general from "the guys",
    search the archive for a thread "I just don't get wildcards at all".
    There's quite an explanation of wildcards, as well as some alternative
    indexing schemes to make wildcard queries without wildcards. Which is
    faster but bloats the index.

    If you're unfamiliar with filters, you really need to become familiar
    with them. Basically, they're a bitset with the bit corresponding to
    each document turned on for terms that could match.

    You'll probably want to use a WildcardTermEnum, or perhaps a
    RegexTermEnum, and for each term, enumerate all the docs containing that
    term (see
    TermDocs) and set your bit in the filter. When the filter is created,
    use it to construct a ConstantScoreQuery that you add to your
    BooleanQuery. It sounds more complicated than it is <G>.

    Just to clarify Lin's comment. The filter won't be slower than what you
    are doing now. But it will be slower than a straight-up query. Each
    filter will be under a megabyte (one bit for each of 5.5M docs), so you
    could consider caching a bunch of them if you expect a small number of
    terms to be repeated. Or let CachingWrapperFilter do it for you (I
    think).

    Note that ConstantScoreQuery loses relevance scoring (but you'll still
    get relevance for the non-wildcard terms in your BooleanQuery). I don't
    consider this a flaw since it's tricky defining how much *use* relevance
    really is for wildcards.

    Finally, your test queries (and I commend you for making limit-testing
    tests before putting it in production) are about as bad as they get.
    From the javadoc for WildcardQuery.

    " In order to prevent extremely slow WildcardQueries, a Wildcard term
    should not start with one of the wildcards * or ?"

    Best
    Erick


    On 8/14/06, yueyu lin wrote:

    To avoid "TooManyClauses", you can try Filter instead of Query. But
    that will be slower.
    Form what I see is that there are so many keys that match your query,
    it will be tough for Lucene.
    On 8/14/06, Van Nguyen wrote:

    It was how I was implementing the search.

    I am using a boolean query. Prior to the 7GB index, I was searching
    over a 150MB index that consist of a very small part of the bigger
    index. I was able to set my BooleanQuery to
    BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked
    fine.
    But I think that's the cause of my problem with this bigger index.
    Commenting that out, I get an TooManyClause Exception. A typical
    query would look something like this:

    +CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
    +LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72

    BooleanQuery q = new BooleanQuery();

    WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*");
    WildcardQuery wc2 = new WildcardQuery("CONTENTS", "*hard*");
    WildcardQuery wc3 = new WildcardQuery("CONTENTS", "*hat*");
    q.add(wc1, BooleanClause.Occur.MUST); q.add(wc2,
    BooleanClause.Occur.MUST); q.add(wc3, BooleanClause.Occur.MUST);

    TermQuery t1 = new TermQuery("COMPANY_CODE", "u1"); q.add(t1,
    BooleanClause.Occur.MUST);

    TermQuery t2 = new TermQuery("LANGUAGE", "enu"); q.add(t2,
    BooleanClause.Occur.MUST); .
    .
    .

    I take it this is not the most optimal way about this.

    So that leads me to my next question... What is the most optimal way
    about this?

    Van

    -----Original Message-----
    From: yueyu lin
    Sent: Monday, August 14, 2006 11:30 AM
    To: java-user@lucene.apache.org
    Subject: Re: 7GB index taking forever to return hits

    2GB limitation only exists when you want to put them to memory in
    32bits box.
    Our index size is larger than 13 giga bytes, and it works fine.
    I think it must be something error in your design. You can use Luke
    to see what happened in your index.
    On 8/14/06, Van Nguyen wrote:

    Hi,



    I have a 7GB index (about 45 fields per document X roughly 5.5
    million
    docs) running on a Windows 2003 32bit machine (dual proc, 2GB
    memory).
    The index is optimized. Performing a search on this index will
    just "hang" when performing the search (wild card query with a
    sort). At first the CPU usage is 100%, then drops down to 50%
    after a minute or so, and then no CPU utilization... but the
    thread is still trying to perform the search. I've tried this in
    my J2EE app and in a main program. Is this due to the 2GB
    limitation of the 32bit OS (I didn't realize the index would be
    this big... just let it run over the weekend).


    If this is due to the 2GB limitation of the 32bit OS and since I
    have this 7GB index built already (and optimized), is there a way
    to split this into 2GB indices w/o having to re-index? Or is this
    due to
    another factor?


    Van

    United Rentals
    Consider it done.(tm)
    800-UR-RENTS
    unitedrentals.com



    ------------------------------------------------------------------
    --- To unsubscribe, e-mail:
    java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin

    United Rentals
    Consider it done.(tm)
    800-UR-RENTS
    unitedrentals.com



    --------------------------------------------------------------------
    - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    --
    Yueyu Lin
    United Rentals
    Consider it done.™
    800-UR-RENTS
    unitedrentals.com



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 14, '06 at 5:36p
activeAug 15, '06 at 9:20p
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase