FAQ
Three things come to mind:

1> You can think about caching your filters. This only makes sense if you
have a fairly small number of wildcards that you *expect*. Probably not
really viable, but I thought I'd mention it.

2> You say that the search is "part of the requirements". It might be worth
asking the people who wrote the requirements why. I've sometimes found that
requirements aren't cast in stone, and things get in there that would be
"nice" to have, but when informed of the trade-offs, the requirements may
not be inflexible. Speaking of tradeoffs, I'm not sure a person can
reasonably complain about 10s wildcard searches on a 7G index of the type
you outlined.<G>.

3> Not that I expect it to matter much for search speed, but is the 7G
necessary? In other words, do you store things that you don't need to store
in your index? This could be important if you decide to try some of the
alternative indexing schemes.

Best
Erick
On 8/15/06, Van Nguyen wrote:

I'm now using a filter on the wildcard portion of the query. Takes 10s
to return results. I took out the wildcard query and just search on
terms and that takes 187ms to return the same results (well... Not
same... But a lot faster to return results). Unfortunately, the
wildcard query is part of the requirement. I'll play around with the
filter to see if I can reduce the search time.

Thanks,

Van

-----Original Message-----
From: Erick Erickson
Sent: Monday, August 14, 2006 6:20 PM
To: java-user@lucene.apache.org
Subject: Re: 7GB index taking forever to return hits

I actually suspect that your process isn't hung, it's just taking
forever because it's swapping a lot. Like a really, really, really lot.
Like more than you ever want to deal with <G>.

I think you're pretty much forced, as Lin said, to use a filter. I was
pleasantly surprised at how quickly filters can be built, and if you
have a small set of wildcards you really expect, you can use a
CachingWrapperFilter to keep them around. NOTE: we're only recommending
the filters for the wildcard portions of the query.

For a generous explanation of wildcards in general from "the guys",
search the archive for a thread "I just don't get wildcards at all".
There's quite an explanation of wildcards, as well as some alternative
indexing schemes to make wildcard queries without wildcards. Which is
faster but bloats the index.

If you're unfamiliar with filters, you really need to become familiar
with them. Basically, they're a bitset with the bit corresponding to
each document turned on for terms that could match.

You'll probably want to use a WildcardTermEnum, or perhaps a
RegexTermEnum, and for each term, enumerate all the docs containing that
term (see
TermDocs) and set your bit in the filter. When the filter is created,
use it to construct a ConstantScoreQuery that you add to your
BooleanQuery. It sounds more complicated than it is <G>.

Just to clarify Lin's comment. The filter won't be slower than what you
are doing now. But it will be slower than a straight-up query. Each
filter will be under a megabyte (one bit for each of 5.5M docs), so you
could consider caching a bunch of them if you expect a small number of
terms to be repeated. Or let CachingWrapperFilter do it for you (I
think).

Note that ConstantScoreQuery loses relevance scoring (but you'll still
get relevance for the non-wildcard terms in your BooleanQuery). I don't
consider this a flaw since it's tricky defining how much *use* relevance
really is for wildcards.

Finally, your test queries (and I commend you for making limit-testing
tests before putting it in production) are about as bad as they get.
From the javadoc for WildcardQuery.

" In order to prevent extremely slow WildcardQueries, a Wildcard term
should not start with one of the wildcards * or ?"

Best
Erick


On 8/14/06, yueyu lin wrote:

To avoid "TooManyClauses", you can try Filter instead of Query. But
that will be slower.
Form what I see is that there are so many keys that match your query,
it will be tough for Lucene.
On 8/14/06, Van Nguyen wrote:

It was how I was implementing the search.

I am using a boolean query. Prior to the 7GB index, I was searching
over a 150MB index that consist of a very small part of the bigger
index. I was able to set my BooleanQuery to
BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked
fine.
But I think that's the cause of my problem with this bigger index.
Commenting that out, I get an TooManyClause Exception. A typical
query would look something like this:

+CONTENTS:*white* +CONTENTS:*hard* +CONTENTS:*hat* +COMPANY_CODE:u1
+LANGUAGE:enu -SKU_DESC_ID:0 +IS_DC:d +LOCATION:b72

BooleanQuery q = new BooleanQuery();

WildcardQuery wc1 = new WildcardQuery("CONTENTS", "*white*");
WildcardQuery wc2 = new WildcardQuery("CONTENTS", "*hard*");
WildcardQuery wc3 = new WildcardQuery("CONTENTS", "*hat*");
q.add(wc1, BooleanClause.Occur.MUST); q.add(wc2,
BooleanClause.Occur.MUST); q.add(wc3, BooleanClause.Occur.MUST);

TermQuery t1 = new TermQuery("COMPANY_CODE", "u1"); q.add(t1,
BooleanClause.Occur.MUST);

TermQuery t2 = new TermQuery("LANGUAGE", "enu"); q.add(t2,
BooleanClause.Occur.MUST); .
.
.

I take it this is not the most optimal way about this.

So that leads me to my next question... What is the most optimal way
about this?

Van

-----Original Message-----
From: yueyu lin
Sent: Monday, August 14, 2006 11:30 AM
To: java-user@lucene.apache.org
Subject: Re: 7GB index taking forever to return hits

2GB limitation only exists when you want to put them to memory in
32bits box.
Our index size is larger than 13 giga bytes, and it works fine.
I think it must be something error in your design. You can use Luke
to see what happened in your index.
On 8/14/06, Van Nguyen wrote:

Hi,



I have a 7GB index (about 45 fields per document X roughly 5.5
million
docs) running on a Windows 2003 32bit machine (dual proc, 2GB
memory).
The index is optimized. Performing a search on this index will
just "hang" when performing the search (wild card query with a
sort). At first the CPU usage is 100%, then drops down to 50%
after a minute or so, and then no CPU utilization... but the
thread is still trying to perform the search. I've tried this in
my J2EE app and in a main program. Is this due to the 2GB
limitation of the 32bit OS (I didn't realize the index would be
this big... just let it run over the weekend).


If this is due to the 2GB limitation of the 32bit OS and since I
have this 7GB index built already (and optimized), is there a way
to split this into 2GB indices w/o having to re-index? Or is this
due to
another factor?


Van

United Rentals
Consider it done.(tm)
800-UR-RENTS
unitedrentals.com



------------------------------------------------------------------
--- To unsubscribe, e-mail:
java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

--
--
Yueyu Lin

United Rentals
Consider it done.(tm)
800-UR-RENTS
unitedrentals.com



--------------------------------------------------------------------
- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

--
--
Yueyu Lin
United Rentals
Consider it done.™
800-UR-RENTS
unitedrentals.com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 10 of 10 | next ›
Discussion Overview
groupjava-user @
categorieslucene
postedAug 14, '06 at 5:36p
activeAug 15, '06 at 9:20p
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase