It depends completely on what analyzer you use. Conceptually, an Analyzer
is composed of a Tokenizer followed by any number of Filters. So the
input stream is broken up by the Tokenizer, then each token has one or
more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..
The reason I'm not answering your question directly is that I can't. If you
choose, say, a WhitespaceAnalyzer, which is built from a
WhitespaceTokenizer,
then your hyphens and apostrophes will pass through as-is, and your tokens
(the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
capitals
and all.
If you choose StandardAnalyzer, built on StandardTokenizer and several
filters
your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
well).
You can build your own Analyzers to process text however you please. Lucene
In Action has quite a thorough explanation of this process, you'll save
yourself
a bunch of time by reading those sections. You can get the second edition
of that book in electronic form from Manning through their early access
program.
Until you understand this process well, I'd recommend that you be very, very
sure that you use the *same* analyzer for both indexing and searching or
your
results will be...surprising.
Think about getting a copy of Luke to examine your indexes, that tool makes
it
easy to see the effects of various Analyzers. Google Lucene Luke....
Finally, you can easily use *different* analyzers for different fields
within a
document, see PerFieldAnalyzerWrapper.
HTH
Erick
On Sun, Dec 27, 2009 at 5:48 PM, syedfa wrote:Dear fellow Java developers:
I have a very basic question about indexing text using Lucene. I am
indexing a large amount of text, that includes names that contain certain
punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.) Will the punctuation
throw off the indexer in any way, such that it breaks up the tokens when
they shouldn't be, or will the indexer simply treat the punctuation inside
the names as any other character, and the presence of the punctuation will
not in any way hinder a user's ability to search for that name? Are there
any precautions that I should take to avoid any problems?
I hope this question is clear and makes sense.
Thanks in advance to all who reply.
--
View this message in context:
http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.htmlSent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org