FAQ
Dear fellow Java developers:

I have a very basic question about indexing text using Lucene. I am
indexing a large amount of text, that includes names that contain certain
punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.) Will the punctuation
throw off the indexer in any way, such that it breaks up the tokens when
they shouldn't be, or will the indexer simply treat the punctuation inside
the names as any other character, and the presence of the punctuation will
not in any way hinder a user's ability to search for that name? Are there
any precautions that I should take to avoid any problems?

I hope this question is clear and makes sense.

Thanks in advance to all who reply.

--
View this message in context: http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Dec 27, 2009 at 11:12 pm
    It depends completely on what analyzer you use. Conceptually, an Analyzer
    is composed of a Tokenizer followed by any number of Filters. So the
    input stream is broken up by the Tokenizer, then each token has one or
    more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..

    The reason I'm not answering your question directly is that I can't. If you
    choose, say, a WhitespaceAnalyzer, which is built from a
    WhitespaceTokenizer,
    then your hyphens and apostrophes will pass through as-is, and your tokens
    (the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
    capitals
    and all.

    If you choose StandardAnalyzer, built on StandardTokenizer and several
    filters
    your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
    well).

    You can build your own Analyzers to process text however you please. Lucene
    In Action has quite a thorough explanation of this process, you'll save
    yourself
    a bunch of time by reading those sections. You can get the second edition
    of that book in electronic form from Manning through their early access
    program.

    Until you understand this process well, I'd recommend that you be very, very
    sure that you use the *same* analyzer for both indexing and searching or
    your
    results will be...surprising.

    Think about getting a copy of Luke to examine your indexes, that tool makes
    it
    easy to see the effects of various Analyzers. Google Lucene Luke....

    Finally, you can easily use *different* analyzers for different fields
    within a
    document, see PerFieldAnalyzerWrapper.

    HTH
    Erick
    On Sun, Dec 27, 2009 at 5:48 PM, syedfa wrote:


    Dear fellow Java developers:

    I have a very basic question about indexing text using Lucene. I am
    indexing a large amount of text, that includes names that contain certain
    punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.) Will the punctuation
    throw off the indexer in any way, such that it breaks up the tokens when
    they shouldn't be, or will the indexer simply treat the punctuation inside
    the names as any other character, and the presence of the punctuation will
    not in any way hinder a user's ability to search for that name? Are there
    any precautions that I should take to avoid any problems?

    I hope this question is clear and makes sense.

    Thanks in advance to all who reply.

    --
    View this message in context:
    http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Syedfa at Dec 27, 2009 at 11:20 pm
    Thanks very much for such a detailed reply, I didn't realize that there was
    so much to this subject. I understand the issue a bit better now!

    Take care.


    Erick Erickson wrote:
    It depends completely on what analyzer you use. Conceptually, an Analyzer
    is composed of a Tokenizer followed by any number of Filters. So the
    input stream is broken up by the Tokenizer, then each token has one or
    more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..

    The reason I'm not answering your question directly is that I can't. If
    you
    choose, say, a WhitespaceAnalyzer, which is built from a
    WhitespaceTokenizer,
    then your hyphens and apostrophes will pass through as-is, and your tokens
    (the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
    capitals
    and all.

    If you choose StandardAnalyzer, built on StandardTokenizer and several
    filters
    your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
    well).

    You can build your own Analyzers to process text however you please.
    Lucene
    In Action has quite a thorough explanation of this process, you'll save
    yourself
    a bunch of time by reading those sections. You can get the second edition
    of that book in electronic form from Manning through their early access
    program.

    Until you understand this process well, I'd recommend that you be very,
    very
    sure that you use the *same* analyzer for both indexing and searching or
    your
    results will be...surprising.

    Think about getting a copy of Luke to examine your indexes, that tool
    makes
    it
    easy to see the effects of various Analyzers. Google Lucene Luke....

    Finally, you can easily use *different* analyzers for different fields
    within a
    document, see PerFieldAnalyzerWrapper.

    HTH
    Erick
    On Sun, Dec 27, 2009 at 5:48 PM, syedfa wrote:


    Dear fellow Java developers:

    I have a very basic question about indexing text using Lucene. I am
    indexing a large amount of text, that includes names that contain certain
    punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.) Will the punctuation
    throw off the indexer in any way, such that it breaks up the tokens when
    they shouldn't be, or will the indexer simply treat the punctuation
    inside
    the names as any other character, and the presence of the punctuation
    will
    not in any way hinder a user's ability to search for that name? Are
    there
    any precautions that I should take to avoid any problems?

    I hope this question is clear and makes sense.

    Thanks in advance to all who reply.

    --
    View this message in context:
    http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context: http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26938117.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 27, '09 at 10:48p
activeDec 27, '09 at 11:20p
posts3
users2
websitelucene.apache.org

2 users in discussion

Syedfa: 2 posts Erick Erickson: 1 post

People

Translate

site design / logo © 2022 Grokbase