FAQ
I've been trying to brainstorm on this but could not figure out a way to
go about this.



Let's say I'm searching for "batman". I want results that include:



batman

bat man

bat-man

etc.



or if I search screwdriver, I would want results to include:



screwdriver

screw drivers

etc.



I've tried using the SnowballAnalyzer. I've thought about creating a
"SynonymAnalyzer" as described in the Lucene In Action book (but that
would mean I would have to know all the synonyms for each word I need to
index - at this point I do not). Any suggestions on how to go about
this?



Van

Search Discussions

  • Dennis Watson at Dec 1, 2006 at 1:35 am
    NGramAnalyzer should do this. I think there is one in the contribs area or in
    LUA.


    Dennis

    On Thursday 30 November 2006 17:25, Van Nguyen wrote:
    I've been trying to brainstorm on this but could not figure out a way to
    go about this.



    Let's say I'm searching for "batman". I want results that include:



    batman

    bat man

    bat-man

    etc.



    or if I search screwdriver, I would want results to include:



    screwdriver

    screw drivers

    etc.



    I've tried using the SnowballAnalyzer. I've thought about creating a
    "SynonymAnalyzer" as described in the Lucene In Action book (but that
    would mean I would have to know all the synonyms for each word I need to
    index - at this point I do not). Any suggestions on how to go about
    this?



    Van
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Soeren Pekrul at Dec 1, 2006 at 10:10 am
    Hello Van,

    it looks like splitting of compound words. This topic was discussed in
    the thread "Analysis/tokenization of compound words"
    (http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded).

    The main idea is as follow:
    You have a corpus (lexicon/dictionary). You want to split a word. Build
    letter wise two part words and search them in the corpus. If you can
    find them you can split your word at these parts. You can do that
    recursively finding all sub-parts. This method searches the smallest
    possible words. [1]
    If you have a corpus without compound words you can have a better
    quality by searching the largest possible words by searching the word
    first and then splitting it if you couldn’t find it. Do this recursively
    as well. [2]

    Warning: The results are usually not linguistically correct. How ever,
    the quality should be good enough for searching.

    Sören

    [1] C. Monz and M. De Rijke: Shallow Morphological Analysis in
    Monolingual Information Retrieval for Dutch, German and Italian,
    Language & Inference Technology, University of Amsterdam.

    [2] Pasqualino Imbemba: A Splitter for German Compound Words, Free
    University Of Bolzano, Bolzano, 2006

    Van Nguyen wrote:
    I've been trying to brainstorm on this but could not figure out a way to
    go about this.



    Let's say I'm searching for "batman". I want results that include:



    batman

    bat man

    bat-man

    etc.



    or if I search screwdriver, I would want results to include:



    screwdriver

    screw drivers

    etc.



    I've tried using the SnowballAnalyzer. I've thought about creating a
    "SynonymAnalyzer" as described in the Lucene In Action book (but that
    would mean I would have to know all the synonyms for each word I need to
    index - at this point I do not). Any suggestions on how to go about
    this?



    Van
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 1, '06 at 1:30a
activeDec 1, '06 at 10:10a
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase