FAQ
I am trying to have multi-word synonyms work in lucene using Solr's *
SynonymFilter*.

I need to match synonyms at index time, since many of the synonym lists are
huge. Actually they are really not synonyms, but are words that belong to a
concept. For example, I would like to map {"New York", "Los Angeles", "New
Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
"city". While searching, the user query for the concept "city" will be
translated to a keyword like, say "CONCEPTcity", which is the synonym for
any city name.

Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
all I could match for "CONCEPTcity" is single word city names like
"Chicago", "Seattle", "Boston", etc., It would not match multi-word city
names like "New York", "Los Angeles", etc.,

I tried using Solr's SynonymFilter in tokenStream method in a custom
Analyzer (that extends org.apache.lucene.analysis.
Analyzer - lucene ver. 2.9.3) using:

* public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(
new WhitespaceTokenizer(reader),
synonymMap);
return result;
}
*
where *synonymMap* is loaded with synonyms using

*synonymMap.add(conceptTerms, listOfTokens, true, true);*

where *conceptTerms* is of type *ArrayList<String>* of all the terms in a
concept and *listofTokens* is of type *List<Token> *and contains only the
generic synonym identifier like *CONCEPTcity*.

When I print synonymMap using synonymMap.toString(), I get the output like

<{New York=<{Chicago=<{Seattle=<{New
Orleans=....<[(CATEGORYcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}>

so it looks like all the synonyms are loaded. But if I search for
"CATEGORYcity" then it says no matches found. I am not sure whether I have
loaded the synonyms correctly in the synonymMap.

Any help will be deeply appreciated. Thanks!

Search Discussions

  • Arun Rangarajan at Aug 19, 2010 at 4:56 am
    I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter
    is the one that prevents multi-word synonyms like "New York" from getting
    mapped to the generic synonym name like CONCEPTYcity. It appears to me that
    an analyzer which recognizes that a white-space is inside a synonym like
    "New York" will be required. Do I need to implement one like this or is
    there already an analyzer I can use? Looks like I am missing something here,
    since Solr's SynonymFilter is supposed to handle this. Can someone tell me
    what is the correct way to integrate Solr's SynonymFilter within a custom
    lucene analyzer? Thanks.


    On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
    wrote:
    I am trying to have multi-word synonyms work in lucene using Solr's *
    SynonymFilter*.

    I need to match synonyms at index time, since many of the synonym lists are
    huge. Actually they are really not synonyms, but are words that belong to a
    concept. For example, I would like to map {"New York", "Los Angeles", "New
    Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
    "city". While searching, the user query for the concept "city" will be
    translated to a keyword like, say "CONCEPTcity", which is the synonym for
    any city name.

    Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
    all I could match for "CONCEPTcity" is single word city names like
    "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
    names like "New York", "Los Angeles", etc.,

    I tried using Solr's SynonymFilter in tokenStream method in a custom
    Analyzer (that extends org.apache.lucene.analysis.
    Analyzer - lucene ver. 2.9.3) using:

    * public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new SynonymFilter(
    new WhitespaceTokenizer(reader),
    synonymMap);
    return result;
    }
    *
    where *synonymMap* is loaded with synonyms using

    *synonymMap.add(conceptTerms, listOfTokens, true, true);*

    where *conceptTerms* is of type *ArrayList<String>* of all the terms in a
    concept and *listofTokens* is of type *List<Token> *and contains only the
    generic synonym identifier like *CONCEPTcity*.

    When I print synonymMap using synonymMap.toString(), I get the output like

    <{New York=<{Chicago=<{Seattle=<{New
    Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}>

    so it looks like all the synonyms are loaded. But if I search for
    "CONCEPTcity" then it says no matches found. I am not sure whether I have
    loaded the synonyms correctly in the synonymMap.

    Any help will be deeply appreciated. Thanks!
  • Lance Norskog at Aug 19, 2010 at 5:28 am
    Yes, you need an analyzer that leaves successive words together as one
    long term. This might be easier to do with the new CharFilter tool,
    which processes text before it goes to the tokenizer.

    What you are doing here is similar to Parts-Of-Speech analysis, where
    text analysis software parses a sentence and labels words 'Noun',
    'Verb', etc. One suite stores these labels as payloads on the terms.
    This might be a better way to store your categories, rather than using
    the synonym filter.

    On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan
    wrote:
    I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter
    is the one that prevents multi-word synonyms like "New York" from getting
    mapped to the generic synonym name like CONCEPTYcity. It appears to me that
    an analyzer which recognizes that a white-space is inside a synonym like
    "New York" will be required. Do I need to implement one like this or is
    there already an analyzer I can use? Looks like I am missing something here,
    since Solr's SynonymFilter is supposed to handle this. Can someone tell me
    what is the correct way to integrate Solr's SynonymFilter within a custom
    lucene analyzer? Thanks.


    On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
    wrote:
    I am trying to have multi-word synonyms work in lucene using Solr's *
    SynonymFilter*.

    I need to match synonyms at index time, since many of the synonym lists are
    huge. Actually they are really not synonyms, but are words that belong to a
    concept. For example, I would like to map {"New York", "Los Angeles", "New
    Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
    "city". While searching, the user query for the concept "city" will be
    translated to a keyword like, say "CONCEPTcity", which is the synonym for
    any city name.

    Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
    all I could match for "CONCEPTcity" is single word city names like
    "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
    names like "New York", "Los Angeles", etc.,

    I tried using Solr's SynonymFilter in tokenStream method in a custom
    Analyzer (that extends org.apache.lucene.analysis.
    Analyzer - lucene ver. 2.9.3) using:

    *    public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new SynonymFilter(
    new WhitespaceTokenizer(reader),
    synonymMap);
    return result;
    }
    *
    where *synonymMap* is loaded with synonyms using

    *synonymMap.add(conceptTerms, listOfTokens, true, true);*

    where *conceptTerms* is of type *ArrayList<String>* of all the terms in a
    concept and *listofTokens* is of type *List<Token>  *and contains only the
    generic synonym identifier like *CONCEPTcity*.

    When I print synonymMap using synonymMap.toString(), I get the output like

    <{New York=<{Chicago=<{Seattle=<{New
    Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}>

    so it looks like all the synonyms are loaded. But if I search for
    "CONCEPTcity" then it says no matches found. I am not sure whether I have
    loaded the synonyms correctly in the synonymMap.

    Any help will be deeply appreciated. Thanks!


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Arun Rangarajan at Aug 27, 2010 at 1:01 am
    Thanks, Lance. After exploring for a while, I used lucene's ShingleFilter
    followed by the SynonymFilter in Lucene in Action book. Then using the type
    attribute, I removed all the shingles which did not belong to any category.
    On Wed, Aug 18, 2010 at 10:28 PM, Lance Norskog wrote:

    Yes, you need an analyzer that leaves successive words together as one
    long term. This might be easier to do with the new CharFilter tool,
    which processes text before it goes to the tokenizer.

    What you are doing here is similar to Parts-Of-Speech analysis, where
    text analysis software parses a sentence and labels words 'Noun',
    'Verb', etc. One suite stores these labels as payloads on the terms.
    This might be a better way to store your categories, rather than using
    the synonym filter.

    On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan
    wrote:
    I think the lucene WhitespaceAnalyzer I am using inside Solr's
    SynonymFilter
    is the one that prevents multi-word synonyms like "New York" from getting
    mapped to the generic synonym name like CONCEPTYcity. It appears to me that
    an analyzer which recognizes that a white-space is inside a synonym like
    "New York" will be required. Do I need to implement one like this or is
    there already an analyzer I can use? Looks like I am missing something here,
    since Solr's SynonymFilter is supposed to handle this. Can someone tell me
    what is the correct way to integrate Solr's SynonymFilter within a custom
    lucene analyzer? Thanks.


    On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
    wrote:
    I am trying to have multi-word synonyms work in lucene using Solr's *
    SynonymFilter*.

    I need to match synonyms at index time, since many of the synonym lists
    are
    huge. Actually they are really not synonyms, but are words that belong
    to a
    concept. For example, I would like to map {"New York", "Los Angeles",
    "New
    Orleans", "Salt Lake City"...}, a bunch of city names, to the concept
    called
    "city". While searching, the user query for the concept "city" will be
    translated to a keyword like, say "CONCEPTcity", which is the synonym
    for
    any city name.

    Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p.
    131),
    all I could match for "CONCEPTcity" is single word city names like
    "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
    names like "New York", "Los Angeles", etc.,

    I tried using Solr's SynonymFilter in tokenStream method in a custom
    Analyzer (that extends org.apache.lucene.analysis.
    Analyzer - lucene ver. 2.9.3) using:

    * public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new SynonymFilter(
    new WhitespaceTokenizer(reader),
    synonymMap);
    return result;
    }
    *
    where *synonymMap* is loaded with synonyms using

    *synonymMap.add(conceptTerms, listOfTokens, true, true);*

    where *conceptTerms* is of type *ArrayList<String>* of all the terms in
    a
    concept and *listofTokens* is of type *List<Token> *and contains only
    the
    generic synonym identifier like *CONCEPTcity*.

    When I print synonymMap using synonymMap.toString(), I get the output
    like
    <{New York=<{Chicago=<{Seattle=<{New
    Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}>

    so it looks like all the synonyms are loaded. But if I search for
    "CONCEPTcity" then it says no matches found. I am not sure whether I
    have
    loaded the synonyms correctly in the synonymMap.

    Any help will be deeply appreciated. Thanks!


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 17, '10 at 11:45p
activeAug 27, '10 at 1:01a
posts4
users2
websitelucene.apache.org

2 users in discussion

Arun Rangarajan: 3 posts Lance Norskog: 1 post

People

Translate

site design / logo © 2022 Grokbase