FAQ
Our business has a need to allow for multiple values for a single field. For example, we have an index of employers where an employer often has multiple ways people refer to it. For example, the company "Wal-mart" is referred to as:

1) Wal-mart

2) Wal-mart Stores

3) Walmart
I would like a search for any of these 3 terms to match the Wal-mart employer.

I've tried two different approaches for this.

Approach 1: Create multiple values for the same field. So the document has these three fields:

1) name=Wal-mart

2) name=Wal-mart Stores

3) name=Walmart
The problem with this is Lucene seems to treat the 3 different fields as one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic b/c term frequencies is 2 when a user searches for "Wal-mart".

Approach 2: Create different named fields for each value so the document has these 3 fields:

1) name1=Wal-mart

2) name2=Wal-mart Stores

3) name3=Walmart
This fixes the issue above but introduces a different problem. The idf calculation is incorrect b/c idf is calculated per field. Most employers only have one name or maybe 2 names. So the name3 fields idf ends up being much higher b/c there are fewer docs with a given term in the name3 field.

For now, I'm going with approach 2 but overriding the IndexReader. IndexReader.docFreq(Term t) method always returns the doc frequency from the name1 field even if the Term t is actually for name2 or name3, etc. But this doesn't feel like a clean solution.

Any suggestions on how to deal with this? Any ideas would be appreciated.
Ryan Aylward

Search Discussions

  • Anshum at Jan 8, 2011 at 3:39 am
    Hi Ryan,
    You should try the synonym filter. That should help you with this kinda
    problem.
    You could also look at turning off norms for the name field, or turning off
    tf or idf.

    --
    Anshum Gupta
    http://ai-cafe.blogspot.com

    On Sat, Jan 8, 2011 at 6:03 AM, Ryan Aylward wrote:

    Our business has a need to allow for multiple values for a single field.
    For example, we have an index of employers where an employer often has
    multiple ways people refer to it. For example, the company "Wal-mart" is
    referred to as:

    1) Wal-mart

    2) Wal-mart Stores

    3) Walmart
    I would like a search for any of these 3 terms to match the Wal-mart
    employer.

    I've tried two different approaches for this.

    Approach 1: Create multiple values for the same field. So the document has
    these three fields:

    1) name=Wal-mart

    2) name=Wal-mart Stores

    3) name=Walmart
    The problem with this is Lucene seems to treat the 3 different fields as
    one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic
    b/c term frequencies is 2 when a user searches for "Wal-mart".

    Approach 2: Create different named fields for each value so the document
    has these 3 fields:

    1) name1=Wal-mart

    2) name2=Wal-mart Stores

    3) name3=Walmart
    This fixes the issue above but introduces a different problem. The idf
    calculation is incorrect b/c idf is calculated per field. Most employers
    only have one name or maybe 2 names. So the name3 fields idf ends up being
    much higher b/c there are fewer docs with a given term in the name3 field.

    For now, I'm going with approach 2 but overriding the IndexReader.
    IndexReader.docFreq(Term t) method always returns the doc frequency from the
    name1 field even if the Term t is actually for name2 or name3, etc. But this
    doesn't feel like a clean solution.

    Any suggestions on how to deal with this? Any ideas would be appreciated.
    Ryan Aylward
  • Ryan Aylward at Jan 10, 2011 at 5:43 pm
    Thanks for the response.

    We do leverage synonyms but they are not appropriate for this case. We use synonyms for words that are truly synonymous for the entire index such as "inc" and "incorporated". Those words are always interchangeable. However, many of the employer alternate names are only valid for a single employer not for the entire index.

    We do disable the lengthNorm but we benefit from tf and idf so disabling those would cause more harm than good.

    Any other suggestions would be appreciated.

    Thanks.

    -----Original Message-----
    From: Anshum
    Sent: Friday, January 07, 2011 7:38 PM
    To: java-user@lucene.apache.org
    Subject: Re: Creating an index with multiple values for a single field

    Hi Ryan,
    You should try the synonym filter. That should help you with this kinda
    problem.
    You could also look at turning off norms for the name field, or turning off
    tf or idf.

    --
    Anshum Gupta
    http://ai-cafe.blogspot.com

    On Sat, Jan 8, 2011 at 6:03 AM, Ryan Aylward wrote:

    Our business has a need to allow for multiple values for a single field.
    For example, we have an index of employers where an employer often has
    multiple ways people refer to it. For example, the company "Wal-mart" is
    referred to as:

    1) Wal-mart

    2) Wal-mart Stores

    3) Walmart
    I would like a search for any of these 3 terms to match the Wal-mart
    employer.

    I've tried two different approaches for this.

    Approach 1: Create multiple values for the same field. So the document has
    these three fields:

    1) name=Wal-mart

    2) name=Wal-mart Stores

    3) name=Walmart
    The problem with this is Lucene seems to treat the 3 different fields as
    one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic
    b/c term frequencies is 2 when a user searches for "Wal-mart".

    Approach 2: Create different named fields for each value so the document
    has these 3 fields:

    1) name1=Wal-mart

    2) name2=Wal-mart Stores

    3) name3=Walmart
    This fixes the issue above but introduces a different problem. The idf
    calculation is incorrect b/c idf is calculated per field. Most employers
    only have one name or maybe 2 names. So the name3 fields idf ends up being
    much higher b/c there are fewer docs with a given term in the name3 field.

    For now, I'm going with approach 2 but overriding the IndexReader.
    IndexReader.docFreq(Term t) method always returns the doc frequency from the
    name1 field even if the Term t is actually for name2 or name3, etc. But this
    doesn't feel like a clean solution.

    Any suggestions on how to deal with this? Any ideas would be appreciated.
    Ryan Aylward
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ahmet Arslan at Jan 10, 2011 at 7:30 pm

    We do leverage synonyms but they are not appropriate for
    this case. We use synonyms for words that are truly
    synonymous for the entire index such as "inc" and
    "incorporated". Those words are always interchangeable.
    However, many of the employer alternate names are only valid
    for a single employer not for the entire index.

    We do disable the lengthNorm but we benefit from tf and idf
    so disabling those would cause more harm than good.

    Any other suggestions would be appreciated.
    May be WDF can useful?

    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doron Cohen at Jan 10, 2011 at 10:16 pm

    On Mon, Jan 10, 2011 at 7:44 PM, Ryan Aylward wrote:

    We do leverage synonyms but they are not appropriate for this case. We use
    synonyms for words that are truly synonymous for the entire index such as
    "inc" and "incorporated". Those words are always interchangeable. However,
    many of the employer alternate names are only valid for a single employer
    not for the entire index.
    We do disable the lengthNorm but we benefit from tf and idf so disabling
    those would cause more harm than good.
    Any other suggestions would be appreciated.
    How about indexing this specific field without analysis - except perhaps for
    lower casing - i.e. in the above example the field would have exactly 3
    tokens: [wal-mart], [wal-mart stores], [walmart]. At search time this field
    would be treated the same way, that is, no analysis except for lower casing.
    Since norms are already omitted for this field its lengths differences
    between docs would not affect scores.
    HTH, Doron

    Thanks.

    -----Original Message-----
    From: Anshum
    Sent: Friday, January 07, 2011 7:38 PM
    To: java-user@lucene.apache.org
    Subject: Re: Creating an index with multiple values for a single field

    Hi Ryan,
    You should try the synonym filter. That should help you with this kinda
    problem.
    You could also look at turning off norms for the name field, or turning off
    tf or idf.

    --
    Anshum Gupta
    http://ai-cafe.blogspot.com

    On Sat, Jan 8, 2011 at 6:03 AM, Ryan Aylward wrote:

    Our business has a need to allow for multiple values for a single field.
    For example, we have an index of employers where an employer often has
    multiple ways people refer to it. For example, the company "Wal-mart" is
    referred to as:

    1) Wal-mart

    2) Wal-mart Stores

    3) Walmart
    I would like a search for any of these 3 terms to match the Wal-mart
    employer.

    I've tried two different approaches for this.

    Approach 1: Create multiple values for the same field. So the document has
    these three fields:

    1) name=Wal-mart

    2) name=Wal-mart Stores

    3) name=Walmart
    The problem with this is Lucene seems to treat the 3 different fields as
    one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic
    b/c term frequencies is 2 when a user searches for "Wal-mart".

    Approach 2: Create different named fields for each value so the document
    has these 3 fields:

    1) name1=Wal-mart

    2) name2=Wal-mart Stores

    3) name3=Walmart
    This fixes the issue above but introduces a different problem. The idf
    calculation is incorrect b/c idf is calculated per field. Most employers
    only have one name or maybe 2 names. So the name3 fields idf ends up being
    much higher b/c there are fewer docs with a given term in the name3 field.
    For now, I'm going with approach 2 but overriding the IndexReader.
    IndexReader.docFreq(Term t) method always returns the doc frequency from the
    name1 field even if the Term t is actually for name2 or name3, etc. But this
    doesn't feel like a clean solution.

    Any suggestions on how to deal with this? Any ideas would be appreciated.
    Ryan Aylward
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 8, '11 at 12:33a
activeJan 10, '11 at 10:16p
posts5
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase