Our business has a need to allow for multiple values for a single field. For example, we have an index of employers where an employer often has multiple ways people refer to it. For example, the company "Wal-mart" is referred to as:
1) Wal-mart
2) Wal-mart Stores
3) Walmart
I would like a search for any of these 3 terms to match the Wal-mart employer.
I've tried two different approaches for this.
Approach 1: Create multiple values for the same field. So the document has these three fields:
1) name=Wal-mart
2) name=Wal-mart Stores
3) name=Walmart
The problem with this is Lucene seems to treat the 3 different fields as one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic b/c term frequencies is 2 when a user searches for "Wal-mart".
Approach 2: Create different named fields for each value so the document has these 3 fields:
1) name1=Wal-mart
2) name2=Wal-mart Stores
3) name3=Walmart
This fixes the issue above but introduces a different problem. The idf calculation is incorrect b/c idf is calculated per field. Most employers only have one name or maybe 2 names. So the name3 fields idf ends up being much higher b/c there are fewer docs with a given term in the name3 field.
For now, I'm going with approach 2 but overriding the IndexReader. IndexReader.docFreq(Term t) method always returns the doc frequency from the name1 field even if the Term t is actually for name2 or name3, etc. But this doesn't feel like a clean solution.
Any suggestions on how to deal with this? Any ideas would be appreciated.
Ryan Aylward