18 maj 2008 kl. 00.01 skrev Paul Elschot:
Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
As far as I know Lucene only handle single word synonyms at index
time. My life would be much simpler if it was possible to add
synonyms that spanned over multiple tokens, such as "lucene in
action"="lia". I have a couple of workarounds that are OK but it
really isn't the same thing when it comes down to the scoring.
The simplest solution is to index such synonyms at the first or last
or middle position of the source tokens, using a zero position
increment for the synonym. Was this one of the workarounds?
I get sloppyFreq problems with that.
The advantage of the zero position increment is that the original
token positions are not affected, so at least there is no influence
on scoring because of changes in the original token positions.
I copy a number of fields to a single one. Each such field can be
represented in a number of languages or aliases in the same language.
[a, b, c, d, e, f], [g, h, i], [j, k, l ,m]
[o, p] [u, v]
[q, r, s, t]
It would be great if the phrase query on [f, o, p, u, v] could yeild a
0 distance.
If I'd been using the same synonyms for the same phrases in all
documents at all times the edit distance would be static when scoring,
but I don't.
The terms of these synonyms are not really compatible with each other.
For instance [f, g, s, t, j] should not be allowed or at least be
heavily penalised compared to [f, o, p, j].
Searching a combination of languages should be allowed but preferably
only one per field copied to the big field. (Disjunction is not
applicable.)
It is OK the way I have it running now, but more dimensions as
described above really increases the score quality. I confirmed that
using permutations of documents and filtering out the "duplicates".
Now I'm thinking it could be solved using token payloads and a brand
new MultiDimensionalSpanQuery. Not too different from what you
suggested way back in
http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storing-them.-to3918462.html#a3944016There are some other issues too, but I'm not at liberty to disclose
too much. I hope it still makes sense?
karl
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]