FAQ
As far as I know Lucene only handle single word synonyms at index
time. My life would be much simpler if it was possible to add synonyms
that spanned over multiple tokens, such as "lucene in action"="lia". I
have a couple of workarounds that are OK but it really isn't the same
thing when it comes down to the scoring.

The thing that does the best job at scoring was to assemble several
permutations of the same document. But it doesn't feel good. I have
cases where that means several hundred documents, and I have to do
post processing to filter out the "duplicate" hits. It can turn out to
be rather expensive. And I'm sure it mess with the scoring in several
ways I did not notice yet.

I've also considering creating some multi dimensional term position
space, but I'd say that could take a lot of time to implement.

Are there any good solutions to this?


karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Paul Elschot at May 17, 2008 at 10:01 pm

    Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
    As far as I know Lucene only handle single word synonyms at index
    time. My life would be much simpler if it was possible to add
    synonyms that spanned over multiple tokens, such as "lucene in
    action"="lia". I have a couple of workarounds that are OK but it
    really isn't the same thing when it comes down to the scoring.

    The thing that does the best job at scoring was to assemble several
    permutations of the same document. But it doesn't feel good. I have
    cases where that means several hundred documents, and I have to do
    post processing to filter out the "duplicate" hits. It can turn out
    to be rather expensive. And I'm sure it mess with the scoring in
    several ways I did not notice yet.

    I've also considering creating some multi dimensional term position
    space, but I'd say that could take a lot of time to implement.

    Are there any good solutions to this?
    The simplest solution is to index such synonyms at the first or last
    or middle position of the source tokens, using a zero position
    increment for the synonym. Was this one of the workarounds?

    The advantage of the zero position increment is that the original
    token positions are not affected, so at least there is no influence
    on scoring because of changes in the original token positions.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Karl Wettin at May 18, 2008 at 2:31 pm

    18 maj 2008 kl. 00.01 skrev Paul Elschot:
    Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
    As far as I know Lucene only handle single word synonyms at index
    time. My life would be much simpler if it was possible to add
    synonyms that spanned over multiple tokens, such as "lucene in
    action"="lia". I have a couple of workarounds that are OK but it
    really isn't the same thing when it comes down to the scoring.
    The simplest solution is to index such synonyms at the first or last
    or middle position of the source tokens, using a zero position
    increment for the synonym. Was this one of the workarounds?
    I get sloppyFreq problems with that.
    The advantage of the zero position increment is that the original
    token positions are not affected, so at least there is no influence
    on scoring because of changes in the original token positions.

    I copy a number of fields to a single one. Each such field can be
    represented in a number of languages or aliases in the same language.

    [a, b, c, d, e, f], [g, h, i], [j, k, l ,m]
    [o, p] [u, v]
    [q, r, s, t]

    It would be great if the phrase query on [f, o, p, u, v] could yeild a
    0 distance.

    If I'd been using the same synonyms for the same phrases in all
    documents at all times the edit distance would be static when scoring,
    but I don't.

    The terms of these synonyms are not really compatible with each other.
    For instance [f, g, s, t, j] should not be allowed or at least be
    heavily penalised compared to [f, o, p, j].

    Searching a combination of languages should be allowed but preferably
    only one per field copied to the big field. (Disjunction is not
    applicable.)

    It is OK the way I have it running now, but more dimensions as
    described above really increases the score quality. I confirmed that
    using permutations of documents and filtering out the "duplicates".
    Now I'm thinking it could be solved using token payloads and a brand
    new MultiDimensionalSpanQuery. Not too different from what you
    suggested way back in http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storing-them.-to3918462.html#a3944016

    There are some other issues too, but I'm not at liberty to disclose
    too much. I hope it still makes sense?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Paul Elschot at May 18, 2008 at 5:18 pm

    Op Sunday 18 May 2008 16:30:26 schreef Karl Wettin:
    18 maj 2008 kl. 00.01 skrev Paul Elschot:
    Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
    As far as I know Lucene only handle single word synonyms at index
    time. My life would be much simpler if it was possible to add
    synonyms that spanned over multiple tokens, such as "lucene in
    action"="lia". I have a couple of workarounds that are OK but it
    really isn't the same thing when it comes down to the scoring.
    The simplest solution is to index such synonyms at the first or
    last or middle position of the source tokens, using a zero position
    increment for the synonym. Was this one of the workarounds?
    I get sloppyFreq problems with that.
    The advantage of the zero position increment is that the original
    token positions are not affected, so at least there is no influence
    on scoring because of changes in the original token positions.
    I copy a number of fields to a single one. Each such field can be
    represented in a number of languages or aliases in the same language.

    [a, b, c, d, e, f], [g, h, i], [j, k, l ,m]
    [o, p] [u, v]
    [q, r, s, t]

    It would be great if the phrase query on [f, o, p, u, v] could yeild
    a 0 distance.

    If I'd been using the same synonyms for the same phrases in all
    documents at all times the edit distance would be static when
    scoring, but I don't.

    The terms of these synonyms are not really compatible with each
    other. For instance [f, g, s, t, j] should not be allowed or at least
    be heavily penalised compared to [f, o, p, j].

    Searching a combination of languages should be allowed but preferably
    only one per field copied to the big field. (Disjunction is not
    applicable.)

    It is OK the way I have it running now, but more dimensions as
    described above really increases the score quality. I confirmed that
    using permutations of documents and filtering out the "duplicates".
    Now I'm thinking it could be solved using token payloads and a brand
    new MultiDimensionalSpanQuery. Not too different from what you
    suggested way back in
    http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storin
    g-them.-to3918462.html#a3944016
    That would mean a term extending tag to indicate that a term is on
    an alternative path?
    There are some other issues too, but I'm not at liberty to disclose
    too much. I hope it still makes sense?
    Yes. I suppose the payload would indicate how much the alternative
    path length differs from the original path?

    In case you can't disclose more, no answer would off course be ok, too.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Karl Wettin at May 18, 2008 at 6:00 pm

    18 maj 2008 kl. 19.17 skrev Paul Elschot:
    Now I'm thinking it could be solved using token payloads and a brand
    new MultiDimensionalSpanQuery. Not too different from what you
    suggested way back in
    http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storin
    g-them.-to3918462.html#a3944016
    That would mean a term extending tag to indicate that a term is on
    an alternative path?
    No extra token, just payloads. I think. Really didn't think that much
    about it yet.
    I suppose the payload would indicate how much the alternative
    path length differs from the original path?
    Something like that. Maybe it could be nice if it allowed for terms in
    diffrent places of the index to be part of the same synonym dimension
    for some extra boost the more the query matches tokens in the same
    dimension. But I'm not completely sure about that either.


    This brings me to another thing: isn't there a layer of code missing
    that partition and read payload data? I already use payloads for
    position boosts and would have to hack a bit extra to combine that.


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 17, '08 at 6:29p
activeMay 18, '08 at 6:00p
posts5
users2
websitelucene.apache.org

2 users in discussion

Karl Wettin: 3 posts Paul Elschot: 2 posts

People

Translate

site design / logo © 2023 Grokbase