FAQ
Hi,

I'm fairly new to Lucene. I'd like to know how we can index synonyms for
multiple words.

This is the scenario:

Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

Now assume the two words combined WORD1 WORD2 can be replaced by another
word SYN.

If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
follow SYN,
which is incorrect; and the other way round if I place it after WORD2.

If any of you have solved a similar problem, I'd be thankful if you could
share some light on
the solution.

Regards,
Sumukh

Search Discussions

  • Erick Erickson at Mar 2, 2009 at 2:51 pm
    This has been discussed in the user list, so searching there
    might get you answer quicker.

    See: http://wiki.apache.org/lucene-java/MailingListArchives

    I don't remember the results, but...

    Best
    Erick
    On Mon, Mar 2, 2009 at 9:13 AM, Sumukh wrote:

    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index synonyms for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
    follow SYN,
    which is incorrect; and the other way round if I place it after WORD2.

    If any of you have solved a similar problem, I'd be thankful if you could
    share some light on
    the solution.

    Regards,
    Sumukh
  • Michael McCandless at Mar 2, 2009 at 3:07 pm
    Shouldn't WORD2's position be 1 more than your SYN?

    Ie, don't you want these positions?:

    WORD1 2
    WORD2 3
    SYN 2

    The position is the starting position of the token; Lucene doesn't
    store an ending position

    Mike

    Sumukh wrote:
    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index synonyms
    for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by
    another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
    follow SYN,
    which is incorrect; and the other way round if I place it after WORD2.

    If any of you have solved a similar problem, I'd be thankful if you
    could
    share some light on
    the solution.

    Regards,
    Sumukh

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Uwe Schindler at Mar 2, 2009 at 3:39 pm
    I think his problem is, that "SYN" is a synonym for the phrase "WORD1
    WORD2". Using these positions, a phrase like "SYN WORD2" would also match
    (or other problems in queries that depend on order of words).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]
    -----Original Message-----
    From: Michael McCandless
    Sent: Monday, March 02, 2009 4:07 PM
    To: [email protected]
    Subject: Re: Indexing synonyms for multiple words


    Shouldn't WORD2's position be 1 more than your SYN?

    Ie, don't you want these positions?:

    WORD1 2
    WORD2 3
    SYN 2

    The position is the starting position of the token; Lucene doesn't
    store an ending position

    Mike

    Sumukh wrote:
    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index synonyms
    for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by
    another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
    follow SYN,
    which is incorrect; and the other way round if I place it after WORD2.

    If any of you have solved a similar problem, I'd be thankful if you
    could
    share some light on
    the solution.

    Regards,
    Sumukh

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Michael McCandless at Mar 2, 2009 at 4:42 pm
    Since Lucene doesn't represent/store end position for a token, I don't
    think the index can properly represent SYN spanning two positions?

    I suppose you could encode this into payloads, and create a custom
    query that would look at the payload to enforce the constraint.

    Or, if you switch to doing SYN expansion only at runtime (not adding
    it to the index), that might work.

    Mike

    Uwe Schindler wrote:
    I think his problem is, that "SYN" is a synonym for the phrase "WORD1
    WORD2". Using these positions, a phrase like "SYN WORD2" would also
    match
    (or other problems in queries that depend on order of words).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]
    -----Original Message-----
    From: Michael McCandless
    Sent: Monday, March 02, 2009 4:07 PM
    To: [email protected]
    Subject: Re: Indexing synonyms for multiple words


    Shouldn't WORD2's position be 1 more than your SYN?

    Ie, don't you want these positions?:

    WORD1 2
    WORD2 3
    SYN 2

    The position is the starting position of the token; Lucene doesn't
    store an ending position

    Mike

    Sumukh wrote:
    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index synonyms
    for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by
    another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2
    will
    follow SYN,
    which is incorrect; and the other way round if I place it after
    WORD2.

    If any of you have solved a similar problem, I'd be thankful if you
    could
    share some light on
    the solution.

    Regards,
    Sumukh

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Sumukh at Mar 3, 2009 at 1:14 am
    Thanks for your suggestion Michael and thanks to Uwe for clarifying.

    Payload is currently used to store only the start positions.
    What I gathered from your suggestion is that we could possibly
    store the end position, or span, or some other complex
    encoding in order to store the extra information.
    Am I right?

    --Sumukh


    Michael McCandless-2 wrote:

    Since Lucene doesn't represent/store end position for a token, I don't
    think the index can properly represent SYN spanning two positions?

    I suppose you could encode this into payloads, and create a custom
    query that would look at the payload to enforce the constraint.

    Or, if you switch to doing SYN expansion only at runtime (not adding
    it to the index), that might work.

    Mike

    Uwe Schindler wrote:
    I think his problem is, that "SYN" is a synonym for the phrase "WORD1
    WORD2". Using these positions, a phrase like "SYN WORD2" would also
    match
    (or other problems in queries that depend on order of words).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]
    -----Original Message-----
    From: Michael McCandless
    Sent: Monday, March 02, 2009 4:07 PM
    To: [email protected]
    Subject: Re: Indexing synonyms for multiple words


    Shouldn't WORD2's position be 1 more than your SYN?

    Ie, don't you want these positions?:

    WORD1 2
    WORD2 3
    SYN 2

    The position is the starting position of the token; Lucene doesn't
    store an ending position

    Mike

    Sumukh wrote:
    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index synonyms
    for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by
    another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2
    will
    follow SYN,
    which is incorrect; and the other way round if I place it after
    WORD2.

    If any of you have solved a similar problem, I'd be thankful if you
    could
    share some light on
    the solution.

    Regards,
    Sumukh

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    --
    View this message in context: http://www.nabble.com/Indexing-synonyms-for-multiple-words-tp22289069p22300656.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Michael McCandless at Mar 3, 2009 at 3:41 pm
    Actually, the start position of each token is stored in the "normal"
    Lucene index (in the *.prx files), not using payloads.

    Payloads are entirely for per-token extensibility (ie, core Lucene
    doesn't use them by default): you'd have to create your own analyzer
    to attach payloads to tokens, and then do something with them at
    search time.

    So I suggested you could store the end position of each token into the
    Payload, but then you'd need to implement a Query class to use this
    during searching.

    Mike

    Sumukh wrote:
    Thanks for your suggestion Michael and thanks to Uwe for clarifying.

    Payload is currently used to store only the start positions.
    What I gathered from your suggestion is that we could possibly
    store the end position, or span, or some other complex
    encoding in order to store the extra information.
    Am I right?

    --Sumukh


    Michael McCandless-2 wrote:

    Since Lucene doesn't represent/store end position for a token, I
    don't
    think the index can properly represent SYN spanning two positions?

    I suppose you could encode this into payloads, and create a custom
    query that would look at the payload to enforce the constraint.

    Or, if you switch to doing SYN expansion only at runtime (not adding
    it to the index), that might work.

    Mike

    Uwe Schindler wrote:
    I think his problem is, that "SYN" is a synonym for the phrase
    "WORD1
    WORD2". Using these positions, a phrase like "SYN WORD2" would also
    match
    (or other problems in queries that depend on order of words).

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]
    -----Original Message-----
    From: Michael McCandless
    Sent: Monday, March 02, 2009 4:07 PM
    To: [email protected]
    Subject: Re: Indexing synonyms for multiple words


    Shouldn't WORD2's position be 1 more than your SYN?

    Ie, don't you want these positions?:

    WORD1 2
    WORD2 3
    SYN 2

    The position is the starting position of the token; Lucene doesn't
    store an ending position

    Mike

    Sumukh wrote:
    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index
    synonyms
    for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by
    another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2
    will
    follow SYN,
    which is incorrect; and the other way round if I place it after
    WORD2.

    If any of you have solved a similar problem, I'd be thankful if
    you
    could
    share some light on
    the solution.

    Regards,
    Sumukh

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    --
    View this message in context: http://www.nabble.com/Indexing-synonyms-for-multiple-words-tp22289069p22300656.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Sumukh at Mar 2, 2009 at 3:27 pm
    Hi,

    I'm fairly new to Lucene. I'd like to know how we can index synonyms for
    multiple words.

    This is the scenario:

    Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG.

    Now assume the two words combined WORD1 WORD2 can be replaced by another
    word SYN.

    If I place SYN after WORD1 with positionIncrement set to 0, WORD2 will
    follow SYN,
    which is incorrect; and the other way round if I place it after WORD2.

    If any of you have solved a similar problem, I'd be thankful if you could
    share some light on
    the solution.

    Regards,
    Sumukh

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 2, '09 at 2:25p
activeMar 3, '09 at 3:41p
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase