FAQ
Does anyone have solutions for handling intraword delimiters (case
changes, non-alphanumeric chars, and alpha-numeric transitions)?

If the source text is Wi-Fi, we want to be able to match the following
user queries:

wi fi
wifi
wi-fi
wi+fi
WiFi

One way is to index "wi", "fi", and "wifi".
However, indexing all combinations of subwords gets a bit messy when
the number of subwords gets larger. I need to handle product names,
serial numbers, SKUs, etc.

Another example:
Source Text contains "Canon Powershot SD500 7MP Digital Elph"

And I want to be able to match the following user queries:
Power Shot SD 500
CanonPowerShotSD500
SD 500 7 MP digitalelph
Canon-Powershot-SD 500

Any ideas?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Marvin Humphrey at Aug 15, 2005 at 11:54 pm

    On Aug 15, 2005, at 3:16 PM, Yonik Seeley wrote:

    Another example:
    Source Text contains "Canon Powershot SD500 7MP Digital Elph"

    And I want to be able to match the following user queries:
    Power Shot SD 500
    CanonPowerShotSD500
    SD 500 7 MP digitalelph
    Canon-Powershot-SD 500

    Any ideas?
    How about this?

    1) Lowercase.
    2) Convert non-alphanumeric characters to spaces.
    3) Introduce a space at every boundary between a letter and a number.
    4) concatenate all 1, 2, 3 .. n term combinations and index them.
    5) Don't stem.

    Marvin Humphrey
    Rectangular Research
    http://www.rectangular.com/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Aug 16, 2005 at 2:47 am
    That was the plan, but step (4) really seems problematic.

    - term expansion this way can lead to a lot of false matches
    - phrase queries with many bordering words break
    - settingt term positions such that phrase queries work on all combos
    of subwords is non-trivial.

    It seems like a better approach might be a new query type that can
    handle things like this.

    As an example, consider a-b-c-d (4 subwords)... one way of indexing
    the tokens would be:

    Pos0: a
    Pos1: b, ab, a
    Pos2: c, bc, abc, cd
    Pos3: d, abcd

    There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
    order to satisfy all possible phrase queries of catenated subwords.
    Notice how many other things will now match though (ac, aab,
    aababcabcd, etc).

    In addition, any algorithm I come up with to generate those term
    position uses even more terms than the hand-generated one above.

    By using index expansion in this manner, we have lost info about the
    original ordering. A new type of fuzzy phrase query seems like it
    might be able to do a better job in many circumstances.

    -Yonik

    On 8/15/05, Marvin Humphrey wrote:
    1) Lowercase.
    2) Convert non-alphanumeric characters to spaces.
    3) Introduce a space at every boundary between a letter and a number.
    4) concatenate all 1, 2, 3 .. n term combinations and index them.
    5) Don't stem.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Aug 16, 2005 at 3:54 am

    On Aug 15, 2005, at 7:47 PM, Yonik Seeley wrote:

    That was the plan, but step (4) really seems problematic.

    - term expansion this way can lead to a lot of false matches
    - phrase queries with many bordering words break
    - settingt term positions such that phrase queries work on all combos
    of subwords is non-trivial.
    Tag every term with its length in tokens. :)

    Index at these positions.

    Pos0: a ab abc abcd
    Pos1: b bc bcd
    Pos2: c cd
    Pos3: d

    Create a phrase query that when it encounters ab => { tokenlength =>
    2 } knows to look for something at position 3.

    Marvin Humphrey
    Rectangular Research
    http://www.rectangular.com/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Aug 16, 2005 at 3:56 am

    On Aug 15, 2005, at 8:53 PM, Marvin Humphrey wrote:

    Create a phrase query that when it encounters ab => { tokenlength
    => 2 } knows to look for something at position 3.
    Fencepost error! That should have been "position 2".

    Not that correcting the error makes the algo any more practical. ;)

    Marvin Humphrey
    Rectangular Research
    http://www.rectangular.com/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Rajesh Munavalli at Aug 16, 2005 at 4:06 am
    How about this....

    (1) Follow the first three steps mentioned by Marvin
    (2) For step 4 instead of trying to come up with all concatenation, create concatenations of 3 words

    For example an entity of length n, W1 W2 W3....Wn will have the index positions as follows

    Index Position Value
    0 W1W2W3
    W1W2
    W1
    1 W2W3W4
    W2W3
    W4

    and so on...

    (3) At the time of querying follow the steps 1 and 2 to create the query terms with respective BOOST levels as follows....
    QUERY: Power Shot SD 500
    PROCESSED QUERY: power shot sd 500

    Query Terms ORed:
    powershotsd^3, powershot^2, power
    shotsd500^3, shotsd^2, shot
    sd500^2, sd
    500

    QUERY: CanonPowerShotSD500
    PROCESSED QUERY: canonpowershotsd 500
    Query Terms ORed:
    canonpowershotsd500^2, canonpowershotsd

    QUERY: Canon-Powershot-SD 500
    PROCESSED QUERY: canon powershot sd 500
    Query Terms ORed:
    canonpowershotsd^3, canonpowershot^2, canon
    powershotsd500^3, powershotsd^2, powershot
    sd500^2, sd
    500

    Hopefully 3 grams should be good enough to capture enough information to retrieve useful documents. You can adjust the boost levels and assess the performance.

    Hope it helps...

    Rajesh Munavalli


    -----Original Message-----
    From: Yonik Seeley
    Sent: Mon 8/15/2005 7:47 PM
    To: java-user@lucene.apache.org
    Subject: Re: intra-word delimiters

    That was the plan, but step (4) really seems problematic.

    - term expansion this way can lead to a lot of false matches
    - phrase queries with many bordering words break
    - settingt term positions such that phrase queries work on all combos
    of subwords is non-trivial.

    It seems like a better approach might be a new query type that can
    handle things like this.

    As an example, consider a-b-c-d (4 subwords)... one way of indexing
    the tokens would be:

    Pos0: a
    Pos1: b, ab, a
    Pos2: c, bc, abc, cd
    Pos3: d, abcd

    There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
    order to satisfy all possible phrase queries of catenated subwords.
    Notice how many other things will now match though (ac, aab,
    aababcabcd, etc).

    In addition, any algorithm I come up with to generate those term
    position uses even more terms than the hand-generated one above.

    By using index expansion in this manner, we have lost info about the
    original ordering. A new type of fuzzy phrase query seems like it
    might be able to do a better job in many circumstances.

    -Yonik

    On 8/15/05, Marvin Humphrey wrote:
    1) Lowercase.
    2) Convert non-alphanumeric characters to spaces.
    3) Introduce a space at every boundary between a letter and a number.
    4) concatenate all 1, 2, 3 .. n term combinations and index them.
    5) Don't stem.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 15, '05 at 10:16p
activeAug 16, '05 at 4:06a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase