FAQ
Some time ago I need to tune our home grown search engine based on lucene to perform well on product searches. Product search is search where users come with part of product name and we should find the product.

The problem here is that users doesn't provide full model name. For instance id product model name is "Sony PRS-A9000QF", users frequently search for "PRS 9000", "9000QF" etc.

The simple and straightforward solution to this problem is to tokenize model names on the different character type boundary. So for "Sony PRS-A9000QF" we will have 5 terms: "sony", "prs", "a", "9000" "qf". This solution could dramatically increase search sensitive (which is not a good thing in a general search), but works well in a specialized indexes.

So a developed such a token filter. My question is there any interest in this solution for the community, and does it make sense to contribute it back?
---
Denis Bazhenov <dotsid@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Em at Jun 13, 2011 at 10:50 am
    Hi,

    sounds like the WordDelimiterTokenFilter from Solr, doesn't it?

    Regards,
    Em

    Am 13.06.2011 12:06, schrieb Denis Bazhenov:
    Some time ago I need to tune our home grown search engine based on lucene to perform well on product searches. Product search is search where users come with part of product name and we should find the product.

    The problem here is that users doesn't provide full model name. For instance id product model name is "Sony PRS-A9000QF", users frequently search for "PRS 9000", "9000QF" etc.

    The simple and straightforward solution to this problem is to tokenize model names on the different character type boundary. So for "Sony PRS-A9000QF" we will have 5 terms: "sony", "prs", "a", "9000" "qf". This solution could dramatically increase search sensitive (which is not a good thing in a general search), but works well in a specialized indexes.

    So a developed such a token filter. My question is there any interest in this solution for the community, and does it make sense to contribute it back?
    ---
    Denis Bazhenov <dotsid@gmail.com>






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Denis Bazhenov at Jun 13, 2011 at 10:57 am
    It seems so. Interestingly I can't find any mentions of WordDelimiterTokenFilter using google. Is it part of Solr codebase?
    On 13.06.2011, at 21:49, Em wrote:

    Hi,

    sounds like the WordDelimiterTokenFilter from Solr, doesn't it?

    Regards,
    Em

    Am 13.06.2011 12:06, schrieb Denis Bazhenov:
    Some time ago I need to tune our home grown search engine based on lucene to perform well on product searches. Product search is search where users come with part of product name and we should find the product.

    The problem here is that users doesn't provide full model name. For instance id product model name is "Sony PRS-A9000QF", users frequently search for "PRS 9000", "9000QF" etc.

    The simple and straightforward solution to this problem is to tokenize model names on the different character type boundary. So for "Sony PRS-A9000QF" we will have 5 terms: "sony", "prs", "a", "9000" "qf". This solution could dramatically increase search sensitive (which is not a good thing in a general search), but works well in a specialized indexes.

    So a developed such a token filter. My question is there any interest in this solution for the community, and does it make sense to contribute it back?
    ---
    Denis Bazhenov <dotsid@gmail.com>






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---
    Denis Bazhenov <dotsid@gmail.com>
  • Em at Jun 13, 2011 at 11:02 am
    Yes, it's part of Solr. And even in Solr there was no documentation in
    the API - at last when I searched for it the last time.

    Regards,
    Em

    Am 13.06.2011 12:56, schrieb Denis Bazhenov:
    It seems so. Interestingly I can't find any mentions of WordDelimiterTokenFilter using google. Is it part of Solr codebase?
    On 13.06.2011, at 21:49, Em wrote:

    Hi,

    sounds like the WordDelimiterTokenFilter from Solr, doesn't it?

    Regards,
    Em

    Am 13.06.2011 12:06, schrieb Denis Bazhenov:
    Some time ago I need to tune our home grown search engine based on lucene to perform well on product searches. Product search is search where users come with part of product name and we should find the product.

    The problem here is that users doesn't provide full model name. For instance id product model name is "Sony PRS-A9000QF", users frequently search for "PRS 9000", "9000QF" etc.

    The simple and straightforward solution to this problem is to tokenize model names on the different character type boundary. So for "Sony PRS-A9000QF" we will have 5 terms: "sony", "prs", "a", "9000" "qf". This solution could dramatically increase search sensitive (which is not a good thing in a general search), but works well in a specialized indexes.

    So a developed such a token filter. My question is there any interest in this solution for the community, and does it make sense to contribute it back?
    ---
    Denis Bazhenov <dotsid@gmail.com>






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---
    Denis Bazhenov <dotsid@gmail.com>




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Jun 13, 2011 at 11:11 am
    In Lucene trunk (will be version 4.0), all analyzers/tokenizers/tokenfilters
    were moved to a new shared analyzer module. So WDF is now part of a shared
    Lucene/Solr module. In 3.x, you still have to add the Solr JARS to use it.

    This TokenFilter should do what you intend to do (see the Solr
    documentation, where all parameters are explained):
    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimit
    erFilterFactory

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Em
    Sent: Monday, June 13, 2011 1:02 PM
    To: java-user@lucene.apache.org
    Subject: Re: WordBoundTokenFilter

    Yes, it's part of Solr. And even in Solr there was no documentation in the API
    - at last when I searched for it the last time.

    Regards,
    Em

    Am 13.06.2011 12:56, schrieb Denis Bazhenov:
    It seems so. Interestingly I can't find any mentions of
    WordDelimiterTokenFilter using google. Is it part of Solr codebase?
    On 13.06.2011, at 21:49, Em wrote:

    Hi,

    sounds like the WordDelimiterTokenFilter from Solr, doesn't it?

    Regards,
    Em

    Am 13.06.2011 12:06, schrieb Denis Bazhenov:
    Some time ago I need to tune our home grown search engine based on
    lucene to perform well on product searches. Product search is search where
    users come with part of product name and we should find the product.
    The problem here is that users doesn't provide full model name. For
    instance id product model name is "Sony PRS-A9000QF", users frequently
    search for "PRS 9000", "9000QF" etc.
    The simple and straightforward solution to this problem is to tokenize
    model names on the different character type boundary. So for "Sony PRS-
    A9000QF" we will have 5 terms: "sony", "prs", "a", "9000" "qf". This solution
    could dramatically increase search sensitive (which is not a good thing in a
    general search), but works well in a specialized indexes.
    So a developed such a token filter. My question is there any interest
    in
    this solution for the community, and does it make sense to contribute it
    back?
    ---
    Denis Bazhenov <dotsid@gmail.com>






    --------------------------------------------------------------------
    - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---
    Denis Bazhenov <dotsid@gmail.com>




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Denis Bazhenov at Jun 13, 2011 at 11:30 am
    Okay, now I'm experiencing one of those "Simpsons already did it" moments in my life :) Nevertheless, nice to know that this problem already solved and I should write no code at all. Thanks a lot!
    On 13.06.2011, at 22:11, Uwe Schindler wrote:

    In Lucene trunk (will be version 4.0), all analyzers/tokenizers/tokenfilters
    were moved to a new shared analyzer module. So WDF is now part of a shared
    Lucene/Solr module. In 3.x, you still have to add the Solr JARS to use it.

    This TokenFilter should do what you intend to do (see the Solr
    documentation, where all parameters are explained):
    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimit
    erFilterFactory

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Em
    Sent: Monday, June 13, 2011 1:02 PM
    To: java-user@lucene.apache.org
    Subject: Re: WordBoundTokenFilter

    Yes, it's part of Solr. And even in Solr there was no documentation in the API
    - at last when I searched for it the last time.

    Regards,
    Em

    Am 13.06.2011 12:56, schrieb Denis Bazhenov:
    It seems so. Interestingly I can't find any mentions of
    WordDelimiterTokenFilter using google. Is it part of Solr codebase?
    On 13.06.2011, at 21:49, Em wrote:

    Hi,

    sounds like the WordDelimiterTokenFilter from Solr, doesn't it?

    Regards,
    Em

    Am 13.06.2011 12:06, schrieb Denis Bazhenov:
    Some time ago I need to tune our home grown search engine based on
    lucene to perform well on product searches. Product search is search where
    users come with part of product name and we should find the product.
    The problem here is that users doesn't provide full model name. For
    instance id product model name is "Sony PRS-A9000QF", users frequently
    search for "PRS 9000", "9000QF" etc.
    The simple and straightforward solution to this problem is to tokenize
    model names on the different character type boundary. So for "Sony PRS-
    A9000QF" we will have 5 terms: "sony", "prs", "a", "9000" "qf". This solution
    could dramatically increase search sensitive (which is not a good thing in a
    general search), but works well in a specialized indexes.
    So a developed such a token filter. My question is there any interest
    in
    this solution for the community, and does it make sense to contribute it
    back?
    ---
    Denis Bazhenov <dotsid@gmail.com>






    --------------------------------------------------------------------
    - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---
    Denis Bazhenov <dotsid@gmail.com>




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---
    Denis Bazhenov <dotsid@gmail.com>






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 13, '11 at 10:07a
activeJun 13, '11 at 11:30a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase