FAQ
Hello,
Are there any suggestions / best practices for using Lucene for searching
non-linguistic text? What I mean by non-linguistic is that it's not English
or any other language, but rather product codes. This is presenting some
interesting challenges. Among them are the need for pretty lax wildcard
searches. For example, ABC should match on ABCD, but so should BCD. Also,
it needs to be agnostic to special characters. So, ABC/D should match ABCD
as well as ABC-D or "ABC D".

As I write an analyzer to handle these cases, I seem to be pretty quickly
degrading into a "like '%blah%' search, with rules to treat all special
characters as single-character, optional wildcards. I'm concerned that the
performance of this will be disappointing, though.

Any help would be much appreciated. Thanks!

- Jes
--
View this message in context: http://www.nabble.com/Search-in-non-linguistic-text-tp24515936p24515936.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Anshum at Jul 16, 2009 at 1:32 pm
    Hi Jes,Good to see you here. You could try something like an n'gram
    analyzer. You'd have to explore, though 'm assuming it'd be helpful for
    you.

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Thu, Jul 16, 2009 at 6:34 PM, JesL wrote:


    Hello,
    Are there any suggestions / best practices for using Lucene for searching
    non-linguistic text? What I mean by non-linguistic is that it's not
    English
    or any other language, but rather product codes. This is presenting some
    interesting challenges. Among them are the need for pretty lax wildcard
    searches. For example, ABC should match on ABCD, but so should BCD. Also,
    it needs to be agnostic to special characters. So, ABC/D should match ABCD
    as well as ABC-D or "ABC D".

    As I write an analyzer to handle these cases, I seem to be pretty quickly
    degrading into a "like '%blah%' search, with rules to treat all special
    characters as single-character, optional wildcards. I'm concerned that the
    performance of this will be disappointing, though.

    Any help would be much appreciated. Thanks!

    - Jes
    --
    View this message in context:
    http://www.nabble.com/Search-in-non-linguistic-text-tp24515936p24515936.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 16, 2009 at 1:34 pm
    take a look at WordDelimiterFilter from Solr [you can use it in your
    lucene app too]

    On Thu, Jul 16, 2009 at 9:04 AM, JesLwrote:
    Hello,
    Are there any suggestions / best practices for using Lucene for searching
    non-linguistic text?  What I mean by non-linguistic is that it's not English
    or any other language, but rather product codes.  This is presenting some
    interesting challenges.  Among them are the need for pretty lax wildcard
    searches.  For example, ABC should match on ABCD, but so should BCD.  Also,
    it needs to be agnostic to special characters.  So, ABC/D should match ABCD
    as well as ABC-D or "ABC D".

    As I write an analyzer to handle these cases, I seem to be pretty quickly
    degrading into a "like '%blah%' search, with rules to treat all special
    characters as single-character, optional wildcards.  I'm concerned that the
    performance of this will be disappointing, though.

    Any help would be much appreciated.  Thanks!

    - Jes
    --
    View this message in context: http://www.nabble.com/Search-in-non-linguistic-text-tp24515936p24515936.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Matthew Hall at Jul 16, 2009 at 1:35 pm
    Assuming your dataset isn't incredibly large, I think you could.. cheat
    here, and optimize your data for searching.

    Am I correct in assuming that BC, should also match on ABCD?

    If so, then yes your current thoughts on the problems that you face are
    correct, and everything you do will be turning into a contains search,
    which is yes.. not the best performance you have ever seen.

    However, knowing this, you can manipulate your data in such a way, that
    you can get around that limitation, and turn everything into a prefix
    (or postfix) search if you so prefer.

    So here's what you do:

    When you are indexing the term ABCD, you are actually going to add
    several documents into the index (or into various special purpose
    indexes, if you so prefer.. but more on that later on)

    Lets say you want to turn everything into a prefix search under the covers.

    In the index you would store the following values, all of which point at
    the document "ABCD"

    'ABCD'
    'BCD'
    'CD'
    'D'

    Then, when you do your search for the terms "BC" you will really be
    searching on "BC*", which will produce a match to the second document.

    Now Lucene documents can be considered as giant data holding object, you
    can and SHOULD have fields in the document that are not used at search
    time, but ARE used at display generation time (or whatever layer feeds
    your display, if you are going in a more OO fashion).

    Now this technique isn't without its drawbacks of course, you will see
    an increase in your index size, but unless you are playing around with
    some VERY large datasets that really shouldn't matter.

    Now, if I was the one implementing this, I would probably make at least
    two indexes, one for exact punctuation relevant data. The other index
    would contain the data that I've described above, with one important
    difference, any and all punctuation (including whitespace) has been
    removed, and all of the letters in your codes were collapsed down into a
    single word. That way you can perform two searches, and ensure that
    exact punctuation relevant matches will appear higher in your results
    list than non punctuation relevant ones.

    Anyhow, that's pretty much it in a nutshell. I think this technique
    should work for you, after you have decided

    JesL wrote:
    Hello,
    Are there any suggestions / best practices for using Lucene for searching
    non-linguistic text? What I mean by non-linguistic is that it's not English
    or any other language, but rather product codes. This is presenting some
    interesting challenges. Among them are the need for pretty lax wildcard
    searches. For example, ABC should match on ABCD, but so should BCD. Also,
    it needs to be agnostic to special characters. So, ABC/D should match ABCD
    as well as ABC-D or "ABC D".

    As I write an analyzer to handle these cases, I seem to be pretty quickly
    degrading into a "like '%blah%' search, with rules to treat all special
    characters as single-character, optional wildcards. I'm concerned that the
    performance of this will be disappointing, though.

    Any help would be much appreciated. Thanks!

    - Jes

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    mhall@informatics.jax.org
    (207) 288-6012


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Digy at Jul 16, 2009 at 6:04 pm
    Another approach could be splitting the text into chars and returning each
    char as a token(in a custom analyzer).

    For ex: for the document [some text]
    Tokens would be [s] [o] [m] [e] [t] [e] [x] [t] and searches such as
    [ome] or [ex] would get hits.

    Sample code written in C# is below:
    http://people.apache.org/~digy/SingleCharAnalyzer.cs


    DIGY

    -----Original Message-----
    From: Matthew Hall
    Sent: Thursday, July 16, 2009 4:36 PM
    To: java-user@lucene.apache.org
    Subject: Re: Search in non-linguistic text

    Assuming your dataset isn't incredibly large, I think you could.. cheat
    here, and optimize your data for searching.

    Am I correct in assuming that BC, should also match on ABCD?

    If so, then yes your current thoughts on the problems that you face are
    correct, and everything you do will be turning into a contains search,
    which is yes.. not the best performance you have ever seen.

    However, knowing this, you can manipulate your data in such a way, that
    you can get around that limitation, and turn everything into a prefix
    (or postfix) search if you so prefer.

    So here's what you do:

    When you are indexing the term ABCD, you are actually going to add
    several documents into the index (or into various special purpose
    indexes, if you so prefer.. but more on that later on)

    Lets say you want to turn everything into a prefix search under the covers.

    In the index you would store the following values, all of which point at
    the document "ABCD"

    'ABCD'
    'BCD'
    'CD'
    'D'

    Then, when you do your search for the terms "BC" you will really be
    searching on "BC*", which will produce a match to the second document.

    Now Lucene documents can be considered as giant data holding object, you
    can and SHOULD have fields in the document that are not used at search
    time, but ARE used at display generation time (or whatever layer feeds
    your display, if you are going in a more OO fashion).

    Now this technique isn't without its drawbacks of course, you will see
    an increase in your index size, but unless you are playing around with
    some VERY large datasets that really shouldn't matter.

    Now, if I was the one implementing this, I would probably make at least
    two indexes, one for exact punctuation relevant data. The other index
    would contain the data that I've described above, with one important
    difference, any and all punctuation (including whitespace) has been
    removed, and all of the letters in your codes were collapsed down into a
    single word. That way you can perform two searches, and ensure that
    exact punctuation relevant matches will appear higher in your results
    list than non punctuation relevant ones.

    Anyhow, that's pretty much it in a nutshell. I think this technique
    should work for you, after you have decided

    JesL wrote:
    Hello,
    Are there any suggestions / best practices for using Lucene for searching
    non-linguistic text? What I mean by non-linguistic is that it's not English
    or any other language, but rather product codes. This is presenting some
    interesting challenges. Among them are the need for pretty lax wildcard
    searches. For example, ABC should match on ABCD, but so should BCD. Also,
    it needs to be agnostic to special characters. So, ABC/D should match ABCD
    as well as ABC-D or "ABC D".

    As I write an analyzer to handle these cases, I seem to be pretty quickly
    degrading into a "like '%blah%' search, with rules to treat all special
    characters as single-character, optional wildcards. I'm concerned that the
    performance of this will be disappointing, though.

    Any help would be much appreciated. Thanks!

    - Jes

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    mhall@informatics.jax.org
    (207) 288-6012


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Matthew Hall at Jul 16, 2009 at 1:59 pm
    They are upgrading our mail servers here, so if you are seeing.. many
    MANY duplicates of things I posted.. I'm really sorry about that. T_T

    Matt

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    mhall@informatics.jax.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 16, '09 at 1:04p
activeJul 16, '09 at 6:04p
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase