FAQ
Hello Lucene users,

I'm rather new to lucene and java but have done work with other
search engines some time before.
Right now I'm trying my hands (and luck) on a 'search as you type'-
sort of high performance search a la GoogleSuggest.

There meanwhile are on the net, a number of examples for such script-
driven forms that are suggesting new possible input to the user with
every keystroke. Mostly for some sort of products.
Some instances are even combining the input from two or more text
fields or giving fault tolerant feedback.

According to occasional references on this list some people have
already tried to implement such a search with lucene but did they
succeed?

My first idea was to run every completed token of the request
(current user input) through a spellchecker and expand an incomplete
token to a PrefixQuery.

Example:
artist:'beetles'
title:'yellow submar'

Alternative terms for 'beetles' and 'yellow' would be looked up by
Spellcheckers for their respective fields and 'submar' being the last
token of the active textfield with no trailing whitespace would be
turned into a PrefixQuery.
And of course the performance considerations have to be of major
concern with these searches.

I'm currently dealing with the problem that short prefixes are
resulting in BooleanQuery$TooManyClauses exceptions.
That's why I've thought of discarding them in favour of extra fields
with the first bigrams and trigrams of every indexed token.

artist:'the beatles'
artist_start:'be bea' ('the' being a stopword)
title:'yellow submarine'
title_start:'ye yel su sub'

Every PrefixQuery of length 2 or 3 could thus be turned into a simple
TermQuery on the appropriate field. (With searches for 1-letter-
prefixes altogether discarded.)

I've seen Otis Gospodnetic suggesting the very same strategy in a
former thread but I have no idea about how I could possibly add these
extra fields.

Normally an IndexWriter uses only one default Analyzer for all its
tokenizing businesses. And while it is appearantly possible to supply
a certain other instance when adding a specific document there seems
to be no way to use different analyzers on different fields within
one document.

Could this be done in one pass at all, or do I have to copy all
documents from one index to a new one, parsing field tokens and
adding new fields on the 2nd go?

I'd appreciate every hint and suggestion on what classes and methods
to write for this purpose because I definitely lack a knack when it
comes to OOP.


Thank You In Advance,
Steffen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Antony Bowesman at Apr 11, 2007 at 9:13 pm

    Steffen Heinrich wrote:
    Normally an IndexWriter uses only one default Analyzer for all its
    tokenizing businesses. And while it is appearantly possible to supply
    a certain other instance when adding a specific document there seems
    to be no way to use different analyzers on different fields within
    one document.
    Use the PerFieldAnalyzerWrapper.

    http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

    It allows different analyzers to be used for different fields.
    Antony



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steffen Heinrich at Apr 12, 2007 at 10:20 am

    On 12 Apr 2007 at 7:13, Antony Bowesman wrote:

    Steffen Heinrich wrote:
    Normally an IndexWriter uses only one default Analyzer for all its
    tokenizing businesses. And while it is appearantly possible to supply
    a certain other instance when adding a specific document there seems
    to be no way to use different analyzers on different fields within
    one document.
    Use the PerFieldAnalyzerWrapper.

    http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

    It allows different analyzers to be used for different fields.
    Antony
    Hello Antony,

    that was exactly what I was looking for but failed to see, thank you.

    Still, the answers by Karl and Erick make me re-think the approach.

    Cheers,
    Steffen



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 11, 2007 at 10:06 pm
    Rather than using a search, have you thought about using a TermEnum?
    It's much, much, much faster than a query. What it allows you to do
    is enumerate the terms in the index on a per-field basis. Essentially, this
    is what happens when you do a PrefixQuery as BooleanClauses are
    added, but you have very few options for restricting the returned list when
    you use PrefixQuery.

    In particular, WildcardTermEnum will allow you to rapidly enumerate
    all the terms that match a particular wildcard pattern and return
    whatever portion of that list you want. I'm not sure it's very valuable
    to the user to return many thousands of terms for suggestions, but
    you'll have to determine that for your situation. But you'll be surprised
    by how fast it is <G>.

    Parenthetically, there also exists a RegexTermEnum in the contrib
    area, but in my experience that's significantly slower than
    WildcardTermEnum, which only makes sense since regular expressions
    are intrinsically harder to evaluate than simple wildcards.

    What I have in mind is something like returning the first N terms
    that match a particular prefix pattern. Even if you elect not to do this,
    and return all the possibilities, this will be much faster than
    executing a query. And won't run afoul of the TooManyClauses
    exception, you'll only be restricted by available memory. Not to
    mention simplifying your index over the bigram/trigram option <G>.....

    BTW, you can alter the limit for returning the TooManyClauses option
    by BooleanQuery.setMaxClauseCount, but I'd really recommend the
    WildCardTermEnum approach first.

    Finally, your question about copying an index... it may not be easy.
    Particularly if you have terms that are indexed but not stored, you
    won't be able to reconstruct your documents exactly from the index....

    Best
    Erick
    On 4/11/07, Steffen Heinrich wrote:

    Hello Lucene users,

    I'm rather new to lucene and java but have done work with other
    search engines some time before.
    Right now I'm trying my hands (and luck) on a 'search as you type'-
    sort of high performance search a la GoogleSuggest.

    There meanwhile are on the net, a number of examples for such script-
    driven forms that are suggesting new possible input to the user with
    every keystroke. Mostly for some sort of products.
    Some instances are even combining the input from two or more text
    fields or giving fault tolerant feedback.

    According to occasional references on this list some people have
    already tried to implement such a search with lucene but did they
    succeed?

    My first idea was to run every completed token of the request
    (current user input) through a spellchecker and expand an incomplete
    token to a PrefixQuery.

    Example:
    artist:'beetles'
    title:'yellow submar'

    Alternative terms for 'beetles' and 'yellow' would be looked up by
    Spellcheckers for their respective fields and 'submar' being the last
    token of the active textfield with no trailing whitespace would be
    turned into a PrefixQuery.
    And of course the performance considerations have to be of major
    concern with these searches.

    I'm currently dealing with the problem that short prefixes are
    resulting in BooleanQuery$TooManyClauses exceptions.
    That's why I've thought of discarding them in favour of extra fields
    with the first bigrams and trigrams of every indexed token.

    artist:'the beatles'
    artist_start:'be bea' ('the' being a stopword)
    title:'yellow submarine'
    title_start:'ye yel su sub'

    Every PrefixQuery of length 2 or 3 could thus be turned into a simple
    TermQuery on the appropriate field. (With searches for 1-letter-
    prefixes altogether discarded.)

    I've seen Otis Gospodnetic suggesting the very same strategy in a
    former thread but I have no idea about how I could possibly add these
    extra fields.

    Normally an IndexWriter uses only one default Analyzer for all its
    tokenizing businesses. And while it is appearantly possible to supply
    a certain other instance when adding a specific document there seems
    to be no way to use different analyzers on different fields within
    one document.

    Could this be done in one pass at all, or do I have to copy all
    documents from one index to a new one, parsing field tokens and
    adding new fields on the 2nd go?

    I'd appreciate every hint and suggestion on what classes and methods
    to write for this purpose because I definitely lack a knack when it
    comes to OOP.


    Thank You In Advance,
    Steffen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steffen Heinrich at Apr 12, 2007 at 10:20 am

    On 11 Apr 2007 at 18:05, Erick Erickson wrote:
    Rather than using a search, have you thought about using a TermEnum?
    It's much, much, much faster than a query. What it allows you to do
    is enumerate the terms in the index on a per-field basis. Essentially, this
    is what happens when you do a PrefixQuery as BooleanClauses are
    added, but you have very few options for restricting the returned list when
    you use PrefixQuery.
    As I'm still fresh with lucene I did not look into TermEnum yet.
    And yes, you are right. I already wondered how to possibly cut down
    on the returns of a prefix query.

    ...
    What I have in mind is something like returning the first N terms
    that match a particular prefix pattern. Even if you elect not to do this,
    and return all the possibilities, this will be much faster than
    executing a query. And won't run afoul of the TooManyClauses
    exception, you'll only be restricted by available memory. Not to
    mention simplifying your index over the bigram/trigram option <G>.....
    If I understand correctly, you are suggesting to look up documents
    that match prefixes with TermDocs.seek(enum) separately, possibly
    restricting them by evaluation of doc boosts, etc. and then merging
    the remainders with the separate search results for the other tokens.
    Is that right?
    BTW, you can alter the limit for returning the TooManyClauses option
    by BooleanQuery.setMaxClauseCount, but I'd really recommend the
    WildCardTermEnum approach first.
    Yes, that was the point where I thought that turning to the group
    would probably get me some better ideas ;-)
    Finally, your question about copying an index... it may not be easy.
    Particularly if you have terms that are indexed but not stored, you
    won't be able to reconstruct your documents exactly from the index....
    Antony Bowesman came up with the PerFieldAnalyzerWrapper which would
    have prevented a need to copy.
    Best
    Erick
    Do you also have an idea for how to improve a fault tolerant search
    for the completed terms?
    The shortcomings are somewhat similar.
    Running each through a spell checker and adding results to a boolean
    query does not help with the performance.
    Besides, with lucene's standard spell checker I think that there is
    no way to influence the sorting of suggestions (because there is no
    criteria). And so the restriction to the first 4-10 suggestions is
    entirely random and might just miss out on the most appropriate one.

    I've tried the NGramTokenizer from the Action book (contributed by
    alias-i, now appearently in the LingPipe distribution) and it gives
    better results in that it returns suggestions based on the weight of
    the documents, but at a much bigger cost. Of disk space as well as
    memory as performance.

    BTW, my test data is ~ 1.5 million artist / song titles which I
    extracted from a CDDB dump.
    This data represents very well the typical applications that I have
    in mind:
    Lots of tiny documents with 2-3 indexed fields that allow for faceted
    search. (Possibly associated with some meta data each.)

    Ideally the system should scale well with heavy user loads. -
    Certainly not a simple task where every keystroke translates into a
    query for suggestions, but the existing implementations show that it
    can be done. Only that I start wondering if these are done with
    lucene and written in java. :-/

    I presume that the need for scalability also forbids any sort of
    result caching with the lucene filter wrappers. Even a bitmap for
    millions of documents must add up to something substantial.
    An optimization of the retrieval is probably worth more than the
    additional overhead of a caching strategy can bring.

    More thoughts, anyone?

    Thank You.

    Cheers, Steffen





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 12, 2007 at 1:28 pm
    See below....
    On 4/12/07, Steffen Heinrich wrote:
    On 11 Apr 2007 at 18:05, Erick Erickson wrote:
    Rather than using a search, have you thought about using a TermEnum?
    It's much, much, much faster than a query. What it allows you to do
    is enumerate the terms in the index on a per-field basis. Essentially, this
    is what happens when you do a PrefixQuery as BooleanClauses are
    added, but you have very few options for restricting the returned list when
    you use PrefixQuery.
    As I'm still fresh with lucene I did not look into TermEnum yet.
    And yes, you are right. I already wondered how to possibly cut down
    on the returns of a prefix query.

    ...
    What I have in mind is something like returning the first N terms
    that match a particular prefix pattern. Even if you elect not to do this,
    and return all the possibilities, this will be much faster than
    executing a query. And won't run afoul of the TooManyClauses
    exception, you'll only be restricted by available memory. Not to
    mention simplifying your index over the bigram/trigram option <G>.....
    If I understand correctly, you are suggesting to look up documents
    that match prefixes with TermDocs.seek(enum) separately, possibly
    restricting them by evaluation of doc boosts, etc. and then merging
    the remainders with the separate search results for the other tokens.
    Is that right?

    Not quite. As I understand your problem, you want all the terms that
    match (or at least a subset) for a field. For this, WildcardTermEnum
    is really all you need. Think of it this way...
    (Wildcard)TermEnum gives you a list of all the terms for a particular field.
    Each term will be mentioned exactly once regardless of how many
    times it appears in your corpus.
    TermDocs will allow you to find documents with those terms.

    Since you're trying to do a set of suggestions, you really don't need
    to know anything about documents that the terms appear in, or even
    how many documents they appear in. All you need is a list of
    the unique terms. Thus you don't need TermDocs here at all.

    Here's part of a chunk of code I have lying around. It
    prints out all the terms that appear in a particular field and you
    should easily be able to make it use a WIldcardTermEnum...
    This is a hack I made for a one-off, so I don't have to be
    proud of it......

    private void enumField(String field) throws Exception
    {
    long start = System.currentTimeMillis();
    TermEnum termEnum = this.reader.getIndexReader().terms(new
    Term(field, ""));

    this.writer.println("Values for term " + field);

    Term term = termEnum.term();
    int idx = 0;

    while ((term != null) && term.field().equals(field)) {
    System.out.println(term.text());

    termEnum.next();

    term = termEnum.term();
    ++idx;
    }

    long interval = System.currentTimeMillis() - start;

    System.out.println(
    String.format(
    "%d terms took %d milliseconds (%d seconds) to
    enumerate term %s",
    idx,
    interval,
    interval / CaptureTerms.MILLIS_IN_SECOND,
    field));
    }



    This isn't really very useful for displaying the *best*, say, 10 terms
    because it'll just start at the beginning of the list and enumerate
    the first N items.
    BTW, you can alter the limit for returning the TooManyClauses option
    by BooleanQuery.setMaxClauseCount, but I'd really recommend the
    WildCardTermEnum approach first.
    Yes, that was the point where I thought that turning to the group
    would probably get me some better ideas ;-)
    Finally, your question about copying an index... it may not be easy.
    Particularly if you have terms that are indexed but not stored, you
    won't be able to reconstruct your documents exactly from the index....
    Antony Bowesman came up with the PerFieldAnalyzerWrapper which would
    have prevented a need to copy.
    Best
    Erick
    Do you also have an idea for how to improve a fault tolerant search
    for the completed terms?
    The shortcomings are somewhat similar.
    Running each through a spell checker and adding results to a boolean
    query does not help with the performance.
    Besides, with lucene's standard spell checker I think that there is
    no way to influence the sorting of suggestions (because there is no
    criteria). And so the restriction to the first 4-10 suggestions is
    entirely random and might just miss out on the most appropriate one.

    You'll have to elaborate what "fault tolerant search" means. If you're
    worried about misspellings, that's tough. You could try FuzzyQuery,
    or if that doesn't work you could think about working with soundex. But I
    can't stress strongly enough that you need to be absolutely sure this
    is a real problem *that your users will notice* before you invest time and
    energy in solving it. I'm continually amazed how much time and energy
    I spend solving non-existent problems <G>....

    And for your sanity's sake, don't ask the produce manager anything
    remotely like "would you like fault-tolearant searches?". The answer
    will be yes. Regardless of whether it makes a difference to the end
    user. And I'll only mention briefly that asking Sales if they'd like
    a feature is the road to madness.....

    And a spell checker isn't very useful with names anyway......

    I've tried the NGramTokenizer from the Action book (contributed by
    alias-i, now appearently in the LingPipe distribution) and it gives
    better results in that it returns suggestions based on the weight of
    the documents, but at a much bigger cost. Of disk space as well as
    memory as performance.

    BTW, my test data is ~ 1.5 million artist / song titles which I
    extracted from a CDDB dump.
    This data represents very well the typical applications that I have
    in mind:
    Lots of tiny documents with 2-3 indexed fields that allow for faceted
    search. (Possibly associated with some meta data each.)

    Ideally the system should scale well with heavy user loads. -
    Certainly not a simple task where every keystroke translates into a
    query for suggestions, but the existing implementations show that it
    can be done. Only that I start wondering if these are done with
    lucene and written in java. :-/

    I presume that the need for scalability also forbids any sort of
    result caching with the lucene filter wrappers. Even a bitmap for
    millions of documents must add up to something substantial.
    An optimization of the retrieval is probably worth more than the
    additional overhead of a caching strategy can bring.


    well, a million documents is a bitmap of 125K. You can fit a LOT
    of these into, say, 256M of memory <G>..... But Filters work
    at the Document level, not the term level. So I'm not sure they do
    what you want......

    I strongly suggest you run some timings on whatever process
    you decide to try first. Take out all the printing and just report the
    time taken on your corpus when enumerating the terms. I think
    you'll be very surprised at just how fast it all is and this will definitely

    inform your calculations about how it'll scale.... and remember
    that the first time you open a reader, you'll pay some extra
    overhead, so pay attention to the 2-N runs on an already
    open reader.....

    Best
    Erick


    More thoughts, anyone?
    Thank You.

    Cheers, Steffen





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steffen Heinrich at Apr 12, 2007 at 5:41 pm

    On 12 Apr 2007 at 9:27, Erick Erickson wrote:

    See below.... ...
    Not quite. As I understand your problem, you want all the terms that
    match (or at least a subset) for a field. For this, WildcardTermEnum
    is really all you need. Think of it this way...
    (Wildcard)TermEnum gives you a list of all the terms for a particular field.
    Each term will be mentioned exactly once regardless of how many
    times it appears in your corpus.
    TermDocs will allow you to find documents with those terms.

    Since you're trying to do a set of suggestions, you really don't need
    to know anything about documents that the terms appear in, or even
    how many documents they appear in. All you need is a list of
    the unique terms. Thus you don't need TermDocs here at all.
    Oops, we have a different conception here.
    Its not individual terms that I want to suggest, but entire artist
    names (usually > 1 term) or song titles or rather groups thereof. And
    these are matching to documents in my model.

    I'm currently dealing with 2 separate indexes.
    1 song index has the fields 'artist' and 'title' indexed per (song-
    )document. It also has for both fields the relative frequency of
    their distinct spellings in the original corpus.
    1 artist index contains only the unique artist names and has them
    indexed and stored together with their commoness.

    The first index is used when the artist is filled in and all his
    songs are to be displayed or in cases where someone starts inputting
    part of a song title.

    The latter index gets perused for performance reasons whenever there
    is only input for the artist name.

    Continued and elaborated at the end of the mail...
    Here's part of a chunk of code I have lying around. It
    prints out all the terms that appear in a particular field and you
    should easily be able to make it use a WIldcardTermEnum...
    This is a hack I made for a one-off, so I don't have to be
    proud of it......

    private void enumField(String field) throws Exception
    {
    long start = System.currentTimeMillis();
    TermEnum termEnum = this.reader.getIndexReader().terms(new
    Term(field, ""));

    this.writer.println("Values for term " + field);

    Term term = termEnum.term();
    int idx = 0;

    while ((term != null) && term.field().equals(field)) {
    System.out.println(term.text());

    termEnum.next();

    term = termEnum.term();
    ++idx;
    }

    long interval = System.currentTimeMillis() - start;

    System.out.println(
    String.format(
    "%d terms took %d milliseconds (%d seconds) to
    enumerate term %s",
    idx,
    interval,
    interval / CaptureTerms.MILLIS_IN_SECOND,
    field));
    }
    Yep, I think that now I understand where the TermEnum takes its place
    in the lucene orchestra. Thank you.

    ...
    You'll have to elaborate what "fault tolerant search" means. If you're
    worried about misspellings, that's tough. You could try FuzzyQuery,
    or if that doesn't work you could think about working with soundex. But I
    can't stress strongly enough that you need to be absolutely sure this
    is a real problem *that your users will notice* before you invest time and
    energy in solving it. I'm continually amazed how much time and energy
    I spend solving non-existent problems <G>....
    Fuzzy and metaphone queries are absolutely out of question, I agree.
    And for your sanity's sake, don't ask the produce manager anything
    remotely like "would you like fault-tolearant searches?". The answer
    will be yes. Regardless of whether it makes a difference to the end
    user. And I'll only mention briefly that asking Sales if they'd like
    a feature is the road to madness.....
    The manager in this case, am I :-)

    I picked this task all by myself as a private project that I hope
    helps to get me into Java and Lucene.
    I have an interest in fulltext searches and IR systems ever since I
    had to write a (simple) engine some years ago. In Perl that was.
    And a spell checker isn't very useful with names anyway......
    ....

    I very much agree that fault tolerance in most other search
    applications makes no sense, confuses the user and is inferior to a
    well suited choice of applicable search criteria.
    Here however, the idea is to direct the user's attention towards an
    information that he didn't think of beforehand but that is present in
    the database and possibly gives him pleasure to find out.

    I don't know whether your personal interest is rather books or music
    but just imagine that you can enter into the form any terms from an
    artist or song name that you vastly remember - et voila, there it is!
    Find yet unknown songs by your favourite group, find covers of your
    favourite songs.

    An example of what I have in mind can be found here:
    http://tinyurl.com/35g3yo

    My 1.5 million songs is very small a corpus compared to theirs but
    the basic problems are still the same:

    A huge number of documents (individual songs) with two fields of
    little content to each of them.

    The content's spelling is as unreliable or rather 'diverse' as any
    user input.
    I've already folded the originally much, much bigger corpus down to
    those songs where at least 2 instances for the spellings of artist
    and song name exist and yet any more popular song may still be
    present in dozens of different forms.

    Thus, without any fault tolerance, a 'sharp' search executed for any
    popular song or artist will probably return the expected result, or
    possibly even a misspelled input.
    But much rarer, I think it will return suggestions for the more
    uncommon songs. And that is basically what this whole suggest thingy
    aims at.
    Suggestions must be artists or songs, not just terms.

    Hopefully my discription was not too confuse, maybe I am.

    I will now first try to find the most effective way to expand the
    prefixes and leave the 'fuzzy' aspect for e 2nd step.


    Thank you for your thoughts,
    Steffen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl wettin at Apr 11, 2007 at 10:30 pm

    11 apr 2007 kl. 22.32 skrev Steffen Heinrich:

    According to occasional references on this list some people have
    already tried to implement such a search with lucene but did they
    succeed?

    My first idea was to run every completed token of the request
    (current user input) through a spellchecker and expand an incomplete
    token to a PrefixQuery.
    I've posted a solution in the Jira that use sucessful user queries as
    a cropus rather than bashing a Lucene index, and is thus best suited
    for systems with a bit of user activity. The data is stored in a trie-
    pattern with meta data at each node, allowing for extra data such as
    number of hits, available facet classifications or what not. I've had
    it running for a couple of days in a system with one user query every
    ten milliseconds, so it ought to be stable(tm).

    https://issues.apache.org/jira/browse/LUCENE-625

    There is a memoryleak in the trie at optimize() that has been fixed
    locally. Might be available in LUCENE-626 too. Not sure right now,
    but let me know and I'll make sure to post it.


    --
    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steffen Heinrich at Apr 12, 2007 at 10:20 am

    On 12 Apr 2007 at 0:28, karl wettin wrote:

    11 apr 2007 kl. 22.32 skrev Steffen Heinrich:
    According to occasional references on this list some people have
    already tried to implement such a search with lucene but did they
    succeed?

    My first idea was to run every completed token of the request
    (current user input) through a spellchecker and expand an incomplete
    token to a PrefixQuery.
    I've posted a solution in the Jira that use sucessful user queries as
    a cropus rather than bashing a Lucene index, and is thus best suited
    for systems with a bit of user activity. The data is stored in a trie-
    pattern with meta data at each node, allowing for extra data such as
    number of hits, available facet classifications or what not. I've had
    it running for a couple of days in a system with one user query every
    ten milliseconds, so it ought to be stable(tm).

    https://issues.apache.org/jira/browse/LUCENE-625

    There is a memoryleak in the trie at optimize() that has been fixed
    locally. Might be available in LUCENE-626 too. Not sure right now,
    but let me know and I'll make sure to post it.


    --
    karl
    Hello Karl,

    thank you for this information. It was the first time I ever heard of
    tries and had to read up on them. Very interesting!

    The intended system however can not be trained by user input. The
    suggestions have to come from a given corpus (e.g. an ocasionally
    updated product database).
    Do you think adopting your package to set up the tries from a corpus
    would be fairly easy?

    Cheers, Steffen



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl wettin at Apr 12, 2007 at 2:50 pm
    12 apr 2007 kl. 12.19 skrev Steffen Heinrich:
    The intended system however can not be trained by user input. The
    suggestions have to come from a given corpus (e.g. an ocasionally
    updated product database).
    Do you think adopting your package to set up the tries from a corpus
    would be fairly easy?
    You can train it with any data you want. So you would need to figure
    out what people probably will be searching for. The first thing I can
    think of is to extract the most frequenct n grams at a word level in
    your title field, or so. It is tough to say what might actually work.
    Frequent phrases in the corpus might have nothing to do with consumer
    popularity.

    If I understand everyhing, your application is installed locally on
    consumer machines. Perhaps you could allow end users to share
    anonymous data based on taste and build a set per end user.
    Collaborative filtering comes to mind. Reducing a data set to
    something relevant usually equals behavioural analysis.

    Hope this helps.

    --
    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steffen Heinrich at Apr 12, 2007 at 6:01 pm

    On 12 Apr 2007 at 16:49, karl wettin wrote:

    12 apr 2007 kl. 12.19 skrev Steffen Heinrich:
    The intended system however can not be trained by user input. The
    suggestions have to come from a given corpus (e.g. an ocasionally
    updated product database).
    Do you think adopting your package to set up the tries from a corpus
    would be fairly easy?
    You can train it with any data you want. So you would need to figure
    out what people probably will be searching for. The first thing I can
    think of is to extract the most frequenct n grams at a word level in
    your title field, or so. It is tough to say what might actually work.
    Frequent phrases in the corpus might have nothing to do with consumer
    popularity.

    If I understand everyhing, your application is installed locally on
    consumer machines. Perhaps you could allow end users to share
    anonymous data based on taste and build a set per end user.
    Collaborative filtering comes to mind. Reducing a data set to
    something relevant usually equals behavioural analysis.

    Hope this helps.

    --
    karl
    Wow, this way of data mining is way over my top! I wouldn't know
    where to begin.

    This search is only meant to be used in an ajax-driven web
    application.
    And the basic idea is to give the user incentive and turn him to
    something new, something he didn't think of before.
    I just generalized on the concept in a mail to Erick under the same
    subject. There is also a link to a working implementation that served
    as my model.

    In the wikipedia article on tries I found the following sentence
    drawing my attention:
    "Tries are also well suited for implementing approximate matching
    algorithms, including those used in spell checking software."

    Do you have any information about how this can be done?

    Cheers,
    Steffen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl wettin at Apr 12, 2007 at 6:23 pm

    12 apr 2007 kl. 20.00 skrev Steffen Heinrich:

    This search is only meant to be used in an ajax-driven web
    application.
    And the basic idea is to give the user incentive and turn him to
    something new, something he didn't think of before.
    I just generalized on the concept in a mail to Erick under the same
    subject. There is also a link to a working implementation that served
    as my model.
    As "ivan charo" finds "Ivan Goncharov", I suspect they work on a
    token n gram level. Perhaps that is something you could try?

    Still, I don't like the idea of hammering the index like that. But in
    your case that might not be a problem.
    "Tries are also well suited for implementing approximate matching
    algorithms, including those used in spell checking software."

    Do you have any information about how this can be done?
    The author probably thought of navigating an "a priori" trie (a trie
    filled with known good words) using some path finder algorithm
    (breadth first, Dijkstra, A*, et c) based on the (possibly) incorrect
    spelled word. Personally I think there are better (algorithmic) ways
    to solve that problem.

    You are welcome to try <https://issues.apache.org/jira/browse/
    LUCENE-626> if you find spellchecking interesting.

    --
    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steffen Heinrich at Apr 12, 2007 at 7:29 pm

    On 12 Apr 2007 at 20:22, karl wettin wrote:


    12 apr 2007 kl. 20.00 skrev Steffen Heinrich:
    This search is only meant to be used in an ajax-driven web
    application.
    And the basic idea is to give the user incentive and turn him to
    something new, something he didn't think of before.
    I just generalized on the concept in a mail to Erick under the same
    subject. There is also a link to a working implementation that served
    as my model.
    As "ivan charo" finds "Ivan Goncharov", I suspect they work on a
    token n gram level. Perhaps that is something you could try?
    Some weeks ago I tried out the NGramTokenstream by alias-i as it was
    presented in 'Lucene in Action' and it returned good results but
    seemed to be overly time consuming against the spell checker which
    was distributed with lucene 2.1.0.
    Still, I don't like the idea of hammering the index like that. But in
    your case that might not be a problem.
    "Tries are also well suited for implementing approximate matching
    algorithms, including those used in spell checking software."

    Do you have any information about how this can be done?
    The author probably thought of navigating an "a priori" trie (a trie
    filled with known good words) using some path finder algorithm
    (breadth first, Dijkstra, A*, et c) based on the (possibly) incorrect
    spelled word. Personally I think there are better (algorithmic) ways
    to solve that problem.

    You are welcome to try <https://issues.apache.org/jira/browse/
    LUCENE-626> if you find spellchecking interesting.
    It looks promising and I will try to get my head into it, but I'm not
    sure at all if I'll be up to the task. :(

    Thank you, Karl.

    Kind Regards,
    Steffen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 11, '07 at 8:33p
activeApr 12, '07 at 7:29p
posts13
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase