FAQ
Hi again.

I have created my own autocompleter based on the spellchecker. This
works well in a sense that it is able to create an auto completion index
from my 'publication' index. However, integrated in my web application,
each keypress asks autocompleter to search the index, which is stored on
disk (not in mem), just like spellchecker does (except that spellchecker
is not invoked every keypress).
With Lucene 3.3.0, auto completion modules are included, which load
their trees/fsa/... in memory. I'd like to use these modules, but the
problem is that they use more than 2.5GB, causing heap space exceptions.
This happens when I try to build a LookUp index (fst,jaspell or tst,
doesn't matter) from my 'publication' index consisting of 1.3M
publications. The field I use for autocompletion holds the titles of the
publications indexed untokenized (but lowercased).

Code:
Lookup autoCompleter = new TSTLookup();
FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
LuceneDictionary dict = new
LuceneDictionary(IndexReader.open(dir),"title_suggest");
autoCompleter.build(dict);

Is it possible to have the autocompletion module to work in-memory on
such a dataset without increasing java's heapspace?
FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
my own autocompleter index is stored on disk using about 300MB.

BR,
Elmer


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Michael McCandless at Jul 6, 2011 at 4:24 pm
    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:
    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Elmer at Jul 6, 2011 at 5:50 pm
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:
    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 6, 2011 at 6:40 pm
    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play with...?

    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Elmer at Jul 6, 2011 at 6:52 pm
    I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
    the memory.

    I'll test further tomorrow and report on mem usage for runnable smaller
    indexes.
    I will email you privately for sharing the index to work with.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 8:39 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play with...?

    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens
    on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational
    Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dawid Weiss at Jul 7, 2011 at 9:10 am
    Elmer. Tst will have a large overhead. Fst may not be that much better if
    your input has very few shared pre or suffixes. In your case i think this is
    unfortunately true. What i would do is create a regular lucene index and
    store it on disk. Then run prefix queries on it. Should work and scale to
    large number of ops per sec. See lucene revolution 2011 talks - there was a
    talk about using just this instead of a completion module.

    Like mike said though, it'd be interesting to investigate on your data.
    On Jul 6, 2011 8:52 PM, "Elmer" wrote:
    I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
    the memory.

    I'll test further tomorrow and report on mem usage for runnable smaller
    indexes.
    I will email you privately for sharing the index to work with.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 8:39 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play with...?

    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens
    on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational
    Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
    in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 7, 2011 at 10:24 am
    OK, Elmer sent me the titles (thanks!) and I ran some quick tests...

    There are 1.32M titles, avg 67.3 chars per title (total 84.5 MB text
    as utf8 on disk so at least 168.9 MB RAM for utf16, ie when loaded
    in RAM).

    I built the suggest FST from these titles, and it required 1.25 GB
    heap (anything less hits OOME). BUT: my test does not load all the
    terms into RAM (just builds the FST directly from the TermsEnum), so
    the "real" FSTLookup construction will require more RAM.

    It took 22.5 seconds to build and the resulting FST is 91.6 MB.

    Next I tried turning off suffix sharing, ie this "downgrades" the
    resulting FST a prefix trie, but it saves RAM and CPU during building:
    it built in 8.2 seconds and with 450 MB heap; the resulting FST is 129
    MB. The suggest module doesn't make this an option today but maybe we
    should? Suffix sharing requires sizable RAM while building because it
    maintains a hash containing all nodes in order to locate the dups.

    It's also possible to improve FST to have shades of gray between
    on/off... I'll open an issue.

    Mike McCandless

    http://blog.mikemccandless.com
    On Thu, Jul 7, 2011 at 5:09 AM, Dawid Weiss wrote:
    Elmer. Tst will have a large overhead. Fst may not be that much better if
    your input has very few shared pre or suffixes. In your case i think this is
    unfortunately true. What i would do is create a regular lucene index and
    store it on disk. Then run prefix queries on it. Should work and scale to
    large number of ops per sec. See lucene revolution 2011 talks - there was a
    talk about using just this instead of a completion module.

    Like mike said though, it'd be interesting to investigate on your data.
    On Jul 6, 2011 8:52 PM, "Elmer" wrote:
    I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
    the memory.

    I'll test further tomorrow and report on mem usage for runnable smaller
    indexes.
    I will email you privately for sharing the index to work with.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 8:39 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play with...?

    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens
    on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational
    Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
    in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dawid Weiss at Jul 7, 2011 at 11:00 am
    Another option to tradeoff dize and mem is to do a lru like cache of suffix
    nodes/ registry. Im still working on that api replacement patch so any
    changes to fst right now scare me...
    On Jul 7, 2011 12:24 PM, "Michael McCandless" wrote:

    OK, Elmer sent me the titles (thanks> OK, Elmer sent me the titles
    (thanks!) and I ran some quick tests...
    There are 1.32M titles, avg 67.3 chars per title (total 84.5 MB text
    as utf8 on disk so at least 168.9 MB RAM for utf16, ie when loaded
    in RAM).

    I built the suggest FST from these titles, and it required 1.25 GB
    heap (anything less hits OOME). BUT: my test does not load all the
    terms into RAM (just builds the FST directly from the TermsEnum), so
    the "real" FSTLookup construction will require more RAM.

    It took 22.5 seconds to build and the resulting FST is 91.6 MB.

    Next I tried turning off suffix sharing, ie this "downgrades" the
    resulting FST a prefix trie, but it saves RAM and CPU during building:
    it built in 8.2 seconds and with 450 MB heap; the resulting FST is 129
    MB. The suggest module doesn't make this an option today but maybe we
    should? Suffix sharing requires sizable RAM while building because it
    maintains a hash containing all nodes in order to locate the dups.

    It's also possible to improve FST to have shades of gray between
    on/off... I'll open an issue.

    Mike McCandless

    http://blog.mikemccandless.com
    On Thu, Jul 7, 2011 at 5:09 AM, Dawid Weiss wrote:
    Elmer. Tst will have a large overhead. Fst may not be that much better if
    your input has very few shared pre or suffixes. In your case i think this
    is
    unfortunately true. What i would do is create a regular lucene index and
    store it on disk. Then run prefix queries on it. Should work and scale to
    large number of ops per sec. See lucene revolution 2011 talks - there was
    a
    talk about using just this instead of a completion module.

    Like mike said though, it'd be interesting to investigate on your data.
    On Jul 6, 2011 8:52 PM, "Elmer" wrote:
    I just profiled the application and tst.TernaryTreeNode takes 99.99..%
    of
    the memory.

    I'll test further tomorrow and report on mem usage for runnable smaller
    indexes.
    I will email you privately for sharing the index to work with.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 8:39 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play
    with...?
    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it
    happens
    on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience
    bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational
    Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
    in
    that order.

    I also tried to do the same for the author names, and this works
    without
    problems. Actually it builds the tree/fsa/... faster from dictionary
    than
    from file (the lookup data file that can be stored and loaded through
    the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion
    index
    from my 'publication' index. However, integrated in my web
    application,
    each keypress asks autocompleter to search the index, which is stored
    on
    disk (not in mem), just like spellchecker does (except that
    spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space
    exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of
    the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM,
    where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 7, 2011 at 5:17 pm

    On Thu, Jul 7, 2011 at 7:00 AM, Dawid Weiss wrote:
    Another option to tradeoff dize and mem is to do a lru like cache of suffix
    nodes/ registry. Im still working on that api replacement patch so any
    changes to fst right now scare me...
    That sounds cool too!

    I opened LUCENE-3289 to allow controlling how hard the Builder tries
    to share suffixes... ie tradeoff CPU/RAM usage while building against
    final FST size.

    Mike McCandless

    http://blog.mikemccandless.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dawid Weiss at Jul 7, 2011 at 7:23 pm
    You can actually make an (relatively easy) change to FSTLookup to
    allow infix matches (or word-boundary matches). This should have
    little impact on memory and nearly zero on performance. This issue is
    tracking this:

    https://issues.apache.org/jira/browse/SOLR-2479

    I should have implemented it a while ago, but I've been swamped with
    other work, sorry.

    Dawid

    On Thu, Jul 7, 2011 at 7:16 PM, Michael McCandless
    wrote:
    On Thu, Jul 7, 2011 at 7:00 AM, Dawid Weiss wrote:
    Another option to tradeoff dize and mem is to do a lru like cache of suffix
    nodes/ registry. Im still working on that api replacement patch so any
    changes to fst right now scare me...
    That sounds cool too!

    I opened LUCENE-3289 to allow controlling how hard the Builder tries
    to share suffixes... ie tradeoff CPU/RAM usage while building against
    final FST size.

    Mike McCandless

    http://blog.mikemccandless.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Elmer at Jul 7, 2011 at 2:14 pm
    Thanks,
    Your replies ended up in my spam box and therefore I missed your
    recommendation to use FST. I'll do more testing soon with FST instead of
    TST. And I'll surely take a look at that talk!

    BR,
    Elmer
    On Thu, 2011-07-07 at 11:09 +0200, Dawid Weiss wrote:
    Elmer. Tst will have a large overhead. Fst may not be that much better if
    your input has very few shared pre or suffixes. In your case i think this is
    unfortunately true. What i would do is create a regular lucene index and
    store it on disk. Then run prefix queries on it. Should work and scale to
    large number of ops per sec. See lucene revolution 2011 talks - there was a
    talk about using just this instead of a completion module.

    Like mike said though, it'd be interesting to investigate on your data.
    On Jul 6, 2011 8:52 PM, "Elmer" wrote:
    I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
    the memory.

    I'll test further tomorrow and report on mem usage for runnable smaller
    indexes.
    I will email you privately for sharing the index to work with.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 8:39 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play with...?

    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens
    on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational
    Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
    in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Elmer at Jul 7, 2011 at 11:09 am
    I got it working by modifying TSTAutocomplete to have a limit on the
    prefix length. :)
    The depth of the tree will not go deeper than this prefix length.

    When set to 20 chars, total mem usage is ~520MB, of which 48,9% is for
    the TernaryTreeNode objects.
    Building took 7 seconds, reading from external HDD.

    I created a zip, with JAR and sourcecode, available here:
    http://www.computer-tuning.nl/lucene/TSTLookupWithPrefixLimit.zip

    You still need the spellchecker for dependencies.

    BR,
    Elmer
    On Wed, 2011-07-06 at 20:52 +0200, Elmer wrote:
    I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
    the memory.

    I'll test further tomorrow and report on mem usage for runnable smaller
    indexes.
    I will email you privately for sharing the index to work with.

    BR,
    Elmer


    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 8:39 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    Hmm... so I suspect the fst suggest module must first gather up all
    titles, then sort them, in RAM, and then build the actual FST. Maybe
    it's this gather + sort that's taking so much RAM?

    1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
    that shouldn't be it...

    Is this a an accessible corpus? Can I somehow get a copy to play with...?

    Are you able to [temporarily, once] build the full FST and other
    suggest impls and compare how much RAM is required for building and
    then lookups?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote:
    Hi Mike,

    That's what I thought when I started indexing it. To be clear, it happens
    on
    build time.
    I don't know if memory efficiency is better when building has finished.

    The titles I index are titles from the dblp computer sience bibliography.
    They can take up to... say 100 characters.
    Examples:
    -------
    - Auditory stimulus optimization with feedback from fuzzy clustering of
    neuronal responses
    - Two-objective method for crisp and fuzzy interval comparison in
    optimization
    - Bound Constrained Smooth Optimization for Solving Variational
    Inequalities
    and Related Problems
    - Retrieval of bibliographic records using Apache Lucene
    - Digital Library Information Appliances
    -------

    The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
    that order.

    I also tried to do the same for the author names, and this works without
    problems. Actually it builds the tree/fsa/... faster from dictionary than
    from file (the lookup data file that can be stored and loaded through the
    .store and .load methods). But the larger set of publication titles is
    currently no-go with 2.5GB of heapspace, only having a main class that
    builds the LookUp data.

    BR,
    Elmer


    -----Oorspronkelijk bericht----- From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:

    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Elmer at Jul 6, 2011 at 6:03 pm
    You could try storing your autocomplete index in a RAMDirectory?
    I forgot to mention. I tried this previously, but that also resulted in heap
    space problems. That's why I was interested in using the new suggest classes
    :)

    BR,
    Elmer

    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:
    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Elmer at Jul 7, 2011 at 1:42 pm
    I just tested my autocompleter in a clean environment (instead of
    sharing a lot of resources with the java servlet) and I was able to run
    the autocompletion in mem, using the RAMDirectory.

    This morning I posted a modified TST implementation in this thread. I
    have compared my autocompleter with the TST (with max prefix length set
    to 20 chars).

    Compared to the TST implementation, my autocompleter is able to match
    tokens from the complete titles (tokenizes on whitespace), i.e. 'inf'
    will match titles:
    "information retrieval"
    "best practices in information retrieval"
    It also ranks the lookups by their popularity using the frequency of the
    terms in the source index to sort the lookup results.
    If somebody is interested, I can provide my autocompletion class (based
    on spellcheck class in Lucene 3.1.0).

    How I tested:
    Both auto completion implementations use the same index, holding 1.32M
    titles. I generated 1000 random prefixes of length 1,2 or 3 chars. Both
    implementations 'warmed up' by looking up 1000 prefixes prior to
    measuring the time it takes to perform 1000 lookups that each return 20
    results (at most). Heapspace set to 2.5GB

    Result:
    -TST uses at least 600MB of memory with a ~10 GC activities
    -My autocompleter uses at least 407MB with 1 GC activity
    -TST runs 1000 completions in 7262ms
    -My implementation runs 1000 completions in 18617ms
    -Both used ~100% cpu from 1 core during test

    For now, I think I'm gonna stick to my own autocompleter until TST can
    be used 'efficiently' and can sort by popularity based on frequency. It
    seems that current TSTLookup implementation doesn't use term frequencies
    from a source dictionary. Also, I'd like to match tokens from within
    each term in the source index. I don't think that's possible without
    changing the inner working of the TSTLookup?

    BR,
    Elmer
    On Wed, 2011-07-06 at 20:02 +0200, Elmer wrote:
    You could try storing your autocomplete index in a RAMDirectory?
    I forgot to mention. I tried this previously, but that also resulted in heap
    space problems. That's why I was interested in using the new suggest classes
    :)

    BR,
    Elmer

    -----Oorspronkelijk bericht-----
    From: Michael McCandless
    Sent: Wednesday, July 06, 2011 6:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Autocompletion on large index

    You could try storing your autocomplete index in a RAMDirectory?

    But: I'm surprised you see the FST suggest impl using up so much RAM;
    very low memory usage is one of the strengths of the FST approach.
    Can you share the text (titles) you are feeding to the suggest module?

    Mike McCandless

    http://blog.mikemccandless.com
    On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote:
    Hi again.

    I have created my own autocompleter based on the spellchecker. This
    works well in a sense that it is able to create an auto completion index
    from my 'publication' index. However, integrated in my web application,
    each keypress asks autocompleter to search the index, which is stored on
    disk (not in mem), just like spellchecker does (except that spellchecker
    is not invoked every keypress).
    With Lucene 3.3.0, auto completion modules are included, which load
    their trees/fsa/... in memory. I'd like to use these modules, but the
    problem is that they use more than 2.5GB, causing heap space exceptions.
    This happens when I try to build a LookUp index (fst,jaspell or tst,
    doesn't matter) from my 'publication' index consisting of 1.3M
    publications. The field I use for autocompletion holds the titles of the
    publications indexed untokenized (but lowercased).

    Code:
    Lookup autoCompleter = new TSTLookup();
    FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
    LuceneDictionary dict = new
    LuceneDictionary(IndexReader.open(dir),"title_suggest");
    autoCompleter.build(dict);

    Is it possible to have the autocompletion module to work in-memory on
    such a dataset without increasing java's heapspace?
    FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
    my own autocompleter index is stored on disk using about 300MB.

    BR,
    Elmer


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 6, '11 at 4:08p
activeJul 7, '11 at 7:23p
posts14
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase