FAQ
Hello people,

yes, there were several threads about this topic, but I sadly have to respawn
it, I'm sorry.

The first I found was a discussion from May 2005:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3cPine.LNX.4.58.0505302221330.28735@hal.rescomp.berkeley.edu%3e

There the final solution suggestion from Hoss was to try it with a binary search
on the TermEnum

The second one was also from May 2005, it seems that it was a follow-up:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3CPine.LNX.4.58.0505311145460.29003@hal.rescomp.berkeley.edu%3E
http://markmail.org/message/rp4xfdclsha7h5uq#query:termenum%20lucene%20maximum%20value+page:1+mid:q3doh6tvyf6swl6h+state:results

Here, the solution with the FieldCache was discussed, and also another direction
with a RangeQuery which results in a TooMAnyClauses that can be avoided by
setting the 'allowed clauses count' to a bigger number.




I use the solution with the FieldCache, which worked fine for a long time, but
when I use it with bigger fields with millions of entries, I have big peaks in
memory consumption, which sometimes result in an OutOfMemory Error. I assume
that with the range query, I will fall in the same Problem.


I now want to try out the solution from Hoss with the binary search over the
TermEnum, but it is not clear for me how to perform this.

The only methods in TermEnum are

public abstract boolean next()
public abstract Term term();
public abstract int docFreq();
public abstract void close()
public boolean skipTo(Term target)

Whereby skipTo "Skips terms to the first beyond the current whose value is
greater or equal to 'target'. Returns true iff there is such an entry."

How to avoid to perfom the 'big loop with next' until I am at the last entry,
like the current implementation of skipTo:

do {
if (!next())
return false;
} while (target.compareTo(term()) > 0);
return true;

Whereby target() would be the over biggest value we could think about, and I
remember the term bevore the method returns false.

Because of the tree-like architecture of the index, where the letters are some
kind of nodes, e.g.

a z
ab ar ze zu
abi ark zer zul

I would assume that there is a fast possibility to determine that 'abi' is the
minimum and 'zul' the maximum for that field - by simply walking through the
tree 'left - or rightwise' (when I only get the left node, I will walk to the
minimum, when I only get the right node through walking, I will get the maximum)

But this is a theoretical view. Enables the Lucene implementation walkthroughs
like this? At least the RangeQuery implementation I would assume walks throgh
the tree.


Thanks for all answers!

kindly regards

Chris

Search Discussions

  • Christian Reuschling at Jun 25, 2008 at 3:10 pm
    Hello people,


    I'm sorry if I have send this message twice - my gmail interface merges the
    mails in the 'send' folder with incoming mails from my adress - strange, but
    I can't say if the mail was sent - I only see it in the send-folder (with
    only one label on it, which brings me to send it again :( ) Okay.



    yes, there were several threads about this topic, but I sadly have to respawn
    it, I'm sorry.

    The first I found was a discussion from May 2005:

    http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3cPine.LNX.4.58.0505302221330.28735@hal.rescomp.berkeley.edu%3e

    There the final solution suggestion from Hoss was to try it with a binary search
    on the TermEnum

    The second one was also from May 2005, it seems that it was a follow-up:

    http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3CPine.LNX.4.58.0505311145460.29003@hal.rescomp.berkeley.edu%3E
    http://markmail.org/message/rp4xfdclsha7h5uq#query:termenum%20lucene%20maximum%20value+page:1+mid:q3doh6tvyf6swl6h+state:results

    Here, the solution with the FieldCache was discussed, and also another direction
    with a RangeQuery which results in a TooMAnyClauses that can be avoided by
    setting the 'allowed clauses count' to a bigger number.




    I use the solution with the FieldCache, which worked fine for a long time, but
    when I use it with bigger fields with millions of entries, I have big peaks in
    memory consumption, which sometimes result in an OutOfMemory Error. I assume
    that with the range query, I will fall in the same Problem.


    I now want to try out the solution from Hoss with the binary search over the
    TermEnum, but it is not clear for me how to perform this.

    The only methods in TermEnum are

    public abstract boolean next()
    public abstract Term term();
    public abstract int docFreq();
    public abstract void close()
    public boolean skipTo(Term target)

    Whereby skipTo "Skips terms to the first beyond the current whose value is
    greater or equal to 'target'. Returns true iff there is such an entry."

    How to avoid to perfom the 'big loop with next' until I am at the last entry,
    like the current implementation of skipTo:

    do {
    if (!next())
    return false;
    } while (target.compareTo(term()) > 0);
    return true;

    Whereby target() would be the over biggest value we could think about, and I
    remember the term bevore the method returns false.

    Because of the tree-like architecture of the index, where the letters are some
    kind of nodes, e.g.

    a z
    ab ar ze zu
    abi ark zer zul

    I would assume that there is a fast possibility to determine that 'abi' is the
    minimum and 'zul' the maximum for that field - by simply walking through the
    tree 'left - or rightwise' (when I only get the left node, I will walk to the
    minimum, when I only get the right node through walking, I will get the maximum)

    But this is a theoretical view. Enables the Lucene implementation walkthroughs
    like this? At least the RangeQuery implementation I would assume walks throgh
    the tree.


    Thanks for all answers!

    kindly regards

    Chris

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jason Rutherglen at Jun 25, 2008 at 3:24 pm
    I looked heavily at this. It requires a customization of TermInfosReader
    whereby the tii (term dictionary) SegmentTermEnum is traversed looking for
    the last term with a particular field. Once found, from that position in
    the tis SegmentTermEnum would need to be traversed again for the last term
    with the desired field. This would be the most efficient using the current
    system.

    In the Tag Index I am working on
    https://issues.apache.org/jira/browse/LUCENE-1292, the max values would be
    stored automatically at the end of the tis file. They would be available
    via a subclass of TermEnum that adds a method "public Term getMax(String
    field)"
    On Wed, Jun 25, 2008 at 5:43 AM, Christian Reuschling wrote:

    Hello people,

    yes, there were several threads about this topic, but I sadly have to
    respawn
    it, I'm sorry.

    The first I found was a discussion from May 2005:


    http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3cPine.LNX.4.58.0505302221330.28735@hal.rescomp.berkeley.edu%3e

    There the final solution suggestion from Hoss was to try it with a binary
    search
    on the TermEnum

    The second one was also from May 2005, it seems that it was a follow-up:


    http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3CPine.LNX.4.58.0505311145460.29003@hal.rescomp.berkeley.edu%3E

    http://markmail.org/message/rp4xfdclsha7h5uq#query:termenum%20lucene%20maximum%20value+page:1+mid:q3doh6tvyf6swl6h+state:results

    Here, the solution with the FieldCache was discussed, and also another
    direction
    with a RangeQuery which results in a TooMAnyClauses that can be avoided by
    setting the 'allowed clauses count' to a bigger number.




    I use the solution with the FieldCache, which worked fine for a long time,
    but
    when I use it with bigger fields with millions of entries, I have big peaks
    in
    memory consumption, which sometimes result in an OutOfMemory Error. I
    assume
    that with the range query, I will fall in the same Problem.


    I now want to try out the solution from Hoss with the binary search over
    the
    TermEnum, but it is not clear for me how to perform this.

    The only methods in TermEnum are

    public abstract boolean next()
    public abstract Term term();
    public abstract int docFreq();
    public abstract void close()
    public boolean skipTo(Term target)

    Whereby skipTo "Skips terms to the first beyond the current whose value is
    greater or equal to 'target'. Returns true iff there is such an entry."

    How to avoid to perfom the 'big loop with next' until I am at the last
    entry,
    like the current implementation of skipTo:

    do {
    if (!next())
    return false;
    } while (target.compareTo(term()) > 0);
    return true;

    Whereby target() would be the over biggest value we could think about, and
    I
    remember the term bevore the method returns false.

    Because of the tree-like architecture of the index, where the letters are
    some
    kind of nodes, e.g.

    a z
    ab ar ze zu
    abi ark zer zul

    I would assume that there is a fast possibility to determine that 'abi' is
    the
    minimum and 'zul' the maximum for that field - by simply walking through
    the
    tree 'left - or rightwise' (when I only get the left node, I will walk to
    the
    minimum, when I only get the right node through walking, I will get the
    maximum)

    But this is a theoretical view. Enables the Lucene implementation
    walkthroughs
    like this? At least the RangeQuery implementation I would assume walks
    throgh
    the tree.


    Thanks for all answers!

    kindly regards

    Chris
  • Chris Hostetter at Jun 25, 2008 at 10:01 pm
    : There the final solution suggestion from Hoss was to try it with a binary
    : search
    : on the TermEnum

    my suggestion at the time to do a binary search was a bit naive (i was
    not as familiar with Lucene as I am now).

    : Because of the tree-like architecture of the index, where the letters are some
    : kind of nodes, e.g.
    :
    : a z
    : ab ar ze zu
    : abi ark zer zul
    :
    : I would assume that there is a fast possibility to determine that 'abi' is the
    : minimum and 'zul' the maximum for that field - by simply walking through the

    the problem is that Terms are not actually organized in a tree structure
    -- the Term list is essentially a large singly linked list with an "index"
    (overused terminology unfortunately) of ever N terms so that it's possible
    to skip very quickly based a lot of nodes in the list.

    : like this? At least the RangeQuery implementation I would assume walks throgh
    : the tree.

    RangeQuery (and RangeFilter) skipTo the lowerbound and then next() their
    way to the upper bound -- but that's because they actually care about
    every term in between.

    To find the min you can skipTo(new Term(yourField, ""))and as long as the
    enum is pointed a Term for yourField you've got the min (if it's not,
    yourField isn't in the index)

    For the max things get harder ... the simplest appraoch is to next() your
    way along untill you find a term not in the current field ... whatever the
    last Term value was is your max ... but for fields with lots of Terms, it
    *might* be faster to do some iterative attempts with skipTo() to jump
    ahead -- if you ever pass over into a new field, then you know you skiped
    to far, you need to get a new TermEnum, skipTo() the last "good" term and
    then iterate from there (or skipTo() with smaller jumps ... maybe that's
    what i ment by a binary search?)

    writing completley generic code to do this would be easy, but probably not
    very efficient since it would have to worry about the fully gambit of
    unicode characters. if however you know that a field always contains
    integers, or english text with limited punctuation, you can probably make
    much more efficient guesses about what to try skiping to ("z" for example)

    Typically, it's much easier to just keep track at indexing time of the
    min/max ... a TokenFilter that inspects Tokens without modifying them can
    do this very easily.


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 25, '08 at 9:48a
activeJun 25, '08 at 10:01p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase