FAQ
Hi all. I have a question about sorting. Lucene in Action says: "For
numeric types, each field being sorted for each document in the index
requires that four bytes be cached. For String types, each unique term is
also cached for each document."

I want to make sure I'm understanding this correctly. Lets say I have a
document with some text and a date; a typical document might look like this:

DOCUMENT #1:
text = hello world
date = 20050401

Lets say I index 10,000 of these documents into a single Lucene index. I
then create two IndexSearchers on this index and do a search. The first
IndexSearcher sorts by date as an int, the other sorts by date as a string:

IndexSearcher #1 = date sort on INT
IndexSearcher #2 = date sort in STRING

If I understand the quoted sentence correctly, IndexSearcher #1 will have an
int array storing one date per document, while IndexSearcher #2 will have a
string array with only unique dates? If so, is there a particular reason
why sorting as an int doesn't cache unique dates?

The reason I ask this is consider an index with 10,000 documents, where I
store year, month, and day as separte fields (for simplicity lets assume I
only store the years 2000 - 2005 only). When searching as an int, if each
field of each document needs to be cached, that's 10,000 documents * 3
fields = 30,000 cached ints. If terms are uniquely cached, that's just 6
(for each year) + 12 (for each month) + 31 (for each day) = 49 cached ints.
Am I interpreting any of this correctly?

Thanks,
Monsur



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Yonik Seeley at Nov 10, 2005 at 2:23 am
    The FieldCache (which is used for sorting), uses arrays of size
    maxDoc() to cache field values. String sorting will involve caching a
    String[] (or StringIndex) and int sorting will involve caching an
    int[]. Unique string values are shared in the array, but the String
    values plus the String[] will always take up more room than the int[].

    -Yonik
    Now hiring -- http://forms.cnet.com/slink?231706

    On 11/9/05, Monsur Hossain wrote:
    Hi all. I have a question about sorting. Lucene in Action says: "For
    numeric types, each field being sorted for each document in the index
    requires that four bytes be cached. For String types, each unique term is
    also cached for each document."

    I want to make sure I'm understanding this correctly. Lets say I have a
    document with some text and a date; a typical document might look like this:

    DOCUMENT #1:
    text = hello world
    date = 20050401

    Lets say I index 10,000 of these documents into a single Lucene index. I
    then create two IndexSearchers on this index and do a search. The first
    IndexSearcher sorts by date as an int, the other sorts by date as a string:

    IndexSearcher #1 = date sort on INT
    IndexSearcher #2 = date sort in STRING

    If I understand the quoted sentence correctly, IndexSearcher #1 will have an
    int array storing one date per document, while IndexSearcher #2 will have a
    string array with only unique dates? If so, is there a particular reason
    why sorting as an int doesn't cache unique dates?

    The reason I ask this is consider an index with 10,000 documents, where I
    store year, month, and day as separte fields (for simplicity lets assume I
    only store the years 2000 - 2005 only). When searching as an int, if each
    field of each document needs to be cached, that's 10,000 documents * 3
    fields = 30,000 cached ints. If terms are uniquely cached, that's just 6
    (for each year) + 12 (for each month) + 31 (for each day) = 49 cached ints.
    Am I interpreting any of this correctly?

    Thanks,
    Monsur
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Monsur Hossain at Nov 10, 2005 at 5:44 pm
    Thanks Yonik, it makes sense now. So getStringIndex indexes every sorted
    string field in the retArray (one per document), and then each unique string
    term in the mterms array. What is the purpose of the mterms array?

    Thanks,
    Monsur



    -----Original Message-----
    From: Yonik Seeley
    Sent: Wednesday, November 09, 2005 9:23 PM
    To: java-user@lucene.apache.org
    Subject: Re: Sorting: string vs int

    The FieldCache (which is used for sorting), uses arrays of size
    maxDoc() to cache field values. String sorting will involve caching a
    String[] (or StringIndex) and int sorting will involve caching an
    int[]. Unique string values are shared in the array, but the String
    values plus the String[] will always take up more room than the int[].

    -Yonik
    Now hiring -- http://forms.cnet.com/slink?231706

    On 11/9/05, Monsur Hossain wrote:
    Hi all. I have a question about sorting. Lucene in Action
    says: "For
    numeric types, each field being sorted for each document in the index
    requires that four bytes be cached. For String types, each
    unique term is
    also cached for each document."

    I want to make sure I'm understanding this correctly. Lets
    say I have a
    document with some text and a date; a typical document
    might look like this:
    DOCUMENT #1:
    text = hello world
    date = 20050401

    Lets say I index 10,000 of these documents into a single
    Lucene index. I
    then create two IndexSearchers on this index and do a
    search. The first
    IndexSearcher sorts by date as an int, the other sorts by
    date as a string:
    IndexSearcher #1 = date sort on INT
    IndexSearcher #2 = date sort in STRING

    If I understand the quoted sentence correctly,
    IndexSearcher #1 will have an
    int array storing one date per document, while
    IndexSearcher #2 will have a
    string array with only unique dates? If so, is there a
    particular reason
    why sorting as an int doesn't cache unique dates?

    The reason I ask this is consider an index with 10,000
    documents, where I
    store year, month, and day as separte fields (for
    simplicity lets assume I
    only store the years 2000 - 2005 only). When searching as
    an int, if each
    field of each document needs to be cached, that's 10,000
    documents * 3
    fields = 30,000 cached ints. If terms are uniquely cached,
    that's just 6
    (for each year) + 12 (for each month) + 31 (for each day) =
    49 cached ints.
    Am I interpreting any of this correctly?

    Thanks,
    Monsur
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Nov 10, 2005 at 6:33 pm
    Here is a snippet of the current StringIndex class:

    public static class StringIndex {
    /** All the term values, in natural order. */
    public final String[] lookup;

    /** For each document, an index into the lookup array. */
    public final int[] order;
    }

    The order field is used for sorting within a single IndexSearcher, but
    the lookup field is needed to populate the actual string value so it
    may be used by MultiSearchers to order hits from multiple Searchers.

    Look at FieldSortedHitQueue.comparatorString() for more info.

    I guess it would be nice to have some way of telling the searcher (and
    the fieldcache) whether the actual string values are needed or not...
    it could save a lot of memory when there are a lot of unique terms.

    -Yonik
    Now hiring -- http://forms.cnet.com/slink?231706

    On 11/10/05, Monsur Hossain wrote:

    Thanks Yonik, it makes sense now. So getStringIndex indexes every sorted
    string field in the retArray (one per document), and then each unique string
    term in the mterms array. What is the purpose of the mterms array?

    Thanks,
    Monsur
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Monsur Hossain at Nov 10, 2005 at 7:13 pm
    Ah, I got it. retArray is an array of ints; in order to return the string
    value, it needs the mterms array to do the mapping. Thanks, Yonik!

    Monsur



    -----Original Message-----
    From: Yonik Seeley
    Sent: Thursday, November 10, 2005 1:33 PM
    To: java-user@lucene.apache.org
    Subject: Re: Sorting: string vs int

    Here is a snippet of the current StringIndex class:

    public static class StringIndex {
    /** All the term values, in natural order. */
    public final String[] lookup;

    /** For each document, an index into the lookup array. */
    public final int[] order;
    }

    The order field is used for sorting within a single IndexSearcher, but
    the lookup field is needed to populate the actual string value so it
    may be used by MultiSearchers to order hits from multiple Searchers.

    Look at FieldSortedHitQueue.comparatorString() for more info.

    I guess it would be nice to have some way of telling the searcher (and
    the fieldcache) whether the actual string values are needed or not...
    it could save a lot of memory when there are a lot of unique terms.

    -Yonik
    Now hiring -- http://forms.cnet.com/slink?231706


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Nov 11, 2005 at 12:12 am
    : I guess it would be nice to have some way of telling the searcher (and
    : the fieldcache) whether the actual string values are needed or not...
    : it could save a lot of memory when there are a lot of unique terms.

    you're talking about something like LUCENE-457 right? ... but make it
    optional so clients who aren't using MultiSearcher can ignore the physical
    strings, and clients who are still have them.

    three thoughts have occured to me on this...

    1) there might be a way to write a FieldCache.IntParser thta could work
    ... but i can't think of any good way to do it that wouldn't be a total
    kludge (given the limited visibility IntParser has to hte rest of hte
    world)

    2) users could write a new sub class of FiledCacheImpl which consisted
    of...

    class FieldCacheNoMultiSearcher extends FieldCacheImpl {
    public StringIndex getStringIndex (IndexReader reader, String field)
    throws IOException {
    Object ret = lookup (reader, field, STRING_INDEX);
    if (ret == null) {
    StringIndex all = super.getStringIndex(reader,field);
    StringIndex part = new StringIndex(all.order, null);
    store(reader, field, STRING_INDEX, part);
    return part;
    }
    return (StringIndex) ret;
    }
    }

    ...but that's also kludgy ... applications might use a class like this to
    improve the memory footprint, but then they might add functionality latter
    that acctually needs the strings (something in the contrib section for
    example, or function query that look at the length of the word, etc...)
    and they'll get a really ugly null pointer exception.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 10, '05 at 1:36a
activeNov 11, '05 at 12:12a
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase