FAQ
One Byte is Seven bits too many? - A Design suggestion

Hi,

The norm takes up 1 byte of storage per document per field. While this may seem
very small, a simple calculation shows that the IndexSearcher can consume lots of
memory when it caches the norms. Further, the current implementation loads up the
norms in memory as soon as the segments gets loaded. Here are the calculations:

For Medium sized archives
docs=40Million, Fields=2 => 80MB memory
docs=40Million, Fields=10 => 400MB memory
docs=40Million, Fields=20 => 800MB memory

For larger sized archives

docs=400Million, Fields=2 => 800MB memory
docs=400Million, Fields=10 => ~4GB memory
docs=400Million, Fields=20 => ~8GB memory


To further compound the issues, we have found JVM performance drops when the memory
that it manages increases.

While the storage itself may not be concern, the runtime memory requirement can use
some optimization, especially for large number of fields.
The fields itself may fall in one of 3 categories

(a) Tokenized fields have huge variance in number of Tokens,
example - HTML page, Mail Body etc.
(b) Tokenized fields with very little variance in number of token,
example - HTML Page Title, Mail Subject etc.
(c) Fixed Tokenized Fields
example - Department, City, State etc.


The one byte usage is very applicable for (a) and not for (b) or (c). In typical
usage, field increases can be attributed to (b) and (c).

Two solutions come to mind:

(1) If there is forethought in the field design, then one can prefix the field tokens
and then reduce the number of fields. Of course, this will add the overhead of
embarrassing explanation to every query writer of why to add Prefix for every token.
If however, this prefix can be done underneath, it may work but still not elegant.

(2) The norm values for (c) has only two values. One is 0 when the field is not present,
and the other value is a fixed one. In this scenario, the memory requirement
is only one bit per doc per field. I would argue that even for (b) one can approximate the
value with one bit and not much loose much in ranking of documents.


Several implementation options are possible:

(a) We can implement the approximation at the time of writing index (Backward compatibility
has to be considered)
(b) Use a bitset instead of an array for search purposes. I have been
wanting to do for the last 6 months, but have found time yet. If I do, will submit an
implementation.


Also, if a field is mandatory, then the 0 scenario never occurs and in this situation, we
use a single constant to represent the array. May be One byte is 8 bits too many:-))


Arvind.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Robert Engels at May 23, 2005 at 12:39 am
    I have always thought that the norms should be an interface, rather than
    fixed, as there are many uses of lucene where norms are not necessary, and
    the memory overhead is substantial.

    -----Original Message-----
    From: Arvind Srinivasan
    Sent: Sunday, May 22, 2005 7:05 PM
    To: java-dev@lucene.apache.org
    Subject: One Byte is Seven bits too many? - A Design suggestion


    One Byte is Seven bits too many? - A Design suggestion

    Hi,

    The norm takes up 1 byte of storage per document per field. While this may
    seem
    very small, a simple calculation shows that the IndexSearcher can consume
    lots of
    memory when it caches the norms. Further, the current implementation loads
    up the
    norms in memory as soon as the segments gets loaded. Here are the
    calculations:

    For Medium sized archives
    docs=40Million, Fields=2 => 80MB memory
    docs=40Million, Fields=10 => 400MB memory
    docs=40Million, Fields=20 => 800MB memory

    For larger sized archives

    docs=400Million, Fields=2 => 800MB memory
    docs=400Million, Fields=10 => ~4GB memory
    docs=400Million, Fields=20 => ~8GB memory


    To further compound the issues, we have found JVM performance drops when the
    memory
    that it manages increases.

    While the storage itself may not be concern, the runtime memory requirement
    can use
    some optimization, especially for large number of fields.
    The fields itself may fall in one of 3 categories

    (a) Tokenized fields have huge variance in number of Tokens,
    example - HTML page, Mail Body etc.
    (b) Tokenized fields with very little variance in number of token,
    example - HTML Page Title, Mail Subject etc.
    (c) Fixed Tokenized Fields
    example - Department, City, State etc.


    The one byte usage is very applicable for (a) and not for (b) or (c). In
    typical
    usage, field increases can be attributed to (b) and (c).

    Two solutions come to mind:

    (1) If there is forethought in the field design, then one can prefix the
    field tokens
    and then reduce the number of fields. Of course, this will add the overhead
    of
    embarrassing explanation to every query writer of why to add Prefix for
    every token.
    If however, this prefix can be done underneath, it may work but still not
    elegant.

    (2) The norm values for (c) has only two values. One is 0 when the field is
    not present,
    and the other value is a fixed one. In this scenario, the memory
    requirement
    is only one bit per doc per field. I would argue that even for (b) one can
    approximate the
    value with one bit and not much loose much in ranking of documents.


    Several implementation options are possible:

    (a) We can implement the approximation at the time of writing index
    (Backward compatibility
    has to be considered)
    (b) Use a bitset instead of an array for search purposes. I have been
    wanting to do for the last 6 months, but have found time yet. If I do, will
    submit an
    implementation.


    Also, if a field is mandatory, then the 0 scenario never occurs and in this
    situation, we
    use a single constant to represent the array. May be One byte is 8 bits too
    many:-))


    Arvind.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Doug Cutting at May 23, 2005 at 5:56 pm

    Robert Engels wrote:
    I have always thought that the norms should be an interface, rather than
    fixed, as there are many uses of lucene where norms are not necessary, and
    the memory overhead is substantial.
    I agree, but that's not the whole story.

    If one seeks merely to avoid caching the norms in RAM, then all that is
    required is a Scorer that does not access the norms. Thus an
    implementation of the constant scoring query proposal (discussed in
    another thread) would suffice to save memory at search time.

    Long term, I would like Lucene to have an extensible index format. One
    should be able to have different fields use different representations,
    e.g., without norms, without positions, with a boost per position, with
    alternate compression algorithms, etc. (This is item 11 on
    http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard, which I doubt
    will be a part of Lucene 2.0.)

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot at May 23, 2005 at 6:52 am

    On Monday 23 May 2005 02:04, Arvind Srinivasan wrote:
    One Byte is Seven bits too many? - A Design suggestion

    Hi,

    The norm takes up 1 byte of storage per document per field. While this may seem
    very small, a simple calculation shows that the IndexSearcher can consume lots of
    memory when it caches the norms. Further, the current implementation loads up the
    norms in memory as soon as the segments gets loaded. Here are the
    calculations:
    For Medium sized archives
    docs=40Million, Fields=2 => 80MB memory
    docs=40Million, Fields=10 => 400MB memory
    docs=40Million, Fields=20 => 800MB memory

    For larger sized archives

    docs=400Million, Fields=2 => 800MB memory
    docs=400Million, Fields=10 => ~4GB memory
    docs=400Million, Fields=20 => ~8GB memory


    To further compound the issues, we have found JVM performance drops when the memory
    that it manages increases.

    While the storage itself may not be concern, the runtime memory requirement can use
    some optimization, especially for large number of fields.
    The fields itself may fall in one of 3 categories

    (a) Tokenized fields have huge variance in number of Tokens,
    example - HTML page, Mail Body etc.
    (b) Tokenized fields with very little variance in number of token,
    example - HTML Page Title, Mail Subject etc.
    (c) Fixed Tokenized Fields
    example - Department, City, State etc.


    The one byte usage is very applicable for (a) and not for (b) or (c). In typical
    usage, field increases can be attributed to (b) and (c).
    (c) would also be a nice fit for the recently discussed constant scoring
    queries. For (b) the relative variance and the influence and on the score
    is still high. Perhaps a mixed form with a minimum field length in a single
    bit could be considered there, but addressing that might be costly.

    Regards,
    Paul Elschot.





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedMay 23, '05 at 12:07a
activeMay 23, '05 at 5:56p
posts4
users4
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase