FAQ
Hello

I want to ask community an advice:
what is the best way to index and search java.math.BigDecimal values in
lucene 2.4.

Any code snippets are welcome.

Sergey Kabashnyuk
eXo Platform SAS

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Ian Lea at Nov 20, 2008 at 2:20 pm
    Hi


    Lucene only indexes strings. The standard advice for numeric is to
    pad to desired width with leading zeros, if likely to be used in range
    searches. How varied are the numbers you're going to be working with?
    I only work with stuff with 2 decimal places and tend to lose that.
    e.g.

    2.22 would be indexed as 000222
    0.99 ... 000099

    And of course go through the same conversion when searching.


    But if you've got variable numbers of decimal places it might get more
    interesting.


    --
    Ian.

    On Thu, Nov 20, 2008 at 2:10 PM, Sergey Kabashnyuk wrote:
    Hello

    I want to ask community an advice:
    what is the best way to index and search java.math.BigDecimal values in
    lucene 2.4.

    Any code snippets are welcome.

    Sergey Kabashnyuk
    eXo Platform SAS
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sergey Kabashnyuk at Nov 20, 2008 at 2:31 pm
    Thanks Ian

    Unfortunately, I have to index any possible number of java.math.BigDecimal
    I can rephrase my question this way:

    How can I convert java.math.BigDecimal numbers in to string
    for its storing in lexicographical order

    Sergey Kabashnyuk
    eXo Platform SAS
    Hi


    Lucene only indexes strings. The standard advice for numeric is to
    pad to desired width with leading zeros, if likely to be used in range
    searches. How varied are the numbers you're going to be working with?
    I only work with stuff with 2 decimal places and tend to lose that.
    e.g.

    2.22 would be indexed as 000222
    0.99 ... 000099

    And of course go through the same conversion when searching.


    But if you've got variable numbers of decimal places it might get more
    interesting.


    --
    Ian.

    On Thu, Nov 20, 2008 at 2:10 PM, Sergey Kabashnyuk wrote:
    Hello

    I want to ask community an advice:
    what is the best way to index and search java.math.BigDecimal values in
    lucene 2.4.

    Any code snippets are welcome.

    Sergey Kabashnyuk
    eXo Platform SAS
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Ludwig at Nov 20, 2008 at 8:06 pm

    Sergey Kabashnyuk schrieb:
    Unfortunately, I have to index any possible number of
    java.math.BigDecimal
    Hi Sergey,

    quite a lot of numbers are possible for BigDecimal. Somehow the range
    must be bounded.

    Let's first draw the line where, for a given BigDecimal bd, the result
    of bd.toString(), which since 1.5 returns a "standard canonical string
    form", cannot be refed to the String constructor for BigDecimal. So when
    reconstruction fails, that is out of range for you.

    ### 9.999E2147483647 still works
    9.999E+2147483647 - toString()
    99.99E+2147483646 - toEngineeringString()
    Rekonstruktion via toString(): works
    Rekonstruktion via toEngineeringString(): works

    ### 10.001E2147483647 too big, does not work
    1.0001E+2147483648 - toString()
    100.01E+2147483646 - toEngineeringString()
    Rekonstruktion via toString(): NumberFormatException
    Rekonstruktion via toEngineeringString(): works

    Next, unlimited precision is a problem. Do you need a precision of two
    billion digits? Probably not. De facto, precision is constrained by
    available memory. So you see you must rephrase your requirement in order
    to accomodate real-world conditions.
    I can rephrase my question this way:
    How can I convert java.math.BigDecimal numbers in to string
    for its storing in lexicographical order
    I assume what you mean is formatting the number so that the
    lexicographical order of any possible sequence of acceptable numbers
    is the same as its numerical order.

    You must find a canonical representation like the scientific notation
    and then tweak it as follows:

    * "N" for negative and "P" for positive numbers ("N" sorts before "P")
    * fixed-width zero-padded exponent first, like "E0000000003", base 10
    * one digit with marker, like "N2"
    * fixed-width zero-padded decimals with marker, like "D008000000000"

    This is 2008, "PE0000000003N2D008000000000". YMMV, of course.

    I hope this helps.

    Michael Ludwig

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Ludwig at Nov 21, 2008 at 10:21 am

    Michael Ludwig schrieb:
    I assume what you mean is formatting the number so that the
    lexicographical order of any possible sequence of acceptable numbers
    is the same as its numerical order.

    You must find a canonical representation like the scientific notation
    and then tweak it as follows:

    * "N" for negative and "P" for positive numbers ("N" sorts before "P")
    * fixed-width zero-padded exponent first, like "E0000000003", base 10
    * one digit with marker, like "N2"
    * fixed-width zero-padded decimals with marker, like "D008000000000"

    This is 2008, "PE0000000003N2D008000000000". YMMV, of course.
    This notation falls short of achieving the goal to make lexicographical
    order coincide with numerical order. First, negative numbers won't sort
    in ascending order; -1 will come before -2. Second, negative exponents
    aren't accounted for at all. Third, there are probably other problems.

    Take a look at Steven Rowe's post in this thread for better
    thought through ideas.

    Michael Ludwig

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Nov 22, 2008 at 9:02 pm

    On Thu, Nov 20, 2008 at 9:30 AM, Sergey Kabashnyuk wrote:
    Thanks Ian

    Unfortunately, I have to index any possible number of java.math.BigDecimal
    I can rephrase my question this way:

    How can I convert java.math.BigDecimal numbers in to string
    for its storing in lexicographical order
    Some early work I did in Solr handles this for integers (it's pretty
    much unused code now though). The format was designed to support
    decimals also, but I never got around to doing the code.

    See BCDIntField, and BCDUtils, specifically
    BCDUtils.base10toBase10kSortableInt

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steven A Rowe at Nov 21, 2008 at 12:54 am
    Hi Sergey,
    On 11/20/2008 at 9:30 AM, Sergey Kabashnyuk wrote:
    How can I convert java.math.BigDecimal numbers in to string
    for its storing in lexicographical order
    Here's a thoroughly untested idea, cribbing some from o.a.l.document.NumberTools[1]: convert BigDecimals into strings of the following form:

    <significand-sign> <exponent-sign> <exponent> <significand>

    As in NumberTools, the signs consist of either the '-' or the '0' character; '-' < '0'.

    The exponent must be fixed length, and serialized as in NumberTools, with left-zero-padding and using the negative inversion trick. The exponent could be expressed in any base that will fit into Java's 16-bit char - Lucene's NumberTools uses base 36; see Solr's NumberUtils[2] or LUCENE-1434[3] for base 0x8000 implementations.

    The exponent can be calculated from the number of digits in the serialized base 10 form of the significand (BigDecimal's "unscaled value") and the "scale" (the number of digits after the decimal): exponent = (number of significand digits) - scale - 1.

    The significand field can be variable length, though it can't contain any left-zero-padding, and could be expressed in any base; again, see [2] or [3].

    Some examples (base 10 and 4-char-width exponent used for purposes of exposition), in sorted order:

    +5.E-3 => 0 - 9996 5
    +1.E-2 => 0 - 9997 1
    +1.0E-2 => 0 - 9997 10
    +1.0000E-2 => 0 - 9997 10000
    +1.1E-2 => 0 - 9997 11
    +1.11E-2 => 0 - 9997 111
    +1.2E-2 => 0 - 9997 12
    +5.E-2 => 0 - 9997 5
    +7.3E+2 => 0 0 0002 73
    +7.4E+2 => 0 0 0002 74
    +7.45E+2 => 0 0 0002 745
    +8.7654E+3 => 0 0 0003 87654

    Negative numbers are a problem for the significand, though, since NumberTools' negative inversion trick assumes a fixed-precision minimum value - you'd need to use a different technique here in order to enable variable length significands.

    Another entirely untested idea, to handle variable length negative significands: substitute (base - digit - 1) for each digit of the serialized representation (e.g. in base 10, 4 => 5 & 0 => 9), then append a sentinal digit that is greater than all other digits used to represent the significand. The format for negative BigDecimals, then, would be:

    '-' <reversed-exponent-sign> <negated-exponent> <significand> <sentinel>

    where the exponent and its sign are negated before serialization, so that their sense is reversed.

    Some negative examples (base 10 and 4-char-width exponent used for purposes of exposition), in sorted order:

    -8.7654E+3 => - - 9996 12345 A
    -7.45E+2 => - - 9997 254 A
    -7.4E+2 => - - 9997 25 A
    -7.3E+2 => - - 9997 26 A
    -5.E-2 => - 0 0002 4 A
    -1.2E-2 => - 0 0002 87 A
    -1.11E-2 => - 0 0002 888 A
    -1.1E-2 => - 0 0002 88 A
    -1.0000E-2 => - 0 0002 89999 A
    -1.0E-2 => - 0 0002 89 A
    -1.E-2 => - 0 0002 8 A
    -5.E-3 => - 0 0003 4 A

    The use of the sentinel digit 'A', which is greater than all of the other digits [0-9], ensures that negative values with greater precision are ordered before those that share significand prefixes but have lesser precision.

    Although BigDecimal claims to support arbitrary precision, its scale representation is an int (32 bits), and its significand ("unscaled value") can have at most Integer.MAX_VALUE bits (c.f. java.math.BigInteger.bitLength():int[4]). If the width of the exponent field is made so that it can express all long values (64 bits), I *think* this scheme can handle all BigDecimal values.

    Steve

    [1] o.a.l.document.NumberTools: <http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_0/src/java/org/apache/lucene/document/NumberTools.java?view=markup>
    [2] o.a.s.util.NumberUtils: <http://svn.apache.org/viewvc/lucene/solr/tags/release-1.3.0/src/java/org/apache/solr/util/NumberUtils.java?view=markup>
    [3] IndexableBinaryStringTools JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1434
    [4] BigInteger.bitLength(): <http://java.sun.com/j2se/1.4.2/docs/api/java/math/BigInteger.html#bitLength()>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 20, '08 at 2:11p
activeNov 22, '08 at 9:02p
posts7
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase