FAQ

Search Discussions

  • Robert Griesemer at Oct 11, 2012 at 4:38 pm
    Dmitry;

    Can you please give this CL a LGTM if you're ok with it now? That way I can
    submit it. Thanks.

    https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643053/>

    - Robert
  • Dmitry Vyukov at Oct 11, 2012 at 8:01 pm
    Lgtm
    On Oct 11, 2012 8:38 PM, "Robert Griesemer" wrote:

    Dmitry;

    Can you please give this CL a LGTM if you're ok with it now? That way I
    can submit it. Thanks.

    https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643053/>

    - Robert
  • Gri at Oct 11, 2012 at 11:04 pm
    *** Submitted as
    http://code.google.com/p/go/source/detail?r=57e13fc87f43 ***

    math/big: more conservative use of lock for divisor table

    Minor performance impact running sequentially:

    benchmark old ns/op new ns/op delta
    BenchmarkString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/6643053


    http://codereview.appspot.com/6643053/
  • Michael Jones at Oct 13, 2012 at 5:46 pm
    Just seeing this now (email was to a mostly unused personal account rather
    than my work account.)

    The blocking and concurrent-waiting on data around this structure is by
    design.

    To run fast, when there are two or more large numbers to be converted from
    internal base 2^n to external base 10 (say, but for any base not a power of
    two), then we need a table of divisors. For a (say million "word" number
    N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say
    8-words. where "**" is the exponentiation operator. This table can be big,
    however, it can be built just once and reused many times. The code does
    this now.

    When a first big (bigger than the existing table size) conversion is
    needed, the code updates the table. When more than one goroutine tries the
    same large conversion at the same time, then the first one builds the table
    and the rest of them wait for each entry to be built. Building them
    redundantly would be no faster and wastes memory and CPU. Once the table
    grows suitably, then this phase is a no-op as the table is cached.

    While we could wish that the table was already built, or that as in a real
    program we might assume that a first operation builds the table without
    other threads waiting, in the benchmark case of all goroutines converting
    the same number will cause delays for the first iterations.

    I see no better way.

    Michael
    On Thu, Oct 11, 2012 at 4:04 PM, wrote:

    *** Submitted as
    http://code.google.com/p/go/**source/detail?r=57e13fc87f43<http://code.google.com/p/go/source/detail?r=57e13fc87f43>***


    math/big: more conservative use of lock for divisor table

    Minor performance impact running sequentially:

    benchmark old ns/op new ns/op delta
    BenchmarkString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053>


    http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/>


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Michael Jones at Oct 13, 2012 at 5:46 pm
    I said that wrong (just woke up). It's not so much n**(1/2), n**(1/4).. as
    it is the other way, the largest power of 10 (say) that fits in a Word, and
    then that value **2, **4, **8, and so on until we ge close to the sqrt(N).
    The essence is the same.
    On Sat, Oct 13, 2012 at 7:50 AM, Michael Jones wrote:

    Just seeing this now (email was to a mostly unused personal account rather
    than my work account.)

    The blocking and concurrent-waiting on data around this structure is by
    design.

    To run fast, when there are two or more large numbers to be converted from
    internal base 2^n to external base 10 (say, but for any base not a power of
    two), then we need a table of divisors. For a (say million "word" number
    N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say
    8-words. where "**" is the exponentiation operator. This table can be big,
    however, it can be built just once and reused many times. The code does
    this now.

    When a first big (bigger than the existing table size) conversion is
    needed, the code updates the table. When more than one goroutine tries the
    same large conversion at the same time, then the first one builds the table
    and the rest of them wait for each entry to be built. Building them
    redundantly would be no faster and wastes memory and CPU. Once the table
    grows suitably, then this phase is a no-op as the table is cached.

    While we could wish that the table was already built, or that as in a real
    program we might assume that a first operation builds the table without
    other threads waiting, in the benchmark case of all goroutines converting
    the same number will cause delays for the first iterations.

    I see no better way.

    Michael
    On Thu, Oct 11, 2012 at 4:04 PM, wrote:

    *** Submitted as
    http://code.google.com/p/go/**source/detail?r=57e13fc87f43<http://code.google.com/p/go/source/detail?r=57e13fc87f43>***


    math/big: more conservative use of lock for divisor table

    Minor performance impact running sequentially:

    benchmark old ns/op new ns/op delta
    BenchmarkString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053>


    http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/>


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Dmitry Vyukov at Oct 14, 2012 at 9:12 am

    On Sat, Oct 13, 2012 at 6:50 PM, Michael Jones wrote:

    Just seeing this now (email was to a mostly unused personal account rather
    than my work account.)

    The blocking and concurrent-waiting on data around this structure is by
    design.

    To run fast, when there are two or more large numbers to be converted from
    internal base 2^n to external base 10 (say, but for any base not a power of
    two), then we need a table of divisors. For a (say million "word" number
    N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say
    8-words. where "**" is the exponentiation operator. This table can be big,
    however, it can be built just once and reused many times. The code does
    this now.

    When a first big (bigger than the existing table size) conversion is
    needed, the code updates the table. When more than one goroutine tries the
    same large conversion at the same time, then the first one builds the table
    and the rest of them wait for each entry to be built. Building them
    redundantly would be no faster and wastes memory and CPU. Once the table
    grows suitably, then this phase is a no-op as the table is cached.

    While we could wish that the table was already built, or that as in a real
    program we might assume that a first operation builds the table without
    other threads waiting, in the benchmark case of all goroutines converting
    the same number will cause delays for the first iterations.

    I see no better way.

    Michael

    There is only one problem. The code does a different thing, that leads to
    data races, crashes and exceptions.
    I suggested to Robert to make the code work the way you describe.


    kString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053>


    http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/>


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Michael Jones at Oct 14, 2012 at 12:51 pm
    Thanks! Performance wise, it is important that other goroutines be able to
    use the read-only smaller entries while a larger one is being created.
    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Dmitry Vyukov at Oct 15, 2012 at 10:37 am
    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able to
    use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Michael Jones at Oct 15, 2012 at 8:59 pm
    Ok. Will look. Thank you for bringing this to my attention!
    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov wrote:

    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able
    to use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Robert Griesemer at Oct 15, 2012 at 9:35 pm
    The only safe way I see at the moment to do this is to lock/unlock the
    mutex for each new table entry. That's not hard, but I don't know if it's
    worth it. I am worried about correctness of lock-free approaches (using
    package atomic) that operate on slice elements. But I'm happy to be proven
    otherwise. Either way, I believe what we have now is correct if possible
    not optimal performance-wise. But again, I don't think there's an immediate
    urgency for this to be fixed as I am not aware of programs caring about
    this too much at the moment.
    - gri

    On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones wrote:

    Ok. Will look. Thank you for bringing this to my attention!

    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov wrote:

    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able
    to use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Michael Jones at Oct 15, 2012 at 10:29 pm
    OK.
    On Oct 15, 2012 2:35 PM, "Robert Griesemer" wrote:

    The only safe way I see at the moment to do this is to lock/unlock the
    mutex for each new table entry. That's not hard, but I don't know if it's
    worth it. I am worried about correctness of lock-free approaches (using
    package atomic) that operate on slice elements. But I'm happy to be proven
    otherwise. Either way, I believe what we have now is correct if possible
    not optimal performance-wise. But again, I don't think there's an immediate
    urgency for this to be fixed as I am not aware of programs caring about
    this too much at the moment.
    - gri

    On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones wrote:

    Ok. Will look. Thank you for bringing this to my attention!

    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov wrote:

    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able
    to use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedOct 11, '12 at 4:26a
activeOct 15, '12 at 10:29p
posts12
users4
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase