FAQ
Reviewers: dvyukov, mtj,

Message:
Hello dvyukov@google.com, michael.jones@gmail.com (cc:
golang-dev@googlegroups.com),

I'd like you to review this change to
https://code.google.com/p/go


Description:
math/big: more conservative use of lock for divisor table

Fixes issue 4218.

Please review this at http://codereview.appspot.com/6643053/

Affected files:
M src/pkg/math/big/nat.go
M src/pkg/math/big/nat_test.go


Index: src/pkg/math/big/nat.go
===================================================================
--- a/src/pkg/math/big/nat.go
+++ b/src/pkg/math/big/nat.go
@@ -920,8 +920,10 @@
ndigits int // digit length of divisor in terms of output base digits
}

-var cacheBase10 [64]divisor // cached divisors for base 10
-var cacheLock sync.Mutex // protects cacheBase10
+var cacheBase10 struct {
+ table [64]divisor // cached divisors for base 10
+ mu sync.Mutex // protects table
+}

// expWW computes x**y
func (z nat) expWW(x, y Word) nat {
@@ -937,27 +939,22 @@

// determine k where (bb**leafSize)**(2**k) >= sqrt(x)
k := 1
- for words := leafSize; words < m>>1 && k < len(cacheBase10); words <<= 1 {
+ for words := leafSize; words < m>>1 && k < len(cacheBase10.table); words
<<= 1 {
k++
}

- // create new table of divisors or extend and reuse existing table as
appropriate
- var table []divisor
- var cached bool
- switch b {
- case 10:
- table = cacheBase10[0:k] // reuse old table for this conversion
- cached = true
- default:
- table = make([]divisor, k) // new table for this conversion
+ // reuse and extend existing table of divisors or create new table as
appropriate
+ var table []divisor // for b == 10, table overlaps with cacheBase10.table
+ if b == 10 {
+ cacheBase10.mu.Lock()
+ defer cacheBase10.mu.Unlock()
+ table = cacheBase10.table[0:k] // reuse old table for this conversion
+ } else {
+ table = make([]divisor, k) // create new table for this conversion
}

// extend table
if table[k-1].ndigits == 0 {
- if cached {
- cacheLock.Lock() // begin critical section
- }
-
// add new entries as needed
var larger nat
for i := 0; i < k; i++ {
@@ -980,10 +977,6 @@
table[i].nbits = table[i].bbb.bitLen()
}
}
-
- if cached {
- cacheLock.Unlock() // end critical section
- }
}

return table
Index: src/pkg/math/big/nat_test.go
===================================================================
--- a/src/pkg/math/big/nat_test.go
+++ b/src/pkg/math/big/nat_test.go
@@ -409,6 +409,20 @@
}
}

+func TestScanPiParallel(t *testing.T) {
+ const n = 2
+ c := make(chan bool)
+ for i := 0; i < n; i++ {
+ go func() {
+ TestScanPi(t)
+ c <- true
+ }()
+ }
+ for i := 0; i < n; i++ {
+ <-c
+ }
+}
+
func BenchmarkScanPi(b *testing.B) {
for i := 0; i < b.N; i++ {
var x nat
@@ -516,7 +530,7 @@
func LeafSizeHelper(b *testing.B, base Word, size int) {
b.StopTimer()
originalLeafSize := leafSize
- resetTable(cacheBase10[:])
+ resetTable(cacheBase10.table[:])
leafSize = size
b.StartTimer()

@@ -533,7 +547,7 @@
}

b.StopTimer()
- resetTable(cacheBase10[:])
+ resetTable(cacheBase10.table[:])
leafSize = originalLeafSize
b.StartTimer()
}

Search Discussions

  • Dave at Oct 10, 2012 at 12:34 am
  • Gri at Oct 10, 2012 at 1:18 am
    PTAL.


    https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go
    File src/pkg/math/big/nat.go (right):

    https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go#newcode925
    src/pkg/math/big/nat.go:925: mu sync.Mutex // protects table
    On 2012/10/10 00:11:01, dfc wrote:
    s/mu//
    but of course! :-)

    https://codereview.appspot.com/6643053/
  • Dave at Oct 10, 2012 at 5:31 am
    Thank you. I'm not sure about this, I think the race still exists
    through table to the memory that it shares with cacheBase10.table.

    https://codereview.appspot.com/6643053/
  • Dvyukov at Oct 10, 2012 at 2:27 pm
    I do not see the race now, looks correct. But I am concerned about
    performance implications. On the following benchmark

    func BenchmarkStringPi(b *testing.B) {
    const STR =
    "123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990"
    x, _, _ := nat(nil).scan(strings.NewReader(STR), 10)
    _ = x.decimalString()

    procs := runtime.GOMAXPROCS(-1)
    N := int32(b.N)
    c := make(chan bool, procs)
    for p := 0; p < procs; p++ {
    go func() {
    x, _, _ := nat(nil).scan(strings.NewReader(STR), 10)
    for atomic.AddInt32(&N, -1) >= 0 {
    _ = x.decimalString()
    }
    c <- true
    }()
    }
    for p := 0; p < procs; p++ {
    <-c
    }
    }

    w/o mutex lock/defer unlock:

    BenchmarkStringPi 200000 4495 ns/op
    BenchmarkStringPi-4 500000 1484 ns/op
    BenchmarkStringPi-8 1000000 971 ns/op
    BenchmarkStringPi-16 1000000 738 ns/op

    with mutex lock/defer unlock:

    BenchmarkStringPi 200000 4708 ns/op
    BenchmarkStringPi-4 500000 1857 ns/op
    BenchmarkStringPi-8 500000 1555 ns/op
    BenchmarkStringPi-16 500000 1633 ns/op


    I would prefer lock-free fast path (but correct).

    https://codereview.appspot.com/6643053/
  • Gri at Oct 10, 2012 at 9:04 pm
    PTAL.

    I am not concerned about the performance. Correctness trumps speed and
    the performance degradation is small (0.5 - 1.5%) for the relevant
    benchmarks (now attached).

    I removed the defer statement for a minor improvement. Note that these
    benchmarks run up to 37% faster than they used to beginning of the year;
    I suspect due to compiler improvements and due to the manual assembly
    improvements for the core math routines. It's safer to tweak those
    (there's more to get ouf of them) than trying lock-free approaches that
    are hard to understand.

    On 2012/10/10 14:20:58, dvyukov wrote:
    I do not see the race now, looks correct. But I am concerned about
    performance
    implications. On the following benchmark
    func BenchmarkStringPi(b *testing.B) {
    const STR =
    "123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990"
    x, _, _ := nat(nil).scan(strings.NewReader(STR), 10)
    _ = x.decimalString()
    procs := runtime.GOMAXPROCS(-1)
    N := int32(b.N)
    c := make(chan bool, procs)
    for p := 0; p < procs; p++ {
    go func() {
    x, _, _ := nat(nil).scan(strings.NewReader(STR), 10)
    for atomic.AddInt32(&N, -1) >= 0 {
    _ = x.decimalString()
    }
    c <- true
    }()
    }
    for p := 0; p < procs; p++ {
    <-c
    }
    }
    w/o mutex lock/defer unlock:
    BenchmarkStringPi 200000 4495 ns/op
    BenchmarkStringPi-4 500000 1484 ns/op
    BenchmarkStringPi-8 1000000 971 ns/op
    BenchmarkStringPi-16 1000000 738 ns/op
    with mutex lock/defer unlock:
    BenchmarkStringPi 200000 4708 ns/op
    BenchmarkStringPi-4 500000 1857 ns/op
    BenchmarkStringPi-8 500000 1555 ns/op
    BenchmarkStringPi-16 500000 1633 ns/op
    I would prefer lock-free fast path (but correct).


    https://codereview.appspot.com/6643053/
  • Dmitry Vyukov at Oct 10, 2012 at 7:07 pm

    On Wed, Oct 10, 2012 at 10:18 PM, wrote:

    PTAL.

    I am not concerned about the performance. Correctness trumps speed and
    the performance degradation is small (0.5 - 1.5%) for the relevant
    benchmarks (now attached).
    They are not parallel.

    I removed the defer statement for a minor improvement. Note that these
    benchmarks run up to 37% faster

    My parallel measurements show 121% degradation, so I guess now it is slower
    than beginning of the year.


    than they used to beginning of the year;
    I suspect due to compiler improvements and due to the manual assembly
    improvements for the core math routines. It's safer to tweak those
    (there's more to get ouf of them) than trying lock-free approaches that
    are hard to understand.


    On 2012/10/10 14:20:58, dvyukov wrote:

    I do not see the race now, looks correct. But I am concerned about
    performance
    implications. On the following benchmark
    func BenchmarkStringPi(b *testing.B) {
    const STR =
    "**123456789012345678990123456789**012345678990123456789012345678**
    990123456789012345678990123456**789012345678990123456789012345**
    678990123456789012345678990123**456789012345678990"
    x, _, _ := nat(nil).scan(strings.**NewReader(STR), 10)
    _ = x.decimalString()
    procs := runtime.GOMAXPROCS(-1)
    N := int32(b.N)
    c := make(chan bool, procs)
    for p := 0; p < procs; p++ {
    go func() {
    x, _, _ := nat(nil).scan(strings.**NewReader(STR),
    10)
    for atomic.AddInt32(&N, -1) >= 0 {
    _ = x.decimalString()
    }
    c <- true
    }()
    }
    for p := 0; p < procs; p++ {
    <-c
    }
    }
    w/o mutex lock/defer unlock:

    BenchmarkStringPi 200000 4495 ns/op
    BenchmarkStringPi-4 500000 1484 ns/op
    BenchmarkStringPi-8 1000000 971 ns/op
    BenchmarkStringPi-16 1000000 738 ns/op
    with mutex lock/defer unlock:

    BenchmarkStringPi 200000 4708 ns/op
    BenchmarkStringPi-4 500000 1857 ns/op
    BenchmarkStringPi-8 500000 1555 ns/op
    BenchmarkStringPi-16 500000 1633 ns/op

    I would prefer lock-free fast path (but correct).


    https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643053/>
  • Robert Griesemer at Oct 10, 2012 at 9:55 pm
    I'm still not concerned. Also, I cannot reproduce the slowdown you are
    reporting: I have added a parallel benchmark and it doesn't show the
    behavior you are seing (albeit I can only go up to 4 on my machine).
    Perhaps there is contention in your benchmark on the benchmark side due to
    the use of atomic?

    But more importantly: This is about converting internal numbers to a
    human-readable decimal form for human consumption (for machine consumption,
    hex conversion would be more appropriate, way faster, and lock-free).
    Outputting large amounts of huge decimal numbers for human consumption is
    not going to be the bottleneck of a program (if it is, there's probably
    something else wrong or it doesn't matter).

    One approach is to fill in the entire table at initialization time, or at
    first use. I am not comfortable using atomic operations for some of the
    slice accesses - maybe I am wrong but I think it's very tricky to get
    right. Happy to be proven otherwise.

    Again, this is about correctness first and fixing the issue, performance
    2nd; and as far as I can tell, performance is ok. If it's becoming a real
    issue for a real application, we can dive into it more.
    - gri

    On Wed, Oct 10, 2012 at 11:38 AM, Dmitry Vyukov wrote:
    On Wed, Oct 10, 2012 at 10:18 PM, wrote:

    PTAL.

    I am not concerned about the performance. Correctness trumps speed and
    the performance degradation is small (0.5 - 1.5%) for the relevant
    benchmarks (now attached).
    They are not parallel.

    I removed the defer statement for a minor improvement. Note that these
    benchmarks run up to 37% faster

    My parallel measurements show 121% degradation, so I guess now it is
    slower than beginning of the year.


    than they used to beginning of the year;
    I suspect due to compiler improvements and due to the manual assembly
    improvements for the core math routines. It's safer to tweak those
    (there's more to get ouf of them) than trying lock-free approaches that
    are hard to understand.


    On 2012/10/10 14:20:58, dvyukov wrote:

    I do not see the race now, looks correct. But I am concerned about
    performance
    implications. On the following benchmark
    func BenchmarkStringPi(b *testing.B) {
    const STR =
    "**123456789012345678990123456789**012345678990123456789012345678**
    990123456789012345678990123456**789012345678990123456789012345**
    678990123456789012345678990123**456789012345678990"
    x, _, _ := nat(nil).scan(strings.**NewReader(STR), 10)
    _ = x.decimalString()
    procs := runtime.GOMAXPROCS(-1)
    N := int32(b.N)
    c := make(chan bool, procs)
    for p := 0; p < procs; p++ {
    go func() {
    x, _, _ := nat(nil).scan(strings.**NewReader(STR),
    10)
    for atomic.AddInt32(&N, -1) >= 0 {
    _ = x.decimalString()
    }
    c <- true
    }()
    }
    for p := 0; p < procs; p++ {
    <-c
    }
    }
    w/o mutex lock/defer unlock:

    BenchmarkStringPi 200000 4495 ns/op
    BenchmarkStringPi-4 500000 1484 ns/op
    BenchmarkStringPi-8 1000000 971 ns/op
    BenchmarkStringPi-16 1000000 738 ns/op
    with mutex lock/defer unlock:

    BenchmarkStringPi 200000 4708 ns/op
    BenchmarkStringPi-4 500000 1857 ns/op
    BenchmarkStringPi-8 500000 1555 ns/op
    BenchmarkStringPi-16 500000 1633 ns/op

    I would prefer lock-free fast path (but correct).


    https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643053/>
  • Gri at Oct 10, 2012 at 9:53 pm
    Hello dvyukov@google.com, michael.jones@gmail.com, dave@cheney.net (cc:
    golang-dev@googlegroups.com),

    Please take another look.


    http://codereview.appspot.com/6643053/
  • Dvyukov at Oct 11, 2012 at 4:26 am
  • Robert Griesemer at Oct 11, 2012 at 4:38 pm
    Dmitry;

    Can you please give this CL a LGTM if you're ok with it now? That way I can
    submit it. Thanks.

    https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643053/>

    - Robert
  • Dmitry Vyukov at Oct 11, 2012 at 8:01 pm
    Lgtm
    On Oct 11, 2012 8:38 PM, "Robert Griesemer" wrote:

    Dmitry;

    Can you please give this CL a LGTM if you're ok with it now? That way I
    can submit it. Thanks.

    https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643053/>

    - Robert
  • Gri at Oct 11, 2012 at 11:04 pm
    *** Submitted as
    http://code.google.com/p/go/source/detail?r=57e13fc87f43 ***

    math/big: more conservative use of lock for divisor table

    Minor performance impact running sequentially:

    benchmark old ns/op new ns/op delta
    BenchmarkString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/6643053


    http://codereview.appspot.com/6643053/
  • Michael Jones at Oct 13, 2012 at 5:46 pm
    Just seeing this now (email was to a mostly unused personal account rather
    than my work account.)

    The blocking and concurrent-waiting on data around this structure is by
    design.

    To run fast, when there are two or more large numbers to be converted from
    internal base 2^n to external base 10 (say, but for any base not a power of
    two), then we need a table of divisors. For a (say million "word" number
    N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say
    8-words. where "**" is the exponentiation operator. This table can be big,
    however, it can be built just once and reused many times. The code does
    this now.

    When a first big (bigger than the existing table size) conversion is
    needed, the code updates the table. When more than one goroutine tries the
    same large conversion at the same time, then the first one builds the table
    and the rest of them wait for each entry to be built. Building them
    redundantly would be no faster and wastes memory and CPU. Once the table
    grows suitably, then this phase is a no-op as the table is cached.

    While we could wish that the table was already built, or that as in a real
    program we might assume that a first operation builds the table without
    other threads waiting, in the benchmark case of all goroutines converting
    the same number will cause delays for the first iterations.

    I see no better way.

    Michael
    On Thu, Oct 11, 2012 at 4:04 PM, wrote:

    *** Submitted as
    http://code.google.com/p/go/**source/detail?r=57e13fc87f43<http://code.google.com/p/go/source/detail?r=57e13fc87f43>***


    math/big: more conservative use of lock for divisor table

    Minor performance impact running sequentially:

    benchmark old ns/op new ns/op delta
    BenchmarkString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053>


    http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/>


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Michael Jones at Oct 13, 2012 at 5:46 pm
    I said that wrong (just woke up). It's not so much n**(1/2), n**(1/4).. as
    it is the other way, the largest power of 10 (say) that fits in a Word, and
    then that value **2, **4, **8, and so on until we ge close to the sqrt(N).
    The essence is the same.
    On Sat, Oct 13, 2012 at 7:50 AM, Michael Jones wrote:

    Just seeing this now (email was to a mostly unused personal account rather
    than my work account.)

    The blocking and concurrent-waiting on data around this structure is by
    design.

    To run fast, when there are two or more large numbers to be converted from
    internal base 2^n to external base 10 (say, but for any base not a power of
    two), then we need a table of divisors. For a (say million "word" number
    N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say
    8-words. where "**" is the exponentiation operator. This table can be big,
    however, it can be built just once and reused many times. The code does
    this now.

    When a first big (bigger than the existing table size) conversion is
    needed, the code updates the table. When more than one goroutine tries the
    same large conversion at the same time, then the first one builds the table
    and the rest of them wait for each entry to be built. Building them
    redundantly would be no faster and wastes memory and CPU. Once the table
    grows suitably, then this phase is a no-op as the table is cached.

    While we could wish that the table was already built, or that as in a real
    program we might assume that a first operation builds the table without
    other threads waiting, in the benchmark case of all goroutines converting
    the same number will cause delays for the first iterations.

    I see no better way.

    Michael
    On Thu, Oct 11, 2012 at 4:04 PM, wrote:

    *** Submitted as
    http://code.google.com/p/go/**source/detail?r=57e13fc87f43<http://code.google.com/p/go/source/detail?r=57e13fc87f43>***


    math/big: more conservative use of lock for divisor table

    Minor performance impact running sequentially:

    benchmark old ns/op new ns/op delta
    BenchmarkString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053>


    http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/>


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Dmitry Vyukov at Oct 14, 2012 at 9:12 am

    On Sat, Oct 13, 2012 at 6:50 PM, Michael Jones wrote:

    Just seeing this now (email was to a mostly unused personal account rather
    than my work account.)

    The blocking and concurrent-waiting on data around this structure is by
    design.

    To run fast, when there are two or more large numbers to be converted from
    internal base 2^n to external base 10 (say, but for any base not a power of
    two), then we need a table of divisors. For a (say million "word" number
    N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say
    8-words. where "**" is the exponentiation operator. This table can be big,
    however, it can be built just once and reused many times. The code does
    this now.

    When a first big (bigger than the existing table size) conversion is
    needed, the code updates the table. When more than one goroutine tries the
    same large conversion at the same time, then the first one builds the table
    and the rest of them wait for each entry to be built. Building them
    redundantly would be no faster and wastes memory and CPU. Once the table
    grows suitably, then this phase is a no-op as the table is cached.

    While we could wish that the table was already built, or that as in a real
    program we might assume that a first operation builds the table without
    other threads waiting, in the benchmark case of all goroutines converting
    the same number will cause delays for the first iterations.

    I see no better way.

    Michael

    There is only one problem. The code does a different thing, that leads to
    data races, crashes and exceptions.
    I suggested to Robert to make the code work the way you describe.


    kString10Base2 389 391 +0.51%
    BenchmarkString100Base2 1530 1534 +0.26%
    BenchmarkString1000Base2 11789 11787 -0.02%
    BenchmarkString10000Base2 111443 112030 +0.53%
    BenchmarkString100000Base2 1017483 1015347 -0.21%
    BenchmarkString10Base8 339 344 +1.47%
    BenchmarkString100Base8 753 756 +0.40%
    BenchmarkString1000Base8 4618 4641 +0.50%
    BenchmarkString10000Base8 43217 43534 +0.73%
    BenchmarkString100000Base8 397518 400602 +0.78%
    BenchmarkString10Base10 630 630 +0.00%
    BenchmarkString100Base10 1975 1960 -0.76%
    BenchmarkString1000Base10 10179 10174 -0.05%
    BenchmarkString10000Base10 44527 44416 -0.25%
    BenchmarkString100000Base10 14404694 14425308 +0.14%
    BenchmarkString10Base16 283 288 +1.77%
    BenchmarkString100Base16 597 598 +0.17%
    BenchmarkString1000Base16 3189 3186 -0.09%
    BenchmarkString10000Base16 29403 29364 -0.13%
    BenchmarkString100000Base16 265657 265587 -0.03%

    Note that due to other improvements (faster assembly routines,
    better code generation by compiler), these benchmarks now run
    up to 37% faster than they used to at the last time measured (1/9/2012).

    Minor performance impact for StringPiParallel running in parallel:

    Current CL but with Lock/Unlock commented out (removed):

    BenchmarkStringPiParallel 5000 343581 ns/op
    BenchmarkStringPiParallel-2 10000 184511 ns/op
    BenchmarkStringPiParallel-3 10000 129768 ns/op
    BenchmarkStringPiParallel-4 10000 102326 ns/op

    Current CL:

    BenchmarkStringPiParallel 5000 345169 ns/op
    BenchmarkStringPiParallel-2 10000 185827 ns/op
    BenchmarkStringPiParallel-3 10000 131168 ns/op
    BenchmarkStringPiParallel-4 10000 102353 ns/op

    Fixes issue 4218.

    R=dvyukov, michael.jones, dave
    CC=golang-dev
    http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053>


    http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/>


    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Michael Jones at Oct 14, 2012 at 12:51 pm
    Thanks! Performance wise, it is important that other goroutines be able to
    use the read-only smaller entries while a larger one is being created.
    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Dmitry Vyukov at Oct 15, 2012 at 10:37 am
    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able to
    use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Michael Jones at Oct 15, 2012 at 8:59 pm
    Ok. Will look. Thank you for bringing this to my attention!
    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov wrote:

    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able
    to use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Robert Griesemer at Oct 15, 2012 at 9:35 pm
    The only safe way I see at the moment to do this is to lock/unlock the
    mutex for each new table entry. That's not hard, but I don't know if it's
    worth it. I am worried about correctness of lock-free approaches (using
    package atomic) that operate on slice elements. But I'm happy to be proven
    otherwise. Either way, I believe what we have now is correct if possible
    not optimal performance-wise. But again, I don't think there's an immediate
    urgency for this to be fixed as I am not aware of programs caring about
    this too much at the moment.
    - gri

    On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones wrote:

    Ok. Will look. Thank you for bringing this to my attention!

    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov wrote:

    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able
    to use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Michael Jones at Oct 15, 2012 at 10:29 pm
    OK.
    On Oct 15, 2012 2:35 PM, "Robert Griesemer" wrote:

    The only safe way I see at the moment to do this is to lock/unlock the
    mutex for each new table entry. That's not hard, but I don't know if it's
    worth it. I am worried about correctness of lock-free approaches (using
    package atomic) that operate on slice elements. But I'm happy to be proven
    otherwise. Either way, I believe what we have now is correct if possible
    not optimal performance-wise. But again, I don't think there's an immediate
    urgency for this to be fixed as I am not aware of programs caring about
    this too much at the moment.
    - gri

    On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones wrote:

    Ok. Will look. Thank you for bringing this to my attention!

    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov wrote:

    It is not the case now.

    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones wrote:

    Thanks! Performance wise, it is important that other goroutines be able
    to use the read-only smaller entries while a larger one is being created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov wrote:

    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones
  • Dvyukov at Oct 24, 2012 at 7:03 pm

    On 2012/10/15 21:35:07, gri wrote:
    The only safe way I see at the moment to do this is to lock/unlock the
    mutex for each new table entry. That's not hard, but I don't know if it's
    worth it. I am worried about correctness of lock-free approaches (using
    package atomic) that operate on slice elements. But I'm happy to be proven
    otherwise. Either way, I believe what we have now is correct if possible
    not optimal performance-wise. But again, I don't think there's an immediate
    urgency for this to be fixed as I am not aware of programs caring about
    this too much at the moment.
    - gri

    It's quite easy to implement w/o atomic ops on slice elements. That's
    what I had in mind:
    https://codereview.appspot.com/6766047/diff/2001/src/pkg/math/big/nat.go


    On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones
    wrote:
    Ok. Will look. Thank you for bringing this to my attention!


    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov
    wrote:
    It is not the case now.


    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones
    wrote:
    Thanks! Performance wise, it is important that other goroutines be
    able
    to use the read-only smaller entries while a larger one is being
    created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov
    wrote:
    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate |
    mailto:mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    mailto:michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones

    https://codereview.appspot.com/6643053/
  • Michael Jones at Oct 24, 2012 at 8:33 pm
    In an airport on the way to Cypress. Will look ASAP.
    On Wed, Oct 24, 2012 at 12:03 PM, wrote:
    On 2012/10/15 21:35:07, gri wrote:

    The only safe way I see at the moment to do this is to lock/unlock the
    mutex for each new table entry. That's not hard, but I don't know if it's
    worth it. I am worried about correctness of lock-free approaches (using
    package atomic) that operate on slice elements. But I'm happy to be proven
    otherwise. Either way, I believe what we have now is correct if possible
    not optimal performance-wise. But again, I don't think there's an immediate
    urgency for this to be fixed as I am not aware of programs caring about
    this too much at the moment.
    - gri


    It's quite easy to implement w/o atomic ops on slice elements. That's
    what I had in mind:
    https://codereview.appspot.com/6766047/diff/2001/src/pkg/math/big/nat.go


    On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones
    wrote:
    Ok. Will look. Thank you for bringing this to my attention!


    On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov
    wrote:
    It is not the case now.


    On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones
    wrote:
    Thanks! Performance wise, it is important that other goroutines be
    able
    to use the read-only smaller entries while a larger one is being
    created.

    On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov
    wrote:
    I suggested to Robert to make the code work the way you describe.



    --
    Michael T. Jones | Chief Technology Advocate |
    mailto:mtj@google.com | +1
    650-335-5765

    --
    Michael T. Jones
    mailto:michael.jones@gmail.com
    http://www.google.com/profiles/michael.jones


    https://codereview.appspot.com/6643053/


    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedOct 9, '12 at 11:57p
activeOct 24, '12 at 8:33p
posts23
users5
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase