FAQ
I suggest you to repeat your analysis on go1.7beta, the compiler has a new
backend
for amd64 and the generated code should be better. On my machine:

go1.6.2:

Found 23000000 primes up to 262146 in 445.329735ms .
Found 23000000 primes up to 262146 in 447.491918ms .
Found 23000000 primes up to 262146 in 451.530021ms .
Found 23000000 primes up to 262146 in 445.181835ms .
Found 23000000 primes up to 262146 in 447.489657ms .
Found 23000000 primes up to 262146 in 451.568191ms .
Found 23000000 primes up to 262146 in 446.207366ms .
Found 23000000 primes up to 262146 in 449.992793ms .
Found 23000000 primes up to 262146 in 445.746567ms .
Found 23000000 primes up to 262146 in 449.102428ms .

on go1.7beta

Found 23000000 primes up to 262146 in 358.99806ms .
Found 23000000 primes up to 262146 in 358.755634ms .
Found 23000000 primes up to 262146 in 360.231645ms .
Found 23000000 primes up to 262146 in 358.93451ms .
Found 23000000 primes up to 262146 in 359.412598ms .
Found 23000000 primes up to 262146 in 363.33646ms .
Found 23000000 primes up to 262146 in 363.981767ms .
Found 23000000 primes up to 262146 in 366.880261ms .
Found 23000000 primes up to 262146 in 358.761698ms .
Found 23000000 primes up to 262146 in 358.025324ms .


Il giorno mercoledì 15 giugno 2016 14:37:59 UTC+2, gordo...@gmail.com ha
scritto:
While I am sure you are correct that some numeric code with golang can run
as fast as C/C++, it seems that bit-packed tight loops such as used for
culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for
that matter) are not as optimized. Of course one of the problems that
golang has is compulsory array bounds checks, but it doesn't run as fast as
even DotNet or JVM code, which also have those compulsory checks.

To be specific, following is a simple Sieve of Eratosthenes program in
golang that runs about twice as slow as the same program in C/C++ and about
half again as slow as in C# or Java. The range of 262146 is chosen so that
the created bit-packed array is smaller than the size of most CPU L1 caches
(16 Kilobytes) so that memory access time isn't a bottleneck for most
modern desktop CPU's, and the program is run 1000 times to get a reasonable
length of time for more accurate timing The commented out lines were an
experiment to use unsafe pointers (as in a C style array program) to try to
save time by avoiding the automatic array bounds checks, but the resulting
time was about the same:

package main

import (
"fmt"
"math"
"time"
// "unsafe"
)

func mkCLUT() [65536]byte {
var arr [65536]byte
for i := 0; i < 65536; i++ {
var cnt byte = 0
for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 {
cnt++
}
arr[i] = cnt
}
return arr
}

var cnstCLUT [65536]byte = mkCLUT()

func primesTest(top uint) int {
lmtndx := (top - 3) >> 1
lstw := lmtndx >> 5
topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1
cmpsts := make([]uint32, lstw+1)
// start := uintptr(unsafe.Pointer(&cmpsts[0]))
// step := unsafe.Sizeof(buf[0])
// limit := start + uintptr(len(cmpsts))*step
for i := 0; i <= topsqrtndx; i++ {
if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 {
p := (uint(i) << 1) + 3
for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
cmpsts[j>>5] |= 1 << (j & 31)
}
// p := uintptr((uint(i) << 1) + 3)
// lmt := uintptr(lmtndx)
// for j := (p*p - 3) >> 1; j <= lmt; j += p {
// *(*uint)(unsafe.Pointer(start + step*(j>>5)))
= 1 << (j & 31)
// }
}
}
msk := uint32(0xFFFFFFFE) << (lmtndx & 31)
cmpsts[lstw] |= msk
cnt := 1
for i := uint(0); i <= lstw; i++ {
v := cmpsts[i]
cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)])
}
return cnt
}

func main() {
n := uint(262146)

strt := time.Now()

sum := 0
for i := 0; i < 1000; i++ {
sum += primesTest(n)
}

end := time.Now()

fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt),
".")
}

The assembly listing revealed by "go tool compile -S primestest.go >
primestest.s" reveals why (I have added comments at the ends of the lines
to indicate what is going on):

line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
line 75 cmpsts[j>>5] |= 1 << (j & 31)
line 76 }

has the following assembly code:

0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make
a copy of 'j'
0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn
it into the uint32 word address
0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do
the array bounds check
0x0128 00296 (primestest.go:75) JCC $1, 546
;; out to panic if it fails
0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX
;; get address of right element of array; the di register contains the
pointer to the array base
0x0132 00306 (primestest.go:75) MOVL (BX), R11
;; get the contents of that element
0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO
ANOTHER ARRAY BOUNDS CHECK!!!!!
0x0138 00312 (primestest.go:75) JCC $1, 539
;; OUT TO A DIFFERENT PANIC IF IT FAILS!!!
0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX
;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!!
0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE
THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
0x0145 00325 (primestest.go:75) MOVQ AX, CX
0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx
register now contains 'j' & 31
0x014c 00332 (primestest.go:75) MOVL $1, BP
0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp
register now contains 1 << ('j' & 31)
0x0153 00339 (primestest.go:75) MOVQ R8, CX
;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE
LOOP)!!!
0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11
now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;;
move it back to the element via the address in BX
0x015b 00347 (primestest.go:75) NOP ;; nop's to make the
addresses work out to even words
0x015b 00347 (primestest.go:75) NOP
0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add
'j' to the stored 'p' in r10
0x015e 00350 (primestest.go:74) NOP
0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;;
check if 'j' is up to the stored 'lmtndx' in r12
0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and
loop back to top if not done

The main problem is the lines I have commented in capitals where the
bounds check is done twice but also
where the contents of the cx register are saved and restored inside the
loop

A secondary problem as far as speed is concerned is that the C/C++ code
also takes advantage of the direct read/modify/write instruction,
equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax
for C/C++ code would become:

0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make
a copy of 'j'
0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn
it into the uint32 word address
0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do
the array bounds check ONCE IF ONE MUST
0x0128 00296 (primestest.go:75) JCC $1, 546
;; out to panic if it fails
0x0145 00325 (primestest.go:75) MOVQ AX, CX
0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx
register now contains 'j' & 31
0x014c 00332 (primestest.go:75) MOVL $1, BP
0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp
register now contains 1 << ('j' & 31)
0x015b 00347 (primestest.go:75) NOP ;; nop's to make the
addresses work out to even words AS NECESSARY
0x015b 00347 (primestest.go:75) NOP
0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);;
the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add
'j' to the stored 'p' in r10
0x015e 00350 (primestest.go:74) NOP
0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;;
check if 'j' is up to the stored 'lmtndx' in r12
0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and
loop back to top if not done

Note that NOP's do not take execution time as they are optimized away in a
modern CPU and are there to
make destinations of jump instructions work to even word boundaries, which
can make a difference in speed.

The immediately above loop will take about 3.5 clock cycles without bounds
check and about 5.5 clock cycles with bounds check for a modern desktop CPU
whereas the former version will take about three clock cycles more.

This analysis is for 64-bit code, the difference is even worse for 32-bit
code as there are less registers available and
the golang compiler is even worse at making best use of them.

For most use cases, the C/C++ code will be about twice the speed of the
golang code.

C#/Java do the bounds checks, but not twice and are better at reducing
register operations within tight loops.

In summary, the golang compiler has a way to go for maximum efficiency in
register use, even if the array bounds checks are kept.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 27 | next ›
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJun 14, '16 at 12:31p
activeJun 18, '16 at 10:54a
posts27
users6
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase