FAQ
While I am sure you are correct that some numeric code with golang can run as fast as C/C++, it seems that bit-packed tight loops such as used for culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for that matter) are not as optimized. Of course one of the problems that golang has is compulsory array bounds checks, but it doesn't run as fast as even DotNet or JVM code, which also have those compulsory checks.

To be specific, following is a simple Sieve of Eratosthenes program in golang that runs about twice as slow as the same program in C/C++ and about half again as slow as in C# or Java. The range of 262146 is chosen so that the created bit-packed array is smaller than the size of most CPU L1 caches (16 Kilobytes) so that memory access time isn't a bottleneck for most modern desktop CPU's, and the program is run 1000 times to get a reasonable length of time for more accurate timing The commented out lines were an experiment to use unsafe pointers (as in a C style array program) to try to save time by avoiding the automatic array bounds checks, but the resulting time was about the same:

package main

import (
  "fmt"
  "math"
  "time"
  // "unsafe"
)

func mkCLUT() [65536]byte {
  var arr [65536]byte
  for i := 0; i < 65536; i++ {
   var cnt byte = 0
   for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 {
    cnt++
   }
   arr[i] = cnt
  }
  return arr
}

var cnstCLUT [65536]byte = mkCLUT()

func primesTest(top uint) int {
  lmtndx := (top - 3) >> 1
  lstw := lmtndx >> 5
  topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1
  cmpsts := make([]uint32, lstw+1)
// start := uintptr(unsafe.Pointer(&cmpsts[0]))
// step := unsafe.Sizeof(buf[0])
// limit := start + uintptr(len(cmpsts))*step
  for i := 0; i <= topsqrtndx; i++ {
   if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 {
    p := (uint(i) << 1) + 3
    for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
     cmpsts[j>>5] |= 1 << (j & 31)
    }
// p := uintptr((uint(i) << 1) + 3)
// lmt := uintptr(lmtndx)
// for j := (p*p - 3) >> 1; j <= lmt; j += p {
// *(*uint)(unsafe.Pointer(start + step*(j>>5))) |= 1 << (j & 31)
// }
   }
  }
  msk := uint32(0xFFFFFFFE) << (lmtndx & 31)
  cmpsts[lstw] |= msk
  cnt := 1
  for i := uint(0); i <= lstw; i++ {
   v := cmpsts[i]
   cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)])
  }
  return cnt
}

func main() {
  n := uint(262146)

  strt := time.Now()

  sum := 0
  for i := 0; i < 1000; i++ {
   sum += primesTest(n)
  }

  end := time.Now()

  fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), ".")
}

The assembly listing revealed by "go tool compile -S primestest.go > primestest.s" reveals why (I have added comments at the ends of the lines to indicate what is going on):

line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
line 75 cmpsts[j>>5] |= 1 << (j & 31)
line 76 }

has the following assembly code:

  0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j'
  0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address
  0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check
  0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails
  0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; get address of right element of array; the di register contains the pointer to the array base
  0x0132 00306 (primestest.go:75) MOVL (BX), R11 ;; get the contents of that element
  0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO ANOTHER ARRAY BOUNDS CHECK!!!!!
  0x0138 00312 (primestest.go:75) JCC $1, 539 ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!!
  0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!!
  0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
  0x0145 00325 (primestest.go:75) MOVQ AX, CX
  0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31
  0x014c 00332 (primestest.go:75) MOVL $1, BP
  0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31)
  0x0153 00339 (primestest.go:75) MOVQ R8, CX ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
  0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11 now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
  0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;; move it back to the element via the address in BX
  0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words
  0x015b 00347 (primestest.go:75) NOP
  0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10
  0x015e 00350 (primestest.go:74) NOP
  0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12
  0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done

The main problem is the lines I have commented in capitals where the bounds check is done twice but also
where the contents of the cx register are saved and restored inside the loop

A secondary problem as far as speed is concerned is that the C/C++ code also takes advantage of the direct read/modify/write instruction,
equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax for C/C++ code would become:

  0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j'
  0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address
  0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check ONCE IF ONE MUST
  0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails
  0x0145 00325 (primestest.go:75) MOVQ AX, CX
  0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31
  0x014c 00332 (primestest.go:75) MOVL $1, BP
  0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31)
  0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words AS NECESSARY
  0x015b 00347 (primestest.go:75) NOP
  0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);; the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
  0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10
  0x015e 00350 (primestest.go:74) NOP
  0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12
  0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done

Note that NOP's do not take execution time as they are optimized away in a modern CPU and are there to
make destinations of jump instructions work to even word boundaries, which can make a difference in speed.

The immediately above loop will take about 3.5 clock cycles without bounds check and about 5.5 clock cycles with bounds check for a modern desktop CPU
whereas the former version will take about three clock cycles more.

This analysis is for 64-bit code, the difference is even worse for 32-bit code as there are less registers available and
the golang compiler is even worse at making best use of them.

For most use cases, the C/C++ code will be about twice the speed of the golang code.

C#/Java do the bounds checks, but not twice and are better at reducing register operations within tight loops.

In summary, the golang compiler has a way to go for maximum efficiency in register use, even if the array bounds checks are kept.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 27 | next ›
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJun 14, '16 at 12:31p
activeJun 18, '16 at 10:54a
posts27
users6
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase