FAQ
That's interesting but your time or 0.7 seconds on your specified machine isn't very informative: you should at least specify how long the original version of primegen (which is not multithreaded) takes, which we assume as about 0.7 seconds, and also how long your golang version takes without multithreading (or set "maxprocs := 1") Also, one should really convert Bernstein's "eratspeed" to golang for comparison.

Current versions of golang are not as fast as GCC C/C++ optimized code by at least a factor of two, and you will find that if one applies maximum wheel factorization (ie a 2/3/5/7 wheel plus pre-culling for primes of 11/13/17/19), rather than the limited 2/3/5 wheel factorization Bernstein used in eratspeed, that the Sieve of Eratosthenes will beat "primegen" for every range when comparing like for like (multithreading or not for both, written in the same language), and that golang will currently always be a factor slower than the C/C++ code due to less efficient compilation.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Alb Donizetti at Jun 14, 2016 at 1:11 pm
    Current versions of golang are not as fast as GCC C/C++ optimized code by
    at least a factor of two

    This is, in general, not true at all: I've written numeric code as fast as
    C code compiled by GCC -O3

    Il giorno martedì 14 giugno 2016 14:31:18 UTC+2, gordo...@gmail.com ha
    scritto:
    That's interesting but your time or 0.7 seconds on your specified machine
    isn't very informative: you should at least specify how long the original
    version of primegen (which is not multithreaded) takes, which we assume as
    about 0.7 seconds, and also how long your golang version takes without
    multithreading (or set "maxprocs := 1") Also, one should really convert
    Bernstein's "eratspeed" to golang for comparison.

    Current versions of golang are not as fast as GCC C/C++ optimized code by
    at least a factor of two, and you will find that if one applies maximum
    wheel factorization (ie a 2/3/5/7 wheel plus pre-culling for primes of
    11/13/17/19), rather than the limited 2/3/5 wheel factorization Bernstein
    used in eratspeed, that the Sieve of Eratosthenes will beat "primegen" for
    every range when comparing like for like (multithreading or not for both,
    written in the same language), and that golang will currently always be a
    factor slower than the C/C++ code due to less efficient compilation.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 15, 2016 at 12:37 pm
    While I am sure you are correct that some numeric code with golang can run as fast as C/C++, it seems that bit-packed tight loops such as used for culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for that matter) are not as optimized. Of course one of the problems that golang has is compulsory array bounds checks, but it doesn't run as fast as even DotNet or JVM code, which also have those compulsory checks.

    To be specific, following is a simple Sieve of Eratosthenes program in golang that runs about twice as slow as the same program in C/C++ and about half again as slow as in C# or Java. The range of 262146 is chosen so that the created bit-packed array is smaller than the size of most CPU L1 caches (16 Kilobytes) so that memory access time isn't a bottleneck for most modern desktop CPU's, and the program is run 1000 times to get a reasonable length of time for more accurate timing The commented out lines were an experiment to use unsafe pointers (as in a C style array program) to try to save time by avoiding the automatic array bounds checks, but the resulting time was about the same:

    package main

    import (
      "fmt"
      "math"
      "time"
      // "unsafe"
    )

    func mkCLUT() [65536]byte {
      var arr [65536]byte
      for i := 0; i < 65536; i++ {
       var cnt byte = 0
       for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 {
        cnt++
       }
       arr[i] = cnt
      }
      return arr
    }

    var cnstCLUT [65536]byte = mkCLUT()

    func primesTest(top uint) int {
      lmtndx := (top - 3) >> 1
      lstw := lmtndx >> 5
      topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1
      cmpsts := make([]uint32, lstw+1)
    // start := uintptr(unsafe.Pointer(&cmpsts[0]))
    // step := unsafe.Sizeof(buf[0])
    // limit := start + uintptr(len(cmpsts))*step
      for i := 0; i <= topsqrtndx; i++ {
       if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 {
        p := (uint(i) << 1) + 3
        for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
         cmpsts[j>>5] |= 1 << (j & 31)
        }
    // p := uintptr((uint(i) << 1) + 3)
    // lmt := uintptr(lmtndx)
    // for j := (p*p - 3) >> 1; j <= lmt; j += p {
    // *(*uint)(unsafe.Pointer(start + step*(j>>5))) |= 1 << (j & 31)
    // }
       }
      }
      msk := uint32(0xFFFFFFFE) << (lmtndx & 31)
      cmpsts[lstw] |= msk
      cnt := 1
      for i := uint(0); i <= lstw; i++ {
       v := cmpsts[i]
       cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)])
      }
      return cnt
    }

    func main() {
      n := uint(262146)

      strt := time.Now()

      sum := 0
      for i := 0; i < 1000; i++ {
       sum += primesTest(n)
      }

      end := time.Now()

      fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), ".")
    }

    The assembly listing revealed by "go tool compile -S primestest.go > primestest.s" reveals why (I have added comments at the ends of the lines to indicate what is going on):

    line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
    line 75 cmpsts[j>>5] |= 1 << (j & 31)
    line 76 }

    has the following assembly code:

      0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j'
      0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address
      0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check
      0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails
      0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; get address of right element of array; the di register contains the pointer to the array base
      0x0132 00306 (primestest.go:75) MOVL (BX), R11 ;; get the contents of that element
      0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO ANOTHER ARRAY BOUNDS CHECK!!!!!
      0x0138 00312 (primestest.go:75) JCC $1, 539 ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!!
      0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!!
      0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
      0x0145 00325 (primestest.go:75) MOVQ AX, CX
      0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31
      0x014c 00332 (primestest.go:75) MOVL $1, BP
      0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31)
      0x0153 00339 (primestest.go:75) MOVQ R8, CX ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
      0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11 now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
      0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;; move it back to the element via the address in BX
      0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words
      0x015b 00347 (primestest.go:75) NOP
      0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10
      0x015e 00350 (primestest.go:74) NOP
      0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12
      0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done

    The main problem is the lines I have commented in capitals where the bounds check is done twice but also
    where the contents of the cx register are saved and restored inside the loop

    A secondary problem as far as speed is concerned is that the C/C++ code also takes advantage of the direct read/modify/write instruction,
    equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax for C/C++ code would become:

      0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j'
      0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address
      0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check ONCE IF ONE MUST
      0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails
      0x0145 00325 (primestest.go:75) MOVQ AX, CX
      0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31
      0x014c 00332 (primestest.go:75) MOVL $1, BP
      0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31)
      0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words AS NECESSARY
      0x015b 00347 (primestest.go:75) NOP
      0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);; the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
      0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10
      0x015e 00350 (primestest.go:74) NOP
      0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12
      0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done

    Note that NOP's do not take execution time as they are optimized away in a modern CPU and are there to
    make destinations of jump instructions work to even word boundaries, which can make a difference in speed.

    The immediately above loop will take about 3.5 clock cycles without bounds check and about 5.5 clock cycles with bounds check for a modern desktop CPU
    whereas the former version will take about three clock cycles more.

    This analysis is for 64-bit code, the difference is even worse for 32-bit code as there are less registers available and
    the golang compiler is even worse at making best use of them.

    For most use cases, the C/C++ code will be about twice the speed of the golang code.

    C#/Java do the bounds checks, but not twice and are better at reducing register operations within tight loops.

    In summary, the golang compiler has a way to go for maximum efficiency in register use, even if the array bounds checks are kept.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Alb Donizetti at Jun 15, 2016 at 1:10 pm
    I suggest you to repeat your analysis on go1.7beta, the compiler has a new
    backend
    for amd64 and the generated code should be better. On my machine:

    go1.6.2:

    Found 23000000 primes up to 262146 in 445.329735ms .
    Found 23000000 primes up to 262146 in 447.491918ms .
    Found 23000000 primes up to 262146 in 451.530021ms .
    Found 23000000 primes up to 262146 in 445.181835ms .
    Found 23000000 primes up to 262146 in 447.489657ms .
    Found 23000000 primes up to 262146 in 451.568191ms .
    Found 23000000 primes up to 262146 in 446.207366ms .
    Found 23000000 primes up to 262146 in 449.992793ms .
    Found 23000000 primes up to 262146 in 445.746567ms .
    Found 23000000 primes up to 262146 in 449.102428ms .

    on go1.7beta

    Found 23000000 primes up to 262146 in 358.99806ms .
    Found 23000000 primes up to 262146 in 358.755634ms .
    Found 23000000 primes up to 262146 in 360.231645ms .
    Found 23000000 primes up to 262146 in 358.93451ms .
    Found 23000000 primes up to 262146 in 359.412598ms .
    Found 23000000 primes up to 262146 in 363.33646ms .
    Found 23000000 primes up to 262146 in 363.981767ms .
    Found 23000000 primes up to 262146 in 366.880261ms .
    Found 23000000 primes up to 262146 in 358.761698ms .
    Found 23000000 primes up to 262146 in 358.025324ms .


    Il giorno mercoledì 15 giugno 2016 14:37:59 UTC+2, gordo...@gmail.com ha
    scritto:
    While I am sure you are correct that some numeric code with golang can run
    as fast as C/C++, it seems that bit-packed tight loops such as used for
    culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for
    that matter) are not as optimized. Of course one of the problems that
    golang has is compulsory array bounds checks, but it doesn't run as fast as
    even DotNet or JVM code, which also have those compulsory checks.

    To be specific, following is a simple Sieve of Eratosthenes program in
    golang that runs about twice as slow as the same program in C/C++ and about
    half again as slow as in C# or Java. The range of 262146 is chosen so that
    the created bit-packed array is smaller than the size of most CPU L1 caches
    (16 Kilobytes) so that memory access time isn't a bottleneck for most
    modern desktop CPU's, and the program is run 1000 times to get a reasonable
    length of time for more accurate timing The commented out lines were an
    experiment to use unsafe pointers (as in a C style array program) to try to
    save time by avoiding the automatic array bounds checks, but the resulting
    time was about the same:

    package main

    import (
    "fmt"
    "math"
    "time"
    // "unsafe"
    )

    func mkCLUT() [65536]byte {
    var arr [65536]byte
    for i := 0; i < 65536; i++ {
    var cnt byte = 0
    for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 {
    cnt++
    }
    arr[i] = cnt
    }
    return arr
    }

    var cnstCLUT [65536]byte = mkCLUT()

    func primesTest(top uint) int {
    lmtndx := (top - 3) >> 1
    lstw := lmtndx >> 5
    topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1
    cmpsts := make([]uint32, lstw+1)
    // start := uintptr(unsafe.Pointer(&cmpsts[0]))
    // step := unsafe.Sizeof(buf[0])
    // limit := start + uintptr(len(cmpsts))*step
    for i := 0; i <= topsqrtndx; i++ {
    if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 {
    p := (uint(i) << 1) + 3
    for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
    cmpsts[j>>5] |= 1 << (j & 31)
    }
    // p := uintptr((uint(i) << 1) + 3)
    // lmt := uintptr(lmtndx)
    // for j := (p*p - 3) >> 1; j <= lmt; j += p {
    // *(*uint)(unsafe.Pointer(start + step*(j>>5)))
    = 1 << (j & 31)
    // }
    }
    }
    msk := uint32(0xFFFFFFFE) << (lmtndx & 31)
    cmpsts[lstw] |= msk
    cnt := 1
    for i := uint(0); i <= lstw; i++ {
    v := cmpsts[i]
    cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)])
    }
    return cnt
    }

    func main() {
    n := uint(262146)

    strt := time.Now()

    sum := 0
    for i := 0; i < 1000; i++ {
    sum += primesTest(n)
    }

    end := time.Now()

    fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt),
    ".")
    }

    The assembly listing revealed by "go tool compile -S primestest.go >
    primestest.s" reveals why (I have added comments at the ends of the lines
    to indicate what is going on):

    line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
    line 75 cmpsts[j>>5] |= 1 << (j & 31)
    line 76 }

    has the following assembly code:

    0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make
    a copy of 'j'
    0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn
    it into the uint32 word address
    0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do
    the array bounds check
    0x0128 00296 (primestest.go:75) JCC $1, 546
    ;; out to panic if it fails
    0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX
    ;; get address of right element of array; the di register contains the
    pointer to the array base
    0x0132 00306 (primestest.go:75) MOVL (BX), R11
    ;; get the contents of that element
    0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO
    ANOTHER ARRAY BOUNDS CHECK!!!!!
    0x0138 00312 (primestest.go:75) JCC $1, 539
    ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!!
    0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX
    ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!!
    0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE
    THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
    0x0145 00325 (primestest.go:75) MOVQ AX, CX
    0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx
    register now contains 'j' & 31
    0x014c 00332 (primestest.go:75) MOVL $1, BP
    0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp
    register now contains 1 << ('j' & 31)
    0x0153 00339 (primestest.go:75) MOVQ R8, CX
    ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE
    LOOP)!!!
    0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11
    now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
    0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;;
    move it back to the element via the address in BX
    0x015b 00347 (primestest.go:75) NOP ;; nop's to make the
    addresses work out to even words
    0x015b 00347 (primestest.go:75) NOP
    0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add
    'j' to the stored 'p' in r10
    0x015e 00350 (primestest.go:74) NOP
    0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;;
    check if 'j' is up to the stored 'lmtndx' in r12
    0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and
    loop back to top if not done

    The main problem is the lines I have commented in capitals where the
    bounds check is done twice but also
    where the contents of the cx register are saved and restored inside the
    loop

    A secondary problem as far as speed is concerned is that the C/C++ code
    also takes advantage of the direct read/modify/write instruction,
    equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax
    for C/C++ code would become:

    0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make
    a copy of 'j'
    0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn
    it into the uint32 word address
    0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do
    the array bounds check ONCE IF ONE MUST
    0x0128 00296 (primestest.go:75) JCC $1, 546
    ;; out to panic if it fails
    0x0145 00325 (primestest.go:75) MOVQ AX, CX
    0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx
    register now contains 'j' & 31
    0x014c 00332 (primestest.go:75) MOVL $1, BP
    0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp
    register now contains 1 << ('j' & 31)
    0x015b 00347 (primestest.go:75) NOP ;; nop's to make the
    addresses work out to even words AS NECESSARY
    0x015b 00347 (primestest.go:75) NOP
    0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);;
    the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
    0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add
    'j' to the stored 'p' in r10
    0x015e 00350 (primestest.go:74) NOP
    0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;;
    check if 'j' is up to the stored 'lmtndx' in r12
    0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and
    loop back to top if not done

    Note that NOP's do not take execution time as they are optimized away in a
    modern CPU and are there to
    make destinations of jump instructions work to even word boundaries, which
    can make a difference in speed.

    The immediately above loop will take about 3.5 clock cycles without bounds
    check and about 5.5 clock cycles with bounds check for a modern desktop CPU
    whereas the former version will take about three clock cycles more.

    This analysis is for 64-bit code, the difference is even worse for 32-bit
    code as there are less registers available and
    the golang compiler is even worse at making best use of them.

    For most use cases, the C/C++ code will be about twice the speed of the
    golang code.

    C#/Java do the bounds checks, but not twice and are better at reducing
    register operations within tight loops.

    In summary, the golang compiler has a way to go for maximum efficiency in
    register use, even if the array bounds checks are kept.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 16, 2016 at 1:22 am
    golang version 1.7beta1 does indeed help, and the time is now not much worse than C#/Java, but still not as good as C/C++ due to the single array bounds check:

    Using the same procedure to obtain an assembly listing (go tool compiler -S PrimeTest.go > PrimeTest.s):

    line 36 for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
    line 37 cmpsts[j>>5] |= 1 << (j & 31)
    line 38 }

      0x00f1 00241 (Main.go:37) MOVQ R8, CX ;; move 'j' in r8 to r
      0x00f4 00244 (Main.go:37) SHRQ $5, R8 ;; shift cx right by 5 to get word address
      0x00f8 00248 (Main.go:37) CMPQ R8, DX ;; array bounds check to array length stored in dx
      0x00fb 00251 (Main.go:37) JCC $0, 454 ;; panic if fail bounds check
      0x0101 00257 (Main.go:37) MOVL (AX)(R8*4), R10 ;; get element to r10 in one step
      0x0105 00261 (Main.go:37) MOVQ CX, R11 ;; save 'j' for later in r11
      0x0108 00264 (Main.go:37) ANDQ $31, CX ;; leave 'j' & 31 in cx
      0x010c 00268 (Main.go:37) MOVL R9, R12 ;; save r9 to r12 to preserve the 1 it contains - WHY NOT JUST MAKE R12 CONTAIN 1 AT ALL TIMES IF USING IT IS QUICKER THAN AN IMMEDIATE LOAD
      0x010f 00271 (Main.go:37) SHLL CX, R9 ;; R9 SHOULD JUST BE LOADED WITH 1 ABOVE - now cx contains 1 << ('j' & 31)
      0x0112 00274 (Main.go:37) ORL R10, R9 ;; r9 contains cmpsts[j >> 5] | (1 << ('j' & 31)) - the bit or is done here
      0x0115 00277 (Main.go:37) MOVL R9, (AX)(R8*4) ;; element now contains the modified value
      0x0119 00281 (Main.go:36) LEAQ 3(R11)(DI*2), R8 ;; tricky way to calculate 'j' + 2 * 'j' + 3 where 2 * 'j' + 3 is p, answer to r8, saves a register
      0x011e 00286 (Main.go:37) MOVL R12, R9 ;; RESTORE R9 FROM R12 - SHOULD NOT BE NECESSARY, but doesn't really cost in time as CPU is waiting for results of LEAQ operation
      0x0121 00289 (Main.go:36) CMPQ R8, BX ;; check if 'j' in r8 is up to limit stored in bx
      0x0124 00292 (Main.go:36) JLS $0, 241 ;; loop if not complete

    This is much better than the 1.6.2 code in that it no longer does the array bounds check twice, although there is still the minor use of an extra r12 register used to store 1 instead of using an immediate load of 1 into the r9 register as above, where it could have been used to store 'p' to save a slight amount of time instead of the tricky code to calculate 'p' (quickly) every loop (the tricky bit is still about a half cycle slower than just using a pre-calculated 'p' value). The C/C++ code will still be quicker, mainly because of no array bounds check for a couple of CPU clock cycles, but also because it is more efficient to use the single read/modify/write version of the ORL instruction instead of MOVL from the array element to a register, ORL with the bit modifier, then MOVL from the register back to the array element. It seems it is now almost trying too hard to save registers at the cost of time in the tricky 'p' calculation, but costing registers for no gain or an actual loss in saving the 1 to a register.

    So it is good to see that golang compiler optimization is taking some steps forward, but it isn't quite there yet.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Jun 16, 2016 at 3:11 am

    On Thu, Jun 16, 2016 at 11:22 AM, wrote:
    it no longer does the array bounds check twice
    It's not safe, but you might want to try disabling bounds checking by
    "go tool compile -B" or "go build -gcflags=-B".

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 16, 2016 at 4:20 am

    On Thursday, 16 June 2016 10:11:12 UTC+7, Nigel Tao wrote:
    On Thu, Jun 16, 2016 at 11:22 AM, <gordo...@gmail.com <javascript:>>
    wrote:
    it no longer does the array bounds check twice
    It's not safe, but you might want to try disabling bounds checking by
    "go tool compile -B" or "go build -gcflags=-B"
    In fact, in this case it is safe because the loop code has a built-in array
    check in checking that the 'j' variable doesn't go past the top of the
    array (and in the enclosing loop that the 'i' index doesn't go past the
    index of the square root of the top value), which array is sized to contain
    that top array index.

    I am not currently at my usual test machine so will wait a few hours to
    post the results, but I would assume that just the two instruction array
    bounds check, which takes about two CPU clock cycles, will be eliminated,
    thus it is likely that the run time twill be reduced to about two thirds
    for a modern desktop CPU. This will still likely be slightly slower than
    optimized C/C++ code for the reasons given in my last post, but not bad at
    all.

    It's too bad one can't disable bounds checks only for certain ranges of
    code by "pragma"'s in the source code, but I don't suppose that is possible.


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 16, 2016 at 9:13 am
    No real surprises with the no bounds checks option (-B), it just eliminated the array bounds checks with the rest of the code the same (version 1.7beta1):

      0x00dd 00221 (main.go:37) MOVQ DI, CX
      0x00e0 00224 (main.go:37) SHRQ $5, DI
      0x00e4 00228 (main.go:37) MOVL (AX)(DI*4), R9
      0x00e8 00232 (main.go:37) MOVQ CX, R10
      0x00eb 00235 (main.go:37) ANDQ $31, CX
      0x00ef 00239 (main.go:37) MOVL R8, R11
      0x00f2 00242 (main.go:37) SHLL CX, R8
      0x00f5 00245 (main.go:37) ORL R8, R9
      0x00f8 00248 (main.go:37) MOVL R9, (AX)(DI*4)
      0x00fc 00252 (main.go:36) LEAQ 3(R10)(SI*2), DI
      0x0101 00257 (main.go:37) MOVL R11, R8
      0x0104 00260 (main.go:36) CMPQ DI, DX
      0x0107 00263 (main.go:36) JLS $0, 221

    It is now almost as fast as C/C++ code, and isn't for the same reasons as explained before: excessively using registers to store things and not using the read/modify/write instruction (which also saves the use of a register).

    The current beta will work not too badly with amd64 code but still doesn't use registers efficiently enough to support x86 code as it uses too many register. optimized C/C++ code only uses six or at most 7 registers, which the x86 architecture has, but not the nine registers that the above requires.

    So for this tight loop, golang is still slower than optimized C/C++ code, but not by very much if array bounds checks are disabled.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 16, 2016 at 3:45 pm

    On Thursday, June 16, 2016 at 11:13:12 AM UTC+2, gordo...@gmail.com wrote:
    No real surprises with the no bounds checks option (-B), it just
    eliminated the array bounds checks with the rest of the code the same
    (version 1.7beta1):

    0x00dd 00221 (main.go:37) MOVQ DI, CX
    0x00e0 00224 (main.go:37) SHRQ $5, DI
    0x00e4 00228 (main.go:37) MOVL (AX)(DI*4), R9
    0x00e8 00232 (main.go:37) MOVQ CX, R10
    0x00eb 00235 (main.go:37) ANDQ $31, CX
    0x00ef 00239 (main.go:37) MOVL R8, R11
    0x00f2 00242 (main.go:37) SHLL CX, R8
    0x00f5 00245 (main.go:37) ORL R8, R9
    0x00f8 00248 (main.go:37) MOVL R9, (AX)(DI*4)
    0x00fc 00252 (main.go:36) LEAQ 3(R10)(SI*2), DI
    0x0101 00257 (main.go:37) MOVL R11, R8
    0x0104 00260 (main.go:36) CMPQ DI, DX
    0x0107 00263 (main.go:36) JLS $0, 221

    It is now almost as fast as C/C++ code, and isn't for the same reasons as
    explained before: excessively using registers to store things and not
    using the read/modify/write instruction (which also saves the use of a
    register).

    The current beta will work not too badly with amd64 code but still doesn't
    use registers efficiently enough to support x86 code as it uses too many
    register. optimized C/C++ code only uses six or at most 7 registers, which
    the x86 architecture has, but not the nine registers that the above
    requires.

    So for this tight loop, golang is still slower than optimized C/C++ code,
    but not by very much if array bounds checks are disabled.
    Modern x86 CPUs don't work like that.

    In general, optimally scheduled assembly code which uses more registers has
    higher performance than optimally scheduled assembly code which uses
    smaller number of registers. Assuming both assembly codes correspond to the
    same source code.

    Register renaming: since Intel Pentium Pro and AMD K5.

    Suggestion for reading: http://www.agner.org/optimize/microarchitecture.pdf

    An excerpt from the above PDF document (Section 10 about Haswell and
    Broadwell pipeline): "... the register file has 168 integer registers and
    168 vector registers ..."

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 17, 2016 at 12:30 am

    Modern x86 CPUs don't work like that.
    In general, optimally scheduled assembly code which uses more registers has higher performance than optimally scheduled assembly code which uses smaller number of registers. Assuming both assembly codes correspond to the same source code.
    Register renaming: since Intel Pentium Pro and AMD K5.
    Suggestion for reading: http://www.agner.org/optimize/microarchitecture.pdf
    An excerpt from the above PDF document (Section 10 about Haswell and Broadwell pipeline): "... the register file has 168 integer registers and 168 vector registers ..."
    I am aware of all of the above and have already read Agner Fogg's publications. In addition modern CPU's do Out of Order Execution (OOE) so rearrange the instructions to best reduce instruction latencies and increase throughput given that there are parallel execution pipelines and ahead-of-time execution, so the actual execution order is almost certainly not as per the assembly listing.

    Yes, both assembly listings are from the same tight loop code, but the "C/C++" one has been converted from another assembly format to the golang assembly format.

    Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86 (32-bit) code, as the Pentium Pro processor wasn't commonly available at that time and 64-bit code didn't exist. His hand optimized C code for the Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the "while k < B loop") uses six registers for this inner culling loop being discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU (Haswell).

    The number of internal CPU registers actually used by the CPU to effect OOE is beside the point, as they have to do with the CPU's internal optimizations and not compiler optimizations; my point is that the compiler's incorrect use of registers still costs time.

    While I don't expect golang, with its philosophy of preserving "safe" paradigms in doing array bounds checks by default, to run as fast as C/C++ code that doesn't have that philosophy, I do expect it to run at least as fast as C#/Java code which are Just In Time (JIT) compiled and do have the "safe" philosophy. The golang compiler version 1.7beta1 is not quite there yet for the indicated reasons: inconsistent use of registers, using one too many in one place in order to avoid an immediate load which doesn't cost any execution time, and saving one register by use of the "trick" which does cost execution time as compared to the use of a single register.

    However, there is hope as version 1.7 has made great advances since version 1.6; surely version 1.8, which is said to intend to improve this further will be faster yet. At any rate, version 1.7 speed is "adequate" for many purposes as at least it comes close (probably within about 15% to 20% or less) of C#/Java speed in many of the most demanding tight loop algorithms, and thus is quite usable as compared to previous versions. But even the most avid golang protagonists must admit that it isn't the language to re-write Kim Walisch's "primesieve" with its extreme loop unrolling that takes an average of about 1.4 CPU clock cycles per composite number cull for small ranges of primes, as even with array bounds checking turned off, golang would still take at least twice and more likely three times as long.

    That is also why I first started posting to this thread: the only reason the golang version of "primegen" is reasonably comparable in speed to C/C++ "primegen" is that it uses multi-threading on a multi-core processor, which weren't available to Daniel Bernstein when he wrote "primegen". My point was one should compare like with like.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Keith Randall at Jun 17, 2016 at 2:23 am
    Looks like something is wrong with immediate loading for the 1 << ...
    operation. Could you open a bug with repro instructions? I can look at it
    when 1.8 opens.
    On Thursday, June 16, 2016 at 5:30:12 PM UTC-7, gordo...@gmail.com wrote:

    Modern x86 CPUs don't work like that.
    In general, optimally scheduled assembly code which uses more registers
    has higher performance than optimally scheduled assembly code which uses
    smaller number of registers. Assuming both assembly codes correspond to the
    same source code.
    Register renaming: since Intel Pentium Pro and AMD K5.
    Suggestion for reading:
    http://www.agner.org/optimize/microarchitecture.pdf
    An excerpt from the above PDF document (Section 10 about Haswell and
    Broadwell pipeline): "... the register file has 168 integer registers and
    168 vector registers ..."

    I am aware of all of the above and have already read Agner Fogg's
    publications. In addition modern CPU's do Out of Order Execution (OOE) so
    rearrange the instructions to best reduce instruction latencies and
    increase throughput given that there are parallel execution pipelines and
    ahead-of-time execution, so the actual execution order is almost certainly
    not as per the assembly listing.

    Yes, both assembly listings are from the same tight loop code, but the
    "C/C++" one has been converted from another assembly format to the golang
    assembly format.

    Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86
    (32-bit) code, as the Pentium Pro processor wasn't commonly available at
    that time and 64-bit code didn't exist. His hand optimized C code for the
    Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the
    "while k < B loop") uses six registers for this inner culling loop being
    discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU
    (Haswell).

    The number of internal CPU registers actually used by the CPU to effect
    OOE is beside the point, as they have to do with the CPU's internal
    optimizations and not compiler optimizations; my point is that the
    compiler's incorrect use of registers still costs time.

    While I don't expect golang, with its philosophy of preserving "safe"
    paradigms in doing array bounds checks by default, to run as fast as C/C++
    code that doesn't have that philosophy, I do expect it to run at least as
    fast as C#/Java code which are Just In Time (JIT) compiled and do have the
    "safe" philosophy. The golang compiler version 1.7beta1 is not quite there
    yet for the indicated reasons: inconsistent use of registers, using one
    too many in one place in order to avoid an immediate load which doesn't
    cost any execution time, and saving one register by use of the "trick"
    which does cost execution time as compared to the use of a single register.

    However, there is hope as version 1.7 has made great advances since
    version 1.6; surely version 1.8, which is said to intend to improve this
    further will be faster yet. At any rate, version 1.7 speed is "adequate"
    for many purposes as at least it comes close (probably within about 15% to
    20% or less) of C#/Java speed in many of the most demanding tight loop
    algorithms, and thus is quite usable as compared to previous versions. But
    even the most avid golang protagonists must admit that it isn't the
    language to re-write Kim Walisch's "primesieve" with its extreme loop
    unrolling that takes an average of about 1.4 CPU clock cycles per composite
    number cull for small ranges of primes, as even with array bounds checking
    turned off, golang would still take at least twice and more likely three
    times as long.

    That is also why I first started posting to this thread: the only reason
    the golang version of "primegen" is reasonably comparable in speed to C/C++
    "primegen" is that it uses multi-threading on a multi-core processor, which
    weren't available to Daniel Bernstein when he wrote "primegen". My point
    was one should compare like with like.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 17, 2016 at 8:25 am
    Looks like something is wrong with immediate loading for the 1 << ... operation. Could you open a bug with repro instructions? I can look at it when 1.8 opens.
    I have opened a golang issue #16092 as follows: https://github.com/golang/go/issues/16092

    I may have over-complicated it, but as well as the failure to use the load immediate instruction, I documented other suggestions that would make the code run faster as well as a method to eliminate the array bounds check for cases when the check is built-in to the loop as C# (sometimes) does for x86 code when one uses a specific formulation of a loop. If able to be implemented, these changes would then apply to x86 code as well due to reduction of register use, and could potentially make golang code run as fast as C/C++ as compared in the issue.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 17, 2016 at 1:35 pm

    On Friday, June 17, 2016 at 2:30:12 AM UTC+2, gordo...@gmail.com wrote:
    Modern x86 CPUs don't work like that.
    In general, optimally scheduled assembly code which uses more registers
    has higher performance than optimally scheduled assembly code which uses
    smaller number of registers. Assuming both assembly codes correspond to the
    same source code.
    Register renaming: since Intel Pentium Pro and AMD K5.
    Suggestion for reading:
    http://www.agner.org/optimize/microarchitecture.pdf
    An excerpt from the above PDF document (Section 10 about Haswell and
    Broadwell pipeline): "... the register file has 168 integer registers and
    168 vector registers ..."

    I am aware of all of the above and have already read Agner Fogg's
    publications. In addition modern CPU's do Out of Order Execution (OOE) so
    rearrange the instructions to best reduce instruction latencies and
    increase throughput given that there are parallel execution pipelines and
    ahead-of-time execution, so the actual execution order is almost certainly
    not as per the assembly listing.

    Yes, both assembly listings are from the same tight loop code, but the
    "C/C++" one has been converted from another assembly format to the golang
    assembly format.

    Daniel Bernstein, the author of "primegen" wrote for the Pentium 3 in x86
    (32-bit) code, as the Pentium Pro processor wasn't commonly available at
    that time and 64-bit code didn't exist. His hand optimized C code for the
    Sieve of Eratosthenes ("eratspeed.c" in the "doit()" function for the
    "while k < B loop") uses six registers for this inner culling loop being
    discussed, and takes about 3.5 CPU clock cycles per loop on a modern CPU
    (Haswell).
    The following isn't an argument against your post or in favor of your post.
    It is just some assembly code that I find interesting.

    $ clang-3.9 -S -O3 eratspeed.c -mtune=bdver3
    .LBB1_4:
             movl %edx, %edi
             shrl $5, %edi
             movl %edx, %ecx
             andl $31, %ecx
             movl two(,%rcx,4), *%ecx*
             addl %esi, %edx
             orl *%ecx*, a(%rbx,%rdi,4)
             cmpl $32032, %edx
             jb .LBB1_4

    5 registers: DX DI CX SI BX

    $ perf stat --repeat=100 -- ./eratspeed.orig >/dev/null

      Performance counter stats for './eratspeed.orig' (100 runs):

             466.592220 task-clock (msec) # 0.999 CPUs utilized
                ( +- 0.11% )
                      1 context-switches # 0.002 K/sec
                ( +- 12.37% )
                      0 cpu-migrations # 0.000 K/sec
                ( +-100.00% )
                     86 page-faults # 0.185 K/sec
                ( +- 0.09% )
          1,862,076,369 cycles # 3.991 GHz
                ( +- 0.11% )
             74,805,707 stalled-cycles-frontend # 4.02% frontend
    cycles idle ( +- 1.20% )
            272,721,373 stalled-cycles-backend # *14.65%* backend
    cycles idle ( +- 0.16% )
          4,116,301,949 instructions # *2.21* insn per
    cycle
                                                       # 0.07 stalled cycles
    per insn ( +- 0.00% )
            473,019,237 branches # 1013.774 M/sec
                ( +- 0.00% )
              8,443,283 branch-misses # 1.78% of all
    branches ( +- 1.05% )

            0.467167554 seconds time elapsed
          ( +- 0.11% )

    -----

    Hand optimization of the above:

    .LBB1_4:
             movl %edx, %edi
             shrl $5, %edi
             movl %edx, %ecx
             andl $31, %ecx
             movl two(,%rcx,4), *%r11d*
             addl %esi, %edx
             orl *%r11d*, a(%rbx,%rdi,4)
             cmpl $32032, %edx
             jb .LBB1_4

    6 registers: DX DI CX SI BX R11

    $ perf stat --repeat=100 -- ./eratspeed.opt >/dev/null
      Performance counter stats for './eratspeed.opt' (100 runs):

             444.740487 task-clock (msec) # 0.999 CPUs utilized
                ( +- 0.01% )
                      1 context-switches # 0.002 K/sec
                ( +- 11.68% )
                      0 cpu-migrations # 0.000 K/sec
                ( +-100.00% )
                     85 page-faults # 0.191 K/sec
                ( +- 0.10% )
          1,775,283,795 cycles # 3.992 GHz
                ( +- 0.00% )
             68,397,472 stalled-cycles-frontend # 3.85% frontend
    cycles idle ( +- 0.04% )
            201,687,390 stalled-cycles-backend # *11.36%* backend
    cycles idle ( +- 0.02% )
          4,116,277,783 instructions # *2.32* insn per
    cycle
                                                       # 0.05 stalled cycles
    per insn ( +- 0.00% )
            473,014,763 branches # 1063.575 M/sec
                ( +- 0.00% )
              8,263,323 branch-misses # 1.75% of all
    branches ( +- 0.00% )

            0.445287360 seconds time elapsed
          ( +- 0.01% )

    The number of internal CPU registers actually used by the CPU to effect OOE
    is beside the point, as they have to do with the CPU's internal
    optimizations and not compiler optimizations; my point is that the
    compiler's incorrect use of registers still costs time.


    While I don't expect golang, with its philosophy of preserving "safe"
    paradigms in doing array bounds checks by default, to run as fast as C/C++
    code that doesn't have that philosophy, I do expect it to run at least as
    fast as C#/Java code which are Just In Time (JIT) compiled and do have the
    "safe" philosophy. The golang compiler version 1.7beta1 is not quite there
    yet for the indicated reasons: inconsistent use of registers, using one
    too many in one place in order to avoid an immediate load which doesn't
    cost any execution time, and saving one register by use of the "trick"
    which does cost execution time as compared to the use of a single register.

    However, there is hope as version 1.7 has made great advances since
    version 1.6; surely version 1.8, which is said to intend to improve this
    further will be faster yet. At any rate, version 1.7 speed is "adequate"
    for many purposes as at least it comes close (probably within about 15% to
    20% or less) of C#/Java speed in many of the most demanding tight loop
    algorithms, and thus is quite usable as compared to previous versions. But
    even the most avid golang protagonists must admit that it isn't the
    language to re-write Kim Walisch's "primesieve" with its extreme loop
    unrolling that takes an average of about 1.4 CPU clock cycles per composite
    number cull for small ranges of primes, as even with array bounds checking
    turned off, golang would still take at least twice and more likely three
    times as long.

    That is also why I first started posting to this thread: the only reason
    the golang version of "primegen" is reasonably comparable in speed to C/C++
    "primegen" is that it uses multi-threading on a multi-core processor, which
    weren't available to Daniel Bernstein when he wrote "primegen". My point
    was one should compare like with like.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 17, 2016 at 1:52 pm
    I don't see the point of the exercise other than it proves that not putting the result of an operation back into the same register reduces the latency slightly for your processor (whatever it is?); I suspect that if you used any other register such as the unused AX register rather then the R11 register, the results would be the same.

    What would be interesting is to compare the run time on your processor with the same eratspeed program written in golang using version 1.7beta1 with and without array bounds checks.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 17, 2016 at 2:18 pm

    On Friday, June 17, 2016 at 3:52:27 PM UTC+2, gordo...@gmail.com wrote:
    I don't see the point of the exercise other than it proves that not
    putting the result of an operation back into the same register reduces the
    latency slightly for your processor (whatever it is?); I suspect that if
    you used any other register such as the unused AX register rather then the
    R11 register, the results would be the same.
    Why do you see no point of the exercise and at the same time you do see the
    point of optimizing 1.7beta1 code? What is the difference?

    What would be interesting is to compare the run time on your processor with
    the same eratspeed program written in golang using version 1.7beta1 with
    and without array bounds checks.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 17, 2016 at 2:47 pm
    I don't see the point to the exercise as far as optimizing golang is concerned. Your experiment just shows that Your compiler (GCC?) missed an optimization as far as reducing backend latency goes.

    You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.

    I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

    My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 17, 2016 at 3:38 pm

    On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
    I don't see the point to the exercise as far as optimizing golang is
    concerned.

    It is a general rule that using more registers results in faster code.

    Your experiment just shows that Your compiler (GCC?)

    My post containing the example mentions that the compiler is clang 3.9.

    missed an optimization as far as reducing backend latency goes.
    It is significant because it indicates that the clang compiler might be
    completely missing the general rule which I mentioned above.

    You may also find that swapping the order of some of the instructions such
    as the second and the third in the loop may also reduce backend latency
    further.
    Swapping the order of instruction in the example results in the same or
    lower performance.

    I am not on a high end Intel CPU now, but when I was I found that with a
    buffer size adjusted to the L1 cache size (8192 32-bit words or 32
    Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5
    clock cycles per loop (about 405,000,000 loops for this range).

    My current AMD Bulldozer CPU has a severe cache bottleneck and can't come
    close to this speed by a factor of about two.

    Which Bulldozer version do you have: original Bulldozer, Piledriver,
    Steamroller/Kaveri or Excavator/Carrizo?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 17, 2016 at 10:21 pm

    On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
    I don't see the point to the exercise as far as optimizing golang is concerned.
    It is a general rule that using more registers results in faster code.
    Yes, except when they're wasted as in using the extra register is golang 1.7 where to save a load immediate $1.
    Your experiment just shows that Your compiler (GCC?)
    My post containing the example mentions that the compiler is clang 3.9.
    Yes, sorry, I was tired and forgot that.
    missed an optimization as far as reducing backend latency goes.
    It is significant because it indicates that the clang compiler might be completely missing the general rule which I mentioned above.
    It is significant to Clang development, but not to golang developement, which I thought was what we were working on in this thread; that's why I said that your work, though important, doesn't really apply here as to golang improvements.
    You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.
    Swapping the order of instruction in the example results in the same or lower performance.
    I said "may".
    I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).
    My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.
    Which Bulldozer version do you have: original Bulldozer, Piledriver, Steamroller/Kaveri or Excavator/Carrizo?
    One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz.

    by the clock frequency, I assume you were running these tests on a high end Intel CPU?

    Have you tried compiling eratspeed with a new version of GCC to see how it compares to Clang?

    I am currently working on a golang version of eratspeed and we'll see how it compares; it will be a better comparison that just using my simple PrimesSpeed test...

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 18, 2016 at 4:15 am
    Here is a golang version of Daniel Bernstein's "eratspeed", which is a straight translation of his C code to golang without any changes other than as necessary to make it work: https://play.golang.org/p/Sd6qlMQcHF.

    It won't run on the golang playground because it takes too long unless one changes the number of loops it runs, but at least one gets a nicely formatted listing.

    I haven't played with optimizing this other than to turn of array bounds checks in compiling it (to make it more comparable to C), but it seems that "as-is" it runs at about the same speed as John's "primespeed" code from the beginning of this thread when that code has multi-threading disabled (force numprocs = 1 in the sieve.go file) when both have array bounds checking turned off, and both for these conditions run slightly faster than Bernstein's C "primespeed" in 32-bit mode just as compiled with GCC using his "make". Bernstein's C "eratspeed" under those some conditions runs about 17% slower than his "primespeed", which makes it about 20% slower than this golang version. Those are all under the conditions of being compiled and run on my machine, which isn't particularly fast of efficient.

    Neither have I yet looked at the machine code generated by this version of eratspeed, but it is likely helped by there being more constants so there is less of a demand for registers.

    This isn't the ultimate in Sieve of Eratosthenes algorithms yet as it could be maximally factorized (use a 2/3/5/7 wheel instead of the 2/3/5 one used and use a pre-cull pattern which has the 11/13/17/19 prime composites eliminated to initialize the culling array) to save about another 25% to 30%, multi-threading could be applied for another speed up by a factor of four on 4 cores plus HT CPU's; it also needs some work to make the range extendable since, as it is currently written, it is hard coded to sieve up to this specific range.

    Anyway, compare apples with apples, both algorithms true to those written by Bernstein.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 18, 2016 at 10:35 am

    On Saturday, June 18, 2016 at 6:15:11 AM UTC+2, gordo...@gmail.com wrote:
    Here is a golang version of Daniel Bernstein's "eratspeed", which is a
    straight translation of his C code to golang without any changes other than
    as necessary to make it work: https://play.golang.org/p/Sd6qlMQcHF.
    $ perf stat --repeat=100 -- ./eratspeed-go1.7tip
      Performance counter stats for './eratspeed-go1.7tip' (100 runs):

             612.731838 task-clock (msec) # 1.002 CPUs utilized
                ( +- 0.06% )
                    129 context-switches # 0.210 K/sec
                ( +- 0.50% )
                      7 cpu-migrations # 0.012 K/sec
                ( +- 3.95% )
                    185 page-faults # 0.302 K/sec
                ( +- 0.10% )
          2,442,624,802 cycles # 3.986 GHz
                ( +- 0.06% )
             33,146,164 stalled-cycles-frontend # 1.36% frontend
    cycles idle ( +- 0.33% )
            611,827,173 stalled-cycles-backend # 25.05% backend cycles
    idle ( +- 0.21% )
          6,851,374,060 instructions # 2.80 insn per cycle

                                                       # 0.09 stalled cycles
    per insn ( +- 0.00% )
          1,277,112,717 branches # 2084.293 M/sec
                ( +- 0.00% )
              2,924,448 branch-misses # 0.23% of all
    branches ( +- 0.09% )

            0.611729451 seconds time elapsed
          ( +- 0.06% )

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 18, 2016 at 10:54 am

    On Saturday, June 18, 2016 at 12:21:21 AM UTC+2, gordo...@gmail.com wrote:
    On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote:
    I am not on a high end Intel CPU now, but when I was I found that with
    a buffer size adjusted to the L1 cache size (8192 32-bit words or 32
    Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5
    clock cycles per loop (about 405,000,000 loops for this range).
    My current AMD Bulldozer CPU has a severe cache bottleneck and can't
    come close to this speed by a factor of about two.
    Which Bulldozer version do you have: original Bulldozer, Piledriver,
    Steamroller/Kaveri or Excavator/Carrizo?

    One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz.
    Your CPU is bdver1. My CPU is bdver3.

    by the clock frequency, I assume you were running these tests on a high end
    Intel CPU?
    bdver3 CPU has some optimizations compared to bdver1, but I don't know
    whether this affects eratspeed code. Is your IPC lower than 2.80 when you
    compile https://play.golang.org/p/Sd6qlMQcHF with Go1.7-tip and run "perf
    stat --repeat=10 -- ./eratspeed-go1.7tip"?

    Have you tried compiling eratspeed with a new version of GCC to see how it
    compares to Clang?
    gcc-6.1 is slower. clang-3.9 is able to translate "bits = buf[pos]; bits |=
    data; buf[pos] = bits;" into a single x86 instruction equivalent to
    "buf[pos] |= data".

    I am currently working on a golang version of eratspeed and we'll see how
    it compares; it will be a better comparison that just using my simple
    PrimesSpeed test...
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 17, 2016 at 2:47 pm
    I don't see the point to the exercise as far as optimizing golang is concerned. Your experiment just shows that Your compiler (GCC?) missed an optimization as far as reducing backend latency goes.

    You may also find that swapping the order of some of the instructions such as the second and the third in the loop may also reduce backend latency further.

    I am not on a high end Intel CPU now, but when I was I found that with a buffer size adjusted to the L1 cache size (8192 32-bit words or 32 Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 clock cycles per loop (about 405,000,000 loops for this range).

    My current AMD Bulldozer CPU has a severe cache bottleneck and can't come close to this speed by a factor of about two.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 16, 2016 at 9:13 am
    No real surprises with the no bounds checks option (-B), it just eliminated the array bounds checks with the rest of the code the same (version 1.7beta1):

      0x00dd 00221 (main.go:37) MOVQ DI, CX
      0x00e0 00224 (main.go:37) SHRQ $5, DI
      0x00e4 00228 (main.go:37) MOVL (AX)(DI*4), R9
      0x00e8 00232 (main.go:37) MOVQ CX, R10
      0x00eb 00235 (main.go:37) ANDQ $31, CX
      0x00ef 00239 (main.go:37) MOVL R8, R11
      0x00f2 00242 (main.go:37) SHLL CX, R8
      0x00f5 00245 (main.go:37) ORL R8, R9
      0x00f8 00248 (main.go:37) MOVL R9, (AX)(DI*4)
      0x00fc 00252 (main.go:36) LEAQ 3(R10)(SI*2), DI
      0x0101 00257 (main.go:37) MOVL R11, R8
      0x0104 00260 (main.go:36) CMPQ DI, DX
      0x0107 00263 (main.go:36) JLS $0, 221

    It is now almost as fast as C/C++ code, and isn't for the same reasons as explained before: excessively using registers to store things and not using the read/modify/write instruction (which also saves the use of a register).

    The current beta will work not too badly with amd64 code but still doesn't use registers efficiently enough to support x86 code as it uses too many register. optimized C/C++ code only uses six or at most 7 registers, which the x86 architecture has, but not the nine registers that the above requires.

    So for this tight loop, golang is still slower than optimized C/C++ code, but not by very much if array bounds checks are disabled.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Konstantin Khomoutov at Jun 16, 2016 at 11:06 am

    On Thu, 16 Jun 2016 02:12:56 -0700 (PDT) gordonbgood@gmail.com wrote: [...]
    The current beta will work not too badly with amd64 code but still
    doesn't use registers efficiently enough to support x86 code as it
    uses too many register. optimized C/C++ code only uses six or at
    most 7 registers, which the x86 architecture has, but not the nine
    registers that the above requires.

    So for this tight loop, golang is still slower than optimized C/C++
    code, but not by very much if array bounds checks are disabled.
    It appears you're well versed in the x86 process instruction set and
    code generation. Could you may be offer help to the folks working on
    improving the code generation backend of the gc compiler?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Gordonbgood at Jun 16, 2016 at 2:57 pm

    . So for this tight loop, golang is still slower than optimized C/C++
    . code, but not by very much if array bounds checks are disabled.
    It appears you're well versed in the x86 process instruction set and
    code generation. Could you may be offer help to the folks working on
    improving the code generation backend of the gc compiler?
    I am a machine language and assembler programmer from over 40 years ago, but have little knowledge of writing compiler backends, and while I know what kind of optimizations they make, I have no experience in actually writing programs to achieve those optimizations. Thus I would not likely to be able to make significant contributions to the gc backend other that to make comments on what I see in the generated output as here.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • ⚛ at Jun 16, 2016 at 3:50 pm

    On Thursday, June 16, 2016 at 1:06:46 PM UTC+2, Konstantin Khomoutov wrote:
    On Thu, 16 Jun 2016 02:12:56 -0700 (PDT)
    gordo...@gmail.com <javascript:> wrote:

    [...]
    The current beta will work not too badly with amd64 code but still
    doesn't use registers efficiently enough to support x86 code as it
    uses too many register. optimized C/C++ code only uses six or at
    most 7 registers, which the x86 architecture has, but not the nine
    registers that the above requires.

    So for this tight loop, golang is still slower than optimized C/C++
    code, but not by very much if array bounds checks are disabled.
    It appears you're well versed in the x86 process instruction set and
    code generation. Could you may be offer help to the folks working on
    improving the code generation backend of the gc compiler?
    Will golang or google pay money for such a work?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Konstantin Khomoutov at Jun 16, 2016 at 4:36 pm

    On Thu, 16 Jun 2016 08:50:19 -0700 (PDT) ⚛ wrote:
    The current beta will work not too badly with amd64 code but
    still doesn't use registers efficiently enough to support x86
    code as it uses too many register. optimized C/C++ code only
    uses six or at most 7 registers, which the x86 architecture has,
    but not the nine registers that the above requires.

    So for this tight loop, golang is still slower than optimized C/C+
    + code, but not by very much if array bounds checks are disabled.
    It appears you're well versed in the x86 process instruction set
    and code generation. Could you may be offer help to the folks
    working on improving the code generation backend of the gc
    compiler?
    Will golang or google pay money for such a work?
    IANAG [*] but I beleive it will not. ;-)

    My line of reasoning resulted in the friently nudge you're discussing
    was something like: "the person appears to be well-versed in this
    subject", "he is also interested in making Go programs go faster",
    "therefore, it would be good to contribute directly to where the speed
    is a concern".

    [*] I Am Not At Google.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJun 14, '16 at 12:31p
activeJun 18, '16 at 10:54a
posts27
users6
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase