On Friday, February 26, 2016 at 8:36:45 AM UTC+1, Nigel Tao wrote:
As an experiment, I have rewritten the core of the Snappy decoder (and
just the decoder, not the encoder, for now) in amd64 asm:
Thanks, very interesting.

All the other benefits of a straightforward hand-written asm gets
another 1.2x over and above that. One example of that is that the Go
code holds src (a []byte) and its index s (an int) as two separate
variables, whereas the asm code can use a single register, for what C
would call a byte*.

Yes, and even if you use the slices "ala" pointers (advancing them as you
read data), the compiler can't still remove / cse all the (useless) len/cap
updates. It's the same in the domain of graphic programming for instance,
you need to go through bitmaps and process them, and you want to have the
equivalent of a pointer, but the slices cause excess overhead that is
usually very measurable.

Furthermore, Go code like this:

offset = int(src[s-2]) | int(src[s-1])<<8

is compiled by the Go 1.6 gc compiler (see
http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

0x0462 01122 (main.go:74) MOVQ AX, BP
0x0465 01125 (main.go:74) SUBQ $2, BP
0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
0x046d 01133 (main.go:74) MOVBQZX (BX), BX
0x0470 01136 (main.go:74) MOVQ AX, R8
0x0473 01139 (main.go:74) DECQ R8
0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
0x047a 01146 (main.go:74) MOVBQZX (BP), BP
0x047e 01150 (main.go:74) SHLQ $8, BP
0x0482 01154 (main.go:74) ORQ BP, BX
0x0485 01157 (main.go:74) MOVQ BX, R9

but the asm code can just use one instruction:

I use this when I need to trigger this optimization in Go:

func Read16LE(mem []byte) uint16 {
_ = mem[1] // trigger panic if out of bounds
return *(*uint16)(unsafe.Pointer(&mem[0]))

func Read32LE(mem []byte) uint32 {
_ = mem[3] // trigger panic if out of bounds
return *(*uint32)(unsafe.Pointer(&mem[0]))

From the compiler side, the problem is to first remove the extra bound
checks (realizing that only the last one is really necessary), and then
pattern-match the fetch+shift+or sequence.

Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

The bigger kicker is an algorithmic change, rather than comparing the
compiler's code-gen with manual code-gen, although being in asm
possibly makes it easier to execute the algorithmic change. One of the
key inner copy loops is

for end := d + length; d != end; d++ {
dst[d] = dst[d-offset]

The pure Go code, and the straightforward asm code, does this one byte
at a time.

Why is it not possible to use copy() here?

Giovanni Bajo

You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 23 | next ›
Discussion Overview
groupgolang-dev @
postedFeb 26, '16 at 7:36a
activeApr 24, '16 at 12:39a



site design / logo © 2021 Grokbase