FAQ
Giovanni, a single call to copy() is not enough to replace the loop because
the slice segments may overlap (i.e. length > offset) and in such a case
the copy() builtin will not work as desired.

This is very interesting. I tried previously to optimize snappy decoding
without using asm (one attempt falling back to copy() in the tight loop
when feasible). I found it very hard to achieve speedups across all
decoding benchmarks. My next goal was indeed along these lines, to start
learning more asm to write an optimized decoder. So I will definitely look
at these changes, though I'm not qualified to do a real review.

- Bryan
On Friday, February 26, 2016 at 4:13:50 AM UTC-8, Giovanni Bajo wrote:
On Friday, February 26, 2016 at 8:36:45 AM UTC+1, Nigel Tao wrote:

As an experiment, I have rewritten the core of the Snappy decoder (and
just the decoder, not the encoder, for now) in amd64 asm:
Thanks, very interesting.

All the other benefits of a straightforward hand-written asm gets
another 1.2x over and above that. One example of that is that the Go
code holds src (a []byte) and its index s (an int) as two separate
variables, whereas the asm code can use a single register, for what C
would call a byte*.

Yes, and even if you use the slices "ala" pointers (advancing them as you
read data), the compiler can't still remove / cse all the (useless) len/cap
updates. It's the same in the domain of graphic programming for instance,
you need to go through bitmaps and process them, and you want to have the
equivalent of a pointer, but the slices cause excess overhead that is
usually very measurable.

Furthermore, Go code like this:

offset = int(src[s-2]) | int(src[s-1])<<8

is compiled by the Go 1.6 gc compiler (see
http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

0x0462 01122 (main.go:74) MOVQ AX, BP
0x0465 01125 (main.go:74) SUBQ $2, BP
0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
0x046d 01133 (main.go:74) MOVBQZX (BX), BX
0x0470 01136 (main.go:74) MOVQ AX, R8
0x0473 01139 (main.go:74) DECQ R8
0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
0x047a 01146 (main.go:74) MOVBQZX (BP), BP
0x047e 01150 (main.go:74) SHLQ $8, BP
0x0482 01154 (main.go:74) ORQ BP, BX
0x0485 01157 (main.go:74) MOVQ BX, R9

but the asm code can just use one instruction:

MOVWQZX -2(SI), DX
I use this when I need to trigger this optimization in Go:

func Read16LE(mem []byte) uint16 {
_ = mem[1] // trigger panic if out of bounds
return *(*uint16)(unsafe.Pointer(&mem[0]))
}

func Read32LE(mem []byte) uint32 {
_ = mem[3] // trigger panic if out of bounds
return *(*uint32)(unsafe.Pointer(&mem[0]))
}

From the compiler side, the problem is to first remove the extra bound
checks (realizing that only the last one is really necessary), and then
pattern-match the fetch+shift+or sequence.


Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

The bigger kicker is an algorithmic change, rather than comparing the
compiler's code-gen with manual code-gen, although being in asm
possibly makes it easier to execute the algorithmic change. One of the
key inner copy loops is

for end := d + length; d != end; d++ {
dst[d] = dst[d-offset]
}

The pure Go code, and the straightforward asm code, does this one byte
at a time.

Why is it not possible to use copy() here?

Giovanni Bajo
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 23 | next ›
Discussion Overview
groupgolang-dev @
categoriesgo
postedFeb 26, '16 at 7:36a
activeApr 24, '16 at 12:39a
posts23
users7
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase