FAQ
As an experiment, I have rewritten the core of the Snappy decoder (and
just the decoder, not the encoder, for now) in amd64 asm:

https://github.com/golang/snappy/commit/402436317ad8035a7246ee89492064f9e6cbb4ce
https://github.com/golang/snappy/commit/8c7c9dec5965484f0a81268ce7985fe31e5d5955
https://github.com/golang/snappy/commit/4c1fc8e426266f00229956994142877543e8b514
https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7

These are github commits that have not gone through gerrit, and
assembly programming is a fraught endeavor, so I would appreciate a
code reviewer.



Snappy is a compression format, like flate (gzip, zlib) or LZW, but
much simpler. It has LZ77-style length/offset back-references, and
that's it. No Huffman encoding (unlike flate). Byte-aligned output
(unlike LZW). The format is specified at
https://github.com/google/snappy/blob/master/format_description.txt

The punchline is, say, a 3x improvement in throughput compared to the
pure Go implementation:

benchmark old MB/s new MB/s speedup
BenchmarkWordsDecode1e1-8 498.83 508.74 1.02x
BenchmarkWordsDecode1e2-8 445.12 962.52 2.16x
BenchmarkWordsDecode1e3-8 530.29 1435.51 2.71x
BenchmarkWordsDecode1e4-8 361.08 1514.02 4.19x
BenchmarkWordsDecode1e5-8 270.69 807.73 2.98x
BenchmarkWordsDecode1e6-8 290.91 892.24 3.07x
Benchmark_UFlat0-8 543.87 2200.22 4.05x
Benchmark_UFlat1-8 449.84 1446.09 3.21x
Benchmark_UFlat2-8 15511.96 14706.88 0.95x
Benchmark_UFlat3-8 873.92 1787.82 2.05x
Benchmark_UFlat4-8 2978.58 10683.24 3.59x
Benchmark_UFlat5-8 536.04 1965.33 3.67x
Benchmark_UFlat6-8 278.33 833.52 2.99x
Benchmark_UFlat7-8 271.63 792.85 2.92x
Benchmark_UFlat8-8 288.86 854.75 2.96x
Benchmark_UFlat9-8 262.13 730.21 2.79x
Benchmark_UFlat10-8 640.03 2775.98 4.34x
Benchmark_UFlat11-8 356.37 1037.94 2.91x

and these new MB/s numbers are comparable to what I get from the C++
snappy implementation's benchmark tests:

BM_UFlat/0 2.4GB/s html
BM_UFlat/1 1.4GB/s urls
BM_UFlat/2 21.1GB/s jpg
BM_UFlat/3 1.5GB/s jpg_200
BM_UFlat/4 10.2GB/s pdf
BM_UFlat/5 2.1GB/s html4
BM_UFlat/6 990.6MB/s txt1
BM_UFlat/7 930.1MB/s txt2
BM_UFlat/8 1.0GB/s txt3
BM_UFlat/9 849.7MB/s txt4
BM_UFlat/10 2.9GB/s pb
BM_UFlat/11 1.2GB/s gaviota

Digging a little deeper, compared to the pure Go codegen...

Just passing -B to the Go compiler, to disable bounds checking, gets
about a 1.15x improvement.

All the other benefits of a straightforward hand-written asm gets
another 1.2x over and above that. One example of that is that the Go
code holds src (a []byte) and its index s (an int) as two separate
variables, whereas the asm code can use a single register, for what C
would call a byte*. Furthermore, Go code like this:

offset = int(src[s-2]) | int(src[s-1])<<8

is compiled by the Go 1.6 gc compiler (see
http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

0x0462 01122 (main.go:74) MOVQ AX, BP
0x0465 01125 (main.go:74) SUBQ $2, BP
0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
0x046d 01133 (main.go:74) MOVBQZX (BX), BX
0x0470 01136 (main.go:74) MOVQ AX, R8
0x0473 01139 (main.go:74) DECQ R8
0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
0x047a 01146 (main.go:74) MOVBQZX (BP), BP
0x047e 01150 (main.go:74) SHLQ $8, BP
0x0482 01154 (main.go:74) ORQ BP, BX
0x0485 01157 (main.go:74) MOVQ BX, R9

but the asm code can just use one instruction:

MOVWQZX -2(SI), DX

Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

The bigger kicker is an algorithmic change, rather than comparing the
compiler's code-gen with manual code-gen, although being in asm
possibly makes it easier to execute the algorithmic change. One of the
key inner copy loops is

for end := d + length; d != end; d++ {
   dst[d] = dst[d-offset]
}

The pure Go code, and the straightforward asm code, does this one byte
at a time. The faster asm code does this 8 bytes at a time, for which
there are at least two things to watch for. One is overrunning:
copying 16 bytes when you only wanted 10. This is especially dangerous
if you can possibly read or write past the end of your buffers. The
other is overlapping: a small offset means that a naive 8-byte load
followed by an 8-byte store will not give the same result as a
byte-by-byte copy, since in the basic loop, later iterations's reads
pick up the earlier iteration's writes. Still, both these problems are
surmountable, although it does mean that the asm code is not a
line-for-line translation of the more understandable Go code. For
example, I find
https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7
to be quite tricky. As I mentioned, I would appreciate a code review.

Applying the "8 bytes at a time" trick several times multiplies up to
a dramatic further 2.2x performance gain (3.0x / 1.4x), although it is
reliant on the fact that unaligned 8 byte load/stores are possible,
and cheap, on amd64. This technique probably wouldn't be as effective
on architectures that are fussier about alignment. So, it may be
possible to factor the amd64 asm implementation to move more of the
tricks into (theoretically) cross-platform Go code and less of it in
amd64 asm code, while still retaining most of the performance, but I'm
not sure if the end result will perform well at all on e.g. arm CPUs.

A note on bounds checking: I believe that, starting with knowing that
0 <= s && s < len(src), I've been careful with saying things like:
if length > len(src)-s { etc }
instead of
if s+length > len(src) { etc }
in case of overflow, and similarly with
s += 4; if uint(s) > uint(len(src)) { etc }
instead of
s += 4; if s > len(src) { etc }
but, as I mentioned, a code reviewer would be appreciated. There may
be other traps that I've missed.

I repeat that this is an experiment that I've been playing with on and
off for a couple of weeks. I'm sharing some interesting numbers, but I
wouldn't draw too much from it. For example, I haven't tried the SSA
branch of the compiler. I presume that it'd be easy for the SSA folks
to check out the pure Go only revision of github.com/golang/snappy at
03ee571c if they want to play along.

There may also be a further 1.1x or 1.2x gain we could eke out of the
asm here, and I'm happy to review suggestions, but given that we're
getting close to the C++ implementation, I'd be surprised if there's
another 2x improvement to be had easily. I should also get back to
some other work that I've been neglecting.

Comments welcome. Also, if you've read this far, I'd like a code reviewer. :-)

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 23 | next ›
Discussion Overview
groupgolang-dev @
categoriesgo
postedFeb 26, '16 at 7:36a
activeApr 24, '16 at 12:39a
posts23
users7
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase