FAQ
As an experiment, I have rewritten the core of the Snappy decoder (and
just the decoder, not the encoder, for now) in amd64 asm:

https://github.com/golang/snappy/commit/402436317ad8035a7246ee89492064f9e6cbb4ce
https://github.com/golang/snappy/commit/8c7c9dec5965484f0a81268ce7985fe31e5d5955
https://github.com/golang/snappy/commit/4c1fc8e426266f00229956994142877543e8b514
https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7

These are github commits that have not gone through gerrit, and
assembly programming is a fraught endeavor, so I would appreciate a
code reviewer.



Snappy is a compression format, like flate (gzip, zlib) or LZW, but
much simpler. It has LZ77-style length/offset back-references, and
that's it. No Huffman encoding (unlike flate). Byte-aligned output
(unlike LZW). The format is specified at
https://github.com/google/snappy/blob/master/format_description.txt

The punchline is, say, a 3x improvement in throughput compared to the
pure Go implementation:

benchmark old MB/s new MB/s speedup
BenchmarkWordsDecode1e1-8 498.83 508.74 1.02x
BenchmarkWordsDecode1e2-8 445.12 962.52 2.16x
BenchmarkWordsDecode1e3-8 530.29 1435.51 2.71x
BenchmarkWordsDecode1e4-8 361.08 1514.02 4.19x
BenchmarkWordsDecode1e5-8 270.69 807.73 2.98x
BenchmarkWordsDecode1e6-8 290.91 892.24 3.07x
Benchmark_UFlat0-8 543.87 2200.22 4.05x
Benchmark_UFlat1-8 449.84 1446.09 3.21x
Benchmark_UFlat2-8 15511.96 14706.88 0.95x
Benchmark_UFlat3-8 873.92 1787.82 2.05x
Benchmark_UFlat4-8 2978.58 10683.24 3.59x
Benchmark_UFlat5-8 536.04 1965.33 3.67x
Benchmark_UFlat6-8 278.33 833.52 2.99x
Benchmark_UFlat7-8 271.63 792.85 2.92x
Benchmark_UFlat8-8 288.86 854.75 2.96x
Benchmark_UFlat9-8 262.13 730.21 2.79x
Benchmark_UFlat10-8 640.03 2775.98 4.34x
Benchmark_UFlat11-8 356.37 1037.94 2.91x

and these new MB/s numbers are comparable to what I get from the C++
snappy implementation's benchmark tests:

BM_UFlat/0 2.4GB/s html
BM_UFlat/1 1.4GB/s urls
BM_UFlat/2 21.1GB/s jpg
BM_UFlat/3 1.5GB/s jpg_200
BM_UFlat/4 10.2GB/s pdf
BM_UFlat/5 2.1GB/s html4
BM_UFlat/6 990.6MB/s txt1
BM_UFlat/7 930.1MB/s txt2
BM_UFlat/8 1.0GB/s txt3
BM_UFlat/9 849.7MB/s txt4
BM_UFlat/10 2.9GB/s pb
BM_UFlat/11 1.2GB/s gaviota

Digging a little deeper, compared to the pure Go codegen...

Just passing -B to the Go compiler, to disable bounds checking, gets
about a 1.15x improvement.

All the other benefits of a straightforward hand-written asm gets
another 1.2x over and above that. One example of that is that the Go
code holds src (a []byte) and its index s (an int) as two separate
variables, whereas the asm code can use a single register, for what C
would call a byte*. Furthermore, Go code like this:

offset = int(src[s-2]) | int(src[s-1])<<8

is compiled by the Go 1.6 gc compiler (see
http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

0x0462 01122 (main.go:74) MOVQ AX, BP
0x0465 01125 (main.go:74) SUBQ $2, BP
0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
0x046d 01133 (main.go:74) MOVBQZX (BX), BX
0x0470 01136 (main.go:74) MOVQ AX, R8
0x0473 01139 (main.go:74) DECQ R8
0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
0x047a 01146 (main.go:74) MOVBQZX (BP), BP
0x047e 01150 (main.go:74) SHLQ $8, BP
0x0482 01154 (main.go:74) ORQ BP, BX
0x0485 01157 (main.go:74) MOVQ BX, R9

but the asm code can just use one instruction:

MOVWQZX -2(SI), DX

Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

The bigger kicker is an algorithmic change, rather than comparing the
compiler's code-gen with manual code-gen, although being in asm
possibly makes it easier to execute the algorithmic change. One of the
key inner copy loops is

for end := d + length; d != end; d++ {
   dst[d] = dst[d-offset]
}

The pure Go code, and the straightforward asm code, does this one byte
at a time. The faster asm code does this 8 bytes at a time, for which
there are at least two things to watch for. One is overrunning:
copying 16 bytes when you only wanted 10. This is especially dangerous
if you can possibly read or write past the end of your buffers. The
other is overlapping: a small offset means that a naive 8-byte load
followed by an 8-byte store will not give the same result as a
byte-by-byte copy, since in the basic loop, later iterations's reads
pick up the earlier iteration's writes. Still, both these problems are
surmountable, although it does mean that the asm code is not a
line-for-line translation of the more understandable Go code. For
example, I find
https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7
to be quite tricky. As I mentioned, I would appreciate a code review.

Applying the "8 bytes at a time" trick several times multiplies up to
a dramatic further 2.2x performance gain (3.0x / 1.4x), although it is
reliant on the fact that unaligned 8 byte load/stores are possible,
and cheap, on amd64. This technique probably wouldn't be as effective
on architectures that are fussier about alignment. So, it may be
possible to factor the amd64 asm implementation to move more of the
tricks into (theoretically) cross-platform Go code and less of it in
amd64 asm code, while still retaining most of the performance, but I'm
not sure if the end result will perform well at all on e.g. arm CPUs.

A note on bounds checking: I believe that, starting with knowing that
0 <= s && s < len(src), I've been careful with saying things like:
if length > len(src)-s { etc }
instead of
if s+length > len(src) { etc }
in case of overflow, and similarly with
s += 4; if uint(s) > uint(len(src)) { etc }
instead of
s += 4; if s > len(src) { etc }
but, as I mentioned, a code reviewer would be appreciated. There may
be other traps that I've missed.

I repeat that this is an experiment that I've been playing with on and
off for a couple of weeks. I'm sharing some interesting numbers, but I
wouldn't draw too much from it. For example, I haven't tried the SSA
branch of the compiler. I presume that it'd be easy for the SSA folks
to check out the pure Go only revision of github.com/golang/snappy at
03ee571c if they want to play along.

There may also be a further 1.1x or 1.2x gain we could eke out of the
asm here, and I'm happy to review suggestions, but given that we're
getting close to the C++ implementation, I'd be surprised if there's
another 2x improvement to be had easily. I should also get back to
some other work that I've been neglecting.

Comments welcome. Also, if you've read this far, I'd like a code reviewer. :-)

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Giovannibajo at Feb 26, 2016 at 12:13 pm

    On Friday, February 26, 2016 at 8:36:45 AM UTC+1, Nigel Tao wrote:
    As an experiment, I have rewritten the core of the Snappy decoder (and
    just the decoder, not the encoder, for now) in amd64 asm:
    Thanks, very interesting.

    All the other benefits of a straightforward hand-written asm gets
    another 1.2x over and above that. One example of that is that the Go
    code holds src (a []byte) and its index s (an int) as two separate
    variables, whereas the asm code can use a single register, for what C
    would call a byte*.

    Yes, and even if you use the slices "ala" pointers (advancing them as you
    read data), the compiler can't still remove / cse all the (useless) len/cap
    updates. It's the same in the domain of graphic programming for instance,
    you need to go through bitmaps and process them, and you want to have the
    equivalent of a pointer, but the slices cause excess overhead that is
    usually very measurable.

    Furthermore, Go code like this:

    offset = int(src[s-2]) | int(src[s-1])<<8

    is compiled by the Go 1.6 gc compiler (see
    http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

    0x0462 01122 (main.go:74) MOVQ AX, BP
    0x0465 01125 (main.go:74) SUBQ $2, BP
    0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
    0x046d 01133 (main.go:74) MOVBQZX (BX), BX
    0x0470 01136 (main.go:74) MOVQ AX, R8
    0x0473 01139 (main.go:74) DECQ R8
    0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
    0x047a 01146 (main.go:74) MOVBQZX (BP), BP
    0x047e 01150 (main.go:74) SHLQ $8, BP
    0x0482 01154 (main.go:74) ORQ BP, BX
    0x0485 01157 (main.go:74) MOVQ BX, R9

    but the asm code can just use one instruction:

    MOVWQZX -2(SI), DX
    I use this when I need to trigger this optimization in Go:

    func Read16LE(mem []byte) uint16 {
    _ = mem[1] // trigger panic if out of bounds
    return *(*uint16)(unsafe.Pointer(&mem[0]))
    }

    func Read32LE(mem []byte) uint32 {
    _ = mem[3] // trigger panic if out of bounds
    return *(*uint32)(unsafe.Pointer(&mem[0]))
    }

    From the compiler side, the problem is to first remove the extra bound
    checks (realizing that only the last one is really necessary), and then
    pattern-match the fetch+shift+or sequence.


    Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

    The bigger kicker is an algorithmic change, rather than comparing the
    compiler's code-gen with manual code-gen, although being in asm
    possibly makes it easier to execute the algorithmic change. One of the
    key inner copy loops is

    for end := d + length; d != end; d++ {
    dst[d] = dst[d-offset]
    }

    The pure Go code, and the straightforward asm code, does this one byte
    at a time.

    Why is it not possible to use copy() here?

    Giovanni Bajo

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bryan Matsuo at Feb 26, 2016 at 8:33 pm
    Giovanni, a single call to copy() is not enough to replace the loop because
    the slice segments may overlap (i.e. length > offset) and in such a case
    the copy() builtin will not work as desired.

    This is very interesting. I tried previously to optimize snappy decoding
    without using asm (one attempt falling back to copy() in the tight loop
    when feasible). I found it very hard to achieve speedups across all
    decoding benchmarks. My next goal was indeed along these lines, to start
    learning more asm to write an optimized decoder. So I will definitely look
    at these changes, though I'm not qualified to do a real review.

    - Bryan
    On Friday, February 26, 2016 at 4:13:50 AM UTC-8, Giovanni Bajo wrote:
    On Friday, February 26, 2016 at 8:36:45 AM UTC+1, Nigel Tao wrote:

    As an experiment, I have rewritten the core of the Snappy decoder (and
    just the decoder, not the encoder, for now) in amd64 asm:
    Thanks, very interesting.

    All the other benefits of a straightforward hand-written asm gets
    another 1.2x over and above that. One example of that is that the Go
    code holds src (a []byte) and its index s (an int) as two separate
    variables, whereas the asm code can use a single register, for what C
    would call a byte*.

    Yes, and even if you use the slices "ala" pointers (advancing them as you
    read data), the compiler can't still remove / cse all the (useless) len/cap
    updates. It's the same in the domain of graphic programming for instance,
    you need to go through bitmaps and process them, and you want to have the
    equivalent of a pointer, but the slices cause excess overhead that is
    usually very measurable.

    Furthermore, Go code like this:

    offset = int(src[s-2]) | int(src[s-1])<<8

    is compiled by the Go 1.6 gc compiler (see
    http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

    0x0462 01122 (main.go:74) MOVQ AX, BP
    0x0465 01125 (main.go:74) SUBQ $2, BP
    0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
    0x046d 01133 (main.go:74) MOVBQZX (BX), BX
    0x0470 01136 (main.go:74) MOVQ AX, R8
    0x0473 01139 (main.go:74) DECQ R8
    0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
    0x047a 01146 (main.go:74) MOVBQZX (BP), BP
    0x047e 01150 (main.go:74) SHLQ $8, BP
    0x0482 01154 (main.go:74) ORQ BP, BX
    0x0485 01157 (main.go:74) MOVQ BX, R9

    but the asm code can just use one instruction:

    MOVWQZX -2(SI), DX
    I use this when I need to trigger this optimization in Go:

    func Read16LE(mem []byte) uint16 {
    _ = mem[1] // trigger panic if out of bounds
    return *(*uint16)(unsafe.Pointer(&mem[0]))
    }

    func Read32LE(mem []byte) uint32 {
    _ = mem[3] // trigger panic if out of bounds
    return *(*uint32)(unsafe.Pointer(&mem[0]))
    }

    From the compiler side, the problem is to first remove the extra bound
    checks (realizing that only the last one is really necessary), and then
    pattern-match the fetch+shift+or sequence.


    Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

    The bigger kicker is an algorithmic change, rather than comparing the
    compiler's code-gen with manual code-gen, although being in asm
    possibly makes it easier to execute the algorithmic change. One of the
    key inner copy loops is

    for end := d + length; d != end; d++ {
    dst[d] = dst[d-offset]
    }

    The pure Go code, and the straightforward asm code, does this one byte
    at a time.

    Why is it not possible to use copy() here?

    Giovanni Bajo
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Ian Lance Taylor at Feb 26, 2016 at 8:56 pm

    On Fri, Feb 26, 2016 at 11:50 AM, Bryan Matsuo wrote:
    Giovanni, a single call to copy() is not enough to replace the loop because
    the slice segments may overlap (i.e. length > offset) and in such a case the
    copy() builtin will not work as desired.
    The copy function will handle overlapping slices correctly. It's
    equivalent to memmove, not memcpy.

    Ian

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bryan Matsuo at Feb 26, 2016 at 9:04 pm

    On Friday, February 26, 2016 at 12:56:35 PM UTC-8, Ian Lance Taylor wrote:
    The copy function will handle overlapping slices correctly. It's
    equivalent to memmove, not memcpy.
    The memmove semantics are not correct *for this case*. The manpage for
    memmove says "copying takes place as though the bytes in src are first
    copied into a temporary array that does not overlap src or dest, and the
    bytes are copied from the temporary array to dest." I understand that is
    typically desirable. But that is exactly what you do *not* want to happen
    during snappy decoding. Forgive my ignorance not really knowing the
    technical terms, but the following is lifted from the snappy specification
    (section 2.2).

    As in most LZ77-based compressors, the length can be larger than the offset,
    yielding a form of run-length encoding (RLE). For instance,
    "xababab" could be encoded as
    <literal: "xab"> <copy: offset=2 length=4>

    - Bryan

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Joe Tsai at Feb 26, 2016 at 11:36 pm
    Using copy directly doesn't work, but with some adjustments, we can still
    use it.

    Nigel, can you try something similar to my simplified version of
    flate.forwardCopy in CL/16526
    <https://go-review.googlesource.com/#/c/16526/>. Since snappy doesn't have
    a rolling window, you only need the bottom half copyHist method. I'm
    curious how much (if anything) this improves the pure Go version of snappy.

    JT
    On Friday, February 26, 2016 at 1:04:57 PM UTC-8, Bryan Matsuo wrote:


    On Friday, February 26, 2016 at 12:56:35 PM UTC-8, Ian Lance Taylor wrote:

    The copy function will handle overlapping slices correctly. It's
    equivalent to memmove, not memcpy.
    The memmove semantics are not correct *for this case*. The manpage for
    memmove says "copying takes place as though the bytes in src are first
    copied into a temporary array that does not overlap src or dest, and the
    bytes are copied from the temporary array to dest." I understand that is
    typically desirable. But that is exactly what you do *not* want to happen
    during snappy decoding. Forgive my ignorance not really knowing the
    technical terms, but the following is lifted from the snappy specification
    (section 2.2).

    As in most LZ77-based compressors, the length can be larger than the
    offset,
    yielding a form of run-length encoding (RLE). For instance,
    "xababab" could be encoded as
    <literal: "xab"> <copy: offset=2 length=4>

    - Bryan
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Mar 3, 2016 at 4:29 am

    On Sat, Feb 27, 2016 at 10:36 AM, Joe Tsai wrote:
    Nigel, can you try something similar to my simplified version of
    flate.forwardCopy in CL/16526. Since snappy doesn't have a rolling window,
    you only need the bottom half copyHist method. I'm curious how much (if
    anything) this improves the pure Go version of snappy.
    I replaced

    for end := d + length; d != end; d++ {
       dst[d] = dst[d-offset]
    }

    with

    for end := d + length; d != end; {
       d += copy(dst[d:end], dst[d-offset:d])
    }

    in decode_other.go. Some numbers got better, some numbers got worse.
    It's far from a clear win.

    benchmark old MB/s new MB/s speedup
    BenchmarkWordsDecode1e1-8 438.16 437.10 1.00x
    BenchmarkWordsDecode1e2-8 445.87 370.46 0.83x
    BenchmarkWordsDecode1e3-8 544.92 461.88 0.85x
    BenchmarkWordsDecode1e4-8 364.79 344.74 0.95x
    BenchmarkWordsDecode1e5-8 281.25 250.36 0.89x
    BenchmarkWordsDecode1e6-8 304.35 276.73 0.91x
    Benchmark_UFlat0-8 583.71 755.21 1.29x
    Benchmark_UFlat1-8 468.82 480.53 1.02x
    Benchmark_UFlat2-8 14981.84 15035.37 1.00x
    Benchmark_UFlat3-8 929.63 515.97 0.56x
    Benchmark_UFlat4-8 3088.96 6012.81 1.95x
    Benchmark_UFlat5-8 569.76 726.51 1.28x
    Benchmark_UFlat6-8 288.07 250.63 0.87x
    Benchmark_UFlat7-8 279.56 244.18 0.87x
    Benchmark_UFlat8-8 299.00 262.31 0.88x
    Benchmark_UFlat9-8 269.44 231.40 0.86x
    Benchmark_UFlat10-8 687.79 1093.23 1.59x
    Benchmark_UFlat11-8 372.22 328.20 0.88x

    Perhaps the copy lengths are usually so small that the overhead of
    calling runtime.memmove isn't worth it. Perhaps it's something else.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Mar 3, 2016 at 4:59 am

    On Thu, Mar 3, 2016 at 3:29 PM, Nigel Tao wrote:
    Perhaps the copy lengths are usually so small that the overhead of
    calling runtime.memmove isn't worth it. Perhaps it's something else.
    I suppose that you could rig it so that you choose:

    if length < L || offset < O {
       use basic "dst[d] = dst[d-offset]" loop
    } else {
       use fancier builtin-copy-based loop
    }

    for some magic numbers L and O, but at least on amd64, that still
    doesn't get close to the asm version, and it's not obvious that
    whatever a good L and O is for one CPU is necessarily good for
    another. I'm open to arguments to the contrary, but I'm not convinced
    that it's worth making the decode_other.go simple implementation that
    little bit more complicated. If, in the future, we also need e.g. a
    really fast arm implementation, I'd be inclined to write an arm asm
    version, again bypassing the decode_other.go version.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Joe Tsai at Mar 3, 2016 at 5:03 am
    I just printed a histogram of the lengths, it seems that the files that
    perform poorly have an order of magnitude more lengths 4 and below than any
    other (I didn't know that snappy allows lengths < 4).

    I also tried the dual version, with L=8, O=0:
    BenchmarkWordsDecode1e1-4 409.04 402.39 0.98x
    BenchmarkWordsDecode1e2-4 413.11 396.93 0.96x
    BenchmarkWordsDecode1e3-4 501.41 487.86 0.97x
    BenchmarkWordsDecode1e4-4 332.99 309.59 0.93x
    BenchmarkWordsDecode1e5-4 256.14 256.74 1.00x
    BenchmarkWordsDecode1e6-4 277.36 277.56 1.00x
    Benchmark_UFlat0-4 533.12 735.90 1.38x
    Benchmark_UFlat1-4 428.48 479.35 1.12x
    Benchmark_UFlat2-4 13939.92 13586.29 0.97x
    Benchmark_UFlat3-4 836.60 697.90 0.83x
    Benchmark_UFlat4-4 2848.84 5326.47 1.87x
    Benchmark_UFlat5-4 520.65 691.76 1.33x
    Benchmark_UFlat6-4 263.59 261.69 0.99x
    Benchmark_UFlat7-4 254.36 258.57 1.02x
    Benchmark_UFlat8-4 272.94 275.24 1.01x
    Benchmark_UFlat9-4 248.50 245.86 0.99x
    Benchmark_UFlat10-4 636.17 1051.09 1.65x
    Benchmark_UFlat11-4 340.62 361.82 1.06x

    Yea, I agree that the using builtin-copy doesn't seem to provide
    significant gains for snappy.
    On Wednesday, March 2, 2016 at 8:59:10 PM UTC-8, Nigel Tao wrote:

    On Thu, Mar 3, 2016 at 3:29 PM, Nigel Tao <nige...@golang.org
    <javascript:>> wrote:
    Perhaps the copy lengths are usually so small that the overhead of
    calling runtime.memmove isn't worth it. Perhaps it's something else.
    I suppose that you could rig it so that you choose:

    if length < L || offset < O {
    use basic "dst[d] = dst[d-offset]" loop
    } else {
    use fancier builtin-copy-based loop
    }

    for some magic numbers L and O, but at least on amd64, that still
    doesn't get close to the asm version, and it's not obvious that
    whatever a good L and O is for one CPU is necessarily good for
    another. I'm open to arguments to the contrary, but I'm not convinced
    that it's worth making the decode_other.go simple implementation that
    little bit more complicated. If, in the future, we also need e.g. a
    really fast arm implementation, I'd be inclined to write an arm asm
    version, again bypassing the decode_other.go version.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Giovanni Bajo at Feb 27, 2016 at 2:45 pm

    On Friday, February 26, 2016 at 10:04:57 PM UTC+1, Bryan Matsuo wrote:

    On Friday, February 26, 2016 at 12:56:35 PM UTC-8, Ian Lance Taylor wrote:

    The copy function will handle overlapping slices correctly. It's
    equivalent to memmove, not memcpy.
    The memmove semantics are not correct *for this case*. The manpage for
    memmove says "copying takes place as though the bytes in src are first
    copied into a temporary array that does not overlap src or dest, and the
    bytes are copied from the temporary array to dest." I understand that is
    typically desirable. But that is exactly what you do *not* want to happen
    during snappy decoding. Forgive my ignorance not really knowing the
    technical terms, but the following is lifted from the snappy specification
    (section 2.2).

    As in most LZ77-based compressors, the length can be larger than the
    offset,
    yielding a form of run-length encoding (RLE). For instance,
    "xababab" could be encoded as
    <literal: "xab"> <copy: offset=2 length=4>
    Thanks, I was aware of how LZ-based pattern matching works and this
    specific trick as well, but I got confused thinking that copy() (aka
    memmove) could handle that. Your quote from the manpage is very clear.

    I prepared a version that calls copy() in a loop, using the longest
    available sequence each time:
    http://play.golang.org/p/y_y9JfX6O_

    I guess it's the best you can do in pure Go.

    Giovanni

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Feb 27, 2016 at 7:32 am

    On Fri, Feb 26, 2016 at 6:36 PM, Nigel Tao wrote:
    Still, 1.15x and 1.2x combine to only 1.4x, not 3x.
    It occurs to me that this accounting (1.15x for bounds elimination,
    1.2x for general asm-ness, 2.2x for algorithmic changes) may be
    under-estimating the impact of the early effects and over-estimating
    the later ones.

    As an analogy, suppose that a loop takes 3 seconds, and there are two
    independent optimizations, each of which take 1 second off that. They
    really have the same-sized effect, but whichever one is tried first
    will be measured as a 1.5x improvement (3/2), and whichever one is
    tried last will be measured as a 2x improvement (2/1).

    It gets more complicated if the various optimizations are not
    independent, but the point might still be generally applicable.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Aaron Jacobs at Feb 28, 2016 at 9:13 pm
    Thanks for the careful write-up, Nigel; that was really interesting reading.
    When you settle on a final form (e.g. after you try suggestions from this
    thread), I'm happy to review if you make sure there are good comments
    throughout.
    On Fri, Feb 26, 2016 at 6:36 PM, Nigel Tao wrote:
    As an experiment, I have rewritten the core of the Snappy decoder (and
    just the decoder, not the encoder, for now) in amd64 asm:

    https://github.com/golang/snappy/commit/402436317ad8035a7246ee89492064f9e6cbb4ce
    https://github.com/golang/snappy/commit/8c7c9dec5965484f0a81268ce7985fe31e5d5955
    https://github.com/golang/snappy/commit/4c1fc8e426266f00229956994142877543e8b514
    https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7

    These are github commits that have not gone through gerrit, and
    assembly programming is a fraught endeavor, so I would appreciate a
    code reviewer.



    Snappy is a compression format, like flate (gzip, zlib) or LZW, but
    much simpler. It has LZ77-style length/offset back-references, and
    that's it. No Huffman encoding (unlike flate). Byte-aligned output
    (unlike LZW). The format is specified at
    https://github.com/google/snappy/blob/master/format_description.txt

    The punchline is, say, a 3x improvement in throughput compared to the
    pure Go implementation:

    benchmark old MB/s new MB/s speedup
    BenchmarkWordsDecode1e1-8 498.83 508.74 1.02x
    BenchmarkWordsDecode1e2-8 445.12 962.52 2.16x
    BenchmarkWordsDecode1e3-8 530.29 1435.51 2.71x
    BenchmarkWordsDecode1e4-8 361.08 1514.02 4.19x
    BenchmarkWordsDecode1e5-8 270.69 807.73 2.98x
    BenchmarkWordsDecode1e6-8 290.91 892.24 3.07x
    Benchmark_UFlat0-8 543.87 2200.22 4.05x
    Benchmark_UFlat1-8 449.84 1446.09 3.21x
    Benchmark_UFlat2-8 15511.96 14706.88 0.95x
    Benchmark_UFlat3-8 873.92 1787.82 2.05x
    Benchmark_UFlat4-8 2978.58 10683.24 3.59x
    Benchmark_UFlat5-8 536.04 1965.33 3.67x
    Benchmark_UFlat6-8 278.33 833.52 2.99x
    Benchmark_UFlat7-8 271.63 792.85 2.92x
    Benchmark_UFlat8-8 288.86 854.75 2.96x
    Benchmark_UFlat9-8 262.13 730.21 2.79x
    Benchmark_UFlat10-8 640.03 2775.98 4.34x
    Benchmark_UFlat11-8 356.37 1037.94 2.91x

    and these new MB/s numbers are comparable to what I get from the C++
    snappy implementation's benchmark tests:

    BM_UFlat/0 2.4GB/s html
    BM_UFlat/1 1.4GB/s urls
    BM_UFlat/2 21.1GB/s jpg
    BM_UFlat/3 1.5GB/s jpg_200
    BM_UFlat/4 10.2GB/s pdf
    BM_UFlat/5 2.1GB/s html4
    BM_UFlat/6 990.6MB/s txt1
    BM_UFlat/7 930.1MB/s txt2
    BM_UFlat/8 1.0GB/s txt3
    BM_UFlat/9 849.7MB/s txt4
    BM_UFlat/10 2.9GB/s pb
    BM_UFlat/11 1.2GB/s gaviota

    Digging a little deeper, compared to the pure Go codegen...

    Just passing -B to the Go compiler, to disable bounds checking, gets
    about a 1.15x improvement.

    All the other benefits of a straightforward hand-written asm gets
    another 1.2x over and above that. One example of that is that the Go
    code holds src (a []byte) and its index s (an int) as two separate
    variables, whereas the asm code can use a single register, for what C
    would call a byte*. Furthermore, Go code like this:

    offset = int(src[s-2]) | int(src[s-1])<<8

    is compiled by the Go 1.6 gc compiler (see
    http://play.golang.org/p/vOs4Z7Qf1X for a listing) as:

    0x0462 01122 (main.go:74) MOVQ AX, BP
    0x0465 01125 (main.go:74) SUBQ $2, BP
    0x0469 01129 (main.go:74) LEAQ (CX)(BP*1), BX
    0x046d 01133 (main.go:74) MOVBQZX (BX), BX
    0x0470 01136 (main.go:74) MOVQ AX, R8
    0x0473 01139 (main.go:74) DECQ R8
    0x0476 01142 (main.go:74) LEAQ (CX)(R8*1), BP
    0x047a 01146 (main.go:74) MOVBQZX (BP), BP
    0x047e 01150 (main.go:74) SHLQ $8, BP
    0x0482 01154 (main.go:74) ORQ BP, BX
    0x0485 01157 (main.go:74) MOVQ BX, R9

    but the asm code can just use one instruction:

    MOVWQZX -2(SI), DX

    Still, 1.15x and 1.2x combine to only 1.4x, not 3x.

    The bigger kicker is an algorithmic change, rather than comparing the
    compiler's code-gen with manual code-gen, although being in asm
    possibly makes it easier to execute the algorithmic change. One of the
    key inner copy loops is

    for end := d + length; d != end; d++ {
    dst[d] = dst[d-offset]
    }

    The pure Go code, and the straightforward asm code, does this one byte
    at a time. The faster asm code does this 8 bytes at a time, for which
    there are at least two things to watch for. One is overrunning:
    copying 16 bytes when you only wanted 10. This is especially dangerous
    if you can possibly read or write past the end of your buffers. The
    other is overlapping: a small offset means that a naive 8-byte load
    followed by an 8-byte store will not give the same result as a
    byte-by-byte copy, since in the basic loop, later iterations's reads
    pick up the earlier iteration's writes. Still, both these problems are
    surmountable, although it does mean that the asm code is not a
    line-for-line translation of the more understandable Go code. For
    example, I find
    https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7
    to be quite tricky. As I mentioned, I would appreciate a code review.

    Applying the "8 bytes at a time" trick several times multiplies up to
    a dramatic further 2.2x performance gain (3.0x / 1.4x), although it is
    reliant on the fact that unaligned 8 byte load/stores are possible,
    and cheap, on amd64. This technique probably wouldn't be as effective
    on architectures that are fussier about alignment. So, it may be
    possible to factor the amd64 asm implementation to move more of the
    tricks into (theoretically) cross-platform Go code and less of it in
    amd64 asm code, while still retaining most of the performance, but I'm
    not sure if the end result will perform well at all on e.g. arm CPUs.

    A note on bounds checking: I believe that, starting with knowing that
    0 <= s && s < len(src), I've been careful with saying things like:
    if length > len(src)-s { etc }
    instead of
    if s+length > len(src) { etc }
    in case of overflow, and similarly with
    s += 4; if uint(s) > uint(len(src)) { etc }
    instead of
    s += 4; if s > len(src) { etc }
    but, as I mentioned, a code reviewer would be appreciated. There may
    be other traps that I've missed.

    I repeat that this is an experiment that I've been playing with on and
    off for a couple of weeks. I'm sharing some interesting numbers, but I
    wouldn't draw too much from it. For example, I haven't tried the SSA
    branch of the compiler. I presume that it'd be easy for the SSA folks
    to check out the pure Go only revision of github.com/golang/snappy at
    03ee571c if they want to play along.

    There may also be a further 1.1x or 1.2x gain we could eke out of the
    asm here, and I'm happy to review suggestions, but given that we're
    getting close to the C++ implementation, I'd be surprised if there's
    another 2x improvement to be had easily. I should also get back to
    some other work that I've been neglecting.

    Comments welcome. Also, if you've read this far, I'd like a code reviewer. :-)
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Mar 3, 2016 at 4:34 am

    On Mon, Feb 29, 2016 at 8:13 AM, Aaron Jacobs wrote:
    Thanks for the careful write-up, Nigel; that was really interesting reading.
    When you settle on a final form (e.g. after you try suggestions from this
    thread), I'm happy to review if you make sure there are good comments
    throughout.
    I think it's settled on a final form, and it is commented. Obviously,
    if you find any particular section under-commented, please let me
    know.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Aaron Jacobs at Mar 3, 2016 at 4:36 am

    On Thu, Mar 3, 2016 at 3:34 PM, Nigel Tao wrote:
    I think it's settled on a final form, and it is commented. Obviously,
    if you find any particular section under-commented, please let me
    know.
    Great. Can you add me to a pull request or a gerrit review or something?

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Mar 3, 2016 at 4:40 am

    On Thu, Mar 3, 2016 at 3:36 PM, Aaron Jacobs wrote:
    Great. Can you add me to a pull request or a gerrit review or something?
    It's github, not gerrit, and it's already been committed, it's not a
    pull request. The commits are:

    https://github.com/golang/snappy/commit/402436317ad8035a7246ee89492064f9e6cbb4ce
    https://github.com/golang/snappy/commit/8c7c9dec5965484f0a81268ce7985fe31e5d5955
    https://github.com/golang/snappy/commit/4c1fc8e426266f00229956994142877543e8b514
    https://github.com/golang/snappy/commit/427fb6fc07997f43afa32f35e850833760e489a7

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Aaron Jacobs at Mar 3, 2016 at 9:30 am

    Right, I thought you were developing on a branch (so you could make a pull
    request for master) until you got a review, but I see you've already pushed to
    master. I guess I'll leave comments on the commits.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Mar 3, 2016 at 11:58 pm

    On Fri, Feb 26, 2016 at 6:36 PM, Nigel Tao wrote:
    For example, I haven't tried the SSA branch of the compiler.
    This doesn't have to do with the asm code per se, but for
    completeness, benchmark numbers for the new amd64 SSA code-gen
    compared to the old code-gen (i.e. with GOSSAHASH=x). For those trying
    to reproduce this, remember to disable the decode_amd64.* asm code and
    enable decode_other.go.

    First, with bounds checking (i.e. the default settings):

    benchmark old MB/s new MB/s speedup
    BenchmarkWordsDecode1e1-8 408.62 440.50 1.08x
    BenchmarkWordsDecode1e2-8 442.44 500.73 1.13x
    BenchmarkWordsDecode1e3-8 537.35 630.78 1.17x
    BenchmarkWordsDecode1e4-8 366.39 399.47 1.09x
    BenchmarkWordsDecode1e5-8 278.43 285.78 1.03x
    BenchmarkWordsDecode1e6-8 300.55 305.22 1.02x
    BenchmarkWordsEncode1e1-8 5.69 6.08 1.07x
    BenchmarkWordsEncode1e2-8 44.65 48.85 1.09x
    BenchmarkWordsEncode1e3-8 157.16 180.46 1.15x
    BenchmarkWordsEncode1e4-8 181.94 175.48 0.96x
    BenchmarkWordsEncode1e5-8 134.33 133.51 0.99x
    BenchmarkWordsEncode1e6-8 159.40 160.70 1.01x
    BenchmarkRandomEncode-8 3670.93 3830.06 1.04x
    Benchmark_UFlat0-8 582.36 616.82 1.06x
    Benchmark_UFlat1-8 468.66 495.64 1.06x
    Benchmark_UFlat2-8 15364.06 15183.57 0.99x
    Benchmark_UFlat3-8 914.92 1052.57 1.15x
    Benchmark_UFlat4-8 3092.91 3558.25 1.15x
    Benchmark_UFlat5-8 563.70 602.27 1.07x
    Benchmark_UFlat6-8 285.21 283.55 0.99x
    Benchmark_UFlat7-8 279.18 271.52 0.97x
    Benchmark_UFlat8-8 298.72 291.35 0.98x
    Benchmark_UFlat9-8 269.68 257.26 0.95x
    Benchmark_UFlat10-8 688.84 744.08 1.08x
    Benchmark_UFlat11-8 368.63 366.37 0.99x
    Benchmark_ZFlat0-8 335.21 339.65 1.01x
    Benchmark_ZFlat1-8 167.68 169.00 1.01x
    Benchmark_ZFlat2-8 3680.74 3937.54 1.07x
    Benchmark_ZFlat3-8 80.25 81.55 1.02x
    Benchmark_ZFlat4-8 518.54 523.30 1.01x
    Benchmark_ZFlat5-8 329.42 335.05 1.02x
    Benchmark_ZFlat6-8 143.51 144.68 1.01x
    Benchmark_ZFlat7-8 136.17 136.08 1.00x
    Benchmark_ZFlat8-8 149.39 150.43 1.01x
    Benchmark_ZFlat9-8 130.19 130.72 1.00x
    Benchmark_ZFlat10-8 374.19 380.09 1.02x
    Benchmark_ZFlat11-8 215.65 220.72 1.02x

    Second, without bounds checking (i.e. with -gcflags=-B):

    benchmark old MB/s new MB/s speedup
    BenchmarkWordsDecode1e1-8 432.54 426.46 0.99x
    BenchmarkWordsDecode1e2-8 492.12 566.52 1.15x
    BenchmarkWordsDecode1e3-8 626.97 717.32 1.14x
    BenchmarkWordsDecode1e4-8 385.07 488.93 1.27x
    BenchmarkWordsDecode1e5-8 307.36 333.04 1.08x
    BenchmarkWordsDecode1e6-8 333.77 360.06 1.08x
    BenchmarkWordsEncode1e1-8 5.74 5.94 1.03x
    BenchmarkWordsEncode1e2-8 45.80 49.88 1.09x
    BenchmarkWordsEncode1e3-8 165.60 194.41 1.17x
    BenchmarkWordsEncode1e4-8 167.06 210.20 1.26x
    BenchmarkWordsEncode1e5-8 138.79 143.32 1.03x
    BenchmarkWordsEncode1e6-8 164.75 171.29 1.04x
    BenchmarkRandomEncode-8 3805.99 4014.19 1.05x
    Benchmark_UFlat0-8 663.66 709.92 1.07x
    Benchmark_UFlat1-8 520.49 568.24 1.09x
    Benchmark_UFlat2-8 15511.13 15282.86 0.99x
    Benchmark_UFlat3-8 1023.28 1173.48 1.15x
    Benchmark_UFlat4-8 3622.37 4122.75 1.14x
    Benchmark_UFlat5-8 648.79 704.63 1.09x
    Benchmark_UFlat6-8 315.26 334.93 1.06x
    Benchmark_UFlat7-8 307.10 327.58 1.07x
    Benchmark_UFlat8-8 327.83 347.94 1.06x
    Benchmark_UFlat9-8 294.00 315.28 1.07x
    Benchmark_UFlat10-8 792.89 880.50 1.11x
    Benchmark_UFlat11-8 410.33 435.30 1.06x
    Benchmark_ZFlat0-8 367.93 386.90 1.05x
    Benchmark_ZFlat1-8 174.28 181.51 1.04x
    Benchmark_ZFlat2-8 3839.38 4210.87 1.10x
    Benchmark_ZFlat3-8 83.15 84.27 1.01x
    Benchmark_ZFlat4-8 543.33 581.04 1.07x
    Benchmark_ZFlat5-8 356.86 377.97 1.06x
    Benchmark_ZFlat6-8 148.18 154.63 1.04x
    Benchmark_ZFlat7-8 140.24 146.30 1.04x
    Benchmark_ZFlat8-8 153.88 160.06 1.04x
    Benchmark_ZFlat9-8 133.90 139.66 1.04x
    Benchmark_ZFlat10-8 417.28 440.01 1.05x
    Benchmark_ZFlat11-8 227.52 233.72 1.03x

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Apr 23, 2016 at 6:19 am

    On Fri, Feb 26, 2016 at 6:36 PM, Nigel Tao wrote:
    As an experiment, I have rewritten the core of the Snappy decoder (and
    just the decoder, not the encoder, for now) in amd64 asm:
    I've rewritten the encoder as well: see the ten or so most recent
    commits at https://github.com/golang/snappy/commits/master

    Comparing pure-Go "old" with Go-with-asm "new":

    name old speed new speed delta
    WordsEncode1e1-8 676MB/s ± 0% 677MB/s ± 1% ~ (p=0.310 n=5+5)
    WordsEncode1e2-8 87.5MB/s ± 1% 428.3MB/s ± 0% +389.71% (p=0.008 n=5+5)
    WordsEncode1e3-8 258MB/s ± 0% 446MB/s ± 1% +72.67% (p=0.008 n=5+5)
    WordsEncode1e4-8 245MB/s ± 0% 316MB/s ± 0% +28.94% (p=0.008 n=5+5)
    WordsEncode1e5-8 186MB/s ± 0% 269MB/s ± 0% +44.86% (p=0.008 n=5+5)
    WordsEncode1e6-8 211MB/s ± 0% 314MB/s ± 1% +48.84% (p=0.008 n=5+5)
    RandomEncode-8 13.2GB/s ± 1% 14.4GB/s ± 1% +9.33% (p=0.008 n=5+5)
    _ZFlat0-8 431MB/s ± 0% 792MB/s ± 0% +83.67% (p=0.008 n=5+5)
    _ZFlat1-8 277MB/s ± 0% 436MB/s ± 1% +57.46% (p=0.008 n=5+5)
    _ZFlat2-8 13.8GB/s ± 2% 16.2GB/s ± 1% +17.16% (p=0.008 n=5+5)
    _ZFlat3-8 173MB/s ± 0% 632MB/s ± 1% +265.85% (p=0.008 n=5+5)
    _ZFlat4-8 3.10GB/s ± 0% 8.00GB/s ± 0% +157.99% (p=0.008 n=5+5)
    _ZFlat5-8 426MB/s ± 0% 768MB/s ± 0% +80.06% (p=0.008 n=5+5)
    _ZFlat6-8 190MB/s ± 0% 282MB/s ± 1% +48.48% (p=0.008 n=5+5)
    _ZFlat7-8 182MB/s ± 0% 264MB/s ± 1% +44.97% (p=0.008 n=5+5)
    _ZFlat8-8 200MB/s ± 0% 298MB/s ± 0% +49.45% (p=0.008 n=5+5)
    _ZFlat9-8 175MB/s ± 0% 247MB/s ± 0% +41.02% (p=0.008 n=5+5)
    _ZFlat10-8 509MB/s ± 0% 1027MB/s ± 0% +101.72% (p=0.008 n=5+5)
    _ZFlat11-8 275MB/s ± 0% 411MB/s ± 0% +49.57% (p=0.008 n=5+5)

    One interesting factoid is that the Go-with-asm encoder now appears
    faster than the C++ encoder (or, alternatively, I've screwed up
    somewhere). My C++ snappy_unittest.log numbers:

    BM_ZFlat/0 151725 151761 1306 643.5MB/s html (22.31 %)
    BM_ZFlat/1 1857383 1857878 107 360.4MB/s urls (47.78 %)
    BM_ZFlat/2 8587 8589 22246 13.3GB/s jpg (99.95 %)
    BM_ZFlat/3 482 482 363636 395.3MB/s jpg_200 (73.00 %)
    BM_ZFlat/4 15701 15705 12406 6.1GB/s pdf (83.30 %)
    BM_ZFlat/5 631610 631613 318 618.5MB/s html4 (22.52 %)
    BM_ZFlat/6 639135 639282 311 226.9MB/s txt1 (57.88 %)
    BM_ZFlat/7 543036 543197 360 219.8MB/s txt2 (61.91 %)
    BM_ZFlat/8 1702822 1703093 118 239.0MB/s txt3 (54.99 %)
    BM_ZFlat/9 2224850 2225480 100 206.5MB/s txt4 (66.26 %)
    BM_ZFlat/10 128935 128974 1526 876.9MB/s pb (19.68 %)
    BM_ZFlat/11 499035 499168 398 352.1MB/s gaviota (37.72 %)

    The "643.5MB/s" throughput column is the one of interest, to compare
    to e.g. Go's "792MB/s".

    Once again, comments welcome, and I'd appreciate a code reviewer.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Caleb Spare at Apr 23, 2016 at 6:46 am
    ​Hey Nigel, thanks for all your work making go-snappy fast. We've put it to
    good use at my company.

    For large-reaching and potentially dangerous changes like this, have you
    considered using Github pull requests to obtain code review prior to
    merging? Not only would that allow for reviewers to possibly catch some
    bugs before they hit master, it can also make the job of the reviewer
    easier if all the related commits are part of a single PR.

    No need for a fork; you can just put the changes on some branch and open a
    pull request from that branch to master. This is a common workflow for
    people doing work on a repo they own.

    -Caleb
    On Fri, Apr 22, 2016 at 11:19 PM, Nigel Tao wrote:
    On Fri, Feb 26, 2016 at 6:36 PM, Nigel Tao wrote:
    As an experiment, I have rewritten the core of the Snappy decoder (and
    just the decoder, not the encoder, for now) in amd64 asm:
    I've rewritten the encoder as well: see the ten or so most recent
    commits at https://github.com/golang/snappy/commits/master

    Comparing pure-Go "old" with Go-with-asm "new":

    name old speed new speed delta
    WordsEncode1e1-8 676MB/s ± 0% 677MB/s ± 1% ~ (p=0.310 n=5+5)
    WordsEncode1e2-8 87.5MB/s ± 1% 428.3MB/s ± 0% +389.71% (p=0.008 n=5+5)
    WordsEncode1e3-8 258MB/s ± 0% 446MB/s ± 1% +72.67% (p=0.008 n=5+5)
    WordsEncode1e4-8 245MB/s ± 0% 316MB/s ± 0% +28.94% (p=0.008 n=5+5)
    WordsEncode1e5-8 186MB/s ± 0% 269MB/s ± 0% +44.86% (p=0.008 n=5+5)
    WordsEncode1e6-8 211MB/s ± 0% 314MB/s ± 1% +48.84% (p=0.008 n=5+5)
    RandomEncode-8 13.2GB/s ± 1% 14.4GB/s ± 1% +9.33% (p=0.008 n=5+5)
    _ZFlat0-8 431MB/s ± 0% 792MB/s ± 0% +83.67% (p=0.008 n=5+5)
    _ZFlat1-8 277MB/s ± 0% 436MB/s ± 1% +57.46% (p=0.008 n=5+5)
    _ZFlat2-8 13.8GB/s ± 2% 16.2GB/s ± 1% +17.16% (p=0.008 n=5+5)
    _ZFlat3-8 173MB/s ± 0% 632MB/s ± 1% +265.85% (p=0.008 n=5+5)
    _ZFlat4-8 3.10GB/s ± 0% 8.00GB/s ± 0% +157.99% (p=0.008 n=5+5)
    _ZFlat5-8 426MB/s ± 0% 768MB/s ± 0% +80.06% (p=0.008 n=5+5)
    _ZFlat6-8 190MB/s ± 0% 282MB/s ± 1% +48.48% (p=0.008 n=5+5)
    _ZFlat7-8 182MB/s ± 0% 264MB/s ± 1% +44.97% (p=0.008 n=5+5)
    _ZFlat8-8 200MB/s ± 0% 298MB/s ± 0% +49.45% (p=0.008 n=5+5)
    _ZFlat9-8 175MB/s ± 0% 247MB/s ± 0% +41.02% (p=0.008 n=5+5)
    _ZFlat10-8 509MB/s ± 0% 1027MB/s ± 0% +101.72% (p=0.008 n=5+5)
    _ZFlat11-8 275MB/s ± 0% 411MB/s ± 0% +49.57% (p=0.008 n=5+5)

    One interesting factoid is that the Go-with-asm encoder now appears
    faster than the C++ encoder (or, alternatively, I've screwed up
    somewhere). My C++ snappy_unittest.log numbers:

    BM_ZFlat/0 151725 151761 1306 643.5MB/s html (22.31
    %)
    BM_ZFlat/1 1857383 1857878 107 360.4MB/s urls (47.78
    %)
    BM_ZFlat/2 8587 8589 22246 13.3GB/s jpg (99.95 %)
    BM_ZFlat/3 482 482 363636 395.3MB/s jpg_200
    (73.00 %)
    BM_ZFlat/4 15701 15705 12406 6.1GB/s pdf (83.30 %)
    BM_ZFlat/5 631610 631613 318 618.5MB/s html4
    (22.52 %)
    BM_ZFlat/6 639135 639282 311 226.9MB/s txt1 (57.88
    %)
    BM_ZFlat/7 543036 543197 360 219.8MB/s txt2 (61.91
    %)
    BM_ZFlat/8 1702822 1703093 118 239.0MB/s txt3 (54.99
    %)
    BM_ZFlat/9 2224850 2225480 100 206.5MB/s txt4 (66.26
    %)
    BM_ZFlat/10 128935 128974 1526 876.9MB/s pb (19.68 %)
    BM_ZFlat/11 499035 499168 398 352.1MB/s gaviota
    (37.72 %)

    The "643.5MB/s" throughput column is the one of interest, to compare
    to e.g. Go's "792MB/s".

    Once again, comments welcome, and I'd appreciate a code reviewer.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Apr 23, 2016 at 6:55 am

    On Sat, Apr 23, 2016 at 4:45 PM, Caleb Spare wrote:
    No need for a fork; you can just put the changes on some branch and open a
    pull request from that branch to master. This is a common workflow for
    people doing work on a repo they own.
    Thanks for the tip, and I'll keep that in mind for next time. I'm
    still a bit of a noob with git, github and PRs. My day-to-day workflow
    uses the gerrit code review tool, the same used by the official Go
    repo, and I think that the longer term solution is to move
    golang/snappy to use that.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Caleb Spare at Apr 23, 2016 at 8:09 am
    Yeah, that would be even better in my opinion.
    On Fri, Apr 22, 2016 at 11:55 PM, Nigel Tao wrote:
    On Sat, Apr 23, 2016 at 4:45 PM, Caleb Spare wrote:
    No need for a fork; you can just put the changes on some branch and open a
    pull request from that branch to master. This is a common workflow for
    people doing work on a repo they own.
    Thanks for the tip, and I'll keep that in mind for next time. I'm
    still a bit of a noob with git, github and PRs. My day-to-day workflow
    uses the gerrit code review tool, the same used by the official Go
    repo, and I think that the longer term solution is to move
    golang/snappy to use that.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Caleb Spare at Apr 23, 2016 at 6:17 pm
    In fact, some kind of bugs do appear to have been introduced with the
    encoder changes:

    https://github.com/golang/snappy/issues/29
    https://github.com/golang/snappy/issues/30
    https://github.com/golang/snappy/issues/31

    -Caleb
    On Sat, Apr 23, 2016 at 1:09 AM, Caleb Spare wrote:

    Yeah, that would be even better in my opinion.
    On Fri, Apr 22, 2016 at 11:55 PM, Nigel Tao wrote:
    On Sat, Apr 23, 2016 at 4:45 PM, Caleb Spare wrote:
    No need for a fork; you can just put the changes on some branch and open a
    pull request from that branch to master. This is a common workflow for
    people doing work on a repo they own.
    Thanks for the tip, and I'll keep that in mind for next time. I'm
    still a bit of a noob with git, github and PRs. My day-to-day workflow
    uses the gerrit code review tool, the same used by the official Go
    repo, and I think that the longer term solution is to move
    golang/snappy to use that.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Apr 24, 2016 at 12:39 am

    On Sun, Apr 24, 2016 at 4:17 AM, Caleb Spare wrote:
    In fact, some kind of bugs do appear to have been introduced with the
    encoder changes:

    https://github.com/golang/snappy/issues/29
    https://github.com/golang/snappy/issues/30
    https://github.com/golang/snappy/issues/31
    Thanks for the note. They're fixed now, although I suspect that #29
    (and its dupe #31) is a regression in the Go 1.6 assembler. I'll
    investigate further (off list)...

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedFeb 26, '16 at 7:36a
activeApr 24, '16 at 12:39a
posts23
users7
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase