FAQ
Bump :)

After a short repro [0] and Dmitry's hard work [1][2], it turned out that
the allocations I was seeing are actually a bug in the Go runtime. Applying
the previously mentioned two fixes to the Go master branch almost
completely eliminates the synchronization allocations (some are required,
but are amortized, so the longer a process runs, the less allocs it does,
hence why the latency benchmarks do report some, but the throughput ones
which are repeated multiple times don't).

Cheers,
   Peter

Refs:
   [0] https://groups.google.com/forum/#!topic/golang-nuts/a8ZoAhAeO7k
   [1] https://go-review.googlesource.com/#/c/3742
   [2] https://go-review.googlesource.com/#/c/3741

Latency benchmarks (GOMAXPROCS = 1):
       [!] bufio.Copy: 5.187µs 11 allocs 1696 B.
      rogerpeppe.Copy: 5.531µs 11 allocs 1632 B.
      mattharden.Copy: 5.653µs 9 allocs 67056 B.
       egonelbre.Copy: 5.432µs 17 allocs 2000 B.
            jnml.Copy: 5.706µs 8 allocs 1712 B.
    augustoroman.Copy: 5.866µs 8 allocs 1504 B.

Latency benchmarks (GOMAXPROCS = 8):
       [!] bufio.Copy: 4.506µs 541 allocs 35840 B.
      rogerpeppe.Copy: 5.696µs 66 allocs 5376 B.
      mattharden.Copy: 5.742µs 34 allocs 68864 B.
       egonelbre.Copy: 4.564µs 277 allocs 19088 B.
            jnml.Copy: 5.744µs 28 allocs 3216 B.
    augustoroman.Copy: 5.976µs 19 allocs 1168 B.

Throughput (GOMAXPROCS = 1) (256 MB):

+-------------------+--------+---------+---------+---------+----------+
THROUGHPUT | 333 | 4155 | 65359 | 1048559 | 16777301 |
+-------------------+--------+---------+---------+---------+----------+
[!] bufio.Copy | 437.80 | 2793.60 | 4667.71 | 4842.16 | 2054.19 |
rogerpeppe.Copy | 198.44 | 1743.79 | 4397.40 | 4820.92 | 2054.45 |
mattharden.Copy | 183.49 | 1272.47 | 2201.86 | 2353.29 | 1185.67 |
egonelbre.Copy | 225.30 | 1737.59 | 4377.38 | 4722.25 | 2027.28 |
jnml.Copy | 222.91 | 1875.31 | 4433.52 | 4832.32 | 2052.50 |
augustoroman.Copy | 142.77 | 1365.51 | 4185.75 | 4818.16 | 2052.86 |
+-------------------+--------+---------+---------+---------+----------+

+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
ALLOCS/BYTES | 333 | 4155 |
   65359 | 1048559 | 16777301 |
+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
[!] bufio.Copy | ( 10 / 1008) | ( 10 / 5264) | (
  10 / 66192) | ( 10 / 1049232) | ( 10 / 16786064) |
rogerpeppe.Copy | ( 7 / 976) | ( 6 / 4944) | (
   6 / 65872) | ( 6 / 1048912) | ( 6 / 16785744) |
mattharden.Copy | ( 9 / 41880) | ( 9 / 46136) | (
   9 / 107064) | ( 9 / 1090104) | ( 9 / 16826936) |
egonelbre.Copy | ( 12 / 1056) | ( 12 / 5312) | (
  12 / 66240) | ( 12 / 1049280) | ( 12 / 16786112) |
jnml.Copy | ( 5 / 896) | ( 5 / 5152) | (
   5 / 66080) | ( 5 / 1049120) | ( 5 / 16785952) |
augustoroman.Copy | ( 5 / 688) | ( 5 / 4944) | (
   5 / 65872) | ( 5 / 1048912) | ( 5 / 16785744) |
+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+

Throughput (GOMAXPROCS = 8) (256 MB):

+-------------------+--------+---------+---------+---------+----------+
THROUGHPUT | 333 | 4155 | 65359 | 1048559 | 16777301 |
+-------------------+--------+---------+---------+---------+----------+
[!] bufio.Copy | 409.02 | 2927.20 | 4395.27 | 4481.92 | 2043.71 |
rogerpeppe.Copy | 195.11 | 1706.20 | 3779.57 | 4523.37 | 2044.33 |
mattharden.Copy | 177.87 | 1236.07 | 2109.07 | 2009.68 | 1143.01 |
egonelbre.Copy | 335.69 | 2283.42 | 3854.51 | 4338.55 | 1896.28 |
jnml.Copy | 211.20 | 1825.58 | 3983.32 | 4549.05 | 2044.82 |
augustoroman.Copy | 139.52 | 1336.84 | 3337.52 | 4463.26 | 2041.43 |
+-------------------+--------+---------+---------+---------+----------+

+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
ALLOCS/BYTES | 333 | 4155 |
   65359 | 1048559 | 16777301 |
+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
[!] bufio.Copy | ( 31 / 2352) | ( 12 / 5616) | (
  13 / 66384) | ( 26 / 1050256) | ( 10 / 16786064) |
rogerpeppe.Copy | ( 6 / 688) | ( 6 / 4944) | (
   6 / 65872) | ( 8 / 1050224) | ( 7 / 16786032) |
mattharden.Copy | ( 10 / 43160) | ( 11 / 47480) | (
  10 / 247560) | ( 10 / 3188488) | ( 10 / 50423560) |
egonelbre.Copy | ( 12 / 1056) | ( 63 / 8800) | (
  13 / 66528) | ( 12 / 1049280) | ( 16 / 16786368) |
jnml.Copy | ( 5 / 896) | ( 5 / 5152) | (
   5 / 66080) | ( 5 / 1049120) | ( 6 / 16786016) |
augustoroman.Copy | ( 9 / 944) | ( 5 / 4944) | (
   5 / 65872) | ( 5 / 1048912) | ( 5 / 16785744) |
+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
On Sat, Jan 31, 2015 at 7:25 PM, Péter Szilágyi wrote:

Hi all,

@Jan: Merged and the new solution indeed passes the shootout. With the
current version, your implementation too is in the same ballpark as the
rest. I think we're starting to converge on the achievable performance from
all implementations.

My solution as of lately seems to beat others significantly for smaller
buffer sizes. This is due to a neat optimization
<https://github.com/karalabe/bufioprop/blob/master/pipe.go#L126> of doing
a short spin-lock before going down to deep sleep if no data/space is
available in the internal buffers. The idea is that it should be very
short, not to take a toll on the performance if no data is coming, but it
should be long enough that if data *is* streamed, then it doesn't need to
sync it. Jan, I guess this isn't something you could try as you're
completely channel based, but give it a thought (if it's doable, it might
bring up your performance on small buffers). The others should maybe play
around with the idea, I think I saw it in Egon's code a while back but
haven't checked lately.

Jan's previous optimization was actually a really really good observation
that the reason performance goes down with large buffers is because you are
missing the CPU cache and need to go through main memory, essentially
limiting your performance by that. His solution was to try and reuse hot
parts of the buffer that can probably still be found in L1/2 caches, but it
didn't pan out correctly (see the previous longish description for
details). Nonetheless the observation is a good one, so it *could* be
worthwhile to try an implement this hot cache reuse. I am thinking in a
solution that would split the buffer up the same way Jan did previously,
but keep writing to one cache-line/chunk/piece until it's full and only
then start the next. The issue is that synchronization can get really messy.

I guess the last algorithmic challenge in this proposal would be to figure
out if the buffer can be kept hot. If yes, great, if not, we could proceed
to finalizing the API around the buffered copy/pipe.

Cheers,
Peter

PS: Jury's still out on why I get hit by memory allocs *always* at the
same tests, never others.

Latency benchmarks (GOMAXPROCS = 1):
[!] bufio.Copy: 4.596µs 23 allocs 2288 B.
rogerpeppe.Copy: 4.883µs 21 allocs 2096 B.
egonelbre.Copy: 4.77µs 29 allocs 2368 B.
jnml.Copy: 5.041µs 19 allocs 2144 B.
augustoroman.Copy: 5.229µs 16 allocs 1840 B.

Latency benchmarks (GOMAXPROCS = 8):
[!] bufio.Copy: 4.597µs 481 allocs 30736 B.
rogerpeppe.Copy: 4.916µs 348 allocs 22224 B.
egonelbre.Copy: 4.786µs 398 allocs 27408 B.
jnml.Copy: 4.92µs 343 allocs 21904 B.
augustoroman.Copy: 5.208µs 162 allocs 11184 B.

Throughput (GOMAXPROCS = 1) (256 MB):

+-------------------+--------+---------+---------+---------+----------+
THROUGHPUT | 333 | 4155 | 65359 | 1048559 | 16777301 |
+-------------------+--------+---------+---------+---------+----------+
[!] bufio.Copy | 470.22 | 2950.34 | 4738.61 | 4924.87 | 2001.32 |
rogerpeppe.Copy | 216.22 | 1856.43 | 4462.28 | 4911.21 | 2000.13 |
egonelbre.Copy | 253.42 | 1852.45 | 4479.33 | 4788.88 | 1982.38 |
jnml.Copy | 231.71 | 1947.09 | 4503.48 | 4907.54 | 2008.88 |
augustoroman.Copy | 158.89 | 1479.22 | 4299.65 | 4887.31 | 2008.11 |
+-------------------+--------+---------+---------+---------+----------+


+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
ALLOCS/BYTES | 333 | 4155 |
65359 | 1048559 | 16777301 |

+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
[!] bufio.Copy | ( 13 / 1024) | ( 13 / 5280) | (
13 / 66208) | ( 13 / 1049248) | ( 13 / 16786080) |
rogerpeppe.Copy | ( 12 / 896) | ( 12 / 5152) | (
12 / 66080) | ( 12 / 1049120) | ( 12 / 16785952) |
egonelbre.Copy | ( 21 / 1248) | ( 21 / 5504) | (
21 / 66432) | ( 21 / 1049472) | ( 21 / 16786304) |
jnml.Copy | ( 12 / 1088) | ( 12 / 5344) | (
12 / 66272) | ( 12 / 1049312) | ( 12 / 16786144) |
augustoroman.Copy | ( 12 / 960) | ( 12 / 5216) | (
12 / 66144) | ( 12 / 1049184) | ( 12 / 16786016) |

+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+

Throughput (GOMAXPROCS = 8) (256 MB):

+-------------------+--------+---------+---------+---------+----------+
THROUGHPUT | 333 | 4155 | 65359 | 1048559 | 16777301 |
+-------------------+--------+---------+---------+---------+----------+
[!] bufio.Copy | 490.45 | 2934.83 | 4543.90 | 4590.13 | 1994.79 |
rogerpeppe.Copy | 215.50 | 1721.20 | 3904.96 | 4596.47 | 2000.28 |
egonelbre.Copy | 344.52 | 2467.48 | 3907.59 | 4348.90 | 1850.96 |
jnml.Copy | 239.27 | 897.75 | 4029.99 | 4625.88 | 1996.11 |
augustoroman.Copy | 152.99 | 1415.63 | 3581.26 | 4587.91 | 1999.95 |
+-------------------+--------+---------+---------+---------+----------+


+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
ALLOCS/BYTES | 333 | 4155 |
65359 | 1048559 | 16777301 |

+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
[!] bufio.Copy | ( 143 / 9344) | ( 38 / 6880) | (
91 / 71200) | ( 781 / 1098400) | ( 61 / 16789152) |
rogerpeppe.Copy | ( 13 / 960) | ( 14 / 5504) | (
14 / 66432) | ( 14 / 1049472) | ( 14 / 16786304) |
egonelbre.Copy | ( 97 / 6560) | ( 48 / 7456) | (
49 / 68672) | ( 30 / 1050272) | ( 31 / 16786944) |
jnml.Copy | ( 12 / 1088) | ( 13 / 5632) | (
13 / 66560) | ( 12 / 1049312) | ( 12 / 16786144) |
augustoroman.Copy | ( 25 / 2016) | ( 14 / 5568) | (
13 / 66432) | ( 12 / 1049184) | ( 12 / 16786016) |

+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+

On Sat, Jan 31, 2015 at 6:33 PM, Jan Mercl wrote:
On Sat Jan 31 2015 at 10:05:34 Péter Szilágyi wrote:

I've just shot out your code as *not solving* the problem again :P
Please pull[0], thank you.

[0]: https://github.com/karalabe/bufioprop/pull/15

-j
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 50 of 50 | next ›
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJan 29, '15 at 11:01a
activeFeb 3, '15 at 11:21a
posts50
users8
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase