FAQ
hi folks,

in my chess game i use a huge array as transposition table, which means i
store informations about previously examined position in it. the rule of
thumb for the size of there tables is "the bigger, the better".

by using pprof i found out that my programm is spending 10% just for
accessing transposition table elements, which seemed a lot to me. i
suspected the big size of the transposition table. so i wrote a small
program to do some tests, you find it here:
http://play.golang.org/p/5qC3MtLb_N. you find two benchmarks, see the
comments on them. basically i just create a hash value and access a (big)
array.

here are some examples, i use a thinkpad t61 with 2gb ram. i will use perf
to provide more information.

test 1: Benchmark2(), wich accesses the same array element all the time, is
- of course - very fast. perf gives me:

perf stat ./pagefault.test -test.bench="2"
Benchmark2 100000000 13.5 ns/op

Performance counter stats for './pagefault.test -test.bench=2':

1363.613558 task-clock # 0.995 CPUs utilized

150 context-switches # 0.110 K/sec

6 cpu-migrations # 0.004 K/sec

391 page-faults # 0.287 K/sec

3,251,998,913 cycles # 2.385 GHz
[49.88%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
6,472,712,515 instructions # 1.99 insns per cycle
[75.09%]
1,114,378,734 branches # 817.225 M/sec
[75.17%]
82,350 branch-misses # 0.01% of all branches
[74.96%]

1.369983412 seconds time elapsed


these values will not change, no matter how big the array gets. Benchmark1
(random accesses) on the other hand is getting very slow, the bigger the
array gets. this is of course something to expect, i was just astonished
about the amount of the performance penalty. some tests:


test 2: array size = 2^15 = 32768 elements, table memory size 256 kb

perf stat ./pagefault.test -test.bench="1"
Benchmark1 5000000 706 ns/op

Performance counter stats for './pagefault.test -test.bench=1':

4249.042305 task-clock # 0.998 CPUs utilized

456 context-switches # 0.107 K/sec

10 cpu-migrations # 0.002 K/sec

452 page-faults # 0.106 K/sec

10,117,463,095 cycles # 2.381 GHz
[49.89%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,690,985,489 instructions # 0.96 insns per cycle
[74.88%]
1,999,073,677 branches # 470.476 M/sec
[75.09%]
691,820 branch-misses # 0.03% of all branches
[75.03%]

4.256151689 seconds time elapsed


test 3: array size = 2^20 = 1048576 elements, table memory size 8 mb

perf stat ./pagefault.test -test.bench="1"
Benchmark1 1000000 1189 ns/op

Performance counter stats for './pagefault.test -test.bench=1':

1214.013801 task-clock # 0.996 CPUs utilized

141 context-switches # 0.116 K/sec

6 cpu-migrations # 0.005 K/sec

514 page-faults # 0.423 K/sec

2,880,758,281 cycles # 2.373 GHz
[50.08%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,636,542,194 instructions # 0.57 insns per cycle
[75.06%]
338,538,078 branches # 278.859 M/sec
[75.06%]
251,635 branch-misses # 0.07% of all branches
[74.88%]

1.218559828 seconds time elapsed


test 4: array size = 2^24 = 16777216 elements, table memory size 128 mb

perf stat ./pagefault.test -test.bench="1"
Benchmark1 1000000 1672 ns/op

Performance counter stats for './pagefault.test -test.bench=1':

2587.889368 task-clock # 0.997 CPUs utilized

290 context-switches # 0.112 K/sec

6 cpu-migrations # 0.002 K/sec

842 page-faults # 0.325 K/sec

6,167,431,406 cycles # 2.383 GHz
[49.96%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2,438,377,742 instructions # 0.40 insns per cycle
[75.08%]
503,968,965 branches # 194.741 M/sec
[75.06%]
273,817 branch-misses # 0.05% of all branches
[74.99%]

2.595921867 seconds time elapsed


test 5: array size = 2^27 = 134217728 elements, table memory size 1 gb

perf stat ./pagefault.test -test.bench="1"
Benchmark1 1000000 1432 ns/op

Performance counter stats for './pagefault.test -test.bench=1':

2088.863390 task-clock # 0.997 CPUs utilized

239 context-switches # 0.114 K/sec

2 cpu-migrations # 0.001 K/sec

110,644 page-faults # 0.053 M/sec

4,961,774,266 cycles # 2.375 GHz
[49.87%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,881,551,046 instructions # 0.38 insns per cycle
[75.00%]
387,027,652 branches # 185.281 M/sec
[75.14%]
289,273 branch-misses # 0.07% of all branches
[75.08%]

2.094972871 seconds time elapsed



my questions:
1. Is it possible that the perfomance decrease is caused by page faults?
looking at test 5 one could doubt that.
2. If not page faults, what else?
3. why is test 5 faster than test 4?
4. is there anything that one (me/the golang dev team) can do about it?

thanks for help, hope this post is not to boring.

regards,

heiko


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Minux at Feb 18, 2013 at 7:12 pm

    On Tue, Feb 19, 2013 at 3:03 AM, hs wrote:

    in my chess game i use a huge array as transposition table, which means i
    store informations about previously examined position in it. the rule of
    thumb for the size of there tables is "the bigger, the better".

    by using pprof i found out that my programm is spending 10% just for
    accessing transposition table elements, which seemed a lot to me. i
    suspected the big size of the transposition table. so i wrote a small
    program to do some tests, you find it here:
    http://play.golang.org/p/5qC3MtLb_N. you find two benchmarks, see the
    comments on them. basically i just create a hash value and access a (big)
    array.

    here are some examples, i use a thinkpad t61 with 2gb ram. i will use perf
    to provide more information.

    test 1: Benchmark2(), wich accesses the same array element all the time,
    is - of course - very fast. perf gives me:

    perf stat ./pagefault.test -test.bench="2"
    Benchmark2 100000000 13.5 ns/op

    Performance counter stats for './pagefault.test -test.bench=2':

    1363.613558 task-clock # 0.995 CPUs utilized

    150 context-switches # 0.110 K/sec

    6 cpu-migrations # 0.004 K/sec

    391 page-faults # 0.287 K/sec

    3,251,998,913 cycles # 2.385 GHz
    [49.88%]
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    6,472,712,515 instructions # 1.99 insns per cycle
    [75.09%]
    1,114,378,734 branches # 817.225 M/sec
    [75.17%]
    82,350 branch-misses # 0.01% of all branches
    [74.96%]

    1.369983412 seconds time elapsed


    these values will not change, no matter how big the array gets. Benchmark1
    (random accesses) on the other hand is getting very slow, the bigger the
    array gets. this is of course something to expect, i was just astonished
    about the amount of the performance penalty. some tests:


    test 2: array size = 2^15 = 32768 elements, table memory size 256 kb

    perf stat ./pagefault.test -test.bench="1"
    Benchmark1 5000000 706 ns/op

    Performance counter stats for './pagefault.test -test.bench=1':

    4249.042305 task-clock # 0.998 CPUs utilized

    456 context-switches # 0.107 K/sec

    10 cpu-migrations # 0.002 K/sec

    452 page-faults # 0.106 K/sec

    10,117,463,095 cycles # 2.381 GHz
    [49.89%]
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    9,690,985,489 instructions # 0.96 insns per cycle
    [74.88%]
    1,999,073,677 branches # 470.476 M/sec
    [75.09%]
    691,820 branch-misses # 0.03% of all branches
    [75.03%]

    4.256151689 seconds time elapsed


    test 3: array size = 2^20 = 1048576 elements, table memory size 8 mb

    perf stat ./pagefault.test -test.bench="1"
    Benchmark1 1000000 1189 ns/op

    Performance counter stats for './pagefault.test -test.bench=1':

    1214.013801 task-clock # 0.996 CPUs utilized

    141 context-switches # 0.116 K/sec

    6 cpu-migrations # 0.005 K/sec

    514 page-faults # 0.423 K/sec

    2,880,758,281 cycles # 2.373 GHz
    [50.08%]
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    1,636,542,194 instructions # 0.57 insns per cycle
    [75.06%]
    338,538,078 branches # 278.859 M/sec
    [75.06%]
    251,635 branch-misses # 0.07% of all branches
    [74.88%]

    1.218559828 seconds time elapsed


    test 4: array size = 2^24 = 16777216 elements, table memory size 128 mb

    perf stat ./pagefault.test -test.bench="1"
    Benchmark1 1000000 1672 ns/op

    Performance counter stats for './pagefault.test -test.bench=1':

    2587.889368 task-clock # 0.997 CPUs utilized

    290 context-switches # 0.112 K/sec

    6 cpu-migrations # 0.002 K/sec

    842 page-faults # 0.325 K/sec

    6,167,431,406 cycles # 2.383 GHz
    [49.96%]
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    2,438,377,742 instructions # 0.40 insns per cycle
    [75.08%]
    503,968,965 branches # 194.741 M/sec
    [75.06%]
    273,817 branch-misses # 0.05% of all branches
    [74.99%]

    2.595921867 seconds time elapsed


    test 5: array size = 2^27 = 134217728 elements, table memory size 1 gb

    perf stat ./pagefault.test -test.bench="1"
    Benchmark1 1000000 1432 ns/op

    Performance counter stats for './pagefault.test -test.bench=1':

    2088.863390 task-clock # 0.997 CPUs utilized

    239 context-switches # 0.114 K/sec

    2 cpu-migrations # 0.001 K/sec

    110,644 page-faults # 0.053 M/sec

    4,961,774,266 cycles # 2.375 GHz
    [49.87%]
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    1,881,551,046 instructions # 0.38 insns per cycle
    [75.00%]
    387,027,652 branches # 185.281 M/sec
    [75.14%]
    289,273 branch-misses # 0.07% of all branches
    [75.08%]

    2.094972871 seconds time elapsed



    my questions:
    1. Is it possible that the perfomance decrease is caused by page faults?
    looking at test 5 one could doubt that.
    no. at least not before the last test.
    2. If not page faults, what else?
    cache. go read something about memory hierarchy.
    3. why is test 5 faster than test 4?
    not sure about this, but test 4 executes much more instructions than test 5
    while
    the number of benchmark iteration is the same.
    4. is there anything that one (me/the golang dev team) can do about it?
    you can try the same test in C/C++ and i'm sure you will get the same
    trend.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Russ Cox at Feb 19, 2013 at 3:18 am
    http://people.redhat.com/drepper/cpumemory.pdf
    In particular, section 3.3.2 and figure 3.15.

    Russ

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Hs at Feb 19, 2013 at 5:54 pm
    thanks, once you know where to search the reasons become obvious. i can
    confirm a huge raise of cache misses while using bigger tables. it's a pity
    perf doesn't print cache misses by default.

    @Russ: This kind of paper i was looking for for a long time. as i am aware
    of the cache misses/memory latency issue (and i already did some
    optimizations to deal with it) i couldn't find a paper which describes the
    whole subject that detailed. thank you, it will help a lot. a must read,
    indeed.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedFeb 18, '13 at 7:03p
activeFeb 19, '13 at 5:54p
posts4
users3
websitegolang.org

3 users in discussion

Hs: 2 posts Russ Cox: 1 post Minux: 1 post

People

Translate

site design / logo © 2022 Grokbase