FAQ
I have made two native ARM sha1 routines.

https://codereview.appspot.com/56900043/

This is the highest performance version on my Chromebook. It is fully
unrolled and uses 5508 bytes of code. However I'm concerned that 5k of
code won't fit into the I-cache on some ARM processors so I'd like some
advice as to whether that is sensible or not.

I've made a second routine which is partially unrolled (note this
version isn't quite as polished as the first one)

https://codereview.appspot.com/56990044/

Which uses only 1896 bytes of code but runs about 10% slower on the
Chromebook.

In comparison the amd64 version of the code is 4963 bytes and the 386
version is 3888 bytes. Both are fully unrolled.

My feeling is that the unrolled version should be preferred as it is
faster and 5k of code isn't excessive. However I don't want to unduly
hamper older ARM processors.

I'll polish and submit one or the other CLs depending on what we decide!

--
Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

--

---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Dave Cheney at Jan 25, 2014 at 9:56 pm
    Thanks nick. I have a bunch of middling arm machines, I'll do some testing.
    On 26 Jan 2014, at 3:17, Nick Craig-Wood wrote:

    I have made two native ARM sha1 routines.

    https://codereview.appspot.com/56900043/

    This is the highest performance version on my Chromebook. It is fully
    unrolled and uses 5508 bytes of code. However I'm concerned that 5k of
    code won't fit into the I-cache on some ARM processors so I'd like some
    advice as to whether that is sensible or not.

    I've made a second routine which is partially unrolled (note this
    version isn't quite as polished as the first one)

    https://codereview.appspot.com/56990044/

    Which uses only 1896 bytes of code but runs about 10% slower on the
    Chromebook.

    In comparison the amd64 version of the code is 4963 bytes and the 386
    version is 3888 bytes. Both are fully unrolled.

    My feeling is that the unrolled version should be preferred as it is
    faster and 5k of code isn't excessive. However I don't want to unduly
    hamper older ARM processors.

    I'll polish and submit one or the other CLs depending on what we decide!

    --
    Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick
    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Minux at Jan 26, 2014 at 7:02 pm

    On Sat, Jan 25, 2014 at 11:17 AM, Nick Craig-Wood wrote:

    I have made two native ARM sha1 routines.
    Great. Thank you for working on that.
    https://codereview.appspot.com/56900043/

    This is the highest performance version on my Chromebook. It is fully
    unrolled and uses 5508 bytes of code. However I'm concerned that 5k of
    code won't fit into the I-cache on some ARM processors so I'd like some
    advice as to whether that is sensible or not.

    I've made a second routine which is partially unrolled (note this
    version isn't quite as polished as the first one)

    https://codereview.appspot.com/56990044/

    Which uses only 1896 bytes of code but runs about 10% slower on the
    Chromebook.

    In comparison the amd64 version of the code is 4963 bytes and the 386
    version is 3888 bytes. Both are fully unrolled.

    My feeling is that the unrolled version should be preferred as it is
    faster and 5k of code isn't excessive. However I don't want to unduly
    hamper older ARM processors.

    I'll polish and submit one or the other CLs depending on what we decide!
    I can't decide which is better, but I'm slightly inclined to the smaller
    one.
    (could we put both in the tree, and use one for armv7a and other for armv5,
    and possibly for armv6?)

    anyway, I want to hear the benchmark result on ARMv5.

    PS: given that GOARM setting now affects the compiler code generation,
    can we introduce armv5, armv6 and armv7 build tags?
    Having that can solve this problem perfectly: just include both with
    different
    build tags.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Minux at Jan 26, 2014 at 7:06 pm

    On Sun, Jan 26, 2014 at 2:01 PM, minux wrote:

    PS: given that GOARM setting now affects the compiler code generation,
    can we introduce armv5, armv6 and armv7 build tags?
    Having that can solve this problem perfectly: just include both with
    different
    build tags.
    I filed https://code.google.com/p/go/issues/detail?id=7211 for this.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nick Craig-Wood at Jan 27, 2014 at 1:55 pm

    On 26/01/14 19:01, minux wrote:
    I can't decide which is better, but I'm slightly inclined to the smaller
    one.
    The larger one has a problem with immediate data - if you disassemble it
    you'll see that the linker shoves the immediate data in the middle of
    the routine and inserts branch instructions to jump over it which is a
    little untidy!
    (could we put both in the tree, and use one for armv7a and other for armv5,
    and possibly for armv6?)

    anyway, I want to hear the benchmark result on ARMv5.

    PS: given that GOARM setting now affects the compiler code generation,
    can we introduce armv5, armv6 and armv7 build tags?
    Having that can solve this problem perfectly: just include both with
    different
    build tags.
    Hmm, nice idea!

    --
    Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Brad Fitzpatrick at Jan 30, 2014 at 9:27 am
    In lieu of a build tag, couldn't you also pick at runtime (sync.Once or
    init) which one to use based on cache size?


    On Mon, Jan 27, 2014 at 2:55 PM, Nick Craig-Wood wrote:
    On 26/01/14 19:01, minux wrote:
    I can't decide which is better, but I'm slightly inclined to the smaller
    one.
    The larger one has a problem with immediate data - if you disassemble it
    you'll see that the linker shoves the immediate data in the middle of
    the routine and inserts branch instructions to jump over it which is a
    little untidy!
    (could we put both in the tree, and use one for armv7a and other for armv5,
    and possibly for armv6?)

    anyway, I want to hear the benchmark result on ARMv5.

    PS: given that GOARM setting now affects the compiler code generation,
    can we introduce armv5, armv6 and armv7 build tags?
    Having that can solve this problem perfectly: just include both with
    different
    build tags.
    Hmm, nice idea!

    --
    Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

    --

    ---
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Jan 30, 2014 at 9:41 am
    Both versions deliver excellent results. The more unrolled version delivers best performance on the latest arm chips, but never performs worse on older hardware.

    I think whichever version Nick wants to propose will be fine, and there is no need to sniff at runtime.


    On 30 Jan 2014, at 20:27, Brad Fitzpatrick wrote:

    In lieu of a build tag, couldn't you also pick at runtime (sync.Once or init) which one to use based on cache size?


    On Mon, Jan 27, 2014 at 2:55 PM, Nick Craig-Wood wrote:
    On 26/01/14 19:01, minux wrote:
    I can't decide which is better, but I'm slightly inclined to the smaller
    one.
    The larger one has a problem with immediate data - if you disassemble it
    you'll see that the linker shoves the immediate data in the middle of
    the routine and inserts branch instructions to jump over it which is a
    little untidy!
    (could we put both in the tree, and use one for armv7a and other for armv5,
    and possibly for armv6?)

    anyway, I want to hear the benchmark result on ARMv5.

    PS: given that GOARM setting now affects the compiler code generation,
    can we introduce armv5, armv6 and armv7 build tags?
    Having that can solve this problem perfectly: just include both with
    different
    build tags.
    Hmm, nice idea!

    --
    Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Brad Fitzpatrick at Jan 30, 2014 at 9:45 am
    I was proposing sniffing at the CPU :)

    But just the larger one is fine too if it's not worse than today.

    Btw, I'm really excited about this for Camlistore. The Android client uses
    a Go child process for the heavy lifting and does a lot of SHA-1. Not sure
    how big all my phones' caches are.
    On Jan 30, 2014 10:41 AM, "Dave Cheney" wrote:

    Both versions deliver excellent results. The more unrolled version
    delivers best performance on the latest arm chips, but never performs worse
    on older hardware.

    I think whichever version Nick wants to propose will be fine, and there is
    no need to sniff at runtime.



    On 30 Jan 2014, at 20:27, Brad Fitzpatrick wrote:

    In lieu of a build tag, couldn't you also pick at runtime (sync.Once or
    init) which one to use based on cache size?


    On Mon, Jan 27, 2014 at 2:55 PM, Nick Craig-Wood wrote:
    On 26/01/14 19:01, minux wrote:
    I can't decide which is better, but I'm slightly inclined to the smaller
    one.
    The larger one has a problem with immediate data - if you disassemble it
    you'll see that the linker shoves the immediate data in the middle of
    the routine and inserts branch instructions to jump over it which is a
    little untidy!
    (could we put both in the tree, and use one for armv7a and other for armv5,
    and possibly for armv6?)

    anyway, I want to hear the benchmark result on ARMv5.

    PS: given that GOARM setting now affects the compiler code generation,
    can we introduce armv5, armv6 and armv7 build tags?
    Having that can solve this problem perfectly: just include both with
    different
    build tags.
    Hmm, nice idea!

    --
    Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

    --

    ---
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nick Craig-Wood at Feb 8, 2014 at 3:30 pm

    On 30/01/14 09:41, Dave Cheney wrote:
    Both versions deliver excellent results. The more unrolled version
    delivers best performance on the latest arm chips, but never performs
    worse on older hardware.

    I think whichever version Nick wants to propose will be fine, and there
    is no need to sniff at runtime.
    I spent a bit of time tweaking the less unrolled (smaller) version until
    it is almost the same speed as the unrolled version and I've submitted that.

    I feel the way the benchmarks work with a hot cache isn't representative
    of the real world though. The real world would punish the much larger
    code much more in my experience. It would be nice if there was an option
    to clear the cache on every iteration of the benchmark (not an easy
    thing to do in a cross platform way though).

    --
    Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dmitry Vyukov at Feb 8, 2014 at 3:34 pm

    On Sat, Feb 8, 2014 at 7:30 PM, Nick Craig-Wood wrote:
    On 30/01/14 09:41, Dave Cheney wrote:
    Both versions deliver excellent results. The more unrolled version
    delivers best performance on the latest arm chips, but never performs
    worse on older hardware.

    I think whichever version Nick wants to propose will be fine, and there
    is no need to sniff at runtime.
    I spent a bit of time tweaking the less unrolled (smaller) version until
    it is almost the same speed as the unrolled version and I've submitted that.

    I feel the way the benchmarks work with a hot cache isn't representative
    of the real world though. The real world would punish the much larger
    code much more in my experience. It would be nice if there was an option
    to clear the cache on every iteration of the benchmark (not an easy
    thing to do in a cross platform way though).

    A better way to do it is to write a more representative benchmark.
    E.g. calculate SHA1 over a set of buffers in round-robin order.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dmitry Vyukov at Jan 30, 2014 at 9:51 am
    Just FYI
    Atomic operations on ARM in both runtime and sync/atomic are quite
    suboptimal. So if somebody has ARM hardware, knows ARM assembly, and
    do not mind hacking low-level stuff, optimizing atomics for ARM
    (especially for newer devices, e.g. multicore ARMv7) would benefit all
    ARM users. It won't provide tremendous speedups, but I would expect it
    to give 2-5% to all Go programs.

    --

    ---
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedJan 25, '14 at 4:17p
activeFeb 8, '14 at 3:34p
posts11
users5
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase