This is the highest performance version on my Chromebook. It is fully
unrolled and uses 5508 bytes of code. However I'm concerned that 5k of
code won't fit into the I-cache on some ARM processors so I'd like some
advice as to whether that is sensible or not.
I've made a second routine which is partially unrolled (note this
version isn't quite as polished as the first one)
Which uses only 1896 bytes of code but runs about 10% slower on the
In comparison the amd64 version of the code is 4963 bytes and the 386
version is 3888 bytes. Both are fully unrolled.
My feeling is that the unrolled version should be preferred as it is
faster and 5k of code isn't excessive. However I don't want to unduly
hamper older ARM processors.
I'll polish and submit one or the other CLs depending on what we decide!
Nick Craig-Wood <firstname.lastname@example.org> -- http://www.craig-wood.com/nick
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.