FAQ

Please take another look.

http://codereview.appspot.com/6820096/

Search Discussions

•  at Nov 6, 2012 at 3:22 am ⇧

Message:

I'd like you to review this change to

Description:
crypto/sha1: Make sha-1 do block mixup in place

Please review this at http://codereview.appspot.com/6820096/

Affected files:
M src/pkg/crypto/sha1/sha1block.go

Index: src/pkg/crypto/sha1/sha1block.go
===================================================================
--- a/src/pkg/crypto/sha1/sha1block.go
+++ b/src/pkg/crypto/sha1/sha1block.go
@@ -16,7 +16,7 @@
)

func block(dig *digest, p []byte) {
- var w [80]uint32
+ var w [16]uint32

h0, h1, h2, h3, h4 := dig.h[0], dig.h[1], dig.h[2], dig.h[3], dig.h[4]
for len(p) >= chunk {
@@ -26,42 +26,54 @@
j := i * 4
w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 |
uint32(p[j+3])
}
- for i := 16; i < 80; i++ {
- tmp := w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]
- w[i] = tmp<<1 | tmp>>(32-1)
- }

a, b, c, d, e := h0, h1, h2, h3, h4

// Each of the four 20-iteration rounds
// differs only in the computation of f and
// the choice of K (_K0, _K1, etc).
- for i := 0; i < 20; i++ {
+ for i := 0; i < 16; i++ {
f := b&c | (^b)&d
a5 := a<<5 | a>>(32-5)
b30 := b<<30 | b>>(32-30)
t := a5 + f + e + w[i] + _K0
a, b, c, d, e = t, a, b30, c, d
}
+ for i := 16; i < 20; i++ {
+ tmp := w[(i-3)&0xf] ^ w[(i-8)&0xf] ^ w[(i-14)&0xf] ^ w[(i)&0xf]
+ w[i&0xf] = tmp<<1 | tmp>>(32-1)
+
+ f := b&c | (^b)&d
+ a5 := a<<5 | a>>(32-5)
+ b30 := b<<30 | b>>(32-30)
+ t := a5 + f + e + w[i&0xf] + _K0
+ a, b, c, d, e = t, a, b30, c, d
+ }
for i := 20; i < 40; i++ {
+ tmp := w[(i-3)&0xf] ^ w[(i-8)&0xf] ^ w[(i-14)&0xf] ^ w[(i)&0xf]
+ w[i&0xf] = tmp<<1 | tmp>>(32-1)
f := b ^ c ^ d
a5 := a<<5 | a>>(32-5)
b30 := b<<30 | b>>(32-30)
- t := a5 + f + e + w[i] + _K1
+ t := a5 + f + e + w[i&0xf] + _K1
a, b, c, d, e = t, a, b30, c, d
}
for i := 40; i < 60; i++ {
+ tmp := w[(i-3)&0xf] ^ w[(i-8)&0xf] ^ w[(i-14)&0xf] ^ w[(i)&0xf]
+ w[i&0xf] = tmp<<1 | tmp>>(32-1)
f := b&c | b&d | c&d
a5 := a<<5 | a>>(32-5)
b30 := b<<30 | b>>(32-30)
- t := a5 + f + e + w[i] + _K2
+ t := a5 + f + e + w[i&0xf] + _K2
a, b, c, d, e = t, a, b30, c, d
}
for i := 60; i < 80; i++ {
+ tmp := w[(i-3)&0xf] ^ w[(i-8)&0xf] ^ w[(i-14)&0xf] ^ w[(i)&0xf]
+ w[i&0xf] = tmp<<1 | tmp>>(32-1)
f := b ^ c ^ d
a5 := a<<5 | a>>(32-5)
b30 := b<<30 | b>>(32-30)
- t := a5 + f + e + w[i] + _K3
+ t := a5 + f + e + w[i&0xf] + _K3
a, b, c, d, e = t, a, b30, c, d
}
•  at Nov 6, 2012 at 4:01 am ⇧

On 2012/11/06 02:12:48, notcarl wrote:
Please take another look.
lucky(~/go/src/pkg/crypto/sha1) % ~/go/misc/benchcmp {old,new}.txt
benchmark old ns/op new ns/op delta
BenchmarkHash8Bytes 863 831 -3.71%
BenchmarkHash1K 10049 9693 -3.54%
BenchmarkHash8K 74555 72132 -3.25%

benchmark old MB/s new MB/s speedup
BenchmarkHash8Bytes 9.27 9.62 1.04x
BenchmarkHash1K 101.90 105.64 1.04x
BenchmarkHash8K 109.88 113.57 1.03x

I think this can be made a little faster.

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 4:06 am ⇧
https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go
File src/pkg/crypto/sha1/sha1block.go (right):

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go#newcode35
src/pkg/crypto/sha1/sha1block.go:35: for i := 0; i < 16; i++ {
lifting i outside the for loop, then not initalising it again lower down
gets another 2% on amd64

i := 0
for ; i < 16 ; i++

lucky(~/go/src/pkg/crypto/sha1) % ~/go/misc/benchcmp {old,new}.txt
benchmark old ns/op new ns/op delta
BenchmarkHash8Bytes 863 819 -5.10%
BenchmarkHash1K 10049 9499 -5.47%
BenchmarkHash8K 74555 70395 -5.58%

benchmark old MB/s new MB/s speedup
BenchmarkHash8Bytes 9.27 9.76 1.05x
BenchmarkHash1K 101.90 107.80 1.06x
BenchmarkHash8K 109.88 116.37 1.06x

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go#newcode42
src/pkg/crypto/sha1/sha1block.go:42: for i := 16; i < 20; i++ {
for ; i < 20; i++ { .. }
and so forth

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 6:52 pm ⇧
looks good to me. will leave for dfc to lgtm and submit

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go
File src/pkg/crypto/sha1/sha1block.go (right):

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go#newcode39
src/pkg/crypto/sha1/sha1block.go:39: t := a5 + f + e + w[i] + _K0
i&0xf will get you something here.

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go#newcode42
src/pkg/crypto/sha1/sha1block.go:42: for i := 16; i < 20; i++ {
On 2012/11/06 04:06:47, dfc wrote:
for ; i < 20; i++ { .. }
and so forth
FWIW this is unlikely to matter.

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 7:39 pm ⇧
Please put the current benchcmp numbers in the CL description.
Leaving for Dave.

Thanks.
Russ

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 7:39 pm ⇧
Thanks for the quick review!

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go
File src/pkg/crypto/sha1/sha1block.go (right):

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go#newcode39
src/pkg/crypto/sha1/sha1block.go:39: t := a5 + f + e + w[i] + _K0
On 2012/11/06 18:52:06, rsc wrote:
i&0xf will get you something here.
I am not sure I understand. Since the loop only goes from [0, 15],
anding with 0xf should be a no-op.

https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1block.go#newcode42
src/pkg/crypto/sha1/sha1block.go:42: for i := 16; i < 20; i++ {
On 2012/11/06 18:52:06, rsc wrote:
On 2012/11/06 04:06:47, dfc wrote:
for ; i < 20; i++ { .. }
and so forth
FWIW this is unlikely to matter.

I was hoping that maybe the compiler saw this. I didn't unroll the
loops because the compiler should be doing it and it decreases
readability. In addition, the comments at the top of this file hint at
it not being a priority to make this as fast as possible.

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 7:08 pm ⇧

i&0xf will get you something here.
I am not sure I understand. Since the loop only goes from [0, 15],
anding with 0xf should be a no-op.
The compiler isn't smart enough to recognize that i is guaranteed to
be in range here, but &0xf will force it to notice that and avoid the
bounds check. It is unlikely to matter much.
On 2012/11/06 04:06:47, dfc wrote:
for ; i < 20; i++ { .. }
and so forth
FWIW this is unlikely to matter.
I was hoping that maybe the compiler saw this. I didn't unroll the
loops because the compiler should be doing it and it decreases
readability. In addition, the comments at the top of this file hint at
it not being a priority to make this as fast as possible.
Dave suggested a single i instead of 5 different i variables. My
comment was only that from a performance standpoint it is unlikely to
matter. The compiler does not unroll the loops, and in general I think
it is unreasonable to expect it to. The big win you can get from the
SHA1 block function is if you unroll the loops 5x and do the variable
renamings in each iteration, so that all the rotating assignments
disappear. If you want to do this it should be done in a separate CL
and the speed gain should justify the increase in code. The last time
I looked at this I think it was not quite enough. crypto/md5/gen.go is
the program that generates the MD5 block routine. You could use
something similar for SHA1.

Russ
•  at Nov 6, 2012 at 7:39 pm ⇧

On 2012/11/06 19:08:54, rsc wrote:
i&0xf will get you something here.
I am not sure I understand. Since the loop only goes from [0, 15],
anding with 0xf should be a no-op.
The compiler isn't smart enough to recognize that i is guaranteed to
be in range here, but &0xf will force it to notice that and avoid the
bounds check. It is unlikely to matter much.

I adding this change, and got a small increase in speed.

On 2012/11/06 04:06:47, dfc wrote:
for ; i < 20; i++ { .. }
and so forth
FWIW this is unlikely to matter.
I was hoping that maybe the compiler saw this. I didn't unroll the
loops because the compiler should be doing it and it decreases
readability. In addition, the comments at the top of this file hint
at
it not being a priority to make this as fast as possible.
Dave suggested a single i instead of 5 different i variables. My
comment was only that from a performance standpoint it is unlikely to
matter. The compiler does not unroll the loops, and in general I think
it is unreasonable to expect it to. The big win you can get from the
SHA1 block function is if you unroll the loops 5x and do the variable
renamings in each iteration, so that all the rotating assignments
disappear. If you want to do this it should be done in a separate CL
and the speed gain should justify the increase in code. The last time
I looked at this I think it was not quite enough. crypto/md5/gen.go is
the program that generates the MD5 block routine. You could use
something similar for SHA1.
Russ
The fear is that unrolling the loop means potentially more i-cache
misses, so I agree it should be in a future CL. Hoisting i out of the
loop in the meantime showed a little extra gain.

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 7:39 pm ⇧
Hello dave@cheney.net, rsc@golang.org (cc: golang-dev@googlegroups.com),

Please take another look.

http://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 7:57 pm ⇧

On 2012/11/06 19:38:57, rsc wrote:
Please put the current benchcmp numbers in the CL description.
Leaving for Dave.
Thanks.
Russ
Done.

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 9:46 pm ⇧
hmmm, i can't replicate your benchmark results.

linux/amd64: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

benchmark old ns/op new ns/op delta
BenchmarkHash8Bytes 866 824 -4.85%
BenchmarkHash1K 10058 9520 -5.35%
BenchmarkHash8K 74567 70953 -4.85%

benchmark old MB/s new MB/s speedup
BenchmarkHash8Bytes 9.24 9.71 1.05x
BenchmarkHash1K 101.81 107.56 1.06x
BenchmarkHash8K 109.86 115.46 1.05x

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 9:50 pm ⇧
\$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU W3690 @ 3.47GHz
...

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 10:04 pm ⇧
Looks pretty good to me as well. I'll run some benchmarks on 32 bit
platforms once mine get through building the overnight changes.

https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1block.go
File src/pkg/crypto/sha1/sha1block.go (right):

https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1block.go#newcode26
src/pkg/crypto/sha1/sha1block.go:26: j := i * 4
j := i << 2 avoids the imul on intel. The compiler should be smarter

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 10:51 pm ⇧

On 2012/11/06 22:04:24, dfc wrote:
Looks pretty good to me as well. I'll run some benchmarks on 32 bit platforms
once mine get through building the overnight changes.

https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1block.go
File src/pkg/crypto/sha1/sha1block.go (right):

https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1block.go#newcode26
src/pkg/crypto/sha1/sha1block.go:26: j := i * 4
j := i << 2 avoids the imul on intel. The compiler should be smarter

I tried changing the variables to unsigned to see if it would change
anything. I also tried adding in the shift instead of the mult, but
neither of these two changes affected the results of my benchmarks. I
think it would be safe to leave this alone for now.

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 11:07 pm ⇧
Fair enough. The compiler should be able to do this for us eventually.

I know you're a Googler, so most of the CLA process does not apply to
you, but have you done whatever is needed on your side ?

https://codereview.appspot.com/6820096/
•  at Nov 6, 2012 at 11:18 pm ⇧

On Tue, Nov 6, 2012 at 3:07 PM, wrote:
Fair enough. The compiler should be able to do this for us eventually.

I know you're a Googler, so most of the CLA process does not apply to
you, but have you done whatever is needed on your side ?
To be clear, are you approving this patch? Let me know.

We just need to get him into CONTRIBUTORS.

Ian
•  at Nov 6, 2012 at 11:18 pm ⇧
Yes, LGTM.
On Wed, Nov 7, 2012 at 10:18 AM, Ian Lance Taylor wrote:
On Tue, Nov 6, 2012 at 3:07 PM, wrote:
Fair enough. The compiler should be able to do this for us eventually.

I know you're a Googler, so most of the CLA process does not apply to
you, but have you done whatever is needed on your side ?
To be clear, are you approving this patch? Let me know.

We just need to get him into CONTRIBUTORS.

Ian
•  at Nov 6, 2012 at 11:24 pm ⇧
Some results from

linux/arm:

benchmark old ns/op new ns/op delta
BenchmarkHash8Bytes 6487 6497 +0.15%
BenchmarkHash1K 69794 70773 +1.40%
BenchmarkHash8K 517529 525604 +1.56%

benchmark old MB/s new MB/s speedup
BenchmarkHash8Bytes 1.23 1.23 1.00x
BenchmarkHash1K 14.67 14.47 0.99x
BenchmarkHash8K 15.83 15.59 0.98x

Wasn't expecting much, and it looks like my expectations were met. This
is within the margin of error.

https://codereview.appspot.com/6820096/
•  at Nov 7, 2012 at 2:41 am ⇧
*** Submitted as

crypto/sha1: Make sha-1 do block mixup in place

Benchmarks:

benchmark old ns/op new ns/op delta
BenchmarkHash8Bytes 762 674 -11.55%
BenchmarkHash1K 8791 7375 -16.11%
BenchmarkHash8K 65094 54881 -15.69%

benchmark old MB/s new MB/s speedup
BenchmarkHash8Bytes 10.50 11.86 1.13x
BenchmarkHash1K 116.48 138.84 1.19x
BenchmarkHash8K 125.85 149.27 1.19x

R=dave, rsc, iant
CC=golang-dev
http://codereview.appspot.com/6820096

Committer: Dave Cheney <dave@cheney.net>

http://codereview.appspot.com/6820096/

Related Discussions

 view thread | post
Discussion Overview
 group golang-dev categories go posted Nov 6, '12 at 3:22a active Nov 7, '12 at 2:41a posts 20 users 4 website golang.org

4 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase