Fwd: Template code is 6.4 times slower than hand-written code

Joshua N Pritikin
Sep 9, 1999 at 7:10 pm
This confirms my suspicions. (Wow. :-)

----- Forwarded message from gcc-digest-help@gcc.gnu.org -----

Subject: Template code is 6.4 times slower than hand-written code
Sender: gcc-digest-help@gcc.gnu.org
To: gcc@gcc.gnu.org
From: stodghil@CS.Cornell.EDU
Content-Type: multipart/Mixed; boundary="openmail-part-041f486c-0000001d"

I have a particular application that makes heavy use of templates[1], and for
which g++ 2.95.1 is generating less than optimal code.

In fact, it appears that the overhead introduced by using the templates
results in a 6.4 slowdown in the innermost loop of the code[2]. I think that
some of this slowdown might be attributed to my not using the right
optimization flags to the compiler, but I'm afraid that alot of it is because
g++ is not optimizing the instantiated code as aggressively as it could.

For instance, below are two versions of a loop that performs the following

double *x_storage;
int *x_indices;
double *yy;
double result = 0.0;
for (int ii=0; ii<x_nz; ii++)
result += x_storage[ii] * yy[x_indices[ii]];

This code is generated from "hand-written" code[3] that is almost exactly as
that given above.

movl -4(%ebp),%ecx
movl (%ecx,%edx,4),%eax
fldl (%edi,%edx,8)
fmull (%ebx,%eax,8)
incl %edx
faddp %st,%st(1)
cmpl %esi,%edx
jl .L3454
movl 16(%ebp),%eax
fstpl (%eax)

This code is generated from code[3] that involved several levels of templates.

movl -64(%ebp),%ecx
movl -60(%ebp),%eax
movl -56(%ebp),%edx
movl %ecx,-248(%ebp)
movl %eax,-244(%ebp)
movl %edx,-240(%ebp)
movl %ecx,-312(%ebp)
movl %eax,-308(%ebp)
movl %edx,-304(%ebp)
movl 8(%edi),%eax
leal (%eax,%edx,8),%edx
movb -456(%ebp),%al
movb %al,-377(%ebp)
movl -484(%ebp),%eax
movl %edx,4(%eax)
movl -456(%ebp),%eax
movl -452(%ebp),%edx
movl %eax,-464(%ebp)
movl %edx,-460(%ebp)
movl 4(%esi),%eax
movl 12(%ebp),%ecx
fldl (%eax)
movl 8(%ecx),%edx
movl 4(%ecx),%eax
addl -40(%ebp),%edx
leal (%eax,%edx,8),%edx
movb -472(%ebp),%al
movb %al,-441(%ebp)
movl -488(%ebp),%eax
movl %edx,4(%eax)
movl -472(%ebp),%eax
movl -468(%ebp),%edx
movl %eax,-480(%ebp)
movl %edx,-476(%ebp)
movl 4(%ebx),%eax
fmull (%eax)
movl 16(%ebp),%ecx
faddl (%ecx)
fstpl (%ecx)
movl -56(%ebp),%edx
movl -64(%ebp),%eax
incl %edx
movl %edx,-56(%ebp)
movl (%eax,%edx,4),%eax
movl %eax,-40(%ebp)
cmpl -492(%ebp),%edx
jne .L3212

My understanding of the code is a little shaky, but I think I see at least
two problems with the later code.

First, some invariant values are repeatedly loaded from memory. In particular,
"movl -64(%ebp),%ecx" is, I think, reloading the value of x_indices. I have
using the __restrict keyword within the class definitions, but maybe I haven't
placed it correctly.

Second, there seem to be alot of "dead" writes. In particular,

movl %ecx,-248(%ebp)
movl %eax,-244(%ebp)
movl %edx,-240(%ebp)
movl %ecx,-312(%ebp)
movl %eax,-308(%ebp)
movl %edx,-304(%ebp)

This values are never read anywhere, as near as I can tell.

Could someone who knows a bit more about this take a look as what I'm doing
and tell me if I am doing something wrong?

Thank you.

[1] The complete source code for my "benchmark" question can be found at,


[2] The overhead from KCC 3.2f is 5.2. The overhead from the SGI MIPSpro
Compilers, version 7.2.1 is more than *42*!

[3] The code was generated using gcc 2.95.1 as follows,

% g++ -DNDEBUG -O6 -funroll-loops -malign-double -mcpu=pentiumpro \
-ansi -pedantic -W -Wall -Woverloaded-virtual -S -fno-unroll-loops \

Paul Stodghill <stodghil@cs.cornell.edu>
Dept. of Computer Science, Upson Hall, Ithaca, NY 14853
Phone: 607-254-8838 FAX: 607-255-4428

----- End forwarded message -----

"Does `competition' have an abstract purpose?"
via, but not speaking for Deutsche Bank

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post

1 user in discussion

Joshua N Pritikin: 1 post