FAQ
I have an inner loop that executes many trillions of times. For various
reasons, I've implemented it via an m4 macro and instantiate it where
needed. Because of this, it is easy for me to try slight variations on code
in that loop and test the resulting operation rates. I just made a change
that I was sure wold be redundant to what the complier is already doing,
yet it turns out that I got a ~8% speedup here and that suggets a missing
compiler optimization.

The code is this (ignore the details just now, will explain the change
below):

m4_define(`COVER',`{
T_c := $1
if T_c.supply == 0 {
T_c.covered = true
T_c.next.prev = T_c.prev
T_c.prev.next = T_c.next
$2++
for T_i := T_c.down; T_i != &T_c.tNode; T_i = T_i.down {
for T_j := T_i.right; T_j != T_i; T_j = T_j.right {
T_j.down.up = T_j.up
T_j.up.down = T_j.down
T_j.col.size--
$2++
}
}
} else {
T_c.supply--
}
}')m4_dnl

The change is right at the top, "T_c := $1", which creates a block-local
variable of the value of the "function's" first argument. In the call
sites, this is sometimes a pointer, sometimes an index in to an array of
pointers (a[i]), and sometimes a member of a structure identified by a
pointer to structure ("thing *structname; ... COVER(thing.p)"). Before this
I had "$1" everywhere you see "T_c" now. The call sites look like this:

COVER(column,exact.Updates) // variable

COVER(j.col,exact.Updates) // struct member

COVER(forced[j],exact.Updates) // array element

Here is what is missing. When in a basic block, with a "RHS expression"
that needs to be validated (nil pointer, range check, ...) and that
expression appears multiple times, the checking happens multiple times,
even when the expression is as simple as a[i] where neither a nor i appear
on a left-hand side, or sp.member when sp is only used for reading.

Suggest that this go onto the compiler's optimization/CSE todo list. The
general idea is to CSE "whole expressions" before disassembling them and
doing CSE on their constituant parts.

8% is a lot of performance to waste.
Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
650-335-5765

Search Discussions

  • Russ Cox at Oct 1, 2012 at 7:24 pm
    I bet gccgo already does this, FWIW.

    Personally, I don't believe common subexpression elimination is the
    compiler's job. For hand-written code it is easier even for people
    trying to read and understand the program if you do the CSE yourself,
    as you have here. There is an important exception: code generated
    implicitly by the compiler, since there's no opportunity for the
    programmer to help with that. I do want to do some bounds check
    elimination for Go 1.1, and perhaps that analysis will make some CSE
    fall out naturally, but perhaps not.

    Russ
  • Michael Jones at Oct 1, 2012 at 8:14 pm
    Hmmm... We disagree on this point. (emphasis added below) I believe it is
    the compiler's job to satisfy the program's meaning and the hardware's
    ability as best as is possible.

    In this case, one can imagine at least a simplistic approach:

    a. in each block (from outermost first and then inwards, recursively) find
    every "atomic" RHS expression accessing variable (x, x.y, a[y])

    b. in that block, look to see if any part of it (x or y) appears on a LHS.
    if so, drop it from consideration.

    c. for every RHS SE surviving test b (even if referenced just once),
    rewrite it at the top of the block as a local temp ("{ t := SE; ...") and
    replace subsequent SE instances with t.

    d. count on the register renaming and dead register logic to kill and or
    reorder use of temporaries.

    Why would we aspire do less? (there is more, of course, but this is simple
    and would cover many cases)
    On Mon, Oct 1, 2012 at 12:23 PM, Russ Cox wrote:

    I bet gccgo already does this, FWIW.

    * Personally, I don't believe common subexpression elimination is the
    compiler's job.* For hand-written code it is easier even for people
    trying to read and understand the program if you do the CSE yourself,
    as you have here. There is an important exception: code generated
    implicitly by the compiler, since there's no opportunity for the
    programmer to help with that. I do want to do some bounds check
    elimination for Go 1.1, and perhaps that analysis will make some CSE
    fall out naturally, but perhaps not.

    Russ


    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Ian Lance Taylor at Oct 1, 2012 at 8:27 pm

    On Mon, Oct 1, 2012 at 1:13 PM, Michael Jones wrote:
    Hmmm... We disagree on this point. (emphasis added below) I believe it is
    the compiler's job to satisfy the program's meaning and the hardware's
    ability as best as is possible.

    In this case, one can imagine at least a simplistic approach:

    a. in each block (from outermost first and then inwards, recursively) find
    every "atomic" RHS expression accessing variable (x, x.y, a[y])

    b. in that block, look to see if any part of it (x or y) appears on a LHS.
    if so, drop it from consideration.

    c. for every RHS SE surviving test b (even if referenced just once), rewrite
    it at the top of the block as a local temp ("{ t := SE; ...") and replace
    subsequent SE instances with t.

    d. count on the register renaming and dead register logic to kill and or
    reorder use of temporaries.

    Why would we aspire do less? (there is more, of course, but this is simple
    and would cover many cases)
    There is, of course, a vast literature on compiler optimizations. As
    Russ said, I expect that gccgo already does this particular
    optimization. Since one of the major goals of the gc compiler is fast
    compilation time, I think it's worth asking how much optimization it
    should do, and how many times it should walk over the function. I
    don't know the answer.

    Ian
  • Russ Cox at Oct 1, 2012 at 8:47 pm
    Why would we aspire do less?
    In this case, only because the programmer can do it too *and* if the
    programmer does it, the result is a program that is cleaner and easier
    to read. That is, if I saw this in real life:

    if forced[j].supply == 0 {
    forced[j].covered = true
    forced[j].next.prev = forced[j].prev
    forced[j].prev.next = forced[j].next
    for T_i := forced[j].down; T_i != &forced[j].tNode; T_i = T_i.down {
    ...
    }
    } else {
    forced[j].supply--
    }

    I would ask the programmer to factor out the forced[j] even if it
    wasn't a hot spot. I think it's unlikely you'd have written the code
    this way by hand: you'd have factored out forced[j] too. It only came
    up because the program is auto-generated.

    I understand that in auto-generated code the tradeoffs may be slightly
    different, but the compiler doesn't have time to do everything, and in
    general I would rather the compiler spend its time on optimizations
    that help well-written programs run faster, not on ones that make
    poorly-written programs run like well-written programs. A corollary is
    that in my world authors of mechanically generated code do have to
    work a little harder to generate good code. I understand your
    position; it's just a matter of what to spend compile time on.

    Russ
  • Michael Jones at Oct 1, 2012 at 8:56 pm
    I admit that I would not have "knowingly" written it this way.

    This inspires a question about the present inlining system -- would it
    evaluate function arguments once and "cache" them in temporaries or does it
    do textual substitutions (such that expensive arguments are reevaluated
    even when side-effect free?)
    On Mon, Oct 1, 2012 at 1:41 PM, Russ Cox wrote:

    Why would we aspire do less?
    In this case, only because the programmer can do it too *and* if the
    programmer does it, the result is a program that is cleaner and easier
    to read. That is, if I saw this in real life:

    if forced[j].supply == 0 {
    forced[j].covered = true
    forced[j].next.prev = forced[j].prev
    forced[j].prev.next = forced[j].next
    for T_i := forced[j].down; T_i != &forced[j].tNode; T_i =
    T_i.down {
    ...
    }
    } else {
    forced[j].supply--
    }

    I would ask the programmer to factor out the forced[j] even if it
    wasn't a hot spot. I think it's unlikely you'd have written the code
    this way by hand: you'd have factored out forced[j] too. It only came
    up because the program is auto-generated.

    I understand that in auto-generated code the tradeoffs may be slightly
    different, but the compiler doesn't have time to do everything, and in
    general I would rather the compiler spend its time on optimizations
    that help well-written programs run faster, not on ones that make
    poorly-written programs run like well-written programs. A corollary is
    that in my world authors of mechanically generated code do have to
    work a little harder to generate good code. I understand your
    position; it's just a matter of what to spend compile time on.

    Russ


    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Russ Cox at Oct 1, 2012 at 9:06 pm

    This inspires a question about the present inlining system -- would it
    evaluate function arguments once and "cache" them in temporaries or does it
    do textual substitutions (such that expensive arguments are reevaluated even
    when side-effect free?)
    No, all arguments to inline functions are cached in temporaries.

    Russ
  • Michael Jones at Oct 1, 2012 at 10:06 pm
    That explains the particularly good benefit of the inlining beyond the
    expected--inlining in 6g forces the compiler to do what is discussed here.
    Too bad all code can't have this at present.
    On Mon, Oct 1, 2012 at 2:06 PM, Russ Cox wrote:

    This inspires a question about the present inlining system -- would it
    evaluate function arguments once and "cache" them in temporaries or does it
    do textual substitutions (such that expensive arguments are reevaluated even
    when side-effect free?)
    No, all arguments to inline functions are cached in temporaries.

    Russ


    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Ingo Oeser at Oct 1, 2012 at 11:32 pm
    One could have both: Introduce a re-factoring step. This would be intended to help the programmer with the re-factoring monkey jobs.

    And the result would be along the lines of what go fmt does when it simplifies, but much more aggressive.

    This could be run automatically for generated code by some build step and be used by IDEs for nice interactive re-factoring.

    Building blocks for this are already in go fix and go fmt.

    Too crazy?
  • Russ Cox at Oct 5, 2012 at 8:08 pm

    On Mon, Oct 1, 2012 at 7:32 PM, Ingo Oeser wrote:
    One could have both: Introduce a re-factoring step. This would be intended to help the programmer with the re-factoring monkey jobs.

    And the result would be along the lines of what go fmt does when it simplifies, but much more aggressive.

    This could be run automatically for generated code by some build step and be used by IDEs for nice interactive re-factoring.

    Building blocks for this are already in go fix and go fmt.

    Too crazy?
    No, not at all. Enabling automated program manipulation is exactly why
    Go is easy to parse and ships with a standard formatter.

    Russ

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedOct 1, '12 at 3:44p
activeOct 5, '12 at 8:08p
posts10
users4
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase