FAQ
Go 1.5's concurrent garbage collector poses new challenges for how to
schedule garbage collection work so that the collector finishes quickly and
on time without sacrificing concurrency. Here's my proposal for the garbage
collector "pacing" algorithm I'm planning to implement to address this in
Go 1.5: https://golang.org/s/go15gcpacing

Comments welcome (please comment in this thread).

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Antoine Grondin at Mar 11, 2015 at 7:11 am
    Hi!

    Nice read, I've found the explanations very clear!

    I was trying to follow the maths in appendix A and saw this (image with
    annotation):


    <https://lh6.googleusercontent.com/-yM2N0xnOTds/VP_oh-w727I/AAAAAAAAACA/jokfzS19Gpw/s1600/Cursor_and_Go_1_5_concurrent_garbage_collector_pacing_-_Google_Docs.png>







    It seems the derivation should be:

    <https://lh3.googleusercontent.com/--sPc3SLJurQ/VP_qP0PAfwI/AAAAAAAAACM/FY2MCP_80_E/s1600/CodeCogsEqn.png>
    On Tuesday, March 10, 2015 at 11:43:36 AM UTC-4, Austin Clements wrote:

    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in
    Go 1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 3:30 pm
    Good catch. I've now fixed this, as well as a few other typos in the
    derivation. I believe the final derivation was correct (plus it has held up
    in simulations), I just copied things off my whiteboard wrong. :)

    Thanks!
    On Wed, Mar 11, 2015 at 3:11 AM, Antoine Grondin wrote:

    Hi!

    Nice read, I've found the explanations very clear!

    I was trying to follow the maths in appendix A and saw this (image with
    annotation):



    <https://lh6.googleusercontent.com/-yM2N0xnOTds/VP_oh-w727I/AAAAAAAAACA/jokfzS19Gpw/s1600/Cursor_and_Go_1_5_concurrent_garbage_collector_pacing_-_Google_Docs.png>







    It seems the derivation should be:


    <https://lh3.googleusercontent.com/--sPc3SLJurQ/VP_qP0PAfwI/AAAAAAAAACM/FY2MCP_80_E/s1600/CodeCogsEqn.png>
    On Tuesday, March 10, 2015 at 11:43:36 AM UTC-4, Austin Clements wrote:

    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in
    Go 1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Tomwilde at Mar 11, 2015 at 11:15 am
    Chrome on Linux here, can't see some symbols in your text. Example:

    https://hostr.co/file/nYNjLO3TixYS/Screenshotfrom2015-03-11120721.png

    It's not entirely illegible but it gets confusing further down.
    On Tuesday, March 10, 2015 at 4:43:36 PM UTC+1, Austin Clements wrote:

    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in
    Go 1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 3:33 pm
    The screenshot you sent is showing up correctly. The box is a placeholder
    saying that this relation applies to H_a/h_a, H_g/h_g, and H_T/h_T. Are
    there other places where the symbols are not showing up correctly?
    On Wed, Mar 11, 2015 at 7:15 AM, tomwilde wrote:

    Chrome on Linux here, can't see some symbols in your text. Example:

    https://hostr.co/file/nYNjLO3TixYS/Screenshotfrom2015-03-11120721.png

    It's not entirely illegible but it gets confusing further down.
    On Tuesday, March 10, 2015 at 4:43:36 PM UTC+1, Austin Clements wrote:

    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in
    Go 1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 4:07 pm

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    "Instead, the runtime will adjust hT based on an estimate of what the
    heap growth would have been if GC CPU utilization was ug=0.25".

    What if GC actually scans with Ug=0.9 because of idle time? If my math
    is correct, this formula will emphatically schedule GC way ahead of
    ideal time.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 4:34 pm

    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.
    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And
    if they still turn out to be non-trivial, then I think we can account for
    them by doing exactly what you suggested and lowering the goal utilization
    to give some margin.

    "this idle utilization does not count against the GC CPU budget"
    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".
    GC will consume its 25% and *then* whatever's left of idle time. That is,
    if the mutator is consuming 50% of the CPU (not counting assists), and
    assists are taking x%, then background GC will be scheduled for (50-x)% and
    "billed" for (25-x)% (assuming x<25).

    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.
    The credit/debit system is still driven by the ideal assist work. The only
    change is that rather than performing A(x) work directly, the mutator first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.

    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.
    Since the assist is still being driven by the ideal work, then (assuming
    the scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?

    "Instead, the runtime will adjust hT based on an estimate of what the
    heap growth would have been if GC CPU utilization was ug=0.25".

    What if GC actually scans with Ug=0.9 because of idle time? If my math
    is correct, this formula will emphatically schedule GC way ahead of
    ideal time.
    I'm not quite sure what you're asking. u_g is a constant (0.25) and u_a
    excludes idle time slurped up by background GC. (You might be right that
    something isn't quite right with idle time handling. I haven't checked that
    part as carefully as I have the other parts and it's the one part I haven't
    modeled in my simulator.)

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 4:46 pm

    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And if
    they still turn out to be non-trivial, then I think we can account for them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That is, if
    the mutator is consuming 50% of the CPU (not counting assists), and assists
    are taking x%, then background GC will be scheduled for (50-x)% and "billed"
    for (25-x)% (assuming x<25).
    Then what is Ug in the formula and how it is measured?

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 4:54 pm

    On Wed, Mar 11, 2015 at 7:46 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And if
    they still turn out to be non-trivial, then I think we can account for them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That is, if
    the mutator is consuming 50% of the CPU (not counting assists), and assists
    are taking x%, then background GC will be scheduled for (50-x)% and "billed"
    for (25-x)% (assuming x<25).
    Then what is Ug in the formula and how it is measured?

    I mean what is Ua and how it is measured?

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 10:01 pm

    On Wed, Mar 11, 2015 at 12:53 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:46 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is
    some
    low-hanging fruit for speeding up write barriers when the time comes.
    And if
    they still turn out to be non-trivial, then I think we can account for
    them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That
    is, if
    the mutator is consuming 50% of the CPU (not counting assists), and
    assists
    are taking x%, then background GC will be scheduled for (50-x)% and
    "billed"
    for (25-x)% (assuming x<25).
    Then what is Ug in the formula and how it is measured?

    I mean what is Ua and how it is measured?
    I assume you mean ua (or u_a), not Ua. If there's a capital U in the doc
    it's a typo.

    u_a is the fraction of GOMAXPROCS used by background GC up to the 25%
    budget plus mutator assists. It does not include background GC run during
    idle time beyond this 25% budget.

    For example, if the mutator (excluding assists) is using 10% of GOMAXPROCS,
    and assists are using 5% of GOMAXPROCS, non-idle background GC will use 20%
    of GOMAXPROCS, and idle background GC will use the remaining 65% of
    GOMAXPROCS. u_a will be 0.25 from assists and non-idle background GC.

    In terms of the actual implementation of this, this will be based on
    wall-clock time (not raw CPU time as measured by, say, getrusage). I'm
    planning to use nanotime around assists and at scheduling quantums that
    consider running background GC. nanotime isn't quite as cheap as cputicks,
    but on most OSes it's close, it doesn't have problems with monotonicity,
    and I expect any overhead it does have to be amortized.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 7:20 am

    On Thu, Mar 12, 2015 at 1:01 AM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:53 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:46 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements <austin@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov <dvyukov@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how
    to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is
    some
    low-hanging fruit for speeding up write barriers when the time comes.
    And if
    they still turn out to be non-trivial, then I think we can account for
    them
    by doing exactly what you suggested and lowering the goal utilization
    to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That
    is, if
    the mutator is consuming 50% of the CPU (not counting assists), and
    assists
    are taking x%, then background GC will be scheduled for (50-x)% and
    "billed"
    for (25-x)% (assuming x<25).
    Then what is Ug in the formula and how it is measured?

    I mean what is Ua and how it is measured?

    I assume you mean ua (or u_a), not Ua. If there's a capital U in the doc
    it's a typo.

    u_a is the fraction of GOMAXPROCS used by background GC up to the 25% budget
    plus mutator assists. It does not include background GC run during idle time
    beyond this 25% budget.
    I still don't understand. You are basically saying that you want to
    keep 25% as close to 25% as possible.
    This looks like a very synthetic and non-interesting goal. I still
    think that the real goal is: "Minimize max(u_g, u_idle) - u_a"
    For example, if the mutator (excluding assists) is using 10% of GOMAXPROCS,
    and assists are using 5% of GOMAXPROCS, non-idle background GC will use 20%
    of GOMAXPROCS, and idle background GC will use the remaining 65% of
    GOMAXPROCS. u_a will be 0.25 from assists and non-idle background GC.

    In terms of the actual implementation of this, this will be based on
    wall-clock time (not raw CPU time as measured by, say, getrusage). I'm
    planning to use nanotime around assists and at scheduling quantums that
    consider running background GC. nanotime isn't quite as cheap as cputicks,
    but on most OSes it's close, it doesn't have problems with monotonicity, and
    I expect any overhead it does have to be amortized.
    How will you capture the concurrency aspect of utilization?
    For a 4-second GC period, stopping the world for 1 second gives
    exactly 25%. But that's obviously unacceptable.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 7:50 am

    On Thu, Mar 12, 2015 at 10:19 AM, Dmitry Vyukov wrote:
    Then what is Ug in the formula and how it is measured?

    I mean what is Ua and how it is measured?

    I assume you mean ua (or u_a), not Ua. If there's a capital U in the doc
    it's a typo.

    u_a is the fraction of GOMAXPROCS used by background GC up to the 25% budget
    plus mutator assists. It does not include background GC run during idle time
    beyond this 25% budget.
    I still don't understand. You are basically saying that you want to
    keep 25% as close to 25% as possible.
    This looks like a very synthetic and non-interesting goal. I still
    think that the real goal is: "Minimize max(u_g, u_idle) - u_a"
    I am probably thinking in different terms.
    What you are saying is that we want to minimize mutator assists, is it correct?
    We always consume the 25% with GOMAXPROCS/4 background GC threads, so
    we can just ignore them. What's left is assists, they are always above
    the cap, and so we want to minimize them.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 4:50 pm

    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And if
    they still turn out to be non-trivial, then I think we can account for them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That is, if
    the mutator is consuming 50% of the CPU (not counting assists), and assists
    are taking x%, then background GC will be scheduled for (50-x)% and "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The only
    change is that rather than performing A(x) work directly, the mutator first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then (assuming the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    Yes, it clarifies. But the formula for the ideal assist work should
    contain Wa which we don't know instead of We. Because of that we can
    easily over-allocate which is not acceptable.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 5:29 pm

    On Wed, Mar 11, 2015 at 12:49 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And if
    they still turn out to be non-trivial, then I think we can account for them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That is, if
    the mutator is consuming 50% of the CPU (not counting assists), and assists
    are taking x%, then background GC will be scheduled for (50-x)% and "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The only
    change is that rather than performing A(x) work directly, the mutator first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then (assuming the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    Yes, it clarifies. But the formula for the ideal assist work should
    contain Wa which we don't know instead of We. Because of that we can
    easily over-allocate which is not acceptable.
    If the assist work is based on something we don't know, then we won't be
    able to compute how much assist work to do. That's why it's based on an
    estimate.

    I'm not sure what you mean by "easily" over-allocate, but, yes, if the work
    estimate is wrong we can exceed GOGC in that cycle. That's why the system
    revises the work estimate as it runs so that future cycles won't exceed
    GOGC. As far as I can tell, the only way to ensure a hard limit is to stop
    the world if allocation exceeds that limit. Given how rough the GOGC limit
    is anyway, this seems like a bad trade-off.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 6:02 pm

    On Wed, Mar 11, 2015 at 8:29 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:49 PM, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements <austin@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov <dvyukov@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is
    some
    low-hanging fruit for speeding up write barriers when the time comes.
    And if
    they still turn out to be non-trivial, then I think we can account for
    them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That
    is, if
    the mutator is consuming 50% of the CPU (not counting assists), and
    assists
    are taking x%, then background GC will be scheduled for (50-x)% and
    "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The
    only
    change is that rather than performing A(x) work directly, the mutator
    first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then (assuming
    the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    Yes, it clarifies. But the formula for the ideal assist work should
    contain Wa which we don't know instead of We. Because of that we can
    easily over-allocate which is not acceptable.

    If the assist work is based on something we don't know, then we won't be
    able to compute how much assist work to do. That's why it's based on an
    estimate.

    I'm not sure what you mean by "easily" over-allocate, but, yes, if the work
    estimate is wrong we can exceed GOGC in that cycle. That's why the system
    revises the work estimate as it runs so that future cycles won't exceed
    GOGC. As far as I can tell, the only way to ensure a hard limit is to stop
    the world if allocation exceeds that limit. Given how rough the GOGC limit
    is anyway, this seems like a bad trade-off.
    GOGC is not rough today. It is quite precise. Over-allocation by 10%
    is bad. Over-allocation by 30% is simply unacceptable, it will cause
    program crashes.
    We cannot rely on a potentially wrong work estimate for heap size bounding.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 6:43 pm

    On Wed, Mar 11, 2015 at 2:01 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 8:29 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:49 PM, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements <austin@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov <dvyukov@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how
    to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the
    accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled
    write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of
    measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but
    that
    could simply be because other things are higher on the list. There is
    some
    low-hanging fruit for speeding up write barriers when the time comes.
    And if
    they still turn out to be non-trivial, then I think we can account for
    them
    by doing exactly what you suggested and lowering the goal utilization
    to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle
    utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That
    is, if
    the mutator is consuming 50% of the CPU (not counting assists), and
    assists
    are taking x%, then background GC will be scheduled for (50-x)% and
    "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The
    only
    change is that rather than performing A(x) work directly, the mutator
    first
    tries to debit A(x) work from the background GC's credit and if it
    can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then
    (assuming
    the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    Yes, it clarifies. But the formula for the ideal assist work should
    contain Wa which we don't know instead of We. Because of that we can
    easily over-allocate which is not acceptable.

    If the assist work is based on something we don't know, then we won't be
    able to compute how much assist work to do. That's why it's based on an
    estimate.

    I'm not sure what you mean by "easily" over-allocate, but, yes, if the work
    estimate is wrong we can exceed GOGC in that cycle. That's why the system
    revises the work estimate as it runs so that future cycles won't exceed
    GOGC. As far as I can tell, the only way to ensure a hard limit is to stop
    the world if allocation exceeds that limit. Given how rough the GOGC limit
    is anyway, this seems like a bad trade-off.
    GOGC is not rough today. It is quite precise. Over-allocation by 10%
    is bad. Over-allocation by 30% is simply unacceptable, it will cause
    program crashes.
    We cannot rely on a potentially wrong work estimate for heap size bounding.
    I meant that it's rough as a tuning parameter since it's relative to live
    heap size. Unless a program is somehow managing its live heap size to
    within 10%, then it's not managing its maximum heap size to within 10%.

    Can you give an example of something that's so sensitive to its maximum
    heap size that it can't exceed GOGC? Rick, Russ and I have been operating
    under the assumption that heap and CPU limits are essentially soft and we
    need to know if there's a really compelling case for making them hard.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 7:09 am

    On Wed, Mar 11, 2015 at 9:43 PM, Austin Clements wrote:
    GOGC is not rough today. It is quite precise. Over-allocation by 10%
    is bad. Over-allocation by 30% is simply unacceptable, it will cause
    program crashes.
    We cannot rely on a potentially wrong work estimate for heap size
    bounding.

    I meant that it's rough as a tuning parameter since it's relative to live
    heap size. Unless a program is somehow managing its live heap size to
    within 10%, then it's not managing its maximum heap size to within 10%.
    Live heap size is fully transparent for user. If memory consumption
    becomes a problem user can measure and optimize size of persistent
    data structures, per-request memory consumption and number of requests
    in flight. And the live heap == persistent data + per-request memory *
    number of concurrent requests.

    Can you give an example of something that's so sensitive to its maximum heap
    size that it can't exceed GOGC? Rick, Russ and I have been operating under
    the assumption that heap and CPU limits are essentially soft and we need to
    know if there's a really compelling case for making them hard.
    I agree that everything related to CPU is soft. There is always OS
    scheduler and other processes; CPU is harder to measure and in some
    sense it is harder to even define CPU consumption; also you have
    infinite amount of CPU time (every second you have another second of
    CPU time). So besides hard-real-time systems hard CPU limits do not
    make lots of sense.

    Memory is a completely different story. It's easy to define and
    measure; environments frequently have hard bounds on memory
    consumption (hard in the sense that a process will be terminated
    rather temporary throttled as with CPU). For systems w/o swap (and to
    some degree for systems with swap) any over-allocation becomes
    persistent. So it is super important to not allow any noticeable
    over-allocation. Consider that a program has a live set of 3GB, and
    container has hard limit of 8GB. If we don't provide hard limit on
    heap size, it is not possible to operate Go programs in such
    environment at all.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 4:52 pm

    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And if
    they still turn out to be non-trivial, then I think we can account for them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That is, if
    the mutator is consuming 50% of the CPU (not counting assists), and assists
    are taking x%, then background GC will be scheduled for (50-x)% and "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The only
    change is that rather than performing A(x) work directly, the mutator first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then (assuming the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    "Instead, the runtime will adjust hT based on an estimate of what the
    heap growth would have been if GC CPU utilization was ug=0.25".

    What if GC actually scans with Ug=0.9 because of idle time? If my math
    is correct, this formula will emphatically schedule GC way ahead of
    ideal time.

    I'm not quite sure what you're asking. u_g is a constant (0.25) and u_a
    excludes idle time slurped up by background GC. (You might be right that
    something isn't quite right with idle time handling. I haven't checked that
    part as carefully as I have the other parts and it's the one part I haven't
    modeled in my simulator.)

    You schedule GC assuming that GC will consume only 25% of CPU. It can
    easily happen that GC consumes more because of idle time. Then you
    will consistently schedule GC too early.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 10:15 pm

    On Wed, Mar 11, 2015 at 12:51 PM, Dmitry Vyukov wrote:
    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is some
    low-hanging fruit for speeding up write barriers when the time comes. And if
    they still turn out to be non-trivial, then I think we can account for them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That is, if
    the mutator is consuming 50% of the CPU (not counting assists), and assists
    are taking x%, then background GC will be scheduled for (50-x)% and "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The only
    change is that rather than performing A(x) work directly, the mutator first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then (assuming the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    "Instead, the runtime will adjust hT based on an estimate of what the
    heap growth would have been if GC CPU utilization was ug=0.25".

    What if GC actually scans with Ug=0.9 because of idle time? If my math
    is correct, this formula will emphatically schedule GC way ahead of
    ideal time.

    I'm not quite sure what you're asking. u_g is a constant (0.25) and u_a
    excludes idle time slurped up by background GC. (You might be right that
    something isn't quite right with idle time handling. I haven't checked that
    part as carefully as I have the other parts and it's the one part I haven't
    modeled in my simulator.)

    You schedule GC assuming that GC will consume only 25% of CPU. It can
    easily happen that GC consumes more because of idle time. Then you
    will consistently schedule GC too early.
    If GC consumes more CPU because of idle time (and does so consistently for
    a few GC cycles), then the trigger controller will adapt to this and move
    the trigger higher. It's true that if the mutator goes from very idle to
    very busy, then the trigger may be too high, so the next few cycles may
    miss their CPU utilization goals (assuming the mutator also ramps up its
    allocation rate), but as in other cases, the controller will pull the
    trigger lower and adapt after a few GC cycles.

    BTW, I added idle time modeling to my simulator and confirmed that the
    formulas do the right thing, given that u_a excludes idle time background
    GC. (If u_a were to include idle time background GC, then the error term in
    the trigger controller would penalize background GC for using idle time and
    pull the trigger lower, just like it would if mutator assists were
    overloading the CPU. With mutator assists, this has the effect of spreading
    their utilization over a longer period in the next cycle. But, idle time
    background GC is different. Assuming the next cycle has the same idle time,
    the background GC will continue to make use of the idle time and now it
    will just finish early. So it's important that u_a *not* include idle time
    background GC.)

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 7:47 am

    On Thu, Mar 12, 2015 at 1:15 AM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 12:51 PM, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 7:34 PM, Austin Clements <austin@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 12:06 PM, Dmitry Vyukov <dvyukov@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    "This includes time in the background collector and assists from the
    mutator, but not time in write barriers (simply because the accounting
    would increase write barrier overhead) or secondary effects like
    increased cache misses."

    This assumes that write barriers and cache missed do not introduce
    significant slowdown, right? Do we have any estimations? Enabled write
    barrier can introduce some overhead, and user generally does not care
    whether the additional latency is introduced by assists or by write
    barriers. Probably we need to shoot for something like 21% of measured
    overhead and 4% of aux overheads.

    We haven't seen significant slowdowns from write barriers yet, but that
    could simply be because other things are higher on the list. There is
    some
    low-hanging fruit for speeding up write barriers when the time comes.
    And if
    they still turn out to be non-trivial, then I think we can account for
    them
    by doing exactly what you suggested and lowering the goal utilization to
    give some margin.
    "this idle utilization does not count against the GC CPU budget"

    I am not sure I understand this correctly. Do you mean that GC will
    eat 25% on top of the 50% that it eat from idle time? Idle utilization
    should count towards total CPU budged. For example, if a program
    consumes 50% of CPU time, GC should just eat the idle 50% rather than
    eat the idle 50% and then 25% more.
    Similarly, "Goal 2. Minimize ug - ua(n)" should be "Minimize max(Ug,
    Uidle) - Ua".

    GC will consume its 25% and *then* whatever's left of idle time. That
    is, if
    the mutator is consuming 50% of the CPU (not counting assists), and
    assists
    are taking x%, then background GC will be scheduled for (50-x)% and
    "billed"
    for (25-x)% (assuming x<25).
    "We(n)=w(n)Hm(n-1)"

    How is it used? I only see a reference in the "ideal assist work",
    which you discard and replace with credit/debit system.

    The credit/debit system is still driven by the ideal assist work. The
    only
    change is that rather than performing A(x) work directly, the mutator
    first
    tries to debit A(x) work from the background GC's credit and if it can't
    debit all of the work, it does some itself.
    "This system of work credit is quite flexible."

    How exactly does it work?
    It is possible that the background scanner has created a fullload of
    credit, but we still don't want mutators to use the credit but rather
    stop where they are and assist, otherwise they will over-allocate.

    Since the assist is still being driven by the ideal work, then (assuming
    the
    scan work estimate is accurate) mutators still won't be able to
    over-allocate.

    Does that clarify?
    "Instead, the runtime will adjust hT based on an estimate of what the
    heap growth would have been if GC CPU utilization was ug=0.25".

    What if GC actually scans with Ug=0.9 because of idle time? If my math
    is correct, this formula will emphatically schedule GC way ahead of
    ideal time.

    I'm not quite sure what you're asking. u_g is a constant (0.25) and u_a
    excludes idle time slurped up by background GC. (You might be right that
    something isn't quite right with idle time handling. I haven't checked
    that
    part as carefully as I have the other parts and it's the one part I
    haven't
    modeled in my simulator.)

    You schedule GC assuming that GC will consume only 25% of CPU. It can
    easily happen that GC consumes more because of idle time. Then you
    will consistently schedule GC too early.

    If GC consumes more CPU because of idle time (and does so consistently for a
    few GC cycles), then the trigger controller will adapt to this and move the
    trigger higher. It's true that if the mutator goes from very idle to very
    busy, then the trigger may be too high, so the next few cycles may miss
    their CPU utilization goals (assuming the mutator also ramps up its
    allocation rate), but as in other cases, the controller will pull the
    trigger lower and adapt after a few GC cycles.

    BTW, I added idle time modeling to my simulator and confirmed that the
    formulas do the right thing, given that u_a excludes idle time background
    GC. (If u_a were to include idle time background GC, then the error term in
    the trigger controller would penalize background GC for using idle time and
    pull the trigger lower, just like it would if mutator assists were
    overloading the CPU. With mutator assists, this has the effect of spreading
    their utilization over a longer period in the next cycle. But, idle time
    background GC is different. Assuming the next cycle has the same idle time,
    the background GC will continue to make use of the idle time and now it will
    just finish early. So it's important that u_a *not* include idle time
    background GC.)

    I am probably thinking in different terms.
    Do I understand it correctly that:
    - we start GOMAXPROCS/4 background GC threads, this gives us u_g of at
    least 25%, so u_g cannot be lower than 25% (GOMAXPROCS%4 != 0 aside)
    - mutator assists add to u_g, mutator assists are always above the cap of 25%
    - idle time GC is not included in u_g
    ?

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 11, 2015 at 5:16 pm

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but the
    problem is that it operates with lots of unknown values (or at least I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC cycle
    - we don't know how OS will schedule threads during the coming GC cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.
    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.
    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller; if
    all that is empty and GC is in progress they help GC before resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 11, 2015 at 7:46 pm

    On Wed, Mar 11, 2015 at 1:16 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but the
    problem is that it operates with lots of unknown values (or at least I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC cycle
    - we don't know how OS will schedule threads during the coming GC cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.
    Many of these are handled by the trigger controller. If we're in steady
    state, it will have figured them out (or, rather, figured out their
    effects). If the mutator changes phases, the controller will find the new
    optimum within a few GC cycles. If the mutator is continuously changing its
    behavior, then any scheduling algorithm is going to have a hard time
    figuring out what to do.

    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.
    I believe it will be extremely small. All of the computations are simple.
    Updating the state variables only needs to be done once per GC cycle, so
    these computations can be effectively ignored. Things like tracking scan
    work will be done per P and aggregated at the end of a cycle. Assist work
    will be done in batches (probably when we fetch a new span) so the overhead
    of going into the assist path will be amortized. Deciding when to schedule
    background GC will be done on the basis of periodically aggregated per-P
    counters.

    The win of a well-scheduled garbage collector, on the other hand, will be
    significant. For example, our current heuristic is to start the garbage
    collector at 7/8 of GOGC (not because we thought this was a good idea, just
    because we needed something until we had a real answer). A consequence of
    this is that speeding up the garbage collector often paradoxically slows
    down the overall application because we perform more GC cycles!

    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller; if
    all that is empty and GC is in progress they help GC before resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).
    At this high level, this doesn't seem significantly different from what I'm
    proposing. One difference is that you're proposing assists only kick in if
    we're nearing the heap goal, but that poses the danger of significantly
    overloading the CPU toward the end of a cycle when there's little time left
    to correct for this overload.

    One thing you may or may not have intended in your proposal (I may just be
    filling in details in my own way here) is the idea that how much work
    assists have to do is based on how much work has happened in the current GC
    cycle. This is possible because, in your proposal, assists don't kick in
    until well in to the cycle, so this estimate would be pretty accurate. In
    contrast, in my current proposal, this estimate is based solely on past
    cycles. I like the idea of using more information from the current cycle.
    In the terms from my proposal, I think this could easily be incorporated by
    revising the scan work ratio w (and the things derived from it) as the
    cycle progresses based on the scan work ratio observed so far in this
    cycle. I think this would give stronger heap size guarantees that you're
    going for as it converged towards the "truth" for this cycle, while also
    smoothing out the CPU utilization.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 8:38 am

    On Wed, Mar 11, 2015 at 10:46 PM, Austin Clements wrote:
    On Wed, Mar 11, 2015 at 1:16 PM, Dmitry Vyukov wrote:

    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but the
    problem is that it operates with lots of unknown values (or at least I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC cycle
    - we don't know how OS will schedule threads during the coming GC cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.

    Many of these are handled by the trigger controller.
    If we're in steady
    state, it will have figured them out (or, rather, figured out their
    effects). If the mutator changes phases, the controller will find the new
    optimum within a few GC cycles. If the mutator is continuously changing its
    behavior, then any scheduling algorithm is going to have a hard time
    figuring out what to do.
    Right.
    And so my proposal below is to rely more on the trigger controller,
    rather than than on complex computations based on unknown values.
    If we are in steady state trigger controller will do its thing
    eventually and we don't need anything else; if we are not in steady
    state, computations based on unknown values have little value, we can
    as well just wait for trigger controller to autotune.


    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.

    I believe it will be extremely small. All of the computations are simple.
    Updating the state variables only needs to be done once per GC cycle, so
    these computations can be effectively ignored. Things like tracking scan
    work will be done per P and aggregated at the end of a cycle. Assist work
    will be done in batches (probably when we fetch a new span) so the overhead
    of going into the assist path will be amortized. Deciding when to schedule
    background GC will be done on the basis of periodically aggregated per-P
    counters. ack
    The win of a well-scheduled garbage collector, on the other hand, will be
    significant. For example, our current heuristic is to start the garbage
    collector at 7/8 of GOGC (not because we thought this was a good idea, just
    because we needed something until we had a real answer). A consequence of
    this is that speeding up the garbage collector often paradoxically slows
    down the overall application because we perform more GC cycles!
    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller; if
    all that is empty and GC is in progress they help GC before resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).

    At this high level, this doesn't seem significantly different from what I'm
    proposing.
    Yes, but I propose to rely more on trigger controller and do not do
    computations of work, utilization and credits/debits. See above why.

    One difference is that you're proposing assists only kick in if
    we're nearing the heap goal, but that poses the danger of significantly
    overloading the CPU toward the end of a cycle when there's little time left
    to correct for this overload.
    Yes, and it is the only option that makes sense.
    If you had enough information in the begging on GC to figure out that
    you need to start assists earlier, why did not you just schedule GC
    earlier? We already did our best using all information we have to
    trigger GC at such a point that assists are not required at all. Now,
    even if we look at how GC progresses, we still don't know whether we
    need to add assists or not (because we don't know actual live heap
    size). So the only point when assists make sense is very late during
    GC.

    So what I am thinking about is:
    - tune trigger point based on feedback aiming at finishing marking at
    0.9 of target heap size
    - enable assists when we cross 0.9 point
    - stop the world when we cross 1.0 point

    The exact numbers are, of course, subject for tuning. For example, it
    may be a good idea to aim for modest assisting at the end of marking
    (just to slow down mutators and being able to set expected marking
    completion closer to 1.0). Also assists can be more aggressive as we
    approach 1.0.
    Another simpler and useful lever that we have is number of GC threads.
    For example, when we cross 0.95 point we can dedicate GOMAXPROCS/2
    threads for GC. This both speedups marking, slows down mutators in a
    simple coarse-grained, cache-friendly and _latency_-friendly way.

    One thing you may or may not have intended in your proposal (I may just be
    filling in details in my own way here) is the idea that how much work
    assists have to do is based on how much work has happened in the current GC
    cycle. This is possible because, in your proposal, assists don't kick in
    until well in to the cycle, so this estimate would be pretty accurate. In
    contrast, in my current proposal, this estimate is based solely on past
    cycles. I like the idea of using more information from the current cycle. In
    the terms from my proposal, I think this could easily be incorporated by
    revising the scan work ratio w (and the things derived from it) as the cycle
    progresses based on the scan work ratio observed so far in this cycle. I
    think this would give stronger heap size guarantees that you're going for as
    it converged towards the "truth" for this cycle, while also smoothing out
    the CPU utilization.
    Difficult to say. This again boils down to computations based on
    assumptions that are not necessary true. For example heap scanning
    speed can be radically different for large []*int and []byte. So we
    can observe that we've scanned only 0.25 of planned memory in 0.75
    time, and decide that we need to start assisting like insane. Or we
    can observe that we've scanned 0.9 in 0.8 time, so we are ahead of
    schedule. Both of these decisions can be incorrect.

    I would implement the mandatory part first (like trigger point
    controller) and then see what problems happen on real program and then
    investigate mechanisms to address them.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Rlh at Mar 12, 2015 at 1:50 pm
    When the GC cycle is triggered is our best estimation of the GC thread
    being able to keep ahead of the mutators. The mutator assist is required
    when
    that estimation is wrong and to help (softly) bound any overrun.
    The earlier in the cycle we provide mutator assist the less disruptive the
    assist
    will be. Delaying the assist will adversely affect our MMU/MUT latency
    numbers since it will push the assists into the last portion of the cycle
    when we also have to do the STW mark termination phase which dominates
    the MMU/MUT numbers.

    As to your last point about the time spent scanning []byte vs []*int, the
    assists
    credits are in terms of pointers and not bytes.



    On Thursday, March 12, 2015 at 4:37:48 AM UTC-4, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 10:46 PM, Austin Clements <aus...@google.com
    <javascript:>> wrote:
    On Wed, Mar 11, 2015 at 1:16 PM, Dmitry Vyukov <dvy...@google.com
    <javascript:>> wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    <golan...@googlegroups.com <javascript:>> wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but the
    problem is that it operates with lots of unknown values (or at least I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC
    cycle
    - we don't know how OS will schedule threads during the coming GC cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.

    Many of these are handled by the trigger controller.
    If we're in steady
    state, it will have figured them out (or, rather, figured out their
    effects). If the mutator changes phases, the controller will find the new
    optimum within a few GC cycles. If the mutator is continuously changing its
    behavior, then any scheduling algorithm is going to have a hard time
    figuring out what to do.
    Right.
    And so my proposal below is to rely more on the trigger controller,
    rather than than on complex computations based on unknown values.
    If we are in steady state trigger controller will do its thing
    eventually and we don't need anything else; if we are not in steady
    state, computations based on unknown values have little value, we can
    as well just wait for trigger controller to autotune.


    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.

    I believe it will be extremely small. All of the computations are simple.
    Updating the state variables only needs to be done once per GC cycle, so
    these computations can be effectively ignored. Things like tracking scan
    work will be done per P and aggregated at the end of a cycle. Assist work
    will be done in batches (probably when we fetch a new span) so the overhead
    of going into the assist path will be amortized. Deciding when to schedule
    background GC will be done on the basis of periodically aggregated per-P
    counters. ack
    The win of a well-scheduled garbage collector, on the other hand, will be
    significant. For example, our current heuristic is to start the garbage
    collector at 7/8 of GOGC (not because we thought this was a good idea, just
    because we needed something until we had a real answer). A consequence of
    this is that speeding up the garbage collector often paradoxically slows
    down the overall application because we perform more GC cycles!
    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller; if
    all that is empty and GC is in progress they help GC before resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).

    At this high level, this doesn't seem significantly different from what I'm
    proposing.
    Yes, but I propose to rely more on trigger controller and do not do
    computations of work, utilization and credits/debits. See above why.

    One difference is that you're proposing assists only kick in if
    we're nearing the heap goal, but that poses the danger of significantly
    overloading the CPU toward the end of a cycle when there's little time left
    to correct for this overload.
    Yes, and it is the only option that makes sense.
    If you had enough information in the begging on GC to figure out that
    you need to start assists earlier, why did not you just schedule GC
    earlier? We already did our best using all information we have to
    trigger GC at such a point that assists are not required at all. Now,
    even if we look at how GC progresses, we still don't know whether we
    need to add assists or not (because we don't know actual live heap
    size). So the only point when assists make sense is very late during
    GC.

    So what I am thinking about is:
    - tune trigger point based on feedback aiming at finishing marking at
    0.9 of target heap size
    - enable assists when we cross 0.9 point
    - stop the world when we cross 1.0 point

    The exact numbers are, of course, subject for tuning. For example, it
    may be a good idea to aim for modest assisting at the end of marking
    (just to slow down mutators and being able to set expected marking
    completion closer to 1.0). Also assists can be more aggressive as we
    approach 1.0.
    Another simpler and useful lever that we have is number of GC threads.
    For example, when we cross 0.95 point we can dedicate GOMAXPROCS/2
    threads for GC. This both speedups marking, slows down mutators in a
    simple coarse-grained, cache-friendly and _latency_-friendly way.

    One thing you may or may not have intended in your proposal (I may just be
    filling in details in my own way here) is the idea that how much work
    assists have to do is based on how much work has happened in the current GC
    cycle. This is possible because, in your proposal, assists don't kick in
    until well in to the cycle, so this estimate would be pretty accurate. In
    contrast, in my current proposal, this estimate is based solely on past
    cycles. I like the idea of using more information from the current cycle. In
    the terms from my proposal, I think this could easily be incorporated by
    revising the scan work ratio w (and the things derived from it) as the cycle
    progresses based on the scan work ratio observed so far in this cycle. I
    think this would give stronger heap size guarantees that you're going for as
    it converged towards the "truth" for this cycle, while also smoothing out
    the CPU utilization.
    Difficult to say. This again boils down to computations based on
    assumptions that are not necessary true. For example heap scanning
    speed can be radically different for large []*int and []byte. So we
    can observe that we've scanned only 0.25 of planned memory in 0.75
    time, and decide that we need to start assisting like insane. Or we
    can observe that we've scanned 0.9 in 0.8 time, so we are ahead of
    schedule. Both of these decisions can be incorrect.

    I would implement the mandatory part first (like trigger point
    controller) and then see what problems happen on real program and then
    investigate mechanisms to address them.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 2:14 pm

    On Thu, Mar 12, 2015 at 4:50 PM, wrote:
    When the GC cycle is triggered is our best estimation of the GC thread
    being able to keep ahead of the mutators. The mutator assist is required
    when
    that estimation is wrong and to help (softly) bound any overrun.
    The earlier in the cycle we provide mutator assist the less disruptive the
    assist
    will be. Delaying the assist will adversely affect our MMU/MUT latency
    numbers since it will push the assists into the last portion of the cycle
    when we also have to do the STW mark termination phase which dominates
    the MMU/MUT numbers.
    But we don't know when to start assists and whether we need them at all.
    If out best estimation is correct, then we don't need assists at all.
    If it is wrong, then everything we know about work/idle procs/live
    heap/etc is also wrong. One guess would be to always start assists
    instantly, but this introduces unnecessary latency. Another guess
    would be to always start assists very late (when we are sure that our
    estimations are wrong, this is what I propose). We can't do a
    well-founded decision to do something in between.


    As to your last point about the time spent scanning []byte vs []*int, the
    assists
    credits are in terms of pointers and not bytes.
    This does introduce overhead onto scanning hot path.

    On Thursday, March 12, 2015 at 4:37:48 AM UTC-4, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 10:46 PM, Austin Clements <aus...@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 1:16 PM, Dmitry Vyukov <dvy...@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but the
    problem is that it operates with lots of unknown values (or at least I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC
    cycle
    - we don't know how OS will schedule threads during the coming GC cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.

    Many of these are handled by the trigger controller.
    If we're in steady
    state, it will have figured them out (or, rather, figured out their
    effects). If the mutator changes phases, the controller will find the
    new
    optimum within a few GC cycles. If the mutator is continuously changing
    its
    behavior, then any scheduling algorithm is going to have a hard time
    figuring out what to do.
    Right.
    And so my proposal below is to rely more on the trigger controller,
    rather than than on complex computations based on unknown values.
    If we are in steady state trigger controller will do its thing
    eventually and we don't need anything else; if we are not in steady
    state, computations based on unknown values have little value, we can
    as well just wait for trigger controller to autotune.


    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.

    I believe it will be extremely small. All of the computations are
    simple.
    Updating the state variables only needs to be done once per GC cycle, so
    these computations can be effectively ignored. Things like tracking scan
    work will be done per P and aggregated at the end of a cycle. Assist
    work
    will be done in batches (probably when we fetch a new span) so the
    overhead
    of going into the assist path will be amortized. Deciding when to
    schedule
    background GC will be done on the basis of periodically aggregated per-P
    counters. ack
    The win of a well-scheduled garbage collector, on the other hand, will
    be
    significant. For example, our current heuristic is to start the garbage
    collector at 7/8 of GOGC (not because we thought this was a good idea,
    just
    because we needed something until we had a real answer). A consequence
    of
    this is that speeding up the garbage collector often paradoxically slows
    down the overall application because we perform more GC cycles!
    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller; if
    all that is empty and GC is in progress they help GC before resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).

    At this high level, this doesn't seem significantly different from what
    I'm
    proposing.
    Yes, but I propose to rely more on trigger controller and do not do
    computations of work, utilization and credits/debits. See above why.

    One difference is that you're proposing assists only kick in if
    we're nearing the heap goal, but that poses the danger of significantly
    overloading the CPU toward the end of a cycle when there's little time
    left
    to correct for this overload.
    Yes, and it is the only option that makes sense.
    If you had enough information in the begging on GC to figure out that
    you need to start assists earlier, why did not you just schedule GC
    earlier? We already did our best using all information we have to
    trigger GC at such a point that assists are not required at all. Now,
    even if we look at how GC progresses, we still don't know whether we
    need to add assists or not (because we don't know actual live heap
    size). So the only point when assists make sense is very late during
    GC.

    So what I am thinking about is:
    - tune trigger point based on feedback aiming at finishing marking at
    0.9 of target heap size
    - enable assists when we cross 0.9 point
    - stop the world when we cross 1.0 point

    The exact numbers are, of course, subject for tuning. For example, it
    may be a good idea to aim for modest assisting at the end of marking
    (just to slow down mutators and being able to set expected marking
    completion closer to 1.0). Also assists can be more aggressive as we
    approach 1.0.
    Another simpler and useful lever that we have is number of GC threads.
    For example, when we cross 0.95 point we can dedicate GOMAXPROCS/2
    threads for GC. This both speedups marking, slows down mutators in a
    simple coarse-grained, cache-friendly and _latency_-friendly way.

    One thing you may or may not have intended in your proposal (I may just
    be
    filling in details in my own way here) is the idea that how much work
    assists have to do is based on how much work has happened in the current
    GC
    cycle. This is possible because, in your proposal, assists don't kick in
    until well in to the cycle, so this estimate would be pretty accurate.
    In
    contrast, in my current proposal, this estimate is based solely on past
    cycles. I like the idea of using more information from the current
    cycle. In
    the terms from my proposal, I think this could easily be incorporated by
    revising the scan work ratio w (and the things derived from it) as the
    cycle
    progresses based on the scan work ratio observed so far in this cycle. I
    think this would give stronger heap size guarantees that you're going
    for as
    it converged towards the "truth" for this cycle, while also smoothing
    out
    the CPU utilization.
    Difficult to say. This again boils down to computations based on
    assumptions that are not necessary true. For example heap scanning
    speed can be radically different for large []*int and []byte. So we
    can observe that we've scanned only 0.25 of planned memory in 0.75
    time, and decide that we need to start assisting like insane. Or we
    can observe that we've scanned 0.9 in 0.8 time, so we are ahead of
    schedule. Both of these decisions can be incorrect.

    I would implement the mandatory part first (like trigger point
    controller) and then see what problems happen on real program and then
    investigate mechanisms to address them.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Rick Hudson at Mar 12, 2015 at 2:53 pm
    We are repeating ourselves. The basic framework in the paper is complete
    and well enough articulated to implement which is what I propose we do. If
    we see use cases (and we will) where our estimations are wrong we can
    adjust them within the framework. If we can't then the framework presented
    will need to be adjusted.
    On Thu, Mar 12, 2015 at 10:14 AM, Dmitry Vyukov wrote:
    On Thu, Mar 12, 2015 at 4:50 PM, wrote:
    When the GC cycle is triggered is our best estimation of the GC thread
    being able to keep ahead of the mutators. The mutator assist is required
    when
    that estimation is wrong and to help (softly) bound any overrun.
    The earlier in the cycle we provide mutator assist the less disruptive the
    assist
    will be. Delaying the assist will adversely affect our MMU/MUT latency
    numbers since it will push the assists into the last portion of the cycle
    when we also have to do the STW mark termination phase which dominates
    the MMU/MUT numbers.
    But we don't know when to start assists and whether we need them at all.
    If out best estimation is correct, then we don't need assists at all.
    If it is wrong, then everything we know about work/idle procs/live
    heap/etc is also wrong. One guess would be to always start assists
    instantly, but this introduces unnecessary latency. Another guess
    would be to always start assists very late (when we are sure that our
    estimations are wrong, this is what I propose). We can't do a
    well-founded decision to do something in between.


    As to your last point about the time spent scanning []byte vs []*int, the
    assists
    credits are in terms of pointers and not bytes.
    This does introduce overhead onto scanning hot path.

    On Thursday, March 12, 2015 at 4:37:48 AM UTC-4, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 10:46 PM, Austin Clements <aus...@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 1:16 PM, Dmitry Vyukov <dvy...@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how
    to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but
    the
    problem is that it operates with lots of unknown values (or at least
    I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC
    cycle
    - we don't know how OS will schedule threads during the coming GC
    cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.

    Many of these are handled by the trigger controller.
    If we're in steady
    state, it will have figured them out (or, rather, figured out their
    effects). If the mutator changes phases, the controller will find the
    new
    optimum within a few GC cycles. If the mutator is continuously
    changing
    its
    behavior, then any scheduling algorithm is going to have a hard time
    figuring out what to do.
    Right.
    And so my proposal below is to rely more on the trigger controller,
    rather than than on complex computations based on unknown values.
    If we are in steady state trigger controller will do its thing
    eventually and we don't need anything else; if we are not in steady
    state, computations based on unknown values have little value, we can
    as well just wait for trigger controller to autotune.


    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.

    I believe it will be extremely small. All of the computations are
    simple.
    Updating the state variables only needs to be done once per GC cycle,
    so
    these computations can be effectively ignored. Things like tracking
    scan
    work will be done per P and aggregated at the end of a cycle. Assist
    work
    will be done in batches (probably when we fetch a new span) so the
    overhead
    of going into the assist path will be amortized. Deciding when to
    schedule
    background GC will be done on the basis of periodically aggregated
    per-P
    counters. ack
    The win of a well-scheduled garbage collector, on the other hand, will
    be
    significant. For example, our current heuristic is to start the
    garbage
    collector at 7/8 of GOGC (not because we thought this was a good idea,
    just
    because we needed something until we had a real answer). A consequence
    of
    this is that speeding up the garbage collector often paradoxically
    slows
    down the overall application because we perform more GC cycles!
    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller;
    if
    all that is empty and GC is in progress they help GC before resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).

    At this high level, this doesn't seem significantly different from
    what
    I'm
    proposing.
    Yes, but I propose to rely more on trigger controller and do not do
    computations of work, utilization and credits/debits. See above why.

    One difference is that you're proposing assists only kick in if
    we're nearing the heap goal, but that poses the danger of
    significantly
    overloading the CPU toward the end of a cycle when there's little time
    left
    to correct for this overload.
    Yes, and it is the only option that makes sense.
    If you had enough information in the begging on GC to figure out that
    you need to start assists earlier, why did not you just schedule GC
    earlier? We already did our best using all information we have to
    trigger GC at such a point that assists are not required at all. Now,
    even if we look at how GC progresses, we still don't know whether we
    need to add assists or not (because we don't know actual live heap
    size). So the only point when assists make sense is very late during
    GC.

    So what I am thinking about is:
    - tune trigger point based on feedback aiming at finishing marking at
    0.9 of target heap size
    - enable assists when we cross 0.9 point
    - stop the world when we cross 1.0 point

    The exact numbers are, of course, subject for tuning. For example, it
    may be a good idea to aim for modest assisting at the end of marking
    (just to slow down mutators and being able to set expected marking
    completion closer to 1.0). Also assists can be more aggressive as we
    approach 1.0.
    Another simpler and useful lever that we have is number of GC threads.
    For example, when we cross 0.95 point we can dedicate GOMAXPROCS/2
    threads for GC. This both speedups marking, slows down mutators in a
    simple coarse-grained, cache-friendly and _latency_-friendly way.

    One thing you may or may not have intended in your proposal (I may
    just
    be
    filling in details in my own way here) is the idea that how much work
    assists have to do is based on how much work has happened in the
    current
    GC
    cycle. This is possible because, in your proposal, assists don't kick
    in
    until well in to the cycle, so this estimate would be pretty accurate.
    In
    contrast, in my current proposal, this estimate is based solely on
    past
    cycles. I like the idea of using more information from the current
    cycle. In
    the terms from my proposal, I think this could easily be incorporated
    by
    revising the scan work ratio w (and the things derived from it) as the
    cycle
    progresses based on the scan work ratio observed so far in this
    cycle. I
    think this would give stronger heap size guarantees that you're going
    for as
    it converged towards the "truth" for this cycle, while also smoothing
    out
    the CPU utilization.
    Difficult to say. This again boils down to computations based on
    assumptions that are not necessary true. For example heap scanning
    speed can be radically different for large []*int and []byte. So we
    can observe that we've scanned only 0.25 of planned memory in 0.75
    time, and decide that we need to start assisting like insane. Or we
    can observe that we've scanned 0.9 in 0.8 time, so we are ahead of
    schedule. Both of these decisions can be incorrect.

    I would implement the mandatory part first (like trigger point
    controller) and then see what problems happen on real program and then
    investigate mechanisms to address them.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Mar 12, 2015 at 3:13 pm
    The proposal lacks information on why this approach is better than
    simpler approaches. You seem to do some modelling. Can you share
    results of the modelling on how the framework reacts on: (1) different
    amount of idle time available, (2) different size of live heap, (3)
    different heap structure/uneven scanning speed, (4) different
    allocation rate?
    On Thu, Mar 12, 2015 at 5:53 PM, Rick Hudson wrote:
    We are repeating ourselves. The basic framework in the paper is complete and
    well enough articulated to implement which is what I propose we do. If we
    see use cases (and we will) where our estimations are wrong we can adjust
    them within the framework. If we can't then the framework presented will
    need to be adjusted.
    On Thu, Mar 12, 2015 at 10:14 AM, Dmitry Vyukov wrote:
    On Thu, Mar 12, 2015 at 4:50 PM, wrote:
    When the GC cycle is triggered is our best estimation of the GC thread
    being able to keep ahead of the mutators. The mutator assist is required
    when
    that estimation is wrong and to help (softly) bound any overrun.
    The earlier in the cycle we provide mutator assist the less disruptive
    the
    assist
    will be. Delaying the assist will adversely affect our MMU/MUT latency
    numbers since it will push the assists into the last portion of the
    cycle
    when we also have to do the STW mark termination phase which dominates
    the MMU/MUT numbers.
    But we don't know when to start assists and whether we need them at all.
    If out best estimation is correct, then we don't need assists at all.
    If it is wrong, then everything we know about work/idle procs/live
    heap/etc is also wrong. One guess would be to always start assists
    instantly, but this introduces unnecessary latency. Another guess
    would be to always start assists very late (when we are sure that our
    estimations are wrong, this is what I propose). We can't do a
    well-founded decision to do something in between.


    As to your last point about the time spent scanning []byte vs []*int,
    the
    assists
    credits are in terms of pointers and not bytes.
    This does introduce overhead onto scanning hot path.

    On Thursday, March 12, 2015 at 4:37:48 AM UTC-4, Dmitry Vyukov wrote:

    On Wed, Mar 11, 2015 at 10:46 PM, Austin Clements <aus...@google.com>
    wrote:
    On Wed, Mar 11, 2015 at 1:16 PM, Dmitry Vyukov <dvy...@google.com>
    wrote:
    On Tue, Mar 10, 2015 at 6:43 PM, 'Austin Clements' via golang-dev
    wrote:
    Go 1.5's concurrent garbage collector poses new challenges for how
    to
    schedule garbage collection work so that the collector finishes
    quickly
    and
    on time without sacrificing concurrency. Here's my proposal for
    the
    garbage
    collector "pacing" algorithm I'm planning to implement to address
    this
    in Go
    1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).

    The proposal puts strong mathematical foundation under the GC, but
    the
    problem is that it operates with lots of unknown values (or at least
    I
    don't see how to measure them). For example:
    - we don't know the alloc rate during the coming GC cycle
    - we don't know the actual scanning speed and heap configuration
    during the coming GC cycle
    - we don't know live heap size during the coming GC cycle
    - we don't know how much idle time we will have during the coming GC
    cycle
    - we don't know how OS will schedule threads during the coming GC
    cycle
    - we can't precisely measure CPU time consumed by GC (how?)
    And each of these values seem to be critical for the math.

    Many of these are handled by the trigger controller.
    If we're in steady
    state, it will have figured them out (or, rather, figured out their
    effects). If the mutator changes phases, the controller will find the
    new
    optimum within a few GC cycles. If the mutator is continuously
    changing
    its
    behavior, then any scheduling algorithm is going to have a hard time
    figuring out what to do.
    Right.
    And so my proposal below is to rely more on the trigger controller,
    rather than than on complex computations based on unknown values.
    If we are in steady state trigger controller will do its thing
    eventually and we don't need anything else; if we are not in steady
    state, computations based on unknown values have little value, we can
    as well just wait for trigger controller to autotune.


    So I suspect that this algorithm can introduce runtime overheads and
    complexity into runtime and at the same time fail to do the right
    thing.
    What is hot-path overhead of the proposed scheme? As far as I see it
    adds work to both scanner and mutator.

    I believe it will be extremely small. All of the computations are
    simple.
    Updating the state variables only needs to be done once per GC cycle,
    so
    these computations can be effectively ignored. Things like tracking
    scan
    work will be done per P and aggregated at the end of a cycle. Assist
    work
    will be done in batches (probably when we fetch a new span) so the
    overhead
    of going into the assist path will be amortized. Deciding when to
    schedule
    background GC will be done on the basis of periodically aggregated
    per-P
    counters. ack
    The win of a well-scheduled garbage collector, on the other hand,
    will
    be
    significant. For example, our current heuristic is to start the
    garbage
    collector at 7/8 of GOGC (not because we thought this was a good
    idea,
    just
    because we needed something until we had a real answer). A
    consequence
    of
    this is that speeding up the garbage collector often paradoxically
    slows
    down the overall application because we perform more GC cycles!
    It will work in steady state, but there are simpler ways to schedule
    GC in steady state.

    I would consider the following dumber approach.
    Start GOMAXPROCS/4 background scanning threads.
    Idle procs help GC. Procs don't ensure that they drained the whole
    system, instead check local workqueue, global workqueue and poller;
    if
    all that is empty and GC is in progress they help GC before
    resorting
    to stealing.
    If we finish GC significantly earlier target heap size, we start
    next
    GC a bit later.
    If we hit target heap size during GC, we start next GC a bit
    earlier.
    To avoid over-allocation mutators start helping GC when heap_alloc
    approaches next_gc. The input data for the decision: live heap after
    the previous GC, current heap size, target heap size, size of the
    scanned objects during the current GC. We don't know the critical
    piece here: current live heap size. Without it we can aim only for:
    (1) no assisting in steady state and (2) no over-allocation. These
    goals should be easy to meet. Assisting happens in coarse-grained
    pieces only when mallocgc allocates from heap (we don't have the
    critical information to do the right decision in non-steady state
    anyway).

    At this high level, this doesn't seem significantly different from
    what
    I'm
    proposing.
    Yes, but I propose to rely more on trigger controller and do not do
    computations of work, utilization and credits/debits. See above why.

    One difference is that you're proposing assists only kick in if
    we're nearing the heap goal, but that poses the danger of
    significantly
    overloading the CPU toward the end of a cycle when there's little
    time
    left
    to correct for this overload.
    Yes, and it is the only option that makes sense.
    If you had enough information in the begging on GC to figure out that
    you need to start assists earlier, why did not you just schedule GC
    earlier? We already did our best using all information we have to
    trigger GC at such a point that assists are not required at all. Now,
    even if we look at how GC progresses, we still don't know whether we
    need to add assists or not (because we don't know actual live heap
    size). So the only point when assists make sense is very late during
    GC.

    So what I am thinking about is:
    - tune trigger point based on feedback aiming at finishing marking at
    0.9 of target heap size
    - enable assists when we cross 0.9 point
    - stop the world when we cross 1.0 point

    The exact numbers are, of course, subject for tuning. For example, it
    may be a good idea to aim for modest assisting at the end of marking
    (just to slow down mutators and being able to set expected marking
    completion closer to 1.0). Also assists can be more aggressive as we
    approach 1.0.
    Another simpler and useful lever that we have is number of GC threads.
    For example, when we cross 0.95 point we can dedicate GOMAXPROCS/2
    threads for GC. This both speedups marking, slows down mutators in a
    simple coarse-grained, cache-friendly and _latency_-friendly way.

    One thing you may or may not have intended in your proposal (I may
    just
    be
    filling in details in my own way here) is the idea that how much work
    assists have to do is based on how much work has happened in the
    current
    GC
    cycle. This is possible because, in your proposal, assists don't kick
    in
    until well in to the cycle, so this estimate would be pretty
    accurate.
    In
    contrast, in my current proposal, this estimate is based solely on
    past
    cycles. I like the idea of using more information from the current
    cycle. In
    the terms from my proposal, I think this could easily be incorporated
    by
    revising the scan work ratio w (and the things derived from it) as
    the
    cycle
    progresses based on the scan work ratio observed so far in this
    cycle. I
    think this would give stronger heap size guarantees that you're going
    for as
    it converged towards the "truth" for this cycle, while also smoothing
    out
    the CPU utilization.
    Difficult to say. This again boils down to computations based on
    assumptions that are not necessary true. For example heap scanning
    speed can be radically different for large []*int and []byte. So we
    can observe that we've scanned only 0.25 of planned memory in 0.75
    time, and decide that we need to start assisting like insane. Or we
    can observe that we've scanned 0.9 in 0.8 time, so we are ahead of
    schedule. Both of these decisions can be incorrect.

    I would implement the mandatory part first (like trigger point
    controller) and then see what problems happen on real program and then
    investigate mechanisms to address them.
    --
    You received this message because you are subscribed to the Google
    Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Russ Cox at Mar 13, 2015 at 1:58 pm

    On Thu, Mar 12, 2015 at 11:13 AM, 'Dmitry Vyukov' via golang-dev wrote:

    The proposal lacks information on why this approach is better than
    simpler approaches. You seem to do some modelling. Can you share
    results of the modelling on how the framework reacts on: (1) different
    amount of idle time available, (2) different size of live heap, (3)
    different heap structure/uneven scanning speed, (4) different
    allocation rate?
    Austin has proposed something that seems completely reasonable (and much
    more principled than just about anything else in the runtime). I don't
    believe he needs to justify every thing he _didn't_ choose to do in order
    to move forward, implement, see how it works on real workloads, and iterate
    from there.

    Russ

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • 2paul De at Mar 17, 2015 at 9:12 am

    On Friday, March 13, 2015 at 2:58:19 PM UTC+1, rsc wrote:
    On Thu, Mar 12, 2015 at 11:13 AM, 'Dmitry Vyukov' via golang-dev <
    golan...@googlegroups.com <javascript:>> wrote:
    The proposal lacks information on why this approach is better than
    simpler approaches. You seem to do some modelling. Can you share
    results of the modelling on how the framework reacts on: (1) different
    amount of idle time available, (2) different size of live heap, (3)
    different heap structure/uneven scanning speed, (4) different
    allocation rate?
    Austin has proposed something that seems completely reasonable (and much
    more principled than just about anything else in the runtime). I don't
    believe he needs to justify every thing he _didn't_ choose to do in order
    to move forward, implement, see how it works on real workloads, and iterate
    from there.

    Russ
    " see how it works on real workloads, and iterate from there." ... that's
    what I was thinking too. Maybe translated that means: monitor, profile,
    optimize. I also think there has to be a bias in the math depending on
    application type (realtime, batch, medium latency etc)

    Overall, fantastic work everybody. As always.






    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Sep 3, 2015 at 11:09 pm
    Now that 1.5 is out I've gone back and added an appendix to the original GC
    pacer design proposal documenting what changed between the proposal and the
    final implementation. The overall feedback system didn't change between the
    proposal and the final implementation, but many of the details were revised
    in response to experimentation with the real implementation on real
    workloads.
    On Tue, Mar 10, 2015 at 11:43 AM, Austin Clements wrote:

    Go 1.5's concurrent garbage collector poses new challenges for how to
    schedule garbage collection work so that the collector finishes quickly and
    on time without sacrificing concurrency. Here's my proposal for the garbage
    collector "pacing" algorithm I'm planning to implement to address this in
    Go 1.5: https://golang.org/s/go15gcpacing

    Comments welcome (please comment in this thread).
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedMar 10, '15 at 3:43p
activeSep 3, '15 at 11:09p
posts30
users7
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase