So after committing the latest round of parameterized-plan hacking,
I was dismayed to see the buildfarm breaking out in pink, thanks to
some of the members producing a different plan than expected for one
test query. I eventually managed to reproduce that (on the fourth
machine I tried locally), and after some quality time with gdb
I understand what is happening, to wit: the two alternative plans have
exactly the same cost so far as our cost model is concerned. On my
main development machine, the two plans look like this to add_path:

$13 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08,
param_info = 0x40194458, rows = 5, startup_cost = 0,
total_cost = 47.628499999999988, pathkeys = 0x0}

$16 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08,
param_info = 0x40194458, rows = 5, startup_cost = 0,
total_cost = 47.628499999999981, pathkeys = 0x0}

so it picks the second one on the basis that its total_cost is better at
the sixteenth decimal place. On the other machine, the same two paths
look like this:

$12 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0,
param_info = 0x895c198, rows = 5, startup_cost = 0,
total_cost = 47.578499999999977, pathkeys = 0x0}

Breakpoint 2, add_path (parent_rel=0x895b9e0, new_path=0x895c208)
$15 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0,
param_info = 0x895c198, rows = 5, startup_cost = 0,
total_cost = 47.578499999999977, pathkeys = 0x0}

and add_path is coded to arbitrarily keep the first one when two
paths are exactly the same on all its preference measures.

Now, the fact that the two machines get different costs at the third
decimal place isn't very interesting here; that's a pretty standard
cross-platform difference arising from different MAXALIGN values.
The important point is that the total_costs of the two paths are
exactly the same on one machine, and on the other one different only
by a microscopic amount that probably arises from a slightly different
floating-point calculation sequence chosen by a different compiler.

So, as our code stands, we're inevitably going to have very platform-
and compiler-specific decisions as to which plan to prefer. I'm a bit
surprised that it's taken us this long to trip over this type of
situation, because it's surely not specific to parameterized paths.

We could deal with this either by giving up on showing the selected
plan in the regression test, or by creating multiple "expected" files,
but neither of those alternatives is very appealing.

The idea that I'm toying with is to try to make the choice a bit less
platform-specific, by removing the exact cost test that add_path uses
as its last-ditch comparison step, essentially this:

/*
* Same pathkeys and outer rels, and fuzzily
* the same cost, so keep just one; to decide
* which, first check rows and then do an
* exact cost comparison.
*/
if (new_path->rows < old_path->rows)
remove_old = true; /* new dominates old */
- else if (new_path->rows > old_path->rows)
- accept_new = false; /* old dominates new */
- else if (compare_path_costs(new_path, old_path,
- TOTAL_COST) < 0)
- remove_old = true; /* new dominates old */
else
accept_new = false; /* old equals or dominates new */

This change would mean that, when two paths have the same pathkeys,
parameterization, and rowcount, and fuzzily the same cost, that we
arbitrarily keep the first-submitted one rather than looking at low
order digits of the costs. Since the order in which different paths
are generated is a property of our code and not platform-specific,
this should eliminate platform dependencies in cases where two paths
are essentially identical to the cost model.

A variant idea would be to replace the exact cost comparison with a
second round of fuzzy cost comparison, but with a much tighter fuzz
factor, maybe 1e-6 instead of 0.01.

Now, neither of these fixes is perfect: what they would do is remove
platform-specific instability from where the costs are basically equal
and add some more in the range where the costs differ by almost exactly
the fuzz factor. But the behavior near that point is platform-specific
already, just not quite as much, and it's surely something we're
unlikely to trip over in the regression tests.

Thoughts, better ideas?

regards, tom lane

Search Discussions

  • Jim Nasby at Apr 19, 2012 at 11:54 pm

    On 4/19/12 5:39 PM, Tom Lane wrote:
    Now, neither of these fixes is perfect: what they would do is remove
    platform-specific instability from where the costs are basically equal
    and add some more in the range where the costs differ by almost exactly
    the fuzz factor. But the behavior near that point is platform-specific
    already, just not quite as much, and it's surely something we're
    unlikely to trip over in the regression tests.
    I can't help but think of complaints we get from users regarding plan stability, even though this is a case of taking that to an extreme. Because this case is extreme (changing plans due to 1e-16 of difference) it's fairly easy to fix this specific case. In order to get 9.2 out the door maybe fixing just this case is the right thing to do. But ISTM this is just an example of a bigger problem.

    One of the complaints we've seen is that the planner will sometimes choose a plan that has a marginally lower cost (where marginally in this case is significantly more than 1e-16 ;) even though that plan will perform really poorly if the stats are off. I have wondered if that could be addressed by introducing the concept of an error range to each plan. My idea is that each node would predict how much the cost estimate would change if the stats were off by some amount. If two plans are close to the same cost, you would want to choose the plan that had the lower error range, trading off a small amount of theoretical performance for less risk of getting a horrible plan if the stats assumptions proved to be wrong.

    I believe that would fix this specific case because even though to plans might come out with a nearly identical cost it is unlikely that they would also have a nearly identical error range.
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Tom Lane at Apr 20, 2012 at 2:55 am

    Jim Nasby writes:
    [ add some error ranges to cost estimates ]
    I believe that would fix this specific case because even though to plans might come out with a nearly identical cost it is unlikely that they would also have a nearly identical error range.
    Actually, I think that *wouldn't* fix this specific case --- did you
    look at the details? The two formulations of the plan are really pretty
    nearly equivalent; you can do the two nestloops in either order and it's
    not clear it'll make much difference. I'm suspicious that the addition
    of parameterized planning might open up more scope for this type of
    thing, even though in principle it was always possible.

    regards, tom lane
  • Simon Riggs at Apr 20, 2012 at 7:53 am

    On Thu, Apr 19, 2012 at 11:39 PM, Tom Lane wrote:

    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.
    The fuzz factor is a better idea, IMHO. I would like to see that as a
    user set parameter.

    Jim is right that plan stability is a wide problem, which could be
    addressed by setting a higher fuzz factor when plan instability is
    observed. Instability normally occurs for queries near a decision
    point in the cost models, which can fluctuate as exacts stats change.
    For those queries, it would be much better to address the instability
    directly via a fuzz factor than to attempt to fiddle with the enable_*
    parameters as is now common practice.

    Doc entry for the parameter...

    plan_choice_factor (float) - plans that vary in cost by less than this
    factor are considered equal by the planner, leading to a consistent
    choice of the first derived plans even across different types of
    hardware. This has a tendency to favour indexed plans when they are
    available. This is designed to address problems related to plan
    stability and may not produce desired results if used as a
    hinting/tweaking mechanism.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Tom Lane at Apr 20, 2012 at 5:19 pm

    Simon Riggs writes:
    On Thu, Apr 19, 2012 at 11:39 PM, Tom Lane wrote:
    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.
    The fuzz factor is a better idea, IMHO. I would like to see that as a
    user set parameter.
    I can see the case for exposing the fuzz factor for the existing round
    of fuzzy cost comparisons (in fact there's been a comment there for
    years questioning whether we should do so). Essentially the point of
    that would be to let the user trade off planning speed against
    resolution, because a larger fuzz factor would result in more plans
    getting discarded by add_path as being indistinguishably different in
    cost, so it would go faster but you'd probably get a plan that was only
    within so many percent of being the best available.

    (However, I'd like to see somebody do some experimentation to show
    that this actually behaves usefully before we invent Yet Another GUC.
    In particular it's not clear whether the error would accumulate
    unpleasantly over many levels of joining, if you tried to use a large
    fuzz factor.)

    If we have a second round of fuzzy comparisons, it would only be to
    dodge platform-specific roundoff differences, so I would think that
    locking its fuzz factor to 1e-6 or 1e-10 or so would be plenty good
    enough.
    Jim is right that plan stability is a wide problem, which could be
    addressed by setting a higher fuzz factor when plan instability is
    observed.
    User-visible plan instability (ie, instability on a given installation)
    usually arises from changes in the contents of pg_statistic. I am
    really dubious that changing the fuzz factor would help that in any
    useful way. It's possible I guess, but I'd want to see some
    experimental proof before we advertise it as being helpful for that.

    regards, tom lane
  • Albe Laurenz at Apr 20, 2012 at 8:29 am
    Tom Lane WROTE.
    So after committing the latest round of parameterized-plan hacking,
    I was dismayed to see the buildfarm breaking out in pink, thanks to
    some of the members producing a different plan than expected for one
    test query. I eventually managed to reproduce that (on the fourth
    machine I tried locally), and after some quality time with gdb
    I understand what is happening, to wit: the two alternative plans have
    exactly the same cost so far as our cost model is concerned.
    The idea that I'm toying with is to try to make the choice a bit less
    platform-specific, by removing the exact cost test that add_path uses
    as its last-ditch comparison step, essentially this:

    /*
    * Same pathkeys and outer rels, and fuzzily
    * the same cost, so keep just one; to decide
    * which, first check rows and then do an
    * exact cost comparison.
    */
    if (new_path->rows < old_path->rows)
    remove_old = true; /* new
    dominates old */
    - else if (new_path->rows >
    old_path->rows)
    - accept_new = false; /* old
    dominates new */
    - else if (compare_path_costs(new_path, old_path,
    - TOTAL_COST) < 0)
    - remove_old = true; /* new
    dominates old */
    else
    accept_new = false; /* old equals
    or dominates new */
    This change would mean that, when two paths have the same pathkeys,
    parameterization, and rowcount, and fuzzily the same cost, that we
    arbitrarily keep the first-submitted one rather than looking at low
    order digits of the costs. Since the order in which different paths
    are generated is a property of our code and not platform-specific,
    this should eliminate platform dependencies in cases where two paths
    are essentially identical to the cost model.

    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.

    Now, neither of these fixes is perfect: what they would do is remove
    platform-specific instability from where the costs are basically equal
    and add some more in the range where the costs differ by almost exactly
    the fuzz factor. But the behavior near that point is
    platform-specific
    already, just not quite as much, and it's surely something we're
    unlikely to trip over in the regression tests.
    What if you remove the exact cost comparison, but leave the part where
    old dominates new based on rows?
    That should not add any new instabilities compared to the code as it is.
    Of course, if the row count estimates are subject to the same problem,
    it would solve nothing. Perhaps it is worth trying on the buildfarm.

    Yours,
    Laurenz Albe
  • Tom Lane at Apr 20, 2012 at 3:04 pm

    "Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
    What if you remove the exact cost comparison, but leave the part where
    old dominates new based on rows?
    Um, that is what the proposed patch does.

    regards, tom lane
  • Albe Laurenz at Apr 20, 2012 at 3:29 pm

    Tom Lane wrote:
    What if you remove the exact cost comparison, but leave the part
    where
    old dominates new based on rows?
    Um, that is what the proposed patch does.
    I was referring to the first two lines that the patch removes.
    I guess I don't understand why they should go.

    Anyway, I fail to see how this patch could introduce new
    platform-specific
    instabilities, so it seems safe for me.

    Yours,
    Laurenz Albe
  • Tom Lane at Apr 20, 2012 at 3:49 pm

    "Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
    Tom Lane wrote:
    Um, that is what the proposed patch does.
    I was referring to the first two lines that the patch removes.
    I guess I don't understand why they should go.
    What we'd have left after the proposed removal is

    if (new_path->rows < old_path->rows)
    remove_old = true; /* new dominates old */
    else
    accept_new = false; /* old equals or dominates new */

    There's no need to make more than one test on the rows values.

    regards, tom lane
  • Albe Laurenz at Apr 21, 2012 at 8:35 am

    Tom Lane wrote:
    Um, that is what the proposed patch does.
    I was referring to the first two lines that the patch removes.
    I guess I don't understand why they should go.
    What we'd have left after the proposed removal is

    if (new_path->rows < old_path->rows)
    remove_old = true; /* new dominates old */
    else
    accept_new = false; /* old equals or dominates new */

    There's no need to make more than one test on the rows values.
    I see, thanks for explaining.

    Yours,
    Laurenz Albe
  • Dean Rasheed at Apr 20, 2012 at 9:14 am

    On 19 April 2012 23:39, Tom Lane wrote:
    The idea that I'm toying with is to try to make the choice a bit less
    platform-specific, by removing the exact cost test that add_path uses
    as its last-ditch comparison step, essentially this:

    /*
    * Same pathkeys and outer rels, and fuzzily
    * the same cost, so keep just one; to decide
    * which, first check rows and then do an
    * exact cost comparison.
    */
    if (new_path->rows < old_path->rows)
    remove_old = true;  /* new dominates old */
    -                               else if (new_path->rows > old_path->rows)
    -                                   accept_new = false; /* old dominates new */
    -                               else if (compare_path_costs(new_path, old_path,
    -                                                      TOTAL_COST) < 0)
    -                                   remove_old = true;  /* new dominates old */
    else
    accept_new = false; /* old equals or dominates new */

    This change would mean that, when two paths have the same pathkeys,
    parameterization, and rowcount, and fuzzily the same cost, that we
    arbitrarily keep the first-submitted one rather than looking at low
    order digits of the costs.  Since the order in which different paths
    are generated is a property of our code and not platform-specific,
    this should eliminate platform dependencies in cases where two paths
    are essentially identical to the cost model.
    I think that the fuzz factor here is large enough that people would
    notice, and it would inevitably lead to questions like "why did it
    choose this plan whose cost is 127 over this one whose cost is 126?".

    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.
    That seems like a better approach. Given the accuracy of floating
    point numbers, you could comfortably reduce that down to 1e-12 or even
    less, which would be less user-visible but still remove any
    platform-specific instability.

    Regards,
    Dean
  • Stephen Frost at Apr 20, 2012 at 3:30 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    This change would mean that, when two paths have the same pathkeys,
    parameterization, and rowcount, and fuzzily the same cost, that we
    arbitrarily keep the first-submitted one rather than looking at low
    order digits of the costs.
    +1 on this approach from me.
    Since the order in which different paths
    are generated is a property of our code and not platform-specific,
    this should eliminate platform dependencies in cases where two paths
    are essentially identical to the cost model.
    And the above is why.
    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.
    Not impressed with this idea- the notion that our model is good enough
    to produce valid values out to that many digits is, well, unlikely.

    I haev to disagree about users noticing this and complaining about it
    too, to be honest, that strikes me as very unlikely.. For starters,
    they'd have to be debugging the planner sufficiently to see that there
    are two nearly-identical plans under consideration and that we picked
    one over the other based on which came first..

    Thanks,

    Stephen
  • Tom Lane at Apr 20, 2012 at 3:58 pm

    Stephen Frost writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.
    Not impressed with this idea- the notion that our model is good enough
    to produce valid values out to that many digits is, well, unlikely.
    I haev to disagree about users noticing this and complaining about it
    too, to be honest, that strikes me as very unlikely.. For starters,
    they'd have to be debugging the planner sufficiently to see that there
    are two nearly-identical plans under consideration and that we picked
    one over the other based on which came first..
    Yeah, I'm pretty dubious about that too. If there is really a reason
    to care which one gets picked, it must be that the actual difference in
    cost is much more than 1%. In which case, the appropriate fix is in the
    cost estimates, not in the details of how add_path resolves near-ties.

    regards, tom lane
  • Tom Lane at Apr 21, 2012 at 4:56 am

    Stephen Frost writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    A variant idea would be to replace the exact cost comparison with a
    second round of fuzzy cost comparison, but with a much tighter fuzz
    factor, maybe 1e-6 instead of 0.01.
    Not impressed with this idea- the notion that our model is good enough
    to produce valid values out to that many digits is, well, unlikely.
    While I remain convinced of the abstract truth of that position, I've
    committed a patch that uses a second round of fuzzy comparison. It
    turned out that simply removing the exact comparison as per my first
    proposal resulted in a surprisingly large number of changes in the
    regression test results; apparently, a lot more cases than we realized
    have multiple plans with indistinguishable costs. I didn't feel like
    going through and validating the test result changes, and besides that
    this could have resulted in changes in behavior-in-the-field that people
    would complain about.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedApr 19, '12 at 10:39p
activeApr 21, '12 at 8:35a
posts14
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase