The examples that Kevin Grittner put up awhile back included several
uses of EXISTS() in places where it couldn't be turned into a semijoin,
eg in the query's targetlist. I was musing a bit about whether we could
improve those scenarios. I would like to get 8.4 to the point where we
could say as a blanket performance recommendation "prefer EXISTS over
IN". The semantic gotchas associated with NOT IN make it hard to
optimize well, not to mention being a perennial bane of novices; so if
we could just point people in the other direction without qualification
I think we'd be better off. But how much work would it be to get there?

The single place where IN wins over EXISTS as of CVS HEAD is that a
non-join-optimizable IN clause can still be turned into a "hashed
subplan", which greatly reduces the cost of making IN tests for a large
number of upper-query rows. It looks to me that with the current
planning infrastructure it wouldn't be very difficult to turn EXISTS
(with hashable comparisons to upper variables in its WHERE) into a
similar kind of plan. The problem is that *that isn't necessarily a win*.
Consider something like

SELECT x, y, EXISTS(SELECT * FROM tab1 WHERE tab1.a = tab2.z)
FROM tab2 WHERE ...;

Given that there's an index on tab1.a, the current planning for this
will produce what's essentially a nestloop-with-inner-indexscan plan:
for each tab2 row selected by the outer query, we'll do an indexscan
probe into tab1 to see if there's a match. This is an ideal plan as
long as the outer query doesn't select very many tab2 rows.

We could transform this into the equivalent of a hashed implementation
of

SELECT x, y, z IN (SELECT a FROM tab1)
FROM tab2 WHERE ...;

which would result in loading all of tab1 into a hashtable and then
probing the hashtable for each tab2 row. Now, that wins big if there
are many selected tab2 rows (and tab1 isn't too big to fit in an
in-memory hashtable). But it loses big if the outer query only needs
to probe for a few values --- we won't repay the cost of building
the hashtable.

So I think it'd be a really bad idea to make this change blindly.
For everyone whose query got speeded up, someone else's would be slowed
down --- in fact, for apps that are tuned to PG's existing behavior,
you could expect that it'd mostly be the latter case.

The problem then is to make the choice of plan with some intelligence.
The bit of information that we lack in order to do that is an idea of
how many times the outer query will call the EXISTS subquery. Right
now, all subqueries that can't be pulled up as joins are planned fully
during SS_process_sublinks(), which is called long before we can have
any idea about that. I looked into whether it's feasible to postpone
planning subqueries till later on in planning. I think it's probably
structurally possible without an enormous amount of work, but it's not
exactly trivial either.

Even given that we postpone planning/costing subqueries until we really
need to know the cost, we're not out of the woods. For an EXISTS
appearing in a join condition, it's entirely possible that different
join sequences will result in executing the EXISTS wildly different
numbers of times. Re-planning the EXISTS subquery each time we consider
a different upper-query join sequence seems entirely impractical on
speed grounds.

So it seems like what we'd need to do is

* During planner startup, generate Paths (we'd need no more level of
detail) for both the "retail" and hashed version of each EXISTS
subquery. From these, estimate the startup cost of the hashed version
(ie, time to execute the un-qualified subquery once and load the hash
table) and the per-upper-row costs of each approach. Stash these costs
somewhere handy.

* While forming upper-query paths, estimate the costs of each approach
on-the-fly for every path, based on the estimated number of rows in the
input paths. Use the cheaper case while figuring the cost of that
upper path.

* While building the final Plan, instantiate whichever subquery Plan
is a winner in the context of the chosen upper path.

I don't think any of this is out of reach, but it'd be a nontrivial
bit of implementation effort (maybe a week or three) and it also looks
like there might be a measurable planning slowdown for any query
involving subqueries. I'm not sure yet how much of this is just moving
existing work around and how much will actually represent new planning
expense. But certainly, costing EXISTS subqueries two different ways
isn't going to be free.

So ... I'm wondering if this actually touches anyone's hot-button,
or if we should just file it in the overflowing pile of Things That
Might Be Nice To Do Someday.

Comments?

regards, tom lane

Search Discussions

  • Alvaro Herrera at Aug 19, 2008 at 2:57 am

    Tom Lane wrote:

    I don't think any of this is out of reach, but it'd be a nontrivial
    bit of implementation effort (maybe a week or three) and it also looks
    like there might be a measurable planning slowdown for any query
    involving subqueries. I'm not sure yet how much of this is just moving
    existing work around and how much will actually represent new planning
    expense. But certainly, costing EXISTS subqueries two different ways
    isn't going to be free.
    The typical comment around here is that it's usually a huge win when a
    bit of CPU time is spent in buying I/O savings. Since most servers are
    I/O bound anyway, and since most servers nowadays are typically
    oversized in CPU terms, this strikes me as the good tradeoff to be
    making.

    In any case, most of the time EXISTS queries are expensive queries, so
    spending more time planning them is probably good. It'd be a shame to
    spend a lot of time planning queries that are trivial in execution
    costs, but I wouldn't expect this to be the case here.

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Stephen Frost at Aug 19, 2008 at 1:06 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    So ... I'm wondering if this actually touches anyone's hot-button,
    or if we should just file it in the overflowing pile of Things That
    Might Be Nice To Do Someday.
    What bugs me the most about having IN() be faster than EXISTS() in
    certain situations is that it ends up being counter-intuitive and not
    really what you'd expect to happen. That being said, we can always tell
    people that they can use IN() as a work-around for these situations. In
    the long run, I think it's definitely worth it to spend a bit of extra
    time in planning the query for this case. Not knowing what else is on
    your plate for 8.4, I don't know where I'd rank this, but it wouldn't be
    at the top.

    Thanks,

    Stephen
  • Kevin Grittner at Aug 19, 2008 at 2:44 pm

    Tom Lane wrote:
    The examples that Kevin Grittner put up awhile back included several
    uses of EXISTS() in places where it couldn't be turned into a semijoin,
    eg in the query's targetlist. I was musing a bit about whether we could
    improve those scenarios. I would like to get 8.4 to the point where we
    could say as a blanket performance recommendation "prefer EXISTS over
    IN". The semantic gotchas associated with NOT IN make it hard to
    optimize well, not to mention being a perennial bane of novices; so if
    we could just point people in the other direction without
    qualification
    I think we'd be better off. Agreed.
    So ... I'm wondering if this actually touches anyone's hot-button,
    or if we should just file it in the overflowing pile of Things That
    Might Be Nice To Do Someday.

    Comments?
    I'm in the position of trying to influence programmers here to write
    queries using set logic. Way too many of the queries here are coded
    with a cursor for a "primary" select, with a bunch of lower level
    cursors to navigate around and get the related rows one at a time.
    Results are often stuck into a work table as this progresses, with the
    work table massaged a bit here and there in this procedural process,
    and the final results selected out. It should surprise nobody here
    that this is not fast to write, easy to maintain, efficient to run, or
    generally free from subtle errors. I point out that they should write
    queries which state what they want, regardless of how complex those
    rules are, instead of writing how to get it. The optimizer, I argue,
    has tricks available which they don't.

    Usually, a rewrite into set logic has a fraction of the number of
    lines, runs much faster, and loses a bug or two that was hidden within
    the procedural spaghetti. On the other hand, sometimes they write a
    perfectly good "set logic" query (from the point of view of stating
    what they want), and the optimizer falls down, and I have to come in
    and say "Oh, it has trouble with EXISTS; you can use IN here." When I
    tell them to use IN instead of EXISTS, then I need to have all these
    caveats about the sizes of tables and the possibilities of NULL on the
    NOT EXISTS. At this point I tend to lose a big part of my audience.

    So I'd be very happy to see this work done, not because I can't find a
    workaround, but because trying to teach all the programmers tricky
    hand-optimizations is a losing battle, and if I lose that battle the
    queries degenerate into spaghetti-land.

    As with others, I can't say where this fits on a priority list, but I
    would hate to see it drift off onto a "someday" list.

    -Kevin
  • Tom Lane at Aug 20, 2008 at 5:46 pm

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    Tom Lane wrote:
    [ complicated scheme for improving planning of EXISTS ]
    So I'd be very happy to see this work done, not because I can't find a
    workaround, but because trying to teach all the programmers tricky
    hand-optimizations is a losing battle, and if I lose that battle the
    queries degenerate into spaghetti-land.
    I spent some time looking at this, and soon grew rather discouraged:
    even the very first step of what I'd had in mind, which was to delay
    replacement of uplevel Vars with Params until late in the planning
    process, looks like it will destabilize large amounts of code that
    aren't particularly related to the problem at hand. (Most of the
    planner blithely assumes that it will never see an uplevel Var, and
    tends to just treat any Var as being of the current query level.)

    So I backed off and thought some more, and eventually came to this
    conclusion: when we have an EXISTS that could be done both ways,
    why not just generate plans for both ways, and leave the decision
    which to use until later? Like maybe even execution time?

    We have speculated in the past about having alternative plans that
    could be conditionally executed based on information not available
    at planning time. This could be seen as a first experiment in that
    direction. I am not thinking of a general-purpose AlternativePlan
    kind of execution node, because SubPlans aren't actually part of the
    main plan-node tree, but an AlternativeSubPlans expression node
    type might work.

    The two issues that would obviously have to be faced to make this
    work are:

    1. While the planner is estimating evaluation costs of the qual
    conditions for the upper query, which EXISTS implementation do we assume
    will be used? It might be that we could still use my original idea of
    providing cost_qual_eval() with some context about the likely number of
    calls, but what I'm thinking at the moment is that it's not worth the
    trouble, because it isn't going to matter that much. Either possibility
    is expensive enough compared to ordinary qual conditions that the
    planner will be driven in the direction of plans that minimize the
    number of EXISTS evaluations, and that's all that we really care about.
    So I'd be inclined to just use the numbers for the base (non hashed)
    implementation and be done with it.

    2. How will the executor make the decision which to use? Well, it's
    got access to the overall rowcount estimates that the planner made.
    What I'm thinking of doing is having the AlternativeSubPlans node
    look at the rowcount estimate of its immediate parent Plan node.
    This is actually exactly the right number for a subplan in the
    targetlist of the Plan node. For a subplan in the qual list, it's
    an underestimate, but probably not an enormous underestimate.
    (Assuming that the subplan is at the end of the qual list, which is
    where it'd normally be, the expected number of calls of the subplan
    would be the output rowcount estimate divided by the estimated
    selectivity of the subplan qual --- but at present the latter is always
    0.5 ...)

    Another technique that we could play with is to have the
    AlternativeSubPlans node track the actual number of calls it gets,
    and switch from the "retail" implementation to the "hashed"
    implementation if that exceeds a threshold. This'd provide some
    robustness in the face of bad estimates, although of course it's
    not optimal compared to having made the right choice to start with.

    Thoughts?

    regards, tom lane
  • Kevin Grittner at Aug 20, 2008 at 10:48 pm

    Tom Lane wrote:
    when we have an EXISTS that could be done both ways,
    why not just generate plans for both ways, and leave the decision
    which to use until later?
    That seems good to me. The costs for the slower plan generally come
    out much higher. When the run times are close, the one that edges out
    the other doesn't always win, but that's to be expected. EXISTS is
    hardly unique in that respect. Competing on costs seems better than
    some more mechanical approach.
    Like maybe even execution time?

    We have speculated in the past about having alternative plans that
    could be conditionally executed based on information not available
    at planning time. This could be seen as a first experiment in that
    direction. I am not thinking of a general-purpose AlternativePlan
    kind of execution node, because SubPlans aren't actually part of the
    main plan-node tree, but an AlternativeSubPlans expression node
    type might work.

    The two issues that would obviously have to be faced to make this
    work are:

    1. While the planner is estimating evaluation costs of the qual
    conditions for the upper query, which EXISTS implementation do we assume
    will be used? It might be that we could still use my original idea of
    providing cost_qual_eval() with some context about the likely number of
    calls, but what I'm thinking at the moment is that it's not worth the
    trouble, because it isn't going to matter that much. Either
    possibility
    is expensive enough compared to ordinary qual conditions that the
    planner will be driven in the direction of plans that minimize the
    number of EXISTS evaluations, and that's all that we really care about.
    So I'd be inclined to just use the numbers for the base (non hashed)
    implementation and be done with it.
    Seems reasonable from this point of view: it seems like you'd never
    choose a plan worse than the current releases, although you might
    sometimes miss a plan that would be even faster than the suggested
    improvement finds. I think it makes sense to defer this until such
    time (if ever) that it is shown to be worth the effort.
    2. How will the executor make the decision which to use? Well, it's
    got access to the overall rowcount estimates that the planner made.
    What I'm thinking of doing is having the AlternativeSubPlans node
    look at the rowcount estimate of its immediate parent Plan node.
    This is actually exactly the right number for a subplan in the
    targetlist of the Plan node. For a subplan in the qual list, it's
    an underestimate, but probably not an enormous underestimate.
    (Assuming that the subplan is at the end of the qual list, which is
    where it'd normally be, the expected number of calls of the subplan
    would be the output rowcount estimate divided by the estimated
    selectivity of the subplan qual --- but at present the latter is always
    0.5 ...)
    If you meant multiplied by 0.5 I think I followed that. Made sense.
    Another technique that we could play with is to have the
    AlternativeSubPlans node track the actual number of calls it gets,
    and switch from the "retail" implementation to the "hashed"
    implementation if that exceeds a threshold. This'd provide some
    robustness in the face of bad estimates, although of course it's
    not optimal compared to having made the right choice to start with.
    That sounds interesting, but unless it has value as a prototype for
    other runtime adaptivity, it sounds like a lot of work for the
    benefit. I'm not that unhappy with the estimates I'm getting in a
    properly tuned database. And the execution-time work to process some
    number of rows this way seems likely to far exceed the work to refine
    the estimates and costing used to choose a plan.

    -Kevin
  • Robert Haas at Aug 21, 2008 at 1:13 am

    Another technique that we could play with is to have the
    AlternativeSubPlans node track the actual number of calls it gets,
    and switch from the "retail" implementation to the "hashed"
    implementation if that exceeds a threshold. This'd provide some
    robustness in the face of bad estimates, although of course it's
    not optimal compared to having made the right choice to start with.
    Ideally you'd want to set that threshold dynamically. If you expect x
    calls and midway through execution notice that you're already up to 2x
    calls, the right thing to do depends a lot on whether you're 1% done
    or 99% done.

    Logic of this type also opens a bit of a can of worms, in that there
    are probably many other situations in which it's possible to notice
    that your estimates are off and shift gears in mid-query, but how much
    are you willing to slow down the queries where there isn't a problem
    to speed up the ones where there is?

    ...Robert
  • Decibel! at Aug 22, 2008 at 4:28 pm

    On Aug 20, 2008, at 12:43 PM, Tom Lane wrote:
    We have speculated in the past about having alternative plans that
    could be conditionally executed based on information not available
    at planning time. This could be seen as a first experiment in that
    direction. I am not thinking of a general-purpose AlternativePlan
    kind of execution node, because SubPlans aren't actually part of the
    main plan-node tree, but an AlternativeSubPlans expression node
    type might work.
    Something I think we could also use is the ability to grab certain
    information before planing takes place. The big case that comes to
    mind is:

    SELECT ... FROM big_table b JOIN small_lookup_table s USING
    (small_lookup_id)
    WHERE s.some_name = 'alpha';

    ... or where we're doing s.some_name IN ('a','b','c'). In many cases,
    translating the some_name lookup into actual _id values that you can
    then look at in pg_stats for big_table results in a huge improvement
    is rowcount estimates. If this is then joining to 5 other tables,
    that rowcount information can have a huge impact on the query plan.
    Another technique that we could play with is to have the
    AlternativeSubPlans node track the actual number of calls it gets,
    and switch from the "retail" implementation to the "hashed"
    implementation if that exceeds a threshold. This'd provide some
    robustness in the face of bad estimates, although of course it's
    not optimal compared to having made the right choice to start with.

    In many systems, having the most optimal plan isn't that important;
    not having a really bad plan is. I expect that giving the executor
    the ability to decide the planner made a mistake and shift gears
    would go a long way to reducing the impact of bad plans. I wonder if
    any other databases have that ability... maybe this will be a first. :)
    --
    Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
    Give your computer some brain candy! www.distributed.net Team #1828

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedAug 19, '08 at 2:15a
activeAug 22, '08 at 4:28p
posts8
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase