This is all on Postgres 9.0.0:

I'm working on the definition for a view that, as part of its output,
includes three columns that each contain a sum of values in a table
that are in one of a few different states -- essentially: for each
parent, give me the sum of children in the foo state, a second sum of
the children in the bar state, and a third sum of the children in the
baz state.

Thinking I'd be clever and avoid multiple almost-identical subqueries
to for each SUM, I decided to do it in a single subquery that returns
an array of all 3 sums and then split it out into its component parts
higher up (see the example below if you don't understand what I mean).
Unfortunately, the query planner doesn't quite handle this case: For
each reference to the subquery value, it duplicates the subquery and
thus its plan:

CREATE TABLE parent (
id INT PRIMARY KEY
);
INSERT INTO parent(id) SELECT GENERATE_SERIES(1, 1000);

CREATE TABLE child (
parent_id INT,
v1 INT,
v2 INT,
PRIMARY KEY(parent_id, v1)
);

INSERT INTO child(parent_id, v1, v2)
SELECT p.id, v1.id, RANDOM() * 500
FROM
GENERATE_SERIES(1, 1000) AS p(id)
CROSS JOIN GENERATE_SERIES(1, 20) AS v1(id)
;

EXPLAIN SELECT *, child_data[1] AS foo, child_data[2] AS bar,
child_data[3] AS baz FROM (
SELECT p.id,
(
SELECT
ARRAY[
SUM(c.v2),
SUM(CASE WHEN c.v1 > 15 THEN c.v2 ELSE 0 END),
SUM(CASE WHEN c.v1 > 5 THEN c.v2 ELSE 0 END)
]
FROM child AS c
WHERE c.parent_id=p.id
) AS child_data
FROM parent AS p
) AS t;

produces:

Seq Scan on wings_sky.parent p (cost=0.00..161113.12 rows=1000 width=4)
Output: p.id, (SubPlan 1), ((SubPlan 2))[1], ((SubPlan 3))[2], ((SubPlan 4))[3]
SubPlan 1
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)
SubPlan 2
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)
SubPlan 3
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)
SubPlan 4
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)

There's another method of writing this that is more efficient in this
PARTICULAR case:


EXPLAIN VERBOSE SELECT parent.id, t.foo, t.bar, t.baz
FROM
parent
INNER JOIN (
SELECT
child.parent_id,
SUM(child.v2) AS foo,
SUM(CASE WHEN child.v1 > 15 THEN child.v2 ELSE 0 END) AS bar,
SUM(CASE WHEN child.v1 > 5 THEN child.v2 ELSE 0 END) AS baz
FROM
child
GROUP BY
child.parent_id
) AS t ON t.parent_id=parent.id

Hash Join (cost=547.12..556.62 rows=200 width=28)
Output: parent.id, (sum(child.v2)), (sum(CASE WHEN (child.v1 > 15)
THEN child.v2 ELSE 0 END)), (sum(CASE WHEN (child.v1 > 5) THEN
child.v2 ELSE 0 END))
Hash Cond: (child.parent_id = parent.id)
-> HashAggregate (cost=483.12..487.62 rows=200 width=12)
Output: child.parent_id, sum(child.v2), sum(CASE WHEN (child.v1 > 15)
THEN child.v2 ELSE 0 END), sum(CASE WHEN (child.v1 > 5) THEN child.v2
ELSE 0 END)
-> Seq Scan on wings_sky.child (cost=0.00..291.06 rows=19206 width=12)
Output: child.parent_id, child.v1, child.v2
-> Hash (cost=34.00..34.00 rows=2400 width=4)
Output: parent.id
-> Seq Scan on wings_sky.parent (cost=0.00..34.00 rows=2400 width=4)
Output: parent.id


However, the second plan performs poorly in cases in the case where
parent_id can be more than one possible value (i.e. there is no WHERE
parent_id=1 clause) -- it takes nearly as long in that instance as it
does with no WHERE clause altogether. Both plans run the same speed
with one parent_id. The first plan starts losing speed gradually as
the number of parents increase; the second plan is either
all-or-nothing.



In the first case, it seems inefficient to duplicate the subplan for
each reference -- I'd think the (corrected) plan should look something
like this:

Seq Scan on wings_sky.parent p (cost=0.00..161113.12 rows=1000 width=4)
Output: p.id, (SubPlan 1), ((SubPlan 1))[1], ((SubPlan 1))[2], ((SubPlan 1))[3]
SubPlan 1
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)

Is there any chance this might be looked at in a future release?

--
Daniel Grace
AGE, LLC
System Administrator and Software Developer
dgrace@wingsnw.com // www.wingsnw.com

Search Discussions

  • Robert Haas at Oct 4, 2010 at 6:47 pm

    On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace wrote:
    Is there any chance this might be looked at in a future release?
    This is another interesting example of a case where an inlining-type
    optimization (which is effectively what's happening here, I think)
    turns out to be a negative. We had one a while back that involved
    actual function inlining, which is not quite what's happening here,
    but it's close. It doesn't seem too hard to figure out whether or not
    inlining is a win (non-trivial subexpressions should probably never
    get duplicated), but nobody's gotten around to writing the logic to
    make it work yet. One useful technique is to stick "OFFSET 0" into
    the subquery; that prevents it from being inlined and gives you
    something more like the plan you were hoping for.

    Whether or not this will get looked at in a future release is a tough
    question to answer. It's possible that someone (most likely, Tom)
    will get motivated to fix this out of sheer annoyance with the current
    behavior, or will notice a way to improve it incidentally while making
    some other change. But of course the only way to make sure it gets
    fixed is to do it yourself (or pay someone to do it).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Daniel Grace at Oct 4, 2010 at 8:23 pm
    As a theoretical question (I'm not savvy on Postgres's code but might
    be intrigued enough to beat on it anyways), is it feasible to do an
    additional pass on the query plan that essentially goes:

    - Are these two subplans identical?
    - Are they at the same part of the tree?

    and if both of these conditions are true, discard one subplan and
    rewrite all references to point to the other one?

    Assuming it IS possible, are there any particular situations where it
    wouldn't work?
    On Mon, Oct 4, 2010 at 11:47 AM, Robert Haas wrote:
    On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace wrote:
    Is there any chance this might be looked at in a future release?
    This is another interesting example of a case where an inlining-type
    optimization (which is effectively what's happening here, I think)
    turns out to be a negative.  We had one a while back that involved
    actual function inlining, which is not quite what's happening here,
    but it's close.  It doesn't seem too hard to figure out whether or not
    inlining is a win (non-trivial subexpressions should probably never
    get duplicated), but nobody's gotten around to writing the logic to
    make it work yet.  One useful technique is to stick "OFFSET 0" into
    the subquery; that prevents it from being inlined and gives you
    something more like the plan you were hoping for.

    Whether or not this will get looked at in a future release is a tough
    question to answer.  It's possible that someone (most likely, Tom)
    will get motivated to fix this out of sheer annoyance with the current
    behavior, or will notice a way to improve it incidentally while making
    some other change.  But of course the only way to make sure it gets
    fixed is to do it yourself (or pay someone to do it).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company


    --
    Daniel Grace
    AGE, LLC
    System Administrator and Software Developer
    dgrace@wingsnw.com // (425)327-0079 // www.wingsnw.com
  • Robert Haas at Oct 4, 2010 at 9:50 pm

    On Mon, Oct 4, 2010 at 4:15 PM, Daniel Grace wrote:
    As a theoretical question (I'm not savvy on Postgres's code but might
    be intrigued enough to beat on it anyways), is it feasible to do an
    additional pass on the query plan that essentially goes:

    - Are these two subplans identical?
    - Are they at the same part of the tree?

    and if both of these conditions are true, discard one subplan and
    rewrite all references to point to the other one?

    Assuming it IS possible, are there any particular situations where it
    wouldn't work?
    Well, volatile functions would be a problem, but I don't think that's
    really the way to go anyway. I think that deciding whether or not to
    collapse subqueries into the main query (which is what's happening
    here) seems like it could be done by counting the number of times any
    given subexpression is referenced, which seems like it would be a lot
    cheaper than checking things against each other pairwise to see if
    they match up. Slowing down the planner to detect cases like this
    wouldn't be very appealing:

    SELECT (SELECT SUM(a) FROM bob, SELECT SUM(b) FROM bob, SELECT SUM(c) FROM bob);

    ...because very few people are going to write that query that way
    anyway, and unless it's almost free to notice the duplication, it
    doesn't make sense to spend planning time on it.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Tom Lane at Oct 5, 2010 at 12:55 am

    Robert Haas writes:
    On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace wrote:
    Is there any chance this might be looked at in a future release?
    This is another interesting example of a case where an inlining-type
    optimization (which is effectively what's happening here, I think)
    turns out to be a negative. We had one a while back that involved
    actual function inlining, which is not quite what's happening here,
    but it's close. It doesn't seem too hard to figure out whether or not
    inlining is a win (non-trivial subexpressions should probably never
    get duplicated), but nobody's gotten around to writing the logic to
    make it work yet.
    Actually this case has some history behind it. The reason that the
    sub-sub-query gets duplicated is that we flatten the sub-query, resulting
    in duplicating its output expressions wherever they're referenced.
    Now before you say that that's stupid, please notice that this case
    got disabled in 7.3 as a bug workaround, and that almost immediately
    produced howls of pain:
    http://archives.postgresql.org/pgsql-general/2002-12/msg00410.php
    http://archives.postgresql.org/pgsql-performance/2002-12/msg00214.php
    so I undid it as soon as I could:
    http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=de97072e3c88e104a55b0d5c67477f1b0097c003

    While just shutting off the pull-up unconditionally would merely require
    reintroducing a contains_subplans() restriction in is_simple_subquery(),
    I'm unwilling to do that because of the earlier complaints. And
    "figuring out whether it's a win" is a lot harder than you opine above:
    at the point where this decision has to be taken, we have no cost
    information whatsoever, and not even any clear idea whether the subquery
    outputs are referenced at all let alone multiply referenced.

    My thought about the current shortest path to a solution would be to
    wrap subquery-containing output expressions in PlaceHolderVars.
    Per this comment elsewhere in is_simple_subquery(),

    /*
    * Don't pull up a subquery that has any volatile functions in its
    * targetlist. Otherwise we might introduce multiple evaluations of these
    * functions, if they get copied to multiple places in the upper query,
    * leading to surprising results. (Note: the PlaceHolderVar mechanism
    * doesn't quite guarantee single evaluation; else we could pull up anyway
    * and just wrap such items in PlaceHolderVars ...)
    */

    that doesn't entirely work at the moment, but it might work if we were
    to rejigger the PHV mechanism a bit. This would be attractive because
    the added overhead of a PHV would be nearly nil, so we could afford to
    do this without making any predictions about relative costs.

    As a workaround until somebody gets around to looking at that, I'd
    suggest that Daniel plaster his sub-query with the good old all-purpose
    optimization fence, "OFFSET 0". That will prevent flattening and thus
    prevent the sub-sub-query from getting duplicated.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-bugs @
categoriespostgresql
postedSep 27, '10 at 9:16p
activeOct 5, '10 at 12:55a
posts5
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase