I have a query which seems to be taking an extraordinarily long time
(many minutes, at least) when seemingly equivalent queries have
different plans and execute in seconds. naturally, I'd like to know why.

Version is Postgresql 8.4.8. The table, "t", is

Column | Type | Modifiers
--------+---------+-----------
y | integer | not null
x | integer | not null
k | integer | not null
j | integer | not null
z | integer | not null
Indexes:
"t_pkey" PRIMARY KEY, btree (j, k, x, y, z)

The table population, in pseudocode, is this:
for x in 0..9
for y in 0..9999
for z in 0..29
INSERT INTO t VALUES(y,x,0,0,z)

So the table has 300000 entries, with j and k always 0.

The query is:

SELECT *
FROM (
SELECT * FROM t GROUP BY j,k,x,z,y
) AS f
NATURAL JOIN t;

The plan:

Merge Join (cost=44508.90..66677.96 rows=1 width=20)
Merge Cond: ((public.t.j = public.t.j) AND (public.t.k = public.t.k)
AND (public.t.x = public.t.x))
Join Filter: ((public.t.y = public.t.y) AND (public.t.z = public.t.z))
-> Group (cost=44508.90..49008.90 rows=30000 width=20)
-> Sort (cost=44508.90..45258.90 rows=300000 width=20)
Sort Key: public.t.j, public.t.k, public.t.x, public.t.z,
public.t.y
-> Seq Scan on t (cost=0.00..4911.00 rows=300000 width=20)
-> Index Scan using t_pkey on t (cost=0.00..14877.18 rows=300000
width=20)

This query runs at least 20 minutes, with postmaster CPU utilization at
99%, without completing. System is a 3.2GHz Zeon, 3GB memory, and not
much else running.

By contrast, placing an intermediate result in a table "u" provides a
result in about 3 seconds:

CREATE TEMPORARY TABLE u AS SELECT * FROM t GROUP BY j,k,x,z,y;
SELECT * FROM u NATURAL JOIN t;

Changing the order of the GROUP BY clause varies the plan, sometimes
yielding shorter execution times. For example, this ordering executes in
about 1.5 seconds:

SELECT *
FROM (
SELECT * FROM t GROUP BY j,k,x,y,z
) AS f
NATURAL JOIN t;

With 120 permutations, I didn't try them all.

I should note that the plans tend to have similar costs, so the query
planner presumably does not know that some permutations have
significantly greater execution times.

Clem Dickey

Search Discussions

  • Clem Dickey at Jul 7, 2011 at 1:08 am
    On 07/05/2011 07:26 PM, Clem Dickey wrote:

    Updates after belatedly reading the "slow queries" guidelines:

    Version: PostgreSQL 8.4.8 on x86_64-redhat-linux-gnu, compiled by GCC
    gcc (GCC) 4.4.5 20101112 (Red Hat 4.4.5-2), 64-bit

    The query has always been slow; the table for this test case is never
    updated. I don't run VACUUM but do run ANALYZE.

    Originally all database config parameters were the default. Since
    yesterday I have changed
    shared_buffers = 224MB
    effective_cache_size = 1024MB
    but seen no change in behavior.
    Column | Type | Modifiers
    --------+---------+-----------
    y | integer | not null
    x | integer | not null
    k | integer | not null
    j | integer | not null
    z | integer | not null
    Indexes:
    "t_pkey" PRIMARY KEY, btree (j, k, x, y, z)

    The table population, in pseudocode, is this:
    for x in 0..9
    for y in 0..9999
    for z in 0..29
    INSERT INTO t VALUES(y,x,0,0,z)
    The query is:

    SELECT *
    FROM (
    SELECT * FROM t GROUP BY j,k,x,z,y
    ) AS f
    NATURAL JOIN t;
    The EXPLAIN ANALYZE output is http://explain.depesz.com/s/KGk

    Notes on the analysis:
    1. I see that the planner estimates that GROUP BY will reduce 300K rows
    to 30K, a bit odd because every row which the planner could examine is
    in a unique group.
    2. The JOIN is expected to produce one row. I'm not sure how the planner
    came up with that estimate.
    By contrast, placing an intermediate result in a table "u" provides a
    result in about 3 seconds:
    => EXPLAIN ANALYZE CREATE TABLE u AS SELECT * FROM t GROUP BY
    j,k,x,z,y;EXPLAIN ANALYZE SELECT * FROM u NATURAL JOIN t;DROP TABLE u;
    QUERY PLAN

    ----------------------------------------------------------------------------------------------------------------------
    Group (cost=44508.90..49008.90 rows=30000 width=20) (actual
    time=1305.381..2028.385 rows=300000 loops=1)
    -> Sort (cost=44508.90..45258.90 rows=300000 width=20) (actual
    time=1305.374..1673.843 rows=300000 loops=1)
    Sort Key: j, k, x, z, y
    Sort Method: external merge Disk: 8792kB
    -> Seq Scan on t (cost=0.00..4911.00 rows=300000 width=20)
    (actual time=0.008..62.935 rows=300000 loops=1)
    Total runtime: 2873.590 ms
    (6 rows)

    QUERY PLAN

    ---------------------------------------------------------------------------------------------------------------------------------
    Merge Join (cost=46229.86..72644.38 rows=1 width=20) (actual
    time=1420.527..2383.507 rows=300000 loops=1)
    Merge Cond: ((t.j = u.j) AND (t.k = u.k) AND (t.x = u.x) AND (t.y =
    u.y) AND (t.z = u.z))
    -> Index Scan using t_pkey on t (cost=0.00..14877.18 rows=300000
    width=20) (actual time=0.013..118.244 rows=300000 loops=1)
    -> Materialize (cost=46229.86..50123.52 rows=311493 width=20)
    (actual time=1420.498..1789.864 rows=300000 loops=1)
    -> Sort (cost=46229.86..47008.59 rows=311493 width=20)
    (actual time=1420.493..1692.988 rows=300000 loops=1)
    Sort Key: u.j, u.k, u.x, u.y, u.z
    Sort Method: external merge Disk: 8784kB
    -> Seq Scan on u (cost=0.00..5025.93 rows=311493
    width=20) (actual time=0.018..78.850 rows=300000 loops=1)
    Total runtime: 2424.870 ms
    (9 rows)

    (Adding an "ANALYZE" on the temporary table improves the JOIN estimated
    fow count from 1 to about 299500, but does not change the plan.)

    Clem Dickey
  • Clem Dickey at Jul 9, 2011 at 1:38 am

    On 07/06/2011 05:59 PM, Clem Dickey wrote:
    On 07/05/2011 07:26 PM, Clem Dickey wrote:

    Column | Type | Modifiers
    --------+---------+-----------
    y | integer | not null
    x | integer | not null
    k | integer | not null
    j | integer | not null
    z | integer | not null
    Indexes:
    "t_pkey" PRIMARY KEY, btree (j, k, x, y, z)

    The table population, in pseudocode, is this:
    for x in 0..9
    for y in 0..9999
    for z in 0..29
    INSERT INTO t VALUES(y,x,0,0,z)
    The query is:

    SELECT *
    FROM (
    SELECT * FROM t GROUP BY j,k,x,z,y
    ) AS f
    NATURAL JOIN t;
    The EXPLAIN ANALYZE output is http://explain.depesz.com/s/KGk

    Notes on the analysis:
    1. I see that the planner estimates that GROUP BY will reduce 300K rows
    to 30K, a bit odd because every row which the planner could examine is
    in a unique group.
    GROUP BY assumes an average 10-element grouping in cases with more than
    one GROUP BY expression. Wrong for this test case, but probably OK in
    general.
    2. The JOIN is expected to produce one row. I'm not sure how the planner
    came up with that estimate.
    The winning Join (merge join) had a very poor estimate of its
    performance. Like a low-ball contract bid. :-)

    a. The Join cost estimators could have been given more information

    The functions which estimate JOIN selectivity (e.g. the chance that
    tuples will match in an equijoin, for instance) use data produced by
    ANALYZE. But the SELECT .. GROUP BY does not propagate ANALYZE data from
    the columns of its input relation to its output relation. That is too
    bad, because the column value statistics (number of unique values) would
    have improved selectivity estimates for all three join plans (merge
    join, nested loop, and hash join).

    b. the Merge Join cost estimator did a poor job with the data it was given:

    In function eqjoinsel_inner there are two cases (1) ANALYZE data is
    available for both sides of the join and (2) ANALYZE data is missing for
    one or both sides. Due to the GROUP BY processing described above,
    ANALYZE data was available for "t" but not for "SELECT * FROM t GROUP BY
    ...".

    The logic in that case is "use the column with the most distinct values"
    to estimate selectivity. The default number of distinct values for a
    column with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number
    of values was:

    col in GROUP BY in table t
    j 200 1
    k 200 1
    x 200 10
    y 200 1000
    z 200 30

    In 4 of the 5 columns the default value had more distinct values, and
    the combined selectivity (chance that two arbitrary rows would have a
    join match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that
    the code does not distinguish known numbers from default numbers. A
    comment in the code acknowledges this:

    "XXX Can we be smarter if we have an MCV list for just one side?"

    But it concludes

    "It seems that if we assume equal distribution for the other side, we
    end up with the same answer anyway."

    I don't think that is the case. Preferring a known value, where one
    exists, would provide a better estimate of the actual range of the data.
    Indeed, the var_eq_non_const in the same file (used by the nested loop
    join estimator) does essentially that.

    - Clem Dickey
  • Robert Haas at Aug 3, 2011 at 1:29 pm

    On Fri, Jul 8, 2011 at 9:33 PM, Clem Dickey wrote:
    a. The Join cost estimators could have been given more information

    The functions which estimate JOIN selectivity (e.g. the chance that tuples
    will match in an equijoin, for instance) use data produced by ANALYZE. But
    the SELECT .. GROUP BY does not propagate ANALYZE data from the columns of
    its input relation to its output relation. That is too bad, because the
    column value statistics (number of unique values) would have improved
    selectivity estimates for all three join plans (merge join, nested loop, and
    hash join).
    Yeah, I've had this same thought. In fact, I think that it would
    probably be an improvement to pass through not just the number of
    unique values but the MCVs and frequencies of the non-GROUP-BY
    columns. Of course, for the grouping columns, we ought to let
    n_distinct = -1 pop out. Granted, the GROUP BY might totally change
    the data distribution, so relying on the input column statistics to be
    meaningful could be totally wrong, but on average it seems more likely
    to give a useful answer than a blind stab in the dark. I haven't
    gotten around to doing anything about this, but it seems like a good
    idea.
    b. the Merge Join cost estimator did a poor job with the data it was given:

    In function eqjoinsel_inner there are two cases (1) ANALYZE data is
    available for both sides of the join and (2) ANALYZE data is missing for one
    or both sides. Due to the GROUP BY processing described above, ANALYZE data
    was available for "t" but not for "SELECT * FROM t GROUP BY ...".

    The logic in that case is "use the column with the most distinct values" to
    estimate selectivity. The default number of distinct values for a column
    with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values
    was:

    col  in GROUP BY   in table t
    j      200            1
    k      200            1
    x      200           10
    y      200         1000
    z      200           30

    In 4 of the 5 columns the default value had more distinct values, and the
    combined selectivity (chance that two arbitrary rows would have a join
    match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code
    does not distinguish known numbers from default numbers. A comment in the
    code acknowledges this:

    "XXX Can we be smarter if we have an MCV list for just one side?"

    But it concludes

    "It seems that if we assume equal distribution for the other side, we end up
    with the same answer anyway."

    I don't think that is the case. Preferring a known value, where one exists,
    would provide a better estimate of the actual range of the data. Indeed, the
    var_eq_non_const in the same file (used by the nested loop join estimator)
    does essentially that.
    I'm not sure I understand what you're getting at here, unless the idea
    is to make get_variable_numdistinct() somehow indicate to the caller
    whether it had to punt. That might be worth doing.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Clem Dickey at Aug 4, 2011 at 1:55 am

    On 08/03/2011 06:29 AM, Robert Haas wrote:
    b. the Merge Join cost estimator did a poor job with the data it was given:

    In function eqjoinsel_inner there are two cases (1) ANALYZE data is
    available for both sides of the join and (2) ANALYZE data is missing for one
    or both sides. Due to the GROUP BY processing described above, ANALYZE data
    was available for "t" but not for "SELECT * FROM t GROUP BY ...".

    The logic in that case is "use the column with the most distinct values" to
    estimate selectivity. The default number of distinct values for a column
    with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values
    was:

    col in GROUP BY in table t
    j 200 1
    k 200 1
    x 200 10
    y 200 1000
    z 200 30

    In 4 of the 5 columns the default value had more distinct values, and the
    combined selectivity (chance that two arbitrary rows would have a join
    match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code
    does not distinguish known numbers from default numbers. A comment in the
    code acknowledges this:
    I'm not sure I understand what you're getting at here, unless the idea
    is to make get_variable_numdistinct() somehow indicate to the caller
    whether it had to punt. That might be worth doing.
    Yes, the first step is to make "punt" a separate indicator. The second
    would be to make good use of that indicator. As it is now, with "punt"
    being a possible data value, there two types of errors:

    False negative (code treats DEFAULT_NUM_DISTINCT as ordinary case when
    it is special):

    I wanted eqjoinsel_inner() to treat "punt" specially: to use the value
    from the known side of the JOIN when the other side is unknown. The
    current behavior, although not ideal, is the expected use of a default
    value.

    False positive (code treats DEFAULT_NUM_DISTINCT as special case when it
    is ordinary):

    eqjoinsel_semi() and estimate_hash_bucketsize() treat
    DEFAULT_NUM_DISTINCT specially. This behavior is less defensible than
    false positive, since a valid numeric value is being re-used as a flag.


    I suggest wrapping the value in a struct (to avoid accidental use) and
    using macros for read access.

    typedef struct {
    double value; // negative means "unknown"
    } num_distinct_t;

    #define IS_NUM_DISTINCT_DEFINED(nd) ((nd).value >= 0)
    #define NUM_DISTINCT_VALUE(nd) ((nd).value)

    - Clem Dickey

    P.S. Congratulations on displacing MySQL in Mac OS X Lion Server.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedJul 6, '11 at 2:38a
activeAug 4, '11 at 1:55a
posts5
users2
websitepostgresql.org
irc#postgresql

2 users in discussion

Clem Dickey: 4 posts Robert Haas: 1 post

People

Translate

site design / logo © 2022 Grokbase