Originally, I posted to -general but I found some time to write some
samples, and realized it's probably more of a performance question.

The original post is here:
http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php

I was hoping that somebody could help me understand the differences
between three plans.
All of the plans are updating a table using a second table, and should
be logically equivalent.
Two of the plans use joins, and one uses an exists subquery.
One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
this plan which has really awful performance.
Clearly it is due to the nested loop, but why would the planner choose
that approach?

I also don't understand why in the 'exists' plan the planner thinks
the index scan will provide 1019978 rows, when there are only 1000000,
but that is a lesser issue.

Here is a sample SQL file which demonstrates the issues and includes
all three variations.

begin;
create temporary table t7 (
i BIGINT NOT NULL,
k BIGINT
);

create temporary table t8 (
i BIGINT NOT NULL,
j INT
);

CREATE FUNCTION populate_t8()
RETURNS VOID
LANGUAGE SQL
AS
$$
truncate t8;
insert into t8
SELECT i, 1 from t7
ORDER BY i LIMIT 10000;

insert into t8
SELECT i, 2 from t7
WHERE i > 10000
ORDER BY i LIMIT 10000;

SELECT i, 3 from t7
WHERE i > 20000
ORDER BY i LIMIT 20000;

analyze t8;
$$;

INSERT INTO t7
select x, x + 10 from generate_series(1,1000000) as x ;
analyze t7;

select populate_t8();

explain analyze verbose
update
t7
SET
k = 1
FROM
t8
WHERE
t7.i = t8.i
AND
(
t8.j = 2
OR
t8.j = 1
);

select populate_t8();

explain analyze verbose
update
t7
SET
k = 1
WHERE
EXISTS (
SELECT 1 FROM t8
WHERE t8.i = t7.i
AND
(
t8.j = 2
OR
t8.j = 1
)
);

select populate_t8();

explain
update
t7
SET
k = 1
FROM
t8
WHERE
ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i)
AND
(
t8.j = 2
OR
t8.j = 1
);

explain analyze verbose
update
t7
SET
k = 1
FROM
t8
WHERE
ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i)
AND
(
t8.j = 2
OR
t8.j = 1
);

rollback;





--
Jon

Search Discussions

  • Tom Lane at Mar 7, 2011 at 8:00 pm

    Jon Nelson writes:
    I was hoping that somebody could help me understand the differences
    between three plans.
    All of the plans are updating a table using a second table, and should
    be logically equivalent.
    Two of the plans use joins, and one uses an exists subquery.
    One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
    this plan which has really awful performance.
    Clearly it is due to the nested loop, but why would the planner choose
    that approach?
    IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be
    an indexqual, merge join qual, or hash join qual. So it's not
    surprising that you get a sucky plan for it. Possibly somebody will
    work on improving that someday.

    As for your other questions, what PG version are you using? Because I
    do get pretty much the same plan (modulo a plain join versus a semijoin)
    for the first two queries, when using 9.0 or later. And the results of
    ANALYZE are only approximate, so you shouldn't be surprised at all if a
    rowcount estimate is off by a couple percent. Most of the time, you
    should be happy if it's within a factor of 2 of reality.

    regards, tom lane
  • Jon Nelson at Mar 7, 2011 at 8:07 pm

    On Mon, Mar 7, 2011 at 2:00 PM, Tom Lane wrote:
    Jon Nelson <jnelson+pgsql@jamponi.net> writes:
    I was hoping that somebody could help me understand the differences
    between three plans.
    All of the plans are updating a table using a second table, and should
    be logically equivalent.
    Two of the plans use joins, and one uses an exists subquery.
    One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
    this plan which has really awful performance.
    Clearly it is due to the nested loop, but why would the planner choose
    that approach?
    IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be
    an indexqual, merge join qual, or hash join qual.  So it's not
    surprising that you get a sucky plan for it.  Possibly somebody will
    work on improving that someday.

    As for your other questions, what PG version are you using?  Because I
    do get pretty much the same plan (modulo a plain join versus a semijoin)
    for the first two queries, when using 9.0 or later.  And the results of
    ANALYZE are only approximate, so you shouldn't be surprised at all if a
    rowcount estimate is off by a couple percent.  Most of the time, you
    should be happy if it's within a factor of 2 of reality.
    Sorry - I had stated in the original post that I was using 8.4.5 on 64
    bit openSUSE and CentOS 5.5, and had forgotten to carry that
    information over into the second post.

    What is the difference between a plain join and a semi join?

    --
    Jon
  • Kevin Grittner at Mar 7, 2011 at 8:24 pm

    Jon Nelson wrote:

    What is the difference between a plain join and a semi join?
    As soon as a semi join finds a match it stops looking for more.

    -Kevin
  • Merlin Moncure at Mar 7, 2011 at 8:05 pm

    On Mon, Mar 7, 2011 at 1:07 PM, Jon Nelson wrote:
    Originally, I posted to -general but I found some time to write some
    samples, and realized it's probably more of a performance question.

    The original post is here:
    http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php

    I was hoping that somebody could help me understand the differences
    between three plans.
    All of the plans are updating a table using a second table, and should
    be logically equivalent.
    Two of the plans use joins, and one uses an exists subquery.
    One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
    this plan which has really awful performance.
    The problem is really coming from SQL: it requires row wise
    comparisons to be of all fields in left to right order and the fact
    that you can't match NULL to NULL with =.

    If you have a table with a,b,c, (1,1,NULL) is not distinct from (1,2,3) becomes:
    Filter: ((NOT (a IS DISTINCT FROM 1)) AND (NOT (b IS DISTINCT FROM 1))
    AND (NOT (c IS DISTINCT FROM NULL::integer)))

    At present postgresql does not have the facilities to turn that into
    an index lookup. SQL doesn't allow the way you'd want to write this
    the way you'd really like to:

    select * from v where (a,b,c) = (1,1,NULL);

    because the comparison can't be applied from a row to another row but
    only between the member fields. You can cheat the system, but only if
    you reserve a special index for that purpose:

    create table v(a int, b int, c int);
    create index on v(v);
    select * from v where v = (1,1, NULL) will match as 'is not distinct
    from' does, using the index. This is because composite type
    comparison (as opposed to its fields) follows a different code path.
    Confused yet? You can also use the above trick with a type if you are
    not comparing all fields of 'v':

    create type foo(a int, b int);
    create index on v(((a,b)::foo));
    select * from v where (a,b)::foo = (1,1);

    will get you field subset comparison with index. Note if you do the
    above, the index can only match on the entire composite, not
    particular fields...

    merlin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedMar 7, '11 at 7:15p
activeMar 7, '11 at 8:24p
posts5
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase