Hi,

I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
subqueries:

DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM bar
cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type != 'o');

The plan produced for this is:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using foo_type_index on foo (cost=17851.93..1271633830.75
rows=66410 width=6)
Index Cond: (type = 'o'::bpchar)
Filter: ((NOT (subplan)) AND (NOT (subplan)))
SubPlan
-> Materialize (cost=6077.87..10238.57 rows=299170 width=8)
-> Seq Scan on bar cqc (cost=0.00..4609.70 rows=299170 width=8)
-> Materialize (cost=11774.06..15728.45 rows=284339 width=8)
-> Seq Scan on foo car (cost=0.00..10378.73 rows=284339 width=8)
Filter: (type <> 'o'::bpchar)
(9 rows)


Unfortunately, when these tables get large-ish, the materilzations get
really expensive to re-scan for every tuple (see cost above). At the
moment, I have ~500k rows in foo and ~300k rows in bar. The
selectivity of type = 'o' is ~50%. I've tried to re-write the query as
follows:

DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as b FROM
(SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM foo
cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND cqar2.type
!= 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER JOIN bar ON
(candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS NULL AND
bar.b IS NULL);

This gives the more sensible plan:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Hash IN Join (cost=48999.81..71174.41 rows=66410 width=6)
Hash Cond: (foo.b = cqar1.b)
-> Seq Scan on foo (cost=0.00..9003.78 rows=549978 width=14)
-> Hash (cost=47909.68..47909.68 rows=66410 width=8)
-> Hash Left Join (cost=24562.29..47909.68 rows=66410 width=8)
Hash Cond: (cqar1.b = bar.b)
Filter: (bar.b IS NULL)
-> Hash Left Join (cost=15043.96..33635.58 rows=132820 width=8)
Hash Cond: (cqar1.b = cqar2.b)
Filter: (cqar2.b IS NULL)
-> Seq Scan on foo cqar1 (cost=0.00..10378.73
rows=265639 width=8)
Filter: (type = 'o'::bpchar)
-> Hash (cost=10378.73..10378.73 rows=284339 width=8)
-> Seq Scan on foo cqar2
(cost=0.00..10378.73 rows=284339 width=8)
Filter: (type <> 'o'::bpchar)
-> Hash (cost=4609.70..4609.70 rows=299170 width=8)
-> Seq Scan on bar (cost=0.00..4609.70
rows=299170 width=8)
(17 rows)


As far as I can tell, the results are identical.

My questions

1. Is my rewrite valid?
2. Any way to reliably achieve the second plan (or really, any plan
that doesn't rescan ~~500k tuples per each of ~250k tuples) by
tweaking (per-session) planner constants? I've played with this a
little, but without much success. As with any rewrite situation, I'd
prefer to stick with the simpler, more explicit original query.

Here is a SQL script to reproduce the problem:

\set ON_ERROR_STOP

drop schema if exists not_in_test cascade;
create schema not_in_test;

set search_path to not_in_test;

create table foo (
a oid not null,
b bigint not null,
type char not null,
ts timestamp without time zone not null
);
create index "foo_b_a_type_index" on foo (b, a, type);
create index "foo_a_index" on foo (a);
create index "foo_type_index" on foo(type);

create table bar (
b bigint unique not null,
c timestamp with time zone not null
);
create index "bar_b_index" on bar(b);

insert into foo select (random()*10)::integer,
generate_series(1,550000), case when random() > 0.5 then 'o' else 'x'
end, now();
insert into bar select val, now() from generate_series(1,1200000) as
vals(val) where random() > 0.75;

analyze foo;
analyze bar;

EXPLAIN DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b
FROM bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type
!= 'o');
EXPLAIN DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as
b FROM (SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM
foo cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND
cqar2.type != 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER
JOIN bar ON (candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS
NULL AND bar.b IS NULL);


Thanks,
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Search Discussions

  • Kevin Grittner at Aug 2, 2010 at 7:29 pm

    Maciek Sakrejda wrote:

    DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM
    bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type !=
    'o');
    Can "b" be null in any of these tables? If not, then you can
    rewrite your query to us NOT EXISTS and have the same semantics.
    That will often be much faster. Something like:

    DELETE FROM foo
    WHERE type = 'o'
    AND NOT EXISTS (SELECT * FROM bar cqc where cqc.b = foo.b)
    AND NOT EXISTS (SELECT * FROM foo car WHERE car.b = foo.b
    AND car.type <> 'o');

    -Kevin
  • Maciek Sakrejda at Aug 2, 2010 at 7:42 pm

    Can "b" be null in any of these tables? If not, then you can
    rewrite your query to us NOT EXISTS and have the same semantics.
    That will often be much faster.
    Thanks, Kevin.

    No NULLs. It looks like it's a good deal slower than the LOJ version,
    but a good deal faster than the original. Since the equivalence of
    semantics is much easier to verify here, we may go with this (at least
    for the moment).
    ---
    Maciek Sakrejda | System Architect | Truviso

    1065 E. Hillsdale Blvd., Suite 230
    Foster City, CA 94404
    (650) 242-3500 Main
    www.truviso.com
  • Kevin Grittner at Aug 2, 2010 at 7:50 pm

    Maciek Sakrejda wrote:

    No NULLs. It looks like it's a good deal slower than the LOJ
    version, but a good deal faster than the original.
    On 8.4 and later the NOT EXISTS I suggested is a bit faster than
    your fast version, since Tom did some very nice work in this area,
    implementing semi join and anti join. If you've got much load with
    this kind of query, it might be worth upgrading.

    -Kevin
  • Dave Crooke at Aug 2, 2010 at 8:15 pm
    With Oracle, I've found an anti-union (MINUS in Oracle, EXCEPT in PGSQL) to
    be often a bit better than an anti-join, which is in turn faster than NOT
    IN. Depends of course on row distribution and index layouts, and a bunch of
    other details.

    Depending on what you're returning, it can pay to make sure this computation
    is done with the shortest possible rows, if necessary using a subquery.

    Cheers
    Dave

    On Mon, Aug 2, 2010 at 2:49 PM, Kevin Grittner
    wrote:
    Maciek Sakrejda wrote:
    No NULLs. It looks like it's a good deal slower than the LOJ
    version, but a good deal faster than the original.
    On 8.4 and later the NOT EXISTS I suggested is a bit faster than
    your fast version, since Tom did some very nice work in this area,
    implementing semi join and anti join. If you've got much load with
    this kind of query, it might be worth upgrading.

    -Kevin

    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
  • Kevin Grittner at Aug 2, 2010 at 9:03 pm

    Dave Crooke wrote:

    With Oracle, I've found an anti-union (MINUS in Oracle, EXCEPT in
    PGSQL) to be often a bit better than an anti-join, which is in
    turn faster than NOT IN. Depends of course on row distribution and
    index layouts, and a bunch of other details.
    I found that assertion intriguing, so I tested the "fast" query from
    the original post against my suggestion and a version using EXCEPT.
    (This was against the development HEAD, not any release.)

    OP "fast": 32.9 seconds
    NOT EXISTS: 11.2 seconds
    EXCEPT: 7.7 seconds

    That last was using this query, which just might work OK on 8.3:

    DELETE FROM foo
    where foo.b in (
    select b from foo WHERE type = 'o'
    except SELECT b FROM bar
    except SELECT b FROM foo where type <> 'o');

    I wonder whether this could make a reasonable alternative plan for
    the optmizer to consider some day....

    -Kevin
  • Kevin Grittner at Aug 2, 2010 at 9:24 pm

    "Kevin Grittner" wrote:

    DELETE FROM foo
    where foo.b in (
    select b from foo WHERE type = 'o'
    except SELECT b FROM bar
    except SELECT b FROM foo where type <> 'o');
    Oops. Maybe before I get excited I should try it with a query which
    is actually logically equivalent. :-(

    -Kevin
  • Kevin Grittner at Aug 2, 2010 at 9:38 pm

    Kevin Grittner wrote:

    Maybe before I get excited I should try it with a query which is
    actually logically equivalent.
    Fixed version:

    DELETE FROM foo
    where type = 'o' and foo.b in (
    select b from foo WHERE type = 'o'
    except SELECT b FROM bar
    except SELECT b FROM foo where type <> 'o');

    The change didn't affect run time significantly; it still beats the
    others.

    -Kevin
  • Maciek Sakrejda at Aug 2, 2010 at 10:37 pm

    Maybe before I get excited I should try it with a query which is
    actually logically equivalent.
    Yes, the joys of manual rewrites...
    Fixed version:

    DELETE FROM foo
    where type = 'o' and foo.b in (
    select b from foo WHERE type = 'o'
    except SELECT b FROM bar
    except SELECT b FROM foo where type <> 'o');

    The change didn't affect run time significantly; it still beats the
    others.
    On my 8.3, it still performs a little worse than your original
    correlated EXCEPT (which is actually on par with the antijoin in 8.3,
    but significantly better in 8.4). In 8.4, this EXCEPT version does
    seem somewhat better.

    It looks like according to Andres, though, I should not be depending
    on these plans with 8.3, so I may want to stick with the manual
    antijoin.

    ---
    Maciek Sakrejda | System Architect | Truviso

    1065 E. Hillsdale Blvd., Suite 215
    Foster City, CA 94404
    (650) 242-3500 Main
    www.truviso.com
  • Andres Freund at Aug 2, 2010 at 7:53 pm
    Hi,
    On Mon, Aug 02, 2010 at 12:12:51PM -0700, Maciek Sakrejda wrote:
    I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
    subqueries:
    With 8.3 you will have to use manual antijoins (i.e LEFT JOIN
    ... WHERE NULL). If you use 8.4 NOT EXISTS() will do that
    automatically in many cases (contrary to NOT IN () which has strange
    NULL semantics).

    Anbdres
  • Maciek Sakrejda at Aug 2, 2010 at 8:06 pm
    All fields involved are declared NOT NULL, but thanks for the heads up.
    ---
    Maciek Sakrejda | System Architect | Truviso

    1065 E. Hillsdale Blvd., Suite 215
    Foster City, CA 94404
    (650) 242-3500 Main
    www.truviso.com
  • Andres Freund at Aug 2, 2010 at 8:22 pm

    On Mon, Aug 02, 2010 at 01:06:00PM -0700, Maciek Sakrejda wrote:
    All fields involved are declared NOT NULL, but thanks for the heads up.
    Afair the planner doesnt use that atm.

    Andres
  • Maciek Sakrejda at Aug 2, 2010 at 9:02 pm

    All fields involved are declared NOT NULL, but thanks for the heads up.
    Afair the planner doesnt use that atm.
    I was referring to not having to care about the strange NULL semantics
    (as per your original comment), since I have no NULLs. Given that, I
    think the NOT EXISTS could be a good solution, even on 8.3 (we're
    planning to upgrade, but it's not a feasible solution to this
    particular problem), no?

    Basically, it seems like the main issue with the current plans is the
    per-tuple seq scans on the full materializations. Adding correlation
    (by rewriting NOT IN as NOT EXISTS) prevents materialization, hence
    getting rid of the biggest performance problem.

    Thanks,
    ---
    Maciek Sakrejda | System Architect | Truviso

    1065 E. Hillsdale Blvd., Suite 215
    Foster City, CA 94404
    (650) 242-3500 Main
    www.truviso.com
  • Andres Freund at Aug 2, 2010 at 10:19 pm

    On Mon, Aug 02, 2010 at 01:35:13PM -0700, Maciek Sakrejda wrote:
    All fields involved are declared NOT NULL, but thanks for the heads up.
    Afair the planner doesnt use that atm.
    I was referring to not having to care about the strange NULL semantics
    (as per your original comment), since I have no NULLs. Given that, I
    think the NOT EXISTS could be a good solution, even on 8.3 (we're
    planning to upgrade, but it's not a feasible solution to this
    particular problem), no?
    The point is that only 8.4 will optimize that case properly. 8.3 will
    generate plans which are inefficient in many (or most) cases for both
    variants. I would suggest to use manual antijoins...

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedAug 2, '10 at 7:13p
activeAug 2, '10 at 10:37p
posts14
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase