Hi,

I have a quite complex, performance sensitive query in a system with a
few (7) joins:
select .... from t1 left join t2 .... WHERE id IN (select ....)

For this query the planner evaluates the IN with a hash semi join last,
and all the joining is done by hash joins for all rows contained in t1.

However when I specify the ids manually (IN (1, 2, 3, 4, 5) the
planner first does an index lookup on the primary key column id,
and subsequently does nested loop joins using an index on t2 - which
gives way better results.

Is there any way to guide the planner to evaluate the IN condition
first, instead of last?
Why is the planner behaving this way? (postgresql 8.4.??)

Thank you in advance, Clemens


Query plan with IN(select):

Sort (cost=165.77..165.77 rows=2 width=16974) (actual
time=13.459..13.460 rows=2 loops=1)
Sort Key: this_.id
Sort Method: quicksort Memory: 26kB
-> Hash Semi Join (cost=123.09..165.76 rows=2 width=16974)
(actual time=12.741..13.432 rows=2 loops=1)
Hash Cond: (this_.id = kladdenent0_.id)
-> Hash Left Join (cost=119.17..160.90 rows=348
width=16974) (actual time=8.765..13.104 rows=342 loops=1)
Hash Cond: (flugzeug2_.flugzeugtyp_id = flugzeugty3_.id)
-> Hash Left Join (cost=118.10..155.08 rows=348
width=16454) (actual time=8.724..12.412 rows=342 loops=1)
Hash Cond: (flugzeug2_.zaehlertyp_id = bmintype4_.id)
-> Hash Left Join (cost=117.06..152.71 rows=348
width=15934) (actual time=8.660..11.786 rows=342 loops=1)
Hash Cond: (this_.lehrerid = pilot5_.id)
-> Hash Left Join (cost=96.66..130.46
rows=348 width=8912) (actual time=6.395..8.899 rows=342 loops=1)
Hash Cond: (this_.nachid = flugplatz6_.id)
-> Hash Left Join
(cost=93.89..122.90 rows=348 width=8370) (actual time=6.354..8.429
rows=342 loops=1)
Hash Cond: (this_.flugzeugid =
flugzeug2_.id)
-> Hash Left Join
(cost=23.17..47.04 rows=348 width=7681) (actual time=1.992..3.374
rows=342 loops=1)
Hash Cond: (this_.pilotid
= pilot7_.id)
-> Hash Left Join
(cost=2.78..22.04 rows=348 width=659) (actual time=0.044..0.548
rows=342 loops=1)
Hash Cond:
(this_.vonid = flugplatz8_.id)
-> Seq Scan on
startkladde this_ (cost=0.00..14.48 rows=348 width=117) (actual
time=0.004..0.074 rows=342 loops=1)
-> Hash
(cost=1.79..1.79 rows=79 width=542) (actual time=0.032..0.032 rows=79
loops=1)
-> Seq Scan
on flugplatz flugplatz8_ (cost=0.00..1.79 rows=79 width=542) (actual
time=0.003..0.010 rows=79 loops=1)
-> Hash
(cost=15.73..15.73 rows=373 width=7022) (actual time=1.938..1.938
rows=375 loops=1)
-> Seq Scan on
pilot pilot7_ (cost=0.00..15.73 rows=373 width=7022) (actual
time=0.006..0.769 rows=375 loops=1)
-> Hash (cost=51.43..51.43
rows=1543 width=689) (actual time=4.351..4.351 rows=1543 loops=1)
-> Seq Scan on flugzeug
flugzeug2_ (cost=0.00..51.43 rows=1543 width=689) (actual
time=0.006..1.615 rows=1543 loops=1)
-> Hash (cost=1.79..1.79 rows=79
width=542) (actual time=0.031..0.031 rows=79 loops=1)
-> Seq Scan on flugplatz
flugplatz6_ (cost=0.00..1.79 rows=79 width=542) (actual
time=0.003..0.011 rows=79 loops=1)
-> Hash (cost=15.73..15.73 rows=373
width=7022) (actual time=2.236..2.236 rows=375 loops=1)
-> Seq Scan on pilot pilot5_
(cost=0.00..15.73 rows=373 width=7022) (actual time=0.005..0.781
rows=375 loops=1)
-> Hash (cost=1.02..1.02 rows=2 width=520)
(actual time=0.005..0.005 rows=2 loops=1)
-> Seq Scan on bmintype bmintype4_
(cost=0.00..1.02 rows=2 width=520) (actual time=0.003..0.004 rows=2
loops=1)
-> Hash (cost=1.03..1.03 rows=3 width=520) (actual
time=0.004..0.004 rows=3 loops=1)
-> Seq Scan on flugzeugtype flugzeugty3_
(cost=0.00..1.03 rows=3 width=520) (actual time=0.002..0.002 rows=3
loops=1)
-> Hash (cost=3.90..3.90 rows=2 width=4) (actual
time=0.239..0.239 rows=2 loops=1)
-> Limit (cost=0.00..3.88 rows=2 width=4) (actual
time=0.202..0.236 rows=2 loops=1)
-> Index Scan using startkladde_pkey on
startkladde kladdenent0_ (cost=0.00..56.24 rows=29 width=4) (actual
time=0.200..0.233 rows=2 loops=1)
Filter: ((status > 0) OR (id = (-1)))

Search Discussions

  • Tom Lane at May 16, 2011 at 11:22 pm

    Clemens Eisserer writes:
    I have a quite complex, performance sensitive query in a system with a
    few (7) joins:
    select .... from t1 left join t2 .... WHERE id IN (select ....)
    Does it work as expected with one less join? If so, try increasing
    join_collapse_limit ...

    regards, tom lane
  • Clemens Eisserer at May 17, 2011 at 7:39 am
    Hi,
    select .... from t1 left join t2 .... WHERE id IN (select ....)
    Does it work as expected with one less join?  If so, try increasing
    join_collapse_limit ...
    That did the trick - thanks a lot. I only had to increase
    join_collapse_limit a bit and now get an almost perfect plan.
    Instead of hash-joining all the data, the planner generates
    nested-loop-joins with index only on the few rows I fetch.

    Using = ANY(array(select... )) also seems to work, I wonder which one
    works better. Does ANY(ARRAY(...)) force the optimizer to plan the
    subquery seperated from the main query?

    Thanks, Clemens
  • Clemens Eisserer at May 18, 2011 at 8:46 am
    Hi,

    Does anybody know why the planner treats "= ANY(ARRAY(select ...))"
    differently than "IN(select ...)"?
    Which one is preferable, when I already have a lot of joins?

    Thanks, Clemens

    2011/5/17 Clemens Eisserer <linuxhippy@gmail.com>:
    Hi,
    select .... from t1 left join t2 .... WHERE id IN (select ....)
    Does it work as expected with one less join?  If so, try increasing
    join_collapse_limit ...
    That did the trick - thanks a lot. I only had to increase
    join_collapse_limit a bit and now get an almost perfect plan.
    Instead of hash-joining all the data, the planner generates
    nested-loop-joins with index only on the few rows I fetch.

    Using = ANY(array(select... )) also seems to work, I wonder which one
    works better. Does ANY(ARRAY(...)) force the optimizer to plan the
    subquery seperated from the main query?

    Thanks, Clemens
  • Dave Johansen at May 18, 2011 at 2:01 pm

    On Wed, May 18, 2011 at 1:46 AM, Clemens Eisserer wrote:

    Hi,

    Does anybody know why the planner treats "= ANY(ARRAY(select ...))"
    differently than "IN(select ...)"?
    Which one is preferable, when I already have a lot of joins?

    Thanks, Clemens

    2011/5/17 Clemens Eisserer <linuxhippy@gmail.com>:
    Hi,
    select .... from t1 left join t2 .... WHERE id IN (select ....)
    Does it work as expected with one less join? If so, try increasing
    join_collapse_limit ...
    That did the trick - thanks a lot. I only had to increase
    join_collapse_limit a bit and now get an almost perfect plan.
    Instead of hash-joining all the data, the planner generates
    nested-loop-joins with index only on the few rows I fetch.

    Using = ANY(array(select... )) also seems to work, I wonder which one
    works better. Does ANY(ARRAY(...)) force the optimizer to plan the
    subquery seperated from the main query?

    Thanks, Clemens
    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance

    I'm just a user so I don't have definitive knowledge of this, but my
    experience seems to indicate that the = ANY(ARRAY(SELECT ...)) does the
    select and turns it into an array and then uses that in the where clause in
    a manner similar to a hard coded list of values, like IN (1, 2, 3, ...). In
    theory, the planner could do the same sort of things with the IN (SELECT
    ...) but my experience seems to indicate that in some cases it decides not
    to use an index that it could.

    One specific example I know of is that at least in PostgreSQL 8.3, a view
    with a UNION/UNION ALL will push the = ANY(ARRAY(SELECT ...)) down into the
    two sub-queries, but the IN (SELECT ...) will be applied after the UNION
    ALL.

    Dave
  • Scott Carey at May 18, 2011 at 5:51 pm

    On 5/17/11 12:38 AM, "Clemens Eisserer" wrote:
    Hi,
    select .... from t1 left join t2 .... WHERE id IN (select ....)
    Does it work as expected with one less join? If so, try increasing
    join_collapse_limit ...
    That did the trick - thanks a lot. I only had to increase
    join_collapse_limit a bit and now get an almost perfect plan.
    Instead of hash-joining all the data, the planner generates
    nested-loop-joins with index only on the few rows I fetch.

    Using = ANY(array(select... )) also seems to work, I wonder which one
    works better. Does ANY(ARRAY(...)) force the optimizer to plan the
    subquery seperated from the main query?

    I'm not sure exactly what happens with ANY(ARRAY()).

    I am fairly confident that the planner simply transforms an IN(select ...)
    to a join, since they are equivalent.

    Because "foo IN (select ...)" is just a join, it counts towards
    join_collapse_limit.

    Thanks, Clemens

    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
  • Dave Johansen at May 17, 2011 at 2:44 am

    On Mon, May 16, 2011 at 3:30 PM, Clemens Eisserer wrote:

    Hi,

    I have a quite complex, performance sensitive query in a system with a
    few (7) joins:
    select .... from t1 left join t2 .... WHERE id IN (select ....)

    For this query the planner evaluates the IN with a hash semi join last,
    and all the joining is done by hash joins for all rows contained in t1.

    However when I specify the ids manually (IN (1, 2, 3, 4, 5) the
    planner first does an index lookup on the primary key column id,
    and subsequently does nested loop joins using an index on t2 - which
    gives way better results.

    Is there any way to guide the planner to evaluate the IN condition
    first, instead of last?
    Why is the planner behaving this way? (postgresql 8.4.??)

    Thank you in advance, Clemens


    Query plan with IN(select):

    Sort (cost=165.77..165.77 rows=2 width=16974) (actual
    time=13.459..13.460 rows=2 loops=1)
    Sort Key: this_.id
    Sort Method: quicksort Memory: 26kB
    -> Hash Semi Join (cost=123.09..165.76 rows=2 width=16974)
    (actual time=12.741..13.432 rows=2 loops=1)
    Hash Cond: (this_.id = kladdenent0_.id)
    -> Hash Left Join (cost=119.17..160.90 rows=348
    width=16974) (actual time=8.765..13.104 rows=342 loops=1)
    Hash Cond: (flugzeug2_.flugzeugtyp_id = flugzeugty3_.id)
    -> Hash Left Join (cost=118.10..155.08 rows=348
    width=16454) (actual time=8.724..12.412 rows=342 loops=1)
    Hash Cond: (flugzeug2_.zaehlertyp_id = bmintype4_.id)
    -> Hash Left Join (cost=117.06..152.71 rows=348
    width=15934) (actual time=8.660..11.786 rows=342 loops=1)
    Hash Cond: (this_.lehrerid = pilot5_.id)
    -> Hash Left Join (cost=96.66..130.46
    rows=348 width=8912) (actual time=6.395..8.899 rows=342 loops=1)
    Hash Cond: (this_.nachid = flugplatz6_.id)
    -> Hash Left Join
    (cost=93.89..122.90 rows=348 width=8370) (actual time=6.354..8.429
    rows=342 loops=1)
    Hash Cond: (this_.flugzeugid =
    flugzeug2_.id)
    -> Hash Left Join
    (cost=23.17..47.04 rows=348 width=7681) (actual time=1.992..3.374
    rows=342 loops=1)
    Hash Cond: (this_.pilotid
    = pilot7_.id)
    -> Hash Left Join
    (cost=2.78..22.04 rows=348 width=659) (actual time=0.044..0.548
    rows=342 loops=1)
    Hash Cond:
    (this_.vonid = flugplatz8_.id)
    -> Seq Scan on
    startkladde this_ (cost=0.00..14.48 rows=348 width=117) (actual
    time=0.004..0.074 rows=342 loops=1)
    -> Hash
    (cost=1.79..1.79 rows=79 width=542) (actual time=0.032..0.032 rows=79
    loops=1)
    -> Seq Scan
    on flugplatz flugplatz8_ (cost=0.00..1.79 rows=79 width=542) (actual
    time=0.003..0.010 rows=79 loops=1)
    -> Hash
    (cost=15.73..15.73 rows=373 width=7022) (actual time=1.938..1.938
    rows=375 loops=1)
    -> Seq Scan on
    pilot pilot7_ (cost=0.00..15.73 rows=373 width=7022) (actual
    time=0.006..0.769 rows=375 loops=1)
    -> Hash (cost=51.43..51.43
    rows=1543 width=689) (actual time=4.351..4.351 rows=1543 loops=1)
    -> Seq Scan on flugzeug
    flugzeug2_ (cost=0.00..51.43 rows=1543 width=689) (actual
    time=0.006..1.615 rows=1543 loops=1)
    -> Hash (cost=1.79..1.79 rows=79
    width=542) (actual time=0.031..0.031 rows=79 loops=1)
    -> Seq Scan on flugplatz
    flugplatz6_ (cost=0.00..1.79 rows=79 width=542) (actual
    time=0.003..0.011 rows=79 loops=1)
    -> Hash (cost=15.73..15.73 rows=373
    width=7022) (actual time=2.236..2.236 rows=375 loops=1)
    -> Seq Scan on pilot pilot5_
    (cost=0.00..15.73 rows=373 width=7022) (actual time=0.005..0.781
    rows=375 loops=1)
    -> Hash (cost=1.02..1.02 rows=2 width=520)
    (actual time=0.005..0.005 rows=2 loops=1)
    -> Seq Scan on bmintype bmintype4_
    (cost=0.00..1.02 rows=2 width=520) (actual time=0.003..0.004 rows=2
    loops=1)
    -> Hash (cost=1.03..1.03 rows=3 width=520) (actual
    time=0.004..0.004 rows=3 loops=1)
    -> Seq Scan on flugzeugtype flugzeugty3_
    (cost=0.00..1.03 rows=3 width=520) (actual time=0.002..0.002 rows=3
    loops=1)
    -> Hash (cost=3.90..3.90 rows=2 width=4) (actual
    time=0.239..0.239 rows=2 loops=1)
    -> Limit (cost=0.00..3.88 rows=2 width=4) (actual
    time=0.202..0.236 rows=2 loops=1)
    -> Index Scan using startkladde_pkey on
    startkladde kladdenent0_ (cost=0.00..56.24 rows=29 width=4) (actual
    time=0.200..0.233 rows=2 loops=1)
    Filter: ((status > 0) OR (id = (-1)))

    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
    In some cases, I've seen improved results when replacing the IN (...) with =
    ANY(ARRAY(...)).
    Dave

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedMay 16, '11 at 10:30p
activeMay 18, '11 at 5:51p
posts7
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase