I have two identical databases that run the same query each morning. Starting this morning, something caused the first db to start using a different execution plan for the query, resulting in much worse performance. I've have tried several things this morning, but I am currently stumped on what would be causing the different execution plans.

The query and the results of the explain analyze on the two db's:

db1=> explain analyze
select
t1.bn,
t2.mu,
t1.nm,
t1.root,
t1.suffix,
t1.type
from
t1,
t2
where
t2.eff_dt = current_date
and t1.active = true
and t1.bn = t2.sn;

The slower plan used on db1:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=145.12..38799.61 rows=7876 width=47) (actual time=6.494..352.166 rows=8437 loops=1)
-> Bitmap Heap Scan on t2 (cost=145.12..19464.74 rows=10898 width=22) (actual time=6.472..22.684 rows=12204 loops=1)
Recheck Cond: (eff_dt = ('now'::text)::date)
-> Bitmap Index Scan on t2_nu1 (cost=0.00..142.40 rows=10898 width=0) (actual time=4.013..4.013 rows=24482 loops=1)
Index Cond: (eff_dt = ('now'::text)::date)
-> Index Scan using t1_uc2 on t1 (cost=0.00..1.76 rows=1 width=32) (actual time=0.012..0.026 rows=1 loops=12204)
Index Cond: ((t1.bn)::text = (t2.sn)::text)
Filter: active
Total runtime: 353.629 ms
(9 rows)

Time: 354.795 ms


And the faster plan from db2:


QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=21371.63..21720.78 rows=7270 width=47) (actual time=60.412..80.865 rows=8437 loops=1)
Merge Cond: ("outer"."?column6?" = "inner"."?column3?")
-> Sort (cost=8988.56..9100.55 rows=44794 width=32) (actual time=30.685..33.370 rows=8438 loops=1)
Sort Key: (t1.bn)::text
-> Seq Scan on t1 (cost=0.00..5528.00 rows=44794 width=32) (actual time=0.008..18.280 rows=8439 loops=1)
Filter: active
-> Sort (cost=12383.07..12409.32 rows=10500 width=22) (actual time=29.718..33.515 rows=12204 loops=1)
Sort Key: (t2.sn)::text
-> Index Scan using t2_nu1 on t2 (cost=0.00..11681.77 rows=10500 width=22) (actual time=0.052..13.295 rows=12204 loops=1)
Index Cond: (eff_dt = ('now'::text)::date)
Total runtime: 83.385 ms
(11 rows)

t2.eff_dt is defined as a date, t1.active is a boolean, all other fields are varchar. Table t1 has a unique index (uc2) on field bn and a second unique index (uc3) on fields (root, suffix). Table t2 has a unique index (uc1) on (sn, eff_dt), and a non-unique index (nu1) on eff_dt.

Table t1 has 12204 rows. Table t2 has 7.1M rows, 12204 of which have eff_dt = current_date.

Both database have autovacuum turned on, and both have been vacuumed and analyzed in the last 24 hours.

Any ideas as to what could the first db to opt for the slower subquery rather than the merge?

Thanks in advance.

Search Discussions

  • Tom Lane at Sep 29, 2008 at 4:42 pm

    Doug Eck writes:
    Any ideas as to what could the first db to opt for the slower subquery rather than the merge?
    Not from the information given. Presumably db1 thinks that the
    mergejoin plan would be slower, but why it thinks that isn't clear yet.
    Try setting enable_nestloop = off (and enable_hashjoin = off if it then
    wants a hashjoin) and then post the EXPLAIN ANALYZE results.

    regards, tom lane
  • Doug Eck at Sep 29, 2008 at 6:18 pm
    Setting enable_nestloop = off did result in a hash join, so I also set enable_hashjoin = off.

    The new plan from the slower db:
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------------
    Merge Join (cost=20195.54..46442.99 rows=7876 width=47) (actual time=136.531..478.708 rows=8437 loops=1)
    Merge Cond: ((t1.bn)::text = "inner"."?column3?")
    -> Index Scan using t1_uc2 on t1 (cost=0.00..25604.74 rows=204906 width=32) (actual time=0.061..327.285 rows=8438 loops=1)
    Filter: active
    -> Sort (cost=20195.54..20222.79 rows=10898 width=22) (actual time=136.461..138.621 rows=12204 loops=1)
    Sort Key: (t2.sn)::text
    -> Bitmap Heap Scan on t2 (cost=145.12..19464.74 rows=10898 width=22) (actual time=7.580..120.144 rows=12204 loops=1)
    Recheck Cond: (eff_dt = ('now'::text)::date)
    -> Bitmap Index Scan on t2_nu1 (cost=0.00..142.40 rows=10898 width=0) (actual time=4.964..4.964 rows=24483 loops=1)
    Index Cond: (eff_dt = ('now'::text)::date)
    Total runtime: 480.344 ms
    (11 rows)

    And the faster one:

    QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
    Merge Join (cost=21371.63..21720.78 rows=7270 width=47) (actual time=60.435..80.604 rows=8437 loops=1)
    Merge Cond: ("outer"."?column6?" = "inner"."?column3?")
    -> Sort (cost=8988.56..9100.55 rows=44794 width=32) (actual time=30.498..33.093 rows=8438 loops=1)
    Sort Key: (t1.bn)::text
    -> Seq Scan on t1 (cost=0.00..5528.00 rows=44794 width=32) (actual time=0.010..17.950 rows=8439 loops=1)
    Filter: active
    -> Sort (cost=12383.07..12409.32 rows=10500 width=22) (actual time=29.928..33.658 rows=12204 loops=1)
    Sort Key: (t2.sn)::text
    -> Index Scan using t2_nu1 on t2 (cost=0.00..11681.77 rows=10500 width=22) (actual time=0.062..13.356 rows=12204 loops=1)
    Index Cond: (eff_dt = ('now'::text)::date)
    Total runtime: 83.054 ms
    (11 rows)

    And the query again:

    explain analyze
    select
    t1.bn,
    t2.mu,
    t1.nm,
    t1.root,
    t1.suffix,
    t1.type
    from
    t1,
    t2
    where
    t2.eff_dt = current_date
    and t1.active = true
    and t1.bn = t2.sn;

    Thanks.



    ----- Original Message ----
    From: Tom Lane <tgl@sss.pgh.pa.us>
    To: Doug Eck <deck1@yahoo.com>
    Cc: pgsql-performance@postgresql.org
    Sent: Monday, September 29, 2008 11:42:01 AM
    Subject: Re: [PERFORM] Identical DB's, different execution plans

    Doug Eck <deck1@yahoo.com> writes:
    Any ideas as to what could the first db to opt for the slower subquery rather than the merge?
    Not from the information given. Presumably db1 thinks that the
    mergejoin plan would be slower, but why it thinks that isn't clear yet.
    Try setting enable_nestloop = off (and enable_hashjoin = off if it then
    wants a hashjoin) and then post the EXPLAIN ANALYZE results.

    regards, tom lane
  • Tom Lane at Sep 29, 2008 at 11:20 pm

    Doug Eck writes:
    The new plan from the slower db:
    -> Index Scan using t1_uc2 on t1 (cost=0.00..25604.74 rows=204906 width=32) (actual time=0.061..327.285 rows=8438 loops=1)
    Filter: active
    This seems a bit fishy. In the first place, with such a simple filter
    condition it shouldn't be that far off on the rowcount estimate. In
    the second place, the cost estimate is more than twice what the other
    server estimates to do a seqscan and sort of the same data, and the
    rowcount estimate is five times as much. So there's something really
    significantly different about the t1 tables in the two cases.

    The first thing you ought to do is to look at the pg_class.relpages
    and reltuples entries for t1 in both databases. What I am suspecting is
    that for some reason the "slow" db has suffered a lot of bloat in that
    table, leading to a corresponding increase in the cost of a seqscan.
    If so, a VACUUM FULL or CLUSTER should fix it, though you'll next need
    to look into why routine vacuumings weren't happening. (It looks like
    t2 may be a bit bloated as well.)

    If that's not it, we'll need to probe deeper ...

    regards, tom lane
  • Doug Eck at Sep 30, 2008 at 12:17 am
    Tom,

    You nailed it. The t1 table was using 9600 relpages versus 410 after the vacuum full. The two databases are now showing similar execution plans and times.

    Thanks for your help. It is greatly appreciated.

    Doug Eck



    ----- Original Message ----
    From: Tom Lane <tgl@sss.pgh.pa.us>
    To: Doug Eck <deck1@yahoo.com>
    Cc: pgsql-performance@postgresql.org
    Sent: Monday, September 29, 2008 6:20:20 PM
    Subject: Re: [PERFORM] Identical DB's, different execution plans

    Doug Eck <deck1@yahoo.com> writes:
    The new plan from the slower db:
    -> Index Scan using t1_uc2 on t1 (cost=0.00..25604.74 rows=204906 width=32) (actual time=0.061..327.285 rows=8438 loops=1)
    Filter: active
    This seems a bit fishy. In the first place, with such a simple filter
    condition it shouldn't be that far off on the rowcount estimate. In
    the second place, the cost estimate is more than twice what the other
    server estimates to do a seqscan and sort of the same data, and the
    rowcount estimate is five times as much. So there's something really
    significantly different about the t1 tables in the two cases.

    The first thing you ought to do is to look at the pg_class.relpages
    and reltuples entries for t1 in both databases. What I am suspecting is
    that for some reason the "slow" db has suffered a lot of bloat in that
    table, leading to a corresponding increase in the cost of a seqscan.
    If so, a VACUUM FULL or CLUSTER should fix it, though you'll next need
    to look into why routine vacuumings weren't happening. (It looks like
    t2 may be a bit bloated as well.)

    If that's not it, we'll need to probe deeper ...

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedSep 29, '08 at 4:00p
activeSep 30, '08 at 12:17a
posts5
users2
websitepostgresql.org
irc#postgresql

2 users in discussion

Doug Eck: 3 posts Tom Lane: 2 posts

People

Translate

site design / logo © 2022 Grokbase