Hi,

I have the following query which has been running very slowly and after a
lot of testing/trial and error I found an execution plan that ran the query
in a fraction of the time (and then lost the statistics that produced it).
What I wish to know is how to force the query to use the faster execution
plan.

Query:
SELECT count(*) as count FROM
(
SELECT *
FROM transaction t, merchant m
WHERE t.merchant_id = m.id
AND m.id = 198
AND t.transaction_date >= '20050101'
AND t.transaction_date <= '20060925'
AND credit_card_no LIKE '1111%111'

UNION ALL
SELECT *
FROM transaction t, merchant m
WHERE t.merchant_id = m.id
AND m.parent_merchant_id = 198
AND t.transaction_date >= '20050101'
AND t.transaction_date <= '20060925'
AND credit_card_no LIKE '1111%111'
) AS foobar

Desired Execution Plan:
Aggregate (cost=97377.90..97377.90 rows=1 width=0)
-> Subquery Scan foobar (cost=0.00..97377.86 rows=16 width=0)
-> Append (cost=0.00..97377.70 rows=16 width=636)
-> Subquery Scan "*SELECT* 1" (cost=0.00..10304.81 rows=3
width=636)
-> Nested Loop (cost=0.00..10304.78 rows=3 width=636)
-> Index Scan using pk_merchant on merchant m
(cost=0.00..5.11 rows=1 width=282)
Index Cond: (id = 198)
-> Index Scan using ix_transaction_merchant_id on
"transaction" t (cost=0.00..10299.64 rows=3 width=354)
Index Cond: (198 = merchant_id)
Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
-> Subquery Scan "*SELECT* 2" (cost=13.86..87072.89 rows=13
width=636)
-> Hash Join (cost=13.86..87072.76 rows=13 width=636)
Hash Cond: ("outer".merchant_id = "inner".id)
-> Seq Scan on "transaction" t
(cost=0.00..87052.65 rows=1223 width=354)
Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
-> Hash (cost=13.85..13.85 rows=4 width=282)
-> Index Scan using
ix_merchant_parent_merchant_id on merchant m (cost=0.00..13.85 rows=4
width=282)
Index Cond: (parent_merchant_id = 198)

Undesired Execution Plan:
Aggregate (cost=88228.82..88228.82 rows=1 width=0)
-> Subquery Scan foobar (cost=0.00..88228.73 rows=35 width=0)
-> Append (cost=0.00..88228.38 rows=35 width=631)
-> Subquery Scan "*SELECT* 1" (cost=0.00..1137.61 rows=1
width=631)
-> Nested Loop (cost=0.00..1137.60 rows=1 width=631)
-> Index Scan using ix_transaction_merchant_id on
"transaction" t (cost=0.00..1132.47 rows=1 width=349)
Index Cond: (198 = merchant_id)
Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
-> Index Scan using pk_merchant on merchant m
(cost=0.00..5.11 rows=1 width=282)
Index Cond: (id = 198)
-> Subquery Scan "*SELECT* 2" (cost=20.90..87090.77 rows=34
width=631)
-> Hash Join (cost=20.90..87090.43 rows=34 width=631)
Hash Cond: ("outer".merchant_id = "inner".id)
-> Seq Scan on "transaction" t
(cost=0.00..87061.04 rows=1632 width=349)
Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
-> Hash (cost=20.88..20.88 rows=8 width=282)
-> Seq Scan on merchant m
(cost=0.00..20.88 rows=8 width=282)
Filter: (parent_merchant_id = 198)



Thanks for any help/ideas


Tim

Search Discussions

  • Jochem van Dieten at Sep 27, 2006 at 12:34 pm

    Tim Truman wrote:
    Query:
    SELECT count(*) as count FROM
    (
    SELECT *
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
    AND m.id = 198
    AND t.transaction_date >= '20050101'
    AND t.transaction_date <= '20060925'
    AND credit_card_no LIKE '1111%111'

    UNION ALL
    SELECT *
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
    AND m.parent_merchant_id = 198
    AND t.transaction_date >= '20050101'
    AND t.transaction_date <= '20060925'
    AND credit_card_no LIKE '1111%111'
    ) AS foobar
    Actually, I think the best course of action is to rewrite the query to a
    faster alternative. What you can try is:
    SELECT SUM(count) AS count FROM
    (
    SELECT count(*) AS count
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
    AND m.id = 198
    AND t.transaction_date >= '20050101'
    AND t.transaction_date <= '20060925'
    AND credit_card_no LIKE '1111%111'

    UNION ALL
    SELECT count(*) AS count
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
    AND m.parent_merchant_id = 198
    AND t.transaction_date >= '20050101'
    AND t.transaction_date <= '20060925'
    AND credit_card_no LIKE '1111%111'
    ) AS foobar;


    The next optimization is to merge the 2 subqueries into one. If you
    schema is such that m.id can not be the same as m.parent_merchant_id I
    think your query can be reduced to:
    SELECT count(*) AS count
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
    AND
    (
    m.id = 198
    OR
    m.parent_merchant_id = 198
    )
    AND t.transaction_date >= '20050101'
    AND t.transaction_date <= '20060925'
    AND credit_card_no LIKE '1111%111'


    If m.id can be the same as m.parent_merchant_id you need something like:
    SELECT SUM(
    CASE WHEN m.id = m.parent_merchant_id THEN 2 ELSE 1 END
    ) AS count
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
    AND
    (
    m.id = 198
    OR
    m.parent_merchant_id = 198
    )
    AND t.transaction_date >= '20050101'
    AND t.transaction_date <= '20060925'
    AND credit_card_no LIKE '1111%111'

    Jochem
  • Dave Dutcher at Sep 27, 2006 at 3:51 pm

    -----Original Message-----
    From: pgsql-performance-owner@postgresql.org
    On Behalf Of Tim Truman
    Hi,

    I have the following query which has been running very slowly
    and after a
    lot of testing/trial and error I found an execution plan that
    ran the query
    in a fraction of the time (and then lost the statistics that
    produced it).
    What I wish to know is how to force the query to use the
    faster execution
    plan.
    It would be a bit easier to diagnose the problem if you posted EXPLAIN
    ANALYZE rather than just EXPLAIN. The two plans you posted looked very
    similar except for the order of the nested loop in subquery 1 and an index
    scan rather than a seq scan in subquery 2.

    My guess would be that the order of the nested loop is determined mostly by
    estimates of matching rows. If you ran an EXPLAIN ANALYZE you could tell if
    the planner is estimating correctly. If it is not, you could try increasing
    your statistics target and running ANALYZE.

    To make the planner prefer an index scan over a seq scan, I would first
    check the statistics again, and then you can try setting enable_seqscan to
    false (enable_seqscan is meant more for testing than production) or, you
    could try reducing random_page_cost, but you should test that against a
    range of queries before putting it in production.

    Dave
  • Jim C. Nasby at Sep 27, 2006 at 8:14 pm

    On Wed, Sep 27, 2006 at 10:51:26AM -0500, Dave Dutcher wrote:
    To make the planner prefer an index scan over a seq scan, I would first
    check the statistics again, and then you can try setting enable_seqscan to
    false (enable_seqscan is meant more for testing than production) or, you
    could try reducing random_page_cost, but you should test that against a
    range of queries before putting it in production.
    Index scans are also pretty picky about correlation. If you have really
    low correlation you don't want to index scan, but I think our current
    estimates make it too eager to switch to a seqscan.
    --
    Jim Nasby jim@nasby.net
    EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
  • Ron Mayer at Oct 4, 2006 at 12:10 am

    Jim C. Nasby wrote:

    Index scans are also pretty picky about correlation. If you have really
    low correlation you don't want to index scan,
    I'm still don't think "correlation" is the right metric
    at all for making this decision.

    If you have a list of addresses clustered by "zip"
    the "correlation" of State, City, County, etc will all be zero (since
    the zip codes don't match the alphabetical order of state or city names)
    but index scans are still big wins because the data for any given
    state or city will be packed on the same few pages - and in fact
    the pages could be read mostly sequentially.
    but I think our current
    estimates make it too eager to switch to a seqscan.
  • Jim C. Nasby at Oct 4, 2006 at 12:56 am
    Adding -performance back in.
    On Tue, Oct 03, 2006 at 05:10:04PM -0700, Ron Mayer wrote:
    Jim C. Nasby wrote:
    Index scans are also pretty picky about correlation. If you have really
    low correlation you don't want to index scan,
    I'm still don't think "correlation" is the right metric
    at all for making this decision.

    If you have a list of addresses clustered by "zip"
    the "correlation" of State, City, County, etc will all be zero (since
    the zip codes don't match the alphabetical order of state or city names)
    but index scans are still big wins because the data for any given
    state or city will be packed on the same few pages - and in fact
    the pages could be read mostly sequentially.
    That's a good point that I don't think has been considered before. I
    think correlation is still somewhat important, but what's probably far
    more important is data localization.

    One possible way to calculate this would be to note the location of
    every tuple with a given value in the heap. Calculate the geometric mean
    of those locations (I think you could essentially average all the
    ctids), and divide that by the average distance of each tuple from that
    mean (or maybe the reciprocal of that would be more logical).

    Obviously we don't want to scan the whole table to do that, but there
    should be some way to do it via sampling as well.

    Or perhaps someone knows of a research paper with real data on how to do
    this instead of hand-waving. :)
    but I think our current
    estimates make it too eager to switch to a seqscan.
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Tim Truman at Oct 3, 2006 at 3:58 am
    Here is an "explain analyze" for the query that performs slowly, I hope this
    helps unfortunately I can't reproduce the version of the query that ran
    quickly and therefore can't provide and 'explain analyze' for it.

    "Aggregate (cost=88256.32..88256.32 rows=1 width=0) (actual
    time=55829.000..55829.000 rows=1 loops=1)"
    " -> Subquery Scan foobar (cost=0.00..88256.23 rows=35 width=0) (actual
    time=19235.000..55829.000 rows=24 loops=1)"
    " -> Append (cost=0.00..88255.88 rows=35 width=631) (actual
    time=19235.000..55829.000 rows=24 loops=1)"
    " -> Subquery Scan "*SELECT* 1" (cost=0.00..1165.12 rows=1
    width=631) (actual time=16.000..16.000 rows=0 loops=1)"
    " -> Nested Loop (cost=0.00..1165.11 rows=1 width=631)
    (actual time=16.000..16.000 rows=0 loops=1)"
    " -> Index Scan using ix_transaction_merchant_id
    on "transaction" t (cost=0.00..1159.98 rows=1 width=349) (actual
    time=16.000..16.000 rows=0 loops=1)"
    " Index Cond: (198 = merchant_id)"
    " Filter: ((transaction_date >=
    '2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
    ((credit_card_no)::text ~~ '4564%549'::text))"
    " -> Index Scan using pk_merchant on merchant m
    (cost=0.00..5.11 rows=1 width=282) (never executed)"
    " Index Cond: (id = 198)"
    " -> Subquery Scan "*SELECT* 2" (cost=20.90..87090.76 rows=34
    width=631) (actual time=19219.000..55813.000 rows=24 loops=1)"
    " -> Hash Join (cost=20.90..87090.42 rows=34 width=631)
    (actual time=19219.000..55813.000 rows=24 loops=1)"
    " Hash Cond: ("outer".merchant_id = "inner".id)"
    " -> Seq Scan on "transaction" t
    (cost=0.00..87061.04 rows=1630 width=349) (actual time=234.000..55797.000
    rows=200 loops=1)"
    " Filter: ((transaction_date >=
    '2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
    ((credit_card_no)::text ~~ '4564%549'::text))"
    " -> Hash (cost=20.88..20.88 rows=8 width=282)
    (actual time=16.000..16.000 rows=0 loops=1)"
    " -> Seq Scan on merchant m
    (cost=0.00..20.88 rows=8 width=282) (actual time=0.000..16.000 rows=7
    loops=1)"
    " Filter: (parent_merchant_id = 198)"
    "Total runtime: 55829.000 ms"

    Once again any help much appreciated.

    Tim

    -----Original Message-----
    From: Dave Dutcher
    Sent: Thursday, 28 September 2006 1:21 AM
    To: 'Tim Truman'; pgsql-performance@postgresql.org
    Subject: RE: [PERFORM] Forcing the use of particular execution plans
    -----Original Message-----
    From: pgsql-performance-owner@postgresql.org
    On Behalf Of Tim Truman
    Hi,

    I have the following query which has been running very slowly
    and after a
    lot of testing/trial and error I found an execution plan that
    ran the query
    in a fraction of the time (and then lost the statistics that
    produced it).
    What I wish to know is how to force the query to use the
    faster execution
    plan.
    It would be a bit easier to diagnose the problem if you posted EXPLAIN
    ANALYZE rather than just EXPLAIN. The two plans you posted looked very
    similar except for the order of the nested loop in subquery 1 and an index
    scan rather than a seq scan in subquery 2.

    My guess would be that the order of the nested loop is determined mostly by
    estimates of matching rows. If you ran an EXPLAIN ANALYZE you could tell if
    the planner is estimating correctly. If it is not, you could try increasing
    your statistics target and running ANALYZE.

    To make the planner prefer an index scan over a seq scan, I would first
    check the statistics again, and then you can try setting enable_seqscan to
    false (enable_seqscan is meant more for testing than production) or, you
    could try reducing random_page_cost, but you should test that against a
    range of queries before putting it in production.

    Dave
  • Tom Lane at Oct 3, 2006 at 4:19 am

    "Tim Truman" <tim@advam.com> writes:
    Here is an "explain analyze" for the query that performs slowly,
    This shows that the planner is exactly correct in thinking that all
    the runtime is going into the seqscan on transaction:
    "Aggregate (cost=88256.32..88256.32 rows=1 width=0) (actual
    time=55829.000..55829.000 rows=1 loops=1)"
    ...
    " -> Seq Scan on "transaction" t
    (cost=0.00..87061.04 rows=1630 width=349) (actual time=234.000..55797.000
    rows=200 loops=1)"
    " Filter: ((transaction_date >=
    '2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
    ((credit_card_no)::text ~~ '4564%549'::text))"
    Since that component of the plan was identical in your two original
    plans ("desired" and "undesired") it seems pretty clear that you have
    not correctly identified what your problem is.

    regards, tom lane
  • Tim Truman at Oct 3, 2006 at 7:00 am
    Thanks Tom
    The time difference did distract me from the issue. Switching Seq Scan to
    off reduced the runtime greatly, so I am now adjusting the
    effective_cache_size, random_page_cost settings to favor indexes over Seq
    Scans.

    Regards,
    Tim


    -----Original Message-----
    From: Tom Lane
    Sent: Tuesday, 3 October 2006 1:50 PM
    To: Tim Truman
    Cc: 'Dave Dutcher'; pgsql-performance@postgresql.org
    Subject: Re: [PERFORM] Forcing the use of particular execution plans

    "Tim Truman" <tim@advam.com> writes:
    Here is an "explain analyze" for the query that performs slowly,
    This shows that the planner is exactly correct in thinking that all
    the runtime is going into the seqscan on transaction:
    "Aggregate (cost=88256.32..88256.32 rows=1 width=0) (actual
    time=55829.000..55829.000 rows=1 loops=1)"
    ...
    " -> Seq Scan on "transaction" t
    (cost=0.00..87061.04 rows=1630 width=349) (actual time=234.000..55797.000
    rows=200 loops=1)"
    " Filter: ((transaction_date >=
    '2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
    ((credit_card_no)::text ~~ '4564%549'::text))"
    Since that component of the plan was identical in your two original
    plans ("desired" and "undesired") it seems pretty clear that you have
    not correctly identified what your problem is.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedSep 27, '06 at 6:39a
activeOct 4, '06 at 12:56a
posts9
users6
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase