Hi,

Is there any inherent benefit of using a the IN operator versus
joining a temporary table? Should they offer near equal performance?
It appears bitmap scan's aren't done when matching across a small
temporary table.

I have a temporary table with 5 integers in it that I'm matching
against mildly complex view that has 5 joins. I've analyzed the
database after the temporary table was created.

Matching against the temporary table takes: 36492.836 ms.
Matching using the IN operator with the same content takes: 2.732 ms.

These measurements are after the query has been run a few times, so
the data should be in cache.

It would appear that the temporary table's join isn't evaluated deep
enough in the query plan to prevent the more expensive joins from
running, is there a way for force it? Could some setting be wrong
that telling the planner to make this decision? The same thing
happens when I perform the join without the view.

select * from foo;
oid
--------
161007
161008
161000
161009
161002
(5 rows)


Plan for IN match:

=# explain analyze select * from crawled_url_full_view where
crawled_url_full_view.oid in (161007, 161008, 161000, 161009, 161002);

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------
Hash Left Join (cost=14.50..94.11 rows=5 width=538) (actual
time=1.025..1.522 rows=5 loops=1)
Hash Cond: ("outer".classification_set_id = "inner".id)
Join Filter: ("outer".classification_set_id IS NOT NULL)
-> Hash Left Join (cost=13.30..92.86 rows=5 width=526) (actual
time=0.794..1.251 rows=5 loops=1)
Hash Cond: ("outer".charset_id = "inner".id)
Join Filter: ("outer".charset_id IS NOT NULL)
-> Hash Left Join (cost=12.21..91.70 rows=5 width=515)
(actual time=0.631..1.048 rows=5 loops=1)
Hash Cond: ("outer".http_error_description_id =
"inner".id)
Join Filter: ("outer".http_error_description_id IS
NOT NULL)
-> Hash Left Join (cost=11.13..90.59 rows=5
width=472) (actual time=0.488..0.868 rows=5 loops=1)
Hash Cond: ("outer".content_type_id = "inner".id)
Join Filter: ("outer".content_type_id IS NOT NULL)
-> Nested Loop Left Join (cost=10.02..89.41
rows=5 width=443) (actual time=0.244..0.578 rows=5 loops=1)
Join Filter: ("outer".redirect_url_id IS
NOT NULL)
-> Nested Loop Left Join
(cost=10.02..59.56 rows=5 width=339) (actual time=0.225..0.488 rows=5
loops=1)
-> Bitmap Heap Scan on
crawled_url (cost=10.02..29.71 rows=5 width=235) (actual
time=0.170..0.217 rows=5 loops=1)
Recheck Cond: ((oid = 161007)
OR (oid = 161008) OR (oid = 161000) OR (oid = 161009) OR (oid = 161002))
-> BitmapOr
(cost=10.02..10.02 rows=5 width=0) (actual time=0.137..0.137 rows=0
loops=1)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.061..0.061 rows=1 loops=1)
Index Cond: (oid
= 161007)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.013..0.013 rows=1 loops=1)
Index Cond: (oid
= 161008)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.013..0.013 rows=1 loops=1)
Index Cond: (oid
= 161000)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.014..0.014 rows=1 loops=1)
Index Cond: (oid
= 161009)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.012..0.012 rows=1 loops=1)
Index Cond: (oid
= 161002)
-> Index Scan using url_pkey on
url (cost=0.00..5.96 rows=1 width=108) (actual time=0.031..0.036
rows=1 loops=5)
Index Cond: (url.url_id =
"outer".url_id)
-> Index Scan using url_pkey on url r1
(cost=0.00..5.96 rows=1 width=108) (actual time=0.004..0.004 rows=0
loops=5)
Index Cond: (r1.url_id =
"outer".redirect_url_id)
-> Hash (cost=1.09..1.09 rows=9 width=33)
(actual time=0.130..0.130 rows=9 loops=1)
-> Seq Scan on content_types
(cost=0.00..1.09 rows=9 width=33) (actual time=0.017..0.062 rows=9
loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=47) (actual
time=0.088..0.088 rows=6 loops=1)
-> Seq Scan on http_error_descriptions
(cost=0.00..1.06 rows=6 width=47) (actual time=0.010..0.040 rows=6
loops=1)
-> Hash (cost=1.08..1.08 rows=8 width=15) (actual
time=0.103..0.103 rows=8 loops=1)
-> Seq Scan on charsets (cost=0.00..1.08 rows=8
width=15) (actual time=0.011..0.048 rows=8 loops=1)
-> Hash (cost=1.16..1.16 rows=16 width=16) (actual
time=0.175..0.175 rows=16 loops=1)
-> Seq Scan on classification_sets (cost=0.00..1.16
rows=16 width=16) (actual time=0.012..0.088 rows=16 loops=1)
Total runtime: 2.743 ms
(41 rows)



Plan for temp table match:


=# explain analyze select * from foo, crawled_url_full_view where
crawled_url_full_view.oid = foo.oid;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------
Hash IN Join (cost=35667.15..145600.71 rows=5 width=538) (actual
time=22371.445..36482.823 rows=5 loops=1)
Hash Cond: ("outer".oid = "inner".oid)
-> Hash Left Join (cost=35666.09..143698.61 rows=380198
width=538) (actual time=9901.782..35218.758 rows=360531 loops=1)
Hash Cond: ("outer".classification_set_id = "inner".id)
Join Filter: ("outer".classification_set_id IS NOT NULL)
-> Hash Left Join (cost=35664.89..140493.61 rows=380198
width=526) (actual time=9901.456..32363.212 rows=360531 loops=1)
Hash Cond: ("outer".charset_id = "inner".id)
Join Filter: ("outer".charset_id IS NOT NULL)
-> Hash Left Join (cost=35663.79..135684.27
rows=380198 width=515) (actual time=9901.257..29400.189 rows=360531
loops=1)
Hash Cond: ("outer".http_error_description_id =
"inner".id)
Join Filter: ("outer".http_error_description_id
IS NOT NULL)
-> Hash Left Join (cost=35662.71..133782.19
rows=380198 width=472) (actual time=9901.080..26691.473 rows=360531
loops=1)
Hash Cond: ("outer".content_type_id =
"inner".id)
Join Filter: ("outer".content_type_id IS
NOT NULL)
-> Hash Left Join
(cost=35661.60..128972.84 rows=380198 width=443) (actual
time=9900.802..23743.323 rows=360531 loops=1)
Hash Cond: ("outer".redirect_url_id
= "inner".url_id)
Join Filter:
("outer".redirect_url_id IS NOT NULL)
-> Hash Left Join
(cost=17830.80..66680.80 rows=380198 width=339) (actual
time=4592.701..14466.994 rows=360531 loops=1)
Hash Cond: ("outer".url_id =
"inner".url_id)
-> Seq Scan on crawled_url
(cost=0.00..10509.98 rows=380198 width=235) (actual
time=0.026..2976.911 rows=360531 loops=1)
-> Hash
(cost=10627.04..10627.04 rows=377104 width=108) (actual
time=4591.703..4591.703 rows=382149 loops=1)
-> Seq Scan on url
(cost=0.00..10627.04 rows=377104 width=108) (actual
time=0.041..2142.702 rows=382149 loops=1)
-> Hash (cost=10627.04..10627.04
rows=377104 width=108) (actual time=5307.540..5307.540 rows=382149
loops=1)
-> Seq Scan on url r1
(cost=0.00..10627.04 rows=377104 width=108) (actual
time=0.138..2503.577 rows=382149 loops=1)
-> Hash (cost=1.09..1.09 rows=9
width=33) (actual time=0.144..0.144 rows=9 loops=1)
-> Seq Scan on content_types
(cost=0.00..1.09 rows=9 width=33) (actual time=0.020..0.068 rows=9
loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=47)
(actual time=0.108..0.108 rows=6 loops=1)
-> Seq Scan on http_error_descriptions
(cost=0.00..1.06 rows=6 width=47) (actual time=0.015..0.049 rows=6
loops=1)
-> Hash (cost=1.08..1.08 rows=8 width=15) (actual
time=0.129..0.129 rows=8 loops=1)
-> Seq Scan on charsets (cost=0.00..1.08
rows=8 width=15) (actual time=0.014..0.058 rows=8 loops=1)
-> Hash (cost=1.16..1.16 rows=16 width=16) (actual
time=0.234..0.234 rows=16 loops=1)
-> Seq Scan on classification_sets (cost=0.00..1.16
rows=16 width=16) (actual time=0.014..0.107 rows=16 loops=1)
-> Hash (cost=1.05..1.05 rows=5 width=4) (actual
time=0.092..0.092 rows=5 loops=1)
-> Seq Scan on foo (cost=0.00..1.05 rows=5 width=4)
(actual time=0.022..0.044 rows=5 loops=1)
Total runtime: 36492.836 ms
(35 rows)


Definition of the view:

create view crawled_url_full_view as
select crawled_url.*,
url.url,
r1.url as redirect_url,
content_types.type as content_type,
http_error_descriptions.error as http_error_description,
charsets.name as charset,
classification_sets.name as classification_set
from crawled_url left join url on url.url_id = crawled_url.url_id
left join url as r1 on (r1.url_id = crawled_url.redirect_url_id and
crawled_url.redirect_url_id is not null)
left join content_types on (content_types.id =
crawled_url.content_type_id and crawled_url.content_type_id is not null)
left join http_error_descriptions on (http_error_descriptions.id =
crawled_url.http_error_description_id and
crawled_url.http_error_description_id is not null)
left join charsets on (charsets.id = crawled_url.charset_id and
crawled_url.charset_id is not null)
left join classification_sets on (classification_sets.id =
crawled_url.classification_set_id and
crawled_url.classification_set_id is not null);


Version is: PostgreSQL 8.1.4 on i686-pc-linux-gnu, compiled by GCC
gcc (GCC) 4.0.1 20050727 (Red Hat 4.0.1-5)

work_mem=30000
shared_buffers=5000
effective_cache_size=15000

Thanks for any help,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com

Search Discussions

  • Tom Lane at Aug 5, 2006 at 2:15 am

    Rusty Conover writes:
    Is there any inherent benefit of using a the IN operator versus
    joining a temporary table? Should they offer near equal performance?
    It appears bitmap scan's aren't done when matching across a small
    temporary table.
    I believe the problem you're facing is that existing PG releases
    don't know how to rearrange join order in the face of outer joins,
    and your view is full of outer joins. So the join against the temp
    table happens after forming the full output of the view, whereas you
    desperately need it to happen at the bottom of the join stack.

    CVS tip (8.2-to-be) has some ability to rearrange outer joins, and
    I'm interested to know whether it's smart enough to fix your problem.
    But you have not provided enough info to let someone else duplicate
    your test case. Would you be willing to download CVS or a recent
    nightly snapshot and see what it does with your problem?

    regards, tom lane
  • Rusty Conover at Aug 5, 2006 at 2:57 am

    On Aug 4, 2006, at 8:15 PM, Tom Lane wrote:

    Rusty Conover <rconover@infogears.com> writes:
    Is there any inherent benefit of using a the IN operator versus
    joining a temporary table? Should they offer near equal performance?
    It appears bitmap scan's aren't done when matching across a small
    temporary table.
    I believe the problem you're facing is that existing PG releases
    don't know how to rearrange join order in the face of outer joins,
    and your view is full of outer joins. So the join against the temp
    table happens after forming the full output of the view, whereas you
    desperately need it to happen at the bottom of the join stack.

    CVS tip (8.2-to-be) has some ability to rearrange outer joins, and
    I'm interested to know whether it's smart enough to fix your problem.
    But you have not provided enough info to let someone else duplicate
    your test case. Would you be willing to download CVS or a recent
    nightly snapshot and see what it does with your problem?

    regards, tom lane

    Absolutely, I'll attempt to run the test against the current CVS HEAD.

    Do I need to pg_dump and restore from 8.1.4?

    What other information would be helpful in the meantime?

    Thanks,

    Rusty
    --
    Rusty Conover
    InfoGears Inc.
    Web: http://www.infogears.com
  • Tom Lane at Aug 5, 2006 at 3:27 am

    Rusty Conover writes:
    Absolutely, I'll attempt to run the test against the current CVS HEAD.
    Do I need to pg_dump and restore from 8.1.4?
    Yup, fraid so.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedAug 4, '06 at 10:38p
activeAug 5, '06 at 3:27a
posts4
users2
websitepostgresql.org
irc#postgresql

2 users in discussion

Rusty Conover: 2 posts Tom Lane: 2 posts

People

Translate

site design / logo © 2022 Grokbase