Hi guys,

I met with the problem that when I was using WITH clause to reuse a subquery, I got a huge performance penalty because of query planner.

Here are the details, the original query is

EXPLAIN ANALYZE WITH latest_identities AS
(
SELECT DISTINCT ON (memberid) memberid, username, changedate
FROM t_username_history
WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' ' || substring(lastname,1,1) = 'Eddie T')
ORDER BY memberid, changedate DESC
)
SELECT t_member.email as email, t_member.username as username, t_member.location as location, t_member.locale as locale, t_member.status as status, t_member.creationdate as creationdate, t_forum_member.pos\
ts as posts, t_forum_member.expertlocations as expertlocations, t_nexus_member.pages_created as pages_created, t_nexus_member.pages_edited as pages_edited, t_member_contributions.hotel_reviews as hotel_rev\
iews, t_member_contributions.restaurant_reviews as restaurant_reviews, t_member_contributions.attraction_reviews as attraction_reviews, t_member_contributions.geo_reviews as geo_reviews, t_member_contribut\
ions.photos as photos, t_member_contributions.videos as videos, t_recent_contribution.recent_contribution_date as recent_contribution_date, t_recent_contribution.recent_contribution_type as recent_contribu\
tion_type, t_owner_member.memberid as owner_memberid, t_member_interaction.flags as interaction_flags, t_media.path as ta_avatar_path, t_external_member.externalid as facebookid, latest_identities.username\
as latest_identity
FROM t_member left join t_forum_member on (t_member.memberid = t_forum_member.memberid) left join t_nexus_member on (t_member.memberid = t_nexus_member.memberid) left join t_member_contributions on (t_memb\
er.memberid = t_member_contributions.memberid) left join t_recent_contribution on (t_member.memberid = t_recent_contribution.memberid) left join t_owner_member on (t_member.memberid = t_owner_member.member\
id) left join t_member_interaction on (t_member.memberid = t_member_interaction.memberid) left join t_media on (t_member.avatar = t_media.id) left join t_external_member on (t_member.memberid = t_external_\
member.memberid AND t_external_member.idtype = 'FB') left join latest_identities on (t_member.memberid = latest_identities.memberid)
WHERE t_member.firstname || ' ' || substring(t_member.lastname,1,1) = 'Eddie T';

The may seems scary, but what it really does is searching for members with certain name and joining with a bunch of other tables on memberid. The t_username_history table has multiple rows for a memberid therefore I just get the most recent record for each memberid that I am interested in before the join.

Here is the link to explain:
http://explain.depesz.com/s/ZKb

Since the red part looks suboptimal to me, I changed it using WITH subquery:

EXPLAIN WITH memberids AS
(
SELECT memberid FROM t_member WHERE firstname || ' ' || substring(lastname,1,1) = 'Eddie T'
),
latest_identities AS
(
SELECT DISTINCT ON (memberid) memberid, username, changedate
FROM t_username_history
WHERE memberid IN (SELECT memberid FROM memberids)
ORDER BY memberid, changedate DESC
)
SELECT t_member.email as email, t_member.username as username, t_member.location as location, t_member.locale as locale, t_member.status as status, t_member.creationdate as creationdate, t_forum_member.pos\
ts as posts, t_forum_member.expertlocations as expertlocations, t_nexus_member.pages_created as pages_created, t_nexus_member.pages_edited as pages_edited, t_member_contributions.hotel_reviews as hotel_rev\
iews, t_member_contributions.restaurant_reviews as restaurant_reviews, t_member_contributions.attraction_reviews as attraction_reviews, t_member_contributions.geo_reviews as geo_reviews, t_member_contribut\
ions.photos as photos, t_member_contributions.videos as videos, t_recent_contribution.recent_contribution_date as recent_contribution_date, t_recent_contribution.recent_contribution_type as recent_contribu\
tion_type, t_owner_member.memberid as owner_memberid, t_member_interaction.flags as interaction_flags, t_media.path as ta_avatar_path, t_external_member.externalid as facebookid, latest_identities.username\
as latest_identity
FROM t_member left join t_forum_member on (t_member.memberid = t_forum_member.memberid) left join t_nexus_member on (t_member.memberid = t_nexus_member.memberid) left join t_member_contributions on (t_memb\
er.memberid = t_member_contributions.memberid) left join t_recent_contribution on (t_member.memberid = t_recent_contribution.memberid) left join t_owner_member on (t_member.memberid = t_owner_member.member\
id) left join t_member_interaction on (t_member.memberid = t_member_interaction.memberid) left join t_media on (t_member.avatar = t_media.id) left join t_external_member on (t_member.memberid = t_external_\
member.memberid AND t_external_member.idtype = 'FB') left join latest_identities on (t_member.memberid = latest_identities.memberid)
WHERE t_member.memberid IN (SELECT memberid FROM memberids)

However, this query runs forever because (I think) the planner join the tables before filter by where clause.

Here is the explain link:
http://explain.depesz.com/s/v2K

Anyone knows why the planner is doing this?

Regards,
Li

Search Discussions

  • Tom Lane at Jul 29, 2011 at 5:40 pm

    Li Jin writes:
    Anyone knows why the planner is doing this?
    WITH is an optimization fence. This is intentional and documented.

    regards, tom lane
  • Robert Klemme at Jul 30, 2011 at 1:10 pm

    On Thu, Jul 28, 2011 at 11:00 PM, Li Jin wrote:
    I met with the problem that when I was using WITH clause to reuse a
    subquery, I got a huge performance penalty because of query planner.
    Here are the details, the original query is
    EXPLAIN ANALYZE WITH latest_identities AS
    (
    SELECT DISTINCT ON (memberid) memberid, username, changedate
    FROM t_username_history
    WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
    substring(lastname,1,1) = 'Eddie T')
    ORDER BY memberid, changedate DESC
    )
    Another observation: That criterion looks suspicious to me. I would
    expect any RDBMS to be better able to optimize this:

    WHERE firstname = 'Eddie' AND lastname like 'T%'

    I know it's semantically not the same but I would assume this is good
    enough for the common usecase. Plus, if there is an index on
    (firstname, lastname) then that could be used.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
  • Merlin Moncure at Aug 2, 2011 at 9:48 pm

    On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme wrote:
    On Thu, Jul 28, 2011 at 11:00 PM, Li Jin wrote:
    I met with the problem that when I was using WITH clause to reuse a
    subquery, I got a huge performance penalty because of query planner.
    Here are the details, the original query is
    EXPLAIN ANALYZE WITH latest_identities AS
    (
    SELECT DISTINCT ON (memberid) memberid, username, changedate
    FROM t_username_history
    WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
    substring(lastname,1,1) = 'Eddie T')
    ORDER BY memberid, changedate DESC
    )
    Another observation: That criterion looks suspicious to me. I would
    expect any RDBMS to be better able to optimize this:

    WHERE firstname = 'Eddie' AND lastname like 'T%'

    I know it's semantically not the same but I would assume this is good
    enough for the common usecase.  Plus, if there is an index on
    (firstname, lastname) then that could be used.
    disagree. just one of the ways that could be stymied would to change
    the function behind the '||' operator.

    merlin
  • Robert Klemme at Aug 3, 2011 at 7:18 am

    On Tue, Aug 2, 2011 at 11:48 PM, Merlin Moncure wrote:
    On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme
    wrote:
    On Thu, Jul 28, 2011 at 11:00 PM, Li Jin wrote:
    I met with the problem that when I was using WITH clause to reuse a
    subquery, I got a huge performance penalty because of query planner.
    Here are the details, the original query is
    EXPLAIN ANALYZE WITH latest_identities AS
    (
    SELECT DISTINCT ON (memberid) memberid, username, changedate
    FROM t_username_history
    WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
    substring(lastname,1,1) = 'Eddie T')
    ORDER BY memberid, changedate DESC
    )
    Another observation: That criterion looks suspicious to me. I would
    expect any RDBMS to be better able to optimize this:

    WHERE firstname = 'Eddie' AND lastname like 'T%'

    I know it's semantically not the same but I would assume this is good
    enough for the common usecase.  Plus, if there is an index on
    (firstname, lastname) then that could be used.
    disagree. just one of the ways that could be stymied would to change
    the function behind the '||' operator.
    I don't understand what you mean. Can you please elaborate?

    To explain my point a bit: I meant that by querying individual fields
    separately instead of applying a criterion on a function of the two
    the RDBMS has a better chance to use indexes and come up with a better
    plan for this part of the query.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
  • Li Jin at Aug 3, 2011 at 1:27 pm
    Robert,

    I've built an index on this expression firstname || ' ' || substring(lastname,1,1). I believe this is the best index for this particular query. Correct me if I am wrong.

    Li
    On Aug 3, 2011, at 3:18 AM, Robert Klemme wrote:
    On Tue, Aug 2, 2011 at 11:48 PM, Merlin Moncure wrote:
    On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme
    wrote:
    On Thu, Jul 28, 2011 at 11:00 PM, Li Jin wrote:
    I met with the problem that when I was using WITH clause to reuse a
    subquery, I got a huge performance penalty because of query planner.
    Here are the details, the original query is
    EXPLAIN ANALYZE WITH latest_identities AS
    (
    SELECT DISTINCT ON (memberid) memberid, username, changedate
    FROM t_username_history
    WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
    substring(lastname,1,1) = 'Eddie T')
    ORDER BY memberid, changedate DESC
    )
    Another observation: That criterion looks suspicious to me. I would
    expect any RDBMS to be better able to optimize this:

    WHERE firstname = 'Eddie' AND lastname like 'T%'

    I know it's semantically not the same but I would assume this is good
    enough for the common usecase. Plus, if there is an index on
    (firstname, lastname) then that could be used.
    disagree. just one of the ways that could be stymied would to change
    the function behind the '||' operator.
    I don't understand what you mean. Can you please elaborate?

    To explain my point a bit: I meant that by querying individual fields
    separately instead of applying a criterion on a function of the two
    the RDBMS has a better chance to use indexes and come up with a better
    plan for this part of the query.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
  • Robert Klemme at Aug 3, 2011 at 4:21 pm

    On Wed, Aug 3, 2011 at 3:27 PM, Li Jin wrote:
    Robert,
    I've built an index on this expression firstname || ' ' ||
    substring(lastname,1,1). I believe this is the best index for this
    particular query. Correct me if I am wrong.
    Maybe, maybe not. Difficult to tell from a distance. I would have an
    index on (firstname, lastname). You could try that and look at the
    plan for the other query. That's the only ultimate test which will
    give you hard facts.

    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
  • Robert Klemme at Aug 3, 2011 at 5:13 pm

    On Wed, Aug 3, 2011 at 3:27 PM, Li Jin wrote:
    Robert,
    I've built an index on this expression firstname || ' ' ||
    substring(lastname,1,1). I believe this is the best index for this
    particular query. Correct me if I am wrong.
    Maybe, maybe not. Difficult to tell from a distance. I would have an
    index on (firstname, lastname). You could try that and look at the
    plan for the other query. That's the only ultimate test which will
    give you hard facts.

    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
  • Merlin Moncure at Aug 3, 2011 at 4:24 pm

    On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme wrote:
    Another observation: That criterion looks suspicious to me. I would
    expect any RDBMS to be better able to optimize this:

    WHERE firstname = 'Eddie' AND lastname like 'T%'

    I know it's semantically not the same but I would assume this is good
    enough for the common usecase.  Plus, if there is an index on
    (firstname, lastname) then that could be used.
    disagree. just one of the ways that could be stymied would to change
    the function behind the '||' operator.
    I don't understand what you mean.  Can you please elaborate?

    To explain my point a bit: I meant that by querying individual fields
    separately instead of applying a criterion on a function of the two
    the RDBMS has a better chance to use indexes and come up with a better
    plan for this part of the query.
    Yes, but your assuming that it is safe and generally advantageous to
    do that. Both assumptions I think are false.

    The || operator is trivially hacked:
    create or replace function funky_concat(l text, r text) returns text as
    $$
    select textcat(textcat($1, 'abc'), $2);
    $$ language sql immutable ;

    update pg_operator set oprcode = 'funky_concat' where oid = 654;

    postgres=# select 'a' || 'b';
    ?column?
    ----------
    aabcb
    (1 row)

    Also even ignoring the above it's not free to have the database try
    and analyze every instance of the || operator to see if it can be
    decomposed to boolean field operations.

    merlin
  • Robert Klemme at Aug 3, 2011 at 5:30 pm

    On Wed, Aug 3, 2011 at 6:24 PM, Merlin Moncure wrote:
    On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme
    wrote:
    Another observation: That criterion looks suspicious to me. I would
    expect any RDBMS to be better able to optimize this:

    WHERE firstname = 'Eddie' AND lastname like 'T%'

    I know it's semantically not the same but I would assume this is good
    enough for the common usecase.  Plus, if there is an index on
    (firstname, lastname) then that could be used.
    disagree. just one of the ways that could be stymied would to change
    the function behind the '||' operator.
    I don't understand what you mean.  Can you please elaborate?

    To explain my point a bit: I meant that by querying individual fields
    separately instead of applying a criterion on a function of the two
    the RDBMS has a better chance to use indexes and come up with a better
    plan for this part of the query.
    Yes, but your assuming that it is safe and generally advantageous to
    do that.  Both assumptions I think are false.
    I am not sure why you say I assume this is _safe_. I said it is "good
    enough for the common usecase". And it is certainly good enough for
    this particular query.

    As for the "generally advantageous" I'd say that an index on "raw"
    column values is usually useful for more queries than an index on a
    specific function. That's why I'd say generally an index on column
    values is more versatile and I would prefer it. Of course you might
    achieve orders of magnitude of speedup for individual queries with an
    index on a function tailored to that particular query but if you need
    to do that for multiple queries you pay a higher penalty for updates.
    The || operator is trivially hacked:
    create or replace function funky_concat(l text, r text) returns text as
    $$
    select textcat(textcat($1, 'abc'), $2);
    $$ language sql immutable ;

    update pg_operator set oprcode = 'funky_concat' where oid = 654;

    postgres=# select 'a' || 'b';
    ?column?
    ----------
    aabcb
    (1 row)

    Also even ignoring the above it's not free to have the database try
    and analyze every instance of the || operator to see if it can be
    decomposed to boolean field operations.
    Even with your hacked operator you would need an index on the
    expression to make it efficient. That could be done with the original
    as well. But my point was to query
    WHERE a = 'foo' and b like 'b%'
    instead of WHERE a || ' ' || substring(b, 1, 1) = 'foo b'

    to use an index on (a,b). That index would also be useful for queries like

    WHERE a = 'foo'
    WHERE a like 'fo%'
    WHERE a = 'foo' and b = 'bar'

    and probably also

    WHERE a > 'foo'
    WHERE a > 'foo' and b like 'b%'
    WHERE a > 'foo' and b = 'bar'

    Kind regards

    robert


    PS: Sorry for the earlier duplicate. Gmail had a hickup.

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedJul 28, '11 at 9:06p
activeAug 3, '11 at 5:30p
posts10
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase