Hi everybody,

I've got two queries that needs optimizing. Actually, there are others,
but these are pretty representative.

You can see the queries and the corresponding plans at

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

or

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

if you prefer text (latex file, effectively text in this case)

The background to this is at
http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

If more details are required, let me know and I can add them. I'd
appreciate suggestions about how to make these queries go faster.

Please CC this email address on any replies.

Regards, Faheem.

Search Discussions

  • Thom Brown at Nov 23, 2009 at 11:27 pm
    2009/11/23 Faheem Mitha <faheem@email.unc.edu>
    Hi everybody,

    I've got two queries that needs optimizing. Actually, there are others, but
    these are pretty representative.

    You can see the queries and the corresponding plans at

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

    or

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

    if you prefer text (latex file, effectively text in this case)

    The background to this is at
    http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

    If more details are required, let me know and I can add them. I'd
    appreciate suggestions about how to make these queries go faster.

    Please CC this email address on any replies.

    Regards, Faheem.

    Hi Faheem,

    There appears to be a discrepancy between the 2 PDFs you provided. One says
    you're using PostgreSQL 8.3, and the other shows you using common table
    expressions, which are only available in 8.4+.

    Thom
  • Faheem Mitha at Nov 23, 2009 at 11:49 pm

    On Mon, 23 Nov 2009, Thom Brown wrote:

    Hi Faheem,

    There appears to be a discrepancy between the 2 PDFs you provided.  One
    says you're using PostgreSQL 8.3, and the other shows you using common
    table expressions, which are only available in 8.4+.
    Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf
    and make sure all the information is current. Thanks for pointing out my
    error.

    Regards, Faheem.
  • Marcin mank at Nov 23, 2009 at 11:53 pm

    On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha wrote:
    Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf
    and make sure all the information is current. Thanks for pointing out my
    error.
    excellent report!

    about the copy problem: You seem to have created the primary key
    before doing the copy (at least that`s what the dump before copy
    says). This is bad. Create it after the copy.

    Greetings
    Marcin
  • Sebastian Jörgensen at Nov 24, 2009 at 12:07 am
    How often are the tables you query from updated?

    Rgds
    Sebastian
    On Tue, Nov 24, 2009 at 12:52 AM, marcin mank wrote:
    On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha wrote:

    Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf
    and make sure all the information is current. Thanks for pointing out my
    error.
    excellent report!

    about the copy problem: You seem to have created the primary key
    before doing the copy (at least that`s what the dump before copy
    says). This is bad. Create it after the copy.

    Greetings
    Marcin

    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
  • Faheem Mitha at Nov 24, 2009 at 4:49 am

    On Tue, 24 Nov 2009, Sebastian Jörgensen wrote:

    How often are the tables you query from updated?
    Quite rarely. Once in a while. The large tables, eg. geno, are basically
    static.

    Regards, Faheem.
    Rgds
    Sebastian

    On Tue, Nov 24, 2009 at 12:52 AM, marcin mank wrote:
    On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha wrote:

    Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf
    and make sure all the information is current. Thanks for pointing out my
    error.
    excellent report!

    about the copy problem: You seem to have created the primary key
    before doing the copy (at least that`s what the dump before copy
    says). This is bad. Create it after the copy.

    Greetings
    Marcin

    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance


  • Robert Haas at Nov 25, 2009 at 5:27 pm

    On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha wrote:
    Hi everybody,

    I've got two queries that needs optimizing. Actually, there are others, but
    these are pretty representative.

    You can see the queries and the corresponding plans at

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

    or

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

    if you prefer text (latex file, effectively text in this case)

    The background to this is at
    http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

    If more details are required, let me know and I can add them. I'd appreciate
    suggestions about how to make these queries go faster.

    Please CC this email address on any replies.
    I've found that a good way to approach optimizing queries of this type
    is to look at the EXPLAIN ANALYZE results and figure out which parts
    of the query are slow. Then simplify the rest of the query as much as
    possible without eliminating the slowness. Then try to figure out how
    to optimize the simplified query: rewrite the logic, add indices,
    change the schema, etc. Lastly start adding the other bits back in.

    It looks like the dedup_patient_anno CTE is part of your problem. Try
    pulling that piece out and optimizing it separately. I wonder if that
    could be rewritten to use SELECT DISTINCT ON (...) and whether that
    would be any faster. If not, you might want to look at some way of
    pre-marking the non-duplicate rows so that you don't have to recompute
    that each time. Then you might be able to use the underlying table
    directly in the next CTE, which will usually permit better
    optimization, more use of indices, etc. It seems pretty unfortunate
    that dedup_patient_anno joins against geno and then patient_geno does
    what appears to be the same join again. Is there some way to
    eliminate that? If so it will probably help.

    Once you've got those parts of the query as well-optimized as you can,
    add the next pieces in and start hacking on those.

    ...Robert
  • Faheem Mitha at Nov 25, 2009 at 11:51 pm
    Hi Robert,

    Thanks very much for your suggestions.
    On Wed, 25 Nov 2009, Robert Haas wrote:
    On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha wrote:

    Hi everybody,

    I've got two queries that needs optimizing. Actually, there are others,
    but these are pretty representative.

    You can see the queries and the corresponding plans at

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

    or

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

    if you prefer text (latex file, effectively text in this case)

    The background to this is at
    http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

    If more details are required, let me know and I can add them. I'd appreciate
    suggestions about how to make these queries go faster.

    Please CC this email address on any replies.
    I've found that a good way to approach optimizing queries of this type
    is to look at the EXPLAIN ANALYZE results and figure out which parts
    of the query are slow. Then simplify the rest of the query as much as
    possible without eliminating the slowness. Then try to figure out how
    to optimize the simplified query: rewrite the logic, add indices,
    change the schema, etc. Lastly start adding the other bits back in.
    Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough
    to figure out which bits are slow. :-)
    It looks like the dedup_patient_anno CTE is part of your problem. Try
    pulling that piece out and optimizing it separately. I wonder if that
    could be rewritten to use SELECT DISTINCT ON (...) and whether that
    would be any faster.
    Isn't SELECT DISTINCT supposed to be evil, since in general the result is
    not deterministic? I think I had SELECT DISTINCT earlier, and removed it
    because of that, with the help of Andrew (RhodiumToad on #postgresql) I
    didn't compare the corresponding subqueries separately, so don't know what
    speed difference this made.
    If not, you might want to look at some way of pre-marking the
    non-duplicate rows so that you don't have to recompute that each time.
    What are the options re pre-marking?
    Then you might be able to use the underlying table directly in the next
    CTE, which will usually permit better optimization, more use of indices,
    etc. It seems pretty unfortunate that dedup_patient_anno joins against
    geno and then patient_geno does what appears to be the same join again.
    Is there some way to eliminate that? If so it will probably help.
    You don't say whether you are looking at the PED or TPED query, so I'll
    assume PED. They are similar anyway.

    I see your point re the joins. You mean

    anno INNER JOIN geno

    followed by

    geno INNER JOIN dedup_patient_anno

    ? I think the point of the first join is to reduce the anno table based on
    information from the geno table. The result is basically a subset of the
    anno table with some potential duplication removed, which is then
    re-joined to the geno table. I agree this seems a bit suboptimal, and
    there might be a better way to do this.
    Once you've got those parts of the query as well-optimized as you can,
    add the next pieces in and start hacking on those.
    Regards, Faheem.
  • Robert Haas at Nov 25, 2009 at 11:57 pm

    On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha wrote:
    Hi Robert,

    Thanks very much for your suggestions.
    Hi everybody,

    I've got two queries that needs optimizing. Actually, there are others,
    but these are pretty representative.

    You can see the queries and the corresponding plans at

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

    or

    http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

    if you prefer text (latex file, effectively text in this case)

    The background to this is at
    http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

    If more details are required, let me know and I can add them. I'd
    appreciate
    suggestions about how to make these queries go faster.

    Please CC this email address on any replies.
    I've found that a good way to approach optimizing queries of this type
    is to look at the EXPLAIN ANALYZE results and figure out which parts
    of the query are slow.  Then simplify the rest of the query as much as
    possible without eliminating the slowness.  Then try to figure out how
    to optimize the simplified query: rewrite the logic, add indices,
    change the schema, etc.  Lastly start adding the other bits back in.
    Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to
    figure out which bits are slow. :-)
    Well, you basically just look for the big numbers. The "actual"
    numbers are in ms, and each node includes the times for the things
    beneath it, so usually my approach is to just look at lower and lower
    levels of the tree (i.e. the parts that are more indented) until I
    find the lowest level that is slow. Then I look at the query bits
    presented there to figure out which piece of the SQL it corresponds
    to.

    Looking at the estimates (which are not in ms or any other particular
    unit) can be helpful too, in that it can help you find places where
    the planner thought it would be fast but it was actually slow. To do
    this, look at the top level of the query and get a sense of what the
    ratio between estimated-cost-units and actual-ms is. Then look for
    big (order of magnitude) deviations from this throughout the plan.
    Those are places where you want to either gather better statistics, or
    rewrite the query so that it can make better use of statistics. The
    latter is more of an art than a science - I or someone else on this
    list can help you with it if we find a specific case to look at.
    It looks like the dedup_patient_anno CTE is part of your problem.  Try
    pulling that piece out and optimizing it separately.  I wonder if that
    could be rewritten to use SELECT DISTINCT ON (...) and whether that
    would be any faster.
    Isn't SELECT DISTINCT supposed to be evil, since in general the result is
    not deterministic? I think I had SELECT DISTINCT earlier, and removed it
    because of that, with the help of Andrew (RhodiumToad on #postgresql) I
    didn't compare the corresponding subqueries separately, so don't know what
    speed difference this made.
    Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
    I've had good luck with SELECT DISTINCT ON (...) in the past, as
    compared with other methods. YMMV - the only way to find out is to
    benchmark it. I don't think it's non-deterministic if you order by
    the DISTINCT-ON columns and enough extras to break any ties - you
    should get the first one of each set.
    If not, you might want to look at some way of pre-marking the
    non-duplicate rows so that you don't have to recompute that each time.
    What are the options re pre-marking?
    Well, what I usually do is - if I'm going to do the same
    distinct-ification frequently, I add an extra column (say, a boolean)
    and set it to true for all and only those rows which will pass the
    distinct-ification filter. Then I can just say WHERE <that column
    name>.
    Then you might be able to use the underlying table directly in the next
    CTE, which will usually permit better optimization, more use of indices,
    etc.  It seems pretty unfortunate that dedup_patient_anno joins against geno
    and then patient_geno does what appears to be the same join again. Is there
    some way to eliminate that?  If so it will probably help.
    You don't say whether you are looking at the PED or TPED query, so I'll
    assume PED. They are similar anyway.

    I see your point re the joins. You mean

    anno INNER JOIN geno

    followed by

    geno INNER JOIN dedup_patient_anno

    ? I think the point of the first join is to reduce the anno table based on
    information from the geno table. The result is basically a subset of the
    anno table with some potential duplication removed, which is then re-joined
    to the geno table. I agree this seems a bit suboptimal, and there might be a
    better way to do this.
    Yeah, I didn't think about it in detail, but it looks like it should
    be possible. Eliminating joins can sometimes have *dramatic* effects
    on query performance, and it never hurts.

    ...Robert
  • Faheem Mitha at Nov 27, 2009 at 9:47 pm

    On Wed, 25 Nov 2009, Robert Haas wrote:

    On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha wrote:
    Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
    I've had good luck with SELECT DISTINCT ON (...) in the past, as
    compared with other methods. YMMV - the only way to find out is to
    benchmark it. I don't think it's non-deterministic if you order by
    the DISTINCT-ON columns and enough extras to break any ties - you
    should get the first one of each set.
    Right, but adding enough extras to break ties is up to the user, and the
    language doesn't guarantee anything, so it feels more fragile.
    If not, you might want to look at some way of pre-marking the
    non-duplicate rows so that you don't have to recompute that each time.
    What are the options re pre-marking?
    Well, what I usually do is - if I'm going to do the same
    distinct-ification frequently, I add an extra column (say, a boolean)
    and set it to true for all and only those rows which will pass the
    distinct-ification filter. Then I can just say WHERE <that column
    name>.
    Yes, I see. The problem with is premarking is that the selection is
    somewhat dynamic, in the sense that this depends on the idlink table,
    which depends on patient data, which can change.
    Then you might be able to use the underlying table directly in the next
    CTE, which will usually permit better optimization, more use of indices,
    etc.  It seems pretty unfortunate that dedup_patient_anno joins against geno
    and then patient_geno does what appears to be the same join again. Is there
    some way to eliminate that?  If so it will probably help.
    You don't say whether you are looking at the PED or TPED query, so I'll
    assume PED. They are similar anyway.

    I see your point re the joins. You mean

    anno INNER JOIN geno

    followed by

    geno INNER JOIN dedup_patient_anno

    ? I think the point of the first join is to reduce the anno table based on
    information from the geno table. The result is basically a subset of the
    anno table with some potential duplication removed, which is then re-joined
    to the geno table. I agree this seems a bit suboptimal, and there might be a
    better way to do this.
    Yeah, I didn't think about it in detail, but it looks like it should
    be possible. Eliminating joins can sometimes have *dramatic* effects
    on query performance, and it never hurts.
    Failing all else, couldn't I smoosh together the two queries and do a
    triple join? For reference, the two CTEs in question, from the PED query,
    are as follows.

    dedup_patient_anno AS
    ( SELECT *
    FROM
    (SELECT *,
    row_number() OVER(PARTITION BY anno.rsid ORDER BY
    anno.id)
    FROM anno
    INNER JOIN geno
    ON anno.id = geno.anno_id
    WHERE idlink_id =
    (SELECT MIN(id)
    FROM idlink
    )
    ) AS s
    WHERE row_number = '1'
    ),
    patient_geno AS
    ( SELECT geno.idlink_id AS idlink_id,
    geno.anno_id AS anno_id,
    geno.snpval_id AS snpval_id,
    allelea_id, alleleb_id
    FROM geno
    INNER JOIN dedup_patient_anno
    ON geno.anno_id = dedup_patient_anno.id
    ),

    Regards, Faheem.
  • Robert Haas at Nov 30, 2009 at 3:41 pm

    On Fri, Nov 27, 2009 at 4:47 PM, Faheem Mitha wrote:
    If not, you might want to look at some way of pre-marking the
    non-duplicate rows so that you don't have to recompute that each time.
    What are the options re pre-marking?
    Well, what I usually do is - if I'm going to do the same
    distinct-ification frequently, I add an extra column (say, a boolean)
    and set it to true for all and only those rows which will pass the
    distinct-ification filter.  Then I can just say WHERE <that column
    name>.
    Yes, I see. The problem with is premarking is that the selection is somewhat
    dynamic, in the sense that this depends on the idlink table, which depends
    on patient data, which can change.
    Yeah. For things like this I find you have to think hard about how to
    organize your schema so that you can optimize the queries you care
    about. There are no "just do this and it works" solutions to
    performance problems of this type. Still, many of them are solvable
    by making the right decisions elsewhere. Sometimes you can use
    triggers to recompute your premarks when the data in the other table
    changes. Another strategy is to keep a cache of precomputed results
    somewhere. When the underlying data changes, you use triggers to
    invalidate anything in the cache that might now be wrong, and set
    things up so that it will be recomputed when next it is used. But in
    either case you have to figure out the right place to do the
    computation so that it gains you more than it saves you, and adjusting
    your schema is often necessary.
    Then you might be able to use the underlying table directly in the next
    CTE, which will usually permit better optimization, more use of indices,
    etc.  It seems pretty unfortunate that dedup_patient_anno joins against
    geno
    and then patient_geno does what appears to be the same join again. Is
    there
    some way to eliminate that?  If so it will probably help.
    You don't say whether you are looking at the PED or TPED query, so I'll
    assume PED. They are similar anyway.

    I see your point re the joins. You mean

    anno INNER JOIN geno

    followed by

    geno INNER JOIN dedup_patient_anno

    ? I think the point of the first join is to reduce the anno table based
    on
    information from the geno table. The result is basically a subset of the
    anno table with some potential duplication removed, which is then
    re-joined
    to the geno table. I agree this seems a bit suboptimal, and there might
    be a
    better way to do this.
    Yeah, I didn't think about it in detail, but it looks like it should
    be possible.  Eliminating joins can sometimes have *dramatic* effects
    on query performance, and it never hurts.
    Failing all else, couldn't I smoosh together the two queries and do a triple
    join? For reference, the two CTEs in question, from the PED query, are as
    follows.

    dedup_patient_anno AS
    ( SELECT *
    FROM
    (SELECT  *,
    row_number() OVER(PARTITION BY anno.rsid ORDER BY
    anno.id)
    FROM     anno
    INNER JOIN geno
    ON       anno.id = geno.anno_id
    WHERE    idlink_id        =
    (SELECT MIN(id)
    FROM    idlink
    )
    ) AS s
    WHERE   row_number = '1'
    ),
    patient_geno AS
    ( SELECT geno.idlink_id AS idlink_id,
    geno.anno_id AS anno_id,
    geno.snpval_id AS snpval_id,
    allelea_id, alleleb_id
    FROM    geno
    INNER JOIN dedup_patient_anno
    ON      geno.anno_id = dedup_patient_anno.id
    ),
    If that will give the same results, which I'm not immediately certain
    about, then I highly recommend it. In general I would recommend only
    using CTEs to express concepts that can't sensibly be expressed in
    other ways, not to beautify your queries. Keep in mind that joins can
    be reordered and/or executed using different methods but most other
    operations can't be, so trying to get your joins together in one place
    is usually a good strategy, in my experience. And of course if that
    lets you reduce the total number of joins, that's even better.

    ...Robert

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedNov 23, '09 at 10:47p
activeNov 30, '09 at 3:41p
posts11
users5
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase