On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha wrote:Hi Robert,
Thanks very much for your suggestions.
I've found that a good way to approach optimizing queries of this type
is to look at the EXPLAIN ANALYZE results and figure out which parts
of the query are slow. Then simplify the rest of the query as much as
possible without eliminating the slowness. Then try to figure out how
to optimize the simplified query: rewrite the logic, add indices,
change the schema, etc. Lastly start adding the other bits back in.
Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to
figure out which bits are slow. :-)
Well, you basically just look for the big numbers. The "actual"
numbers are in ms, and each node includes the times for the things
beneath it, so usually my approach is to just look at lower and lower
levels of the tree (i.e. the parts that are more indented) until I
find the lowest level that is slow. Then I look at the query bits
presented there to figure out which piece of the SQL it corresponds
to.
Looking at the estimates (which are not in ms or any other particular
unit) can be helpful too, in that it can help you find places where
the planner thought it would be fast but it was actually slow. To do
this, look at the top level of the query and get a sense of what the
ratio between estimated-cost-units and actual-ms is. Then look for
big (order of magnitude) deviations from this throughout the plan.
Those are places where you want to either gather better statistics, or
rewrite the query so that it can make better use of statistics. The
latter is more of an art than a science - I or someone else on this
list can help you with it if we find a specific case to look at.
It looks like the dedup_patient_anno CTE is part of your problem. Try
pulling that piece out and optimizing it separately. I wonder if that
could be rewritten to use SELECT DISTINCT ON (...) and whether that
would be any faster.
Isn't SELECT DISTINCT supposed to be evil, since in general the result is
not deterministic? I think I had SELECT DISTINCT earlier, and removed it
because of that, with the help of Andrew (RhodiumToad on #postgresql) I
didn't compare the corresponding subqueries separately, so don't know what
speed difference this made.
Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
I've had good luck with SELECT DISTINCT ON (...) in the past, as
compared with other methods. YMMV - the only way to find out is to
benchmark it. I don't think it's non-deterministic if you order by
the DISTINCT-ON columns and enough extras to break any ties - you
should get the first one of each set.
If not, you might want to look at some way of pre-marking the
non-duplicate rows so that you don't have to recompute that each time.
What are the options re pre-marking?
Well, what I usually do is - if I'm going to do the same
distinct-ification frequently, I add an extra column (say, a boolean)
and set it to true for all and only those rows which will pass the
distinct-ification filter. Then I can just say WHERE <that column
name>.
Then you might be able to use the underlying table directly in the next
CTE, which will usually permit better optimization, more use of indices,
etc. It seems pretty unfortunate that dedup_patient_anno joins against geno
and then patient_geno does what appears to be the same join again. Is there
some way to eliminate that? If so it will probably help.
You don't say whether you are looking at the PED or TPED query, so I'll
assume PED. They are similar anyway.
I see your point re the joins. You mean
anno INNER JOIN geno
followed by
geno INNER JOIN dedup_patient_anno
? I think the point of the first join is to reduce the anno table based on
information from the geno table. The result is basically a subset of the
anno table with some potential duplication removed, which is then re-joined
to the geno table. I agree this seems a bit suboptimal, and there might be a
better way to do this.
Yeah, I didn't think about it in detail, but it looks like it should
be possible. Eliminating joins can sometimes have *dramatic* effects
on query performance, and it never hurts.
...Robert