Hi. I'm interested in understanding the differences between
CLUSTERing a table and making a dedicated one.

We have a table with about 1 million records. On a given day, only
about 1% of them are of interest. That 1% changes every day (it's
WHERE active_date = today), and so we index and cluster on it.

Even so, the planner shows a very large cost for the Index Scan: about
3500. If I instead do a SELECT INTO temp_table FROM big_table WHERE
active_date = today, and then do SELECT * FROM temp_table, I get a
planned cost of 65. Yet, the actual time for both queries is almost
identical.

Questions:
1. Why is there such a discrepancy between the planner's estimate and
the actual cost?

2. In a case like this, will I in general see a performance gain by
doing a daily SELECT INTO and then querying from that table? My ad hoc
test doesn't indicate I would (despite the planner's prediction), and
I'd rather avoid this if it won't help.

3. In general, does CLUSTER provide all the performance benefits of a
dedicated table? If it doesn't, what does it lack?

Thank you.

Search Discussions

  • Mark at Jun 2, 2011 at 1:18 am

    -----Original Message-----
    From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
    owner@postgresql.org] On Behalf Of Robert James
    Sent: Wednesday, June 01, 2011 5:55 PM
    To: pgsql-performance@postgresql.org
    Subject: [PERFORM] CLUSTER versus a dedicated table

    Hi. I'm interested in understanding the differences between
    CLUSTERing a table and making a dedicated one.

    We have a table with about 1 million records. On a given day, only
    about 1% of them are of interest. That 1% changes every day (it's
    WHERE active_date = today), and so we index and cluster on it.

    Even so, the planner shows a very large cost for the Index Scan: about
    3500. If I instead do a SELECT INTO temp_table FROM big_table WHERE
    active_date = today, and then do SELECT * FROM temp_table, I get a
    planned cost of 65. Yet, the actual time for both queries is almost
    identical.

    Questions:
    1. Why is there such a discrepancy between the planner's estimate and
    the actual cost?

    2. In a case like this, will I in general see a performance gain by
    doing a daily SELECT INTO and then querying from that table? My ad hoc
    test doesn't indicate I would (despite the planner's prediction), and
    I'd rather avoid this if it won't help.

    3. In general, does CLUSTER provide all the performance benefits of a
    dedicated table? If it doesn't, what does it lack?

    Thank you.
    Start here:
    http://wiki.postgresql.org/wiki/SlowQueryQuestions

    1: there could be many reasons for the planner to come up a grossly
    inaccurate ESTIMATE for some values. Last time the table was analyzed, is
    usually where people start.

    2: look at table partitioning it's pretty straight forward and sounds like
    it might be a good fit for you. It will however involve some triggers or
    rules and check constraints. Table partitioning has some downsides though,
    you should be aware of what they are before you commit to it.

    3: clustering, from a high level, just reorders the data on disk by a given
    index. Depending on your setup keeping it close to that clustered ordering
    might be trivial or it might not be. Big tables are relative to different
    people, 1M rows might be a big table or it might not be, since you didn't
    post the size of the table and indexes we can only guess. Table
    partitioning helps most with table maintenance, IMO, but can be very useful
    it the constraint exclusion can eliminate a large number of child tables
    right off so it doesn't have to traverse large indexes or do lots of random
    IO.


    You will need to post at lot more specific info if you want more specific
    help. The guide to reporting slow queries or guide to reporting problems
    and start gathering specific information and then post back to the list.



    -Mark
    --
    Sent via pgsql-performance mailing list (pgsql-
    performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
  • Merlin Moncure at Jun 2, 2011 at 1:33 am

    On Wed, Jun 1, 2011 at 7:54 PM, Robert James wrote:
    Hi.  I'm interested in understanding the differences between
    CLUSTERing a table and making a dedicated one.

    We have a table with about 1 million records.  On a given day, only
    about 1% of them are of interest.  That 1% changes every day (it's
    WHERE active_date = today), and so we index and cluster on it.

    Even so, the planner shows a very large cost for the Index Scan: about
    3500.  If I instead do a SELECT INTO temp_table FROM big_table WHERE
    active_date = today, and then do SELECT * FROM temp_table, I get a
    planned cost of 65.  Yet, the actual time for both queries is almost
    identical.

    Questions:
    1. Why is there such a discrepancy between the planner's estimate and
    the actual cost?

    2. In a case like this, will I in general see a performance gain by
    doing a daily SELECT INTO and then querying from that table? My ad hoc
    test doesn't indicate I would (despite the planner's prediction), and
    I'd rather avoid this if it won't help.

    3. In general, does CLUSTER provide all the performance benefits of a
    dedicated table? If it doesn't, what does it lack?
    no. i suspect you may be over thinking the problem -- what led you to
    want to cluster in the first place?

    merlin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedJun 1, '11 at 11:54p
activeJun 2, '11 at 1:33a
posts3
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase