The following bug has been logged online:

Bug reference: 3979
Logged by: David Lee
Email address: david_lee@bigfix.com
PostgreSQL version: 8.2.6
Operating system: Ubuntu Feisty Server
Description: SELECT DISTINCT slow even on indexed column
Details:

\d x:

Column | Type | Modifiers
--------------+-----------------------------+-----------
a | integer | not null
b | integer | not null
time | timestamp without time zone | not null
remote_time | timestamp without time zone | not null
ip | inet | not null

The table has 20 million rows.

The table "x" has an index on ("a", "b").

I first tried:
SELECT DISTINCT a, b FROM x

but it was so slow.

I ran EXPLAIN and it showed that the path did not use the index, so I ran:

SET enable_seqscan = off;

and ran the query again.

Although it used the index, the query was still very slow.

Finally, I ran:
SELECT a, b FROM x GROUP BY a, b;

But it was still the same.

Next I created an index on ("a") and ran the query:
SELECT DISTINCT a FROM x

but the same thing happened (first didn't use the index; after turning
seq-scan off, was still slow; tried using GROUP BY, still slow).

The columns "a" and "b" are NOT NULL and has 100 distinct values each. The
indexes are all btree indexes.

Search Discussions

  • Jeff Davis at Feb 22, 2008 at 12:37 am

    On Thu, 2008-02-21 at 23:34 +0000, David Lee wrote:
    Finally, I ran:
    SELECT a, b FROM x GROUP BY a, b;

    But it was still the same.

    Next I created an index on ("a") and ran the query:
    SELECT DISTINCT a FROM x

    but the same thing happened (first didn't use the index; after turning
    seq-scan off, was still slow; tried using GROUP BY, still slow).

    The columns "a" and "b" are NOT NULL and has 100 distinct values each. The
    indexes are all btree indexes.
    If there are only 100 distinct values each, then that's only (at most)
    10k distinct (a,b) pairs.

    To me it sounds like it would be most efficient to use a HashAggregate,
    which can only be used by the "GROUP BY" variant of the query you ran
    (DISTINCT can't use that plan).

    First, try to force a HashAggregate and see what the results are. If
    that is faster, the planner is not choosing the right plan. Try ANALYZE
    to update the statistics, and if that doesn't work, post EXPLAIN
    results.

    Also, this post is somewhat off-topic for -bugs, try posting to -general
    or -performance with this type of question.

    Regards,
    Jeff Davis
  • Simon Riggs at Feb 25, 2008 at 10:33 am

    On Thu, 2008-02-21 at 23:34 +0000, David Lee wrote:

    I ran EXPLAIN and it showed that the path did not use the index, so I ran:
    Your expectation that this would use an index is unfortunately not
    correct.

    We need to check visibility on the table rows to do the query. We choose
    to do this by doing a sort and then a unique operation. That's the most
    efficient plan when there are potentially many unique values.

    In the case you mention it would be faster for us to skip through the
    index retrieving at most one row from each value in the index. We don't
    currently do that, but we could. However that plan would be restricted
    only to queries of the form
    SELECT DISTINCT column-list-of-index FROM table;

    so its probably not going to be optimised anytime soon.

    --
    Simon Riggs
    2ndQuadrant http://www.2ndQuadrant.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-bugs @
categoriespostgresql
postedFeb 21, '08 at 11:34p
activeFeb 25, '08 at 10:33a
posts3
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase