Hi All,

It doesn't appear that ANALYZE uses the specified operator class for
producing statistics on an index when that operator class is not the
default for the data type. This appears to be leading to poor query
planning.

I'm using:

# select version();
-[ RECORD
1 ]---------------------------------------------------------------------
-------------------------
version | PostgreSQL 8.1.5 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 4.1.0 20060304 (Red Hat 4.1.0-3)

For speed of indexing a few million urls I'm indexing them with a
custom data type. The data type aptly named "urlhash" has:

A 2 byte host id (just the hostname, not many hosts are involved so
this is fine)
A 2 byte locality id (parts of the path)
A 16 byte MD5 of the full url all in a custom data type called aptly.

For a total length of 20 bytes. By default the text output of the
data type is of the form of ([data in hex]).

Everything is nice and fast (due to good locality) for the default
operator class that respects the entire data type. Everything is
unique so there is an n_distinct of -1 in pg_stats for the index.

Now, I created an operator class for urlhash that just does
comparisons on the host id "urlhash_host_ops" and I created the index
using:

# create index url_uh2_idx on url using btree (texttourlhash(url)
urlhash_host_ops);

After running vacuum analyze that statistics for the new index using
the second operator class still say that every value is unique. It
appears it's using the default operator class to produce the
statistics. By doing so, the planner doesn't work great since it
assumes that only one row will be returned.

Note: I prefixed all default operators with @ to denote the host class.

# explain select * from url where url @= 'http://
www.prolitegear.com/'::urlhash ;
QUERY PLAN
------------------------------------------------------------------------
-----------------
Index Scan using url_uh2_idx on url (cost=0.00..6.02 rows=1 width=105)
Index Cond: ((url)::urlhash @=
'(93d48c2ab505280c99322630a24f61c6533bc368)'::urlhash)


# select count(*) from url where url @= 'http://
www.prolitegear.com/'::urlhash;
count
-------
6992

There are hosts with orders of magnitude difference in total number
of rows (100,000 or more), this was just chosen as an example.

Here are the stats for the index:

# select * from pg_stats where tablename = 'url_uh2_idx';
-[ RECORD 1 ]-----
+-----------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------
schemaname | public
tablename | url_uh2_idx
attname | pg_expression_1
null_frac | 0
avg_width | 20
n_distinct | -1
most_common_vals |
most_common_freqs |
histogram_bounds | {(11890a55087088877a5a01ea5b8a8459e62491c9),
(3c983e95c87b3ba8669ade6fe1be797495be671f),
(4ac4698bc9a61b94eef527c6599217a0bbcd1ad6),
(4ac4b7c874b9b16ceab038e93646510a8fb756ca),
(60d74740f0a49e68e89e01232373ed2da0ca4c2f),
(67fe336eb73e327491a4e8bf224bce5857879188),
(7975336e56ad4ee211ebee9ff2a6f1e4c9e43ca5),
(88a9336e7a8e5419ff63ed1cec6985d91cd15688),
(909b76d12cf6c84255a33a008bf91f0cf0f711c4),
(d687336e90c41fc92776dc35896797ef873c38f5),
(f4aaba84e7e86a4c7d7c4d5930128cb3a5889cbf)}
correlation | -0.109073

Should I just create a cast function rather then an entire operator
class that just respects the host part? Is this bad style? If you
have questions I'd be happy to explain more.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com

Search Discussions

  • Simon Riggs at Nov 6, 2006 at 8:54 pm

    On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote:

    It doesn't appear that ANALYZE uses the specified operator class for
    producing statistics on an index when that operator class is not the
    default for the data type. This appears to be leading to poor query
    planning.
    For speed of indexing a few million urls I'm indexing them with a
    custom data type. The data type aptly named "urlhash" has:
    Have you read the CREATE TYPE man page, specifically with regard to the
    analyze_function clause? Basically, if you want anything different, you
    have to write an analysis function yourself. This is what PostGIS
    (www.postgis.org) does, if you want to look for specific code examples.

    ANALYZE collects stats for tables, not indexes, using the default
    operator class for the datatype. So even though you've clearly specified
    an opclass for the index, no stats will be collected using it.

    Alternatively, perhaps you have fallen foul of this situation?

    /*
    * Can't analyze if the opclass uses a storage type
    * different from the expression result type. We'd get
    * confused because the type shown in pg_attribute for
    * the index column doesn't match what we are getting
    * from the expression. Perhaps this can be fixed
    * someday, but for now, punt.
    */

    It's in the analyze.c code, but not in the docs.

    Be interested in a full report of your research, once you're done.

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Rusty Conover at Nov 6, 2006 at 9:48 pm

    On Nov 6, 2006, at 1:53 PM, Simon Riggs wrote:
    On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote:

    It doesn't appear that ANALYZE uses the specified operator class for
    producing statistics on an index when that operator class is not the
    default for the data type. This appears to be leading to poor query
    planning.
    For speed of indexing a few million urls I'm indexing them with a
    custom data type. The data type aptly named "urlhash" has:
    Have you read the CREATE TYPE man page, specifically with regard to
    the
    analyze_function clause? Basically, if you want anything different,
    you
    have to write an analysis function yourself. This is what PostGIS
    (www.postgis.org) does, if you want to look for specific code
    examples.

    I have read it and works great, because it says it uses the default
    operators = and < to build the statistics which should work great for
    the default operator class. I don't think I need to write my own
    analyze function because the default behavior works so well. I just
    want the ANALYZE call to use the index's opclass definitions of = and
    < if the index is created with a custom operator class that is not
    the default for the data type.

    ANALYZE collects stats for tables, not indexes, using the default
    operator class for the datatype. So even though you've clearly
    specified
    an opclass for the index, no stats will be collected using it.
    I don't think thats correct because, with indexes based on functions
    there are statistics that clearly can't be gathered just directly
    from the table. Analyze does look at indexes to build up the
    histogram statistics and correlation. See compute_index_stats in
    analyze.c.

    Alternatively, perhaps you have fallen foul of this situation?
    /*
    * Can't analyze if the opclass uses a storage type
    * different from the expression result type. We'd get
    * confused because the type shown in pg_attribute for
    * the index column doesn't match what we are getting
    * from the expression. Perhaps this can be fixed
    * someday, but for now, punt.
    */

    It's in the analyze.c code, but not in the docs.

    Be interested in a full report of your research, once you're done.
    I don't think this applies because the storage type is the same its
    just a separate operator class being used.

    I think the problem may be with the "ordering_oper" and
    "equality_oper" only being passed the type rather then the type and
    the operator class that's being used for the index. Looking more it
    goes back into typcache.c which only caches the default operators
    from the default btree or hash operator classes. Changing this looks
    like it would require a bit of effort if it is the right path to pursue.

    Thanks,

    Rusty
    --
    Rusty Conover
    InfoGears Inc.
    Web: http://www.infogears.com
  • Simon Riggs at Nov 6, 2006 at 10:21 pm

    On Mon, 2006-11-06 at 14:47 -0700, Rusty Conover wrote:

    I just
    want the ANALYZE call to use the index's opclass definitions of = and
    < if the index is created with a custom operator class that is not
    the default for the data type.
    Which is exactly what the manual specifically says it doesn't do,
    therefore you'll need the analyze_function. That capability was put
    there deliberately to help you out, in this situation.
    I don't think I need to write my own analyze function
    That is the solution, not a workaround.

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Rusty Conover at Nov 6, 2006 at 10:54 pm

    On Nov 6, 2006, at 3:20 PM, Simon Riggs wrote:
    On Mon, 2006-11-06 at 14:47 -0700, Rusty Conover wrote:

    I just
    want the ANALYZE call to use the index's opclass definitions of = and
    < if the index is created with a custom operator class that is not
    the default for the data type.
    Which is exactly what the manual specifically says it doesn't do,
    therefore you'll need the analyze_function. That capability was put
    there deliberately to help you out, in this situation.

    Since the analyze function is part of the type not the operator class
    (making it type specific not operator class specific), it doesn't
    appear that it will be able to do what I ask since you can only have
    one analyze function per type and its not passed the operator context
    for the particular run.

    from analyze.h:

    /*----------
    * ANALYZE builds one of these structs for each attribute (column)
    that is
    * to be analyzed. The struct and subsidiary data are in
    anl_context,
    * so they live until the end of the ANALYZE operation.
    *
    * The type-specific typanalyze function is passed a pointer to this
    struct
    * and must return TRUE to continue analysis, FALSE to skip analysis
    of this
    * column. In the TRUE case it must set the compute_stats and
    minrows fields,
    * and can optionally set extra_data to pass additional info to
    compute_stats.
    * minrows is its request for the minimum number of sample rows to be
    gathered
    * (but note this request might not be honored, eg if there are fewer
    rows
    * than that in the table).
    *
    */

    I still think this is a deficiency in the analyze function to not use
    the operator_class that the index uses when producing statistics for
    that index.

    Thanks,

    Rusty
    --
    Rusty Conover
    InfoGears Inc.
    Web: http://www.infogears.com
  • Simon Riggs at Nov 6, 2006 at 11:11 pm

    On Mon, 2006-11-06 at 15:54 -0700, Rusty Conover wrote:

    I still think this is a deficiency in the analyze function to not use
    the operator_class that the index uses when producing statistics for
    that index.
    Agreed, but that isn't the way it works right now, AFAICS. TODO...

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Bruce Momjian at Nov 23, 2006 at 3:48 am

    Simon Riggs wrote:
    On Mon, 2006-11-06 at 15:54 -0700, Rusty Conover wrote:

    I still think this is a deficiency in the analyze function to not use
    the operator_class that the index uses when producing statistics for
    that index.
    Agreed, but that isn't the way it works right now, AFAICS. TODO...
    Do people want this as a TODO? What would the wording be?

    --
    Bruce Momjian bruce@momjian.us
    EnterpriseDB http://www.enterprisedb.com

    + If your life is a hard drive, Christ can be your backup. +
  • David Fetter at Nov 6, 2006 at 11:10 pm

    On Mon, Nov 06, 2006 at 08:53:28PM +0000, Simon Riggs wrote:
    On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote:

    It doesn't appear that ANALYZE uses the specified operator class for
    producing statistics on an index when that operator class is not the
    default for the data type. This appears to be leading to poor query
    planning.
    For speed of indexing a few million urls I'm indexing them with a
    custom data type. The data type aptly named "urlhash" has:
    Have you read the CREATE TYPE man page, specifically with regard to the
    analyze_function clause? Basically, if you want anything different, you
    have to write an analysis function yourself. This is what PostGIS
    (www.postgis.org) does, if you want to look for specific code examples.

    ANALYZE collects stats for tables, not indexes, using the default
    operator class for the datatype. So even though you've clearly specified
    an opclass for the index, no stats will be collected using it.

    Alternatively, perhaps you have fallen foul of this situation?

    /*
    * Can't analyze if the opclass uses a storage type
    * different from the expression result type. We'd get
    * confused because the type shown in pg_attribute for
    * the index column doesn't match what we are getting
    * from the expression. Perhaps this can be fixed
    * someday, but for now, punt.
    */

    It's in the analyze.c code, but not in the docs.
    Should a doc patch be in the offing here?

    Cheers,
    D
    --
    David Fetter <david@fetter.org> http://fetter.org/
    phone: +1 415 235 3778 AIM: dfetter666
    Skype: davidfetter

    Remember to vote!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-bugs @
categoriespostgresql
postedNov 3, '06 at 10:16p
activeNov 23, '06 at 3:48a
posts8
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase