FAQ
Today I ran into some interesting consequences of the xml data type
being without an "=" operator. One I thought I'd post here because it
has a *possible* planner impact. I'm not sure it is actually a bug as
such, but this seemed the best forum to post in initially:

test=# \d bug
Table "public.bug"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | xml |

test=# explain select val::text from bug;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)


Note the width estimate. However a more realistic estimate for width is:

test=# select 8192/(reltuples/relpages) as width from pg_class where
relname='bug';
width
------------------
394.130431739976

So we are going to massively underestimate the "size" of such a dataset.
Now this appears to be a consequence of no "=" operator (std_typanalyze
in analyze.c bails if there isn't one), so the planner has no idea about
how wide 'val' actually is. I'm wondering if it is worth having at least
an "=" operator to enable some minimal stats to be available for xml
columns.

regards

Mark

Search Discussions

  • Mark Kirkwood at May 27, 2010 at 1:38 am

    On 25/05/10 16:43, Mark Kirkwood wrote:
    Today I ran into some interesting consequences of the xml data type
    being without an "=" operator. One I thought I'd post here because it
    has a *possible* planner impact. I'm not sure it is actually a bug as
    such, but this seemed the best forum to post in initially:

    test=# \d bug
    Table "public.bug"
    Column | Type | Modifiers
    --------+---------+-----------
    id | integer |
    val | xml |

    test=# explain select val::text from bug;
    QUERY PLAN
    --------------------------------------------------------------
    Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)


    Note the width estimate. However a more realistic estimate for width is:

    test=# select 8192/(reltuples/relpages) as width from pg_class where
    relname='bug';
    width
    ------------------
    394.130431739976

    So we are going to massively underestimate the "size" of such a
    dataset. Now this appears to be a consequence of no "=" operator
    (std_typanalyze in analyze.c bails if there isn't one), so the planner
    has no idea about how wide 'val' actually is. I'm wondering if it is
    worth having at least an "=" operator to enable some minimal stats to
    be available for xml columns.
    Adding a minimal = op (see attached) and an analyze results in:

    test=# explain select val::text from bug;
    QUERY PLAN
    ---------------------------------------------------------------
    Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385)

    which gives a much better indication of dataset size.
  • Mark Kirkwood at Jun 3, 2010 at 11:17 pm

    On 27/05/10 13:37, Mark Kirkwood wrote:
    On 25/05/10 16:43, Mark Kirkwood wrote:
    Today I ran into some interesting consequences of the xml data type
    being without an "=" operator. One I thought I'd post here because it
    has a *possible* planner impact. I'm not sure it is actually a bug as
    such, but this seemed the best forum to post in initially:

    test=# \d bug
    Table "public.bug"
    Column | Type | Modifiers
    --------+---------+-----------
    id | integer |
    val | xml |

    test=# explain select val::text from bug;
    QUERY PLAN
    --------------------------------------------------------------
    Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)


    Note the width estimate. However a more realistic estimate for width is:

    test=# select 8192/(reltuples/relpages) as width from pg_class where
    relname='bug';
    width
    ------------------
    394.130431739976

    So we are going to massively underestimate the "size" of such a
    dataset. Now this appears to be a consequence of no "=" operator
    (std_typanalyze in analyze.c bails if there isn't one), so the
    planner has no idea about how wide 'val' actually is. I'm wondering
    if it is worth having at least an "=" operator to enable some minimal
    stats to be available for xml columns.
    Adding a minimal = op (see attached) and an analyze results in:

    test=# explain select val::text from bug;
    QUERY PLAN
    ---------------------------------------------------------------
    Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385)

    which gives a much better indication of dataset size.

    Maybe I gave this guy a bad title - is it a concern that the 'width'
    estimate is so far off for xml datatypes (because of no = op)? It seemed
    to me that this could result in some bad plan choices (e.g in subqueries
    etc).

    regards

    Mark
  • Robert Haas at Jun 9, 2010 at 3:22 am

    On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood wrote:
    Maybe I gave this guy a bad title - is it a concern that the 'width'
    estimate is so far off for xml datatypes (because of no = op)? It seemed to
    me that this could result in some bad plan choices (e.g in subqueries etc).
    It's possible. I don't really see a reason not to add an = operator
    for XML - does anyone else?

    It would need to be done by updating src/include/catalog/pg_*.h,
    rather than via SQL, of course.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Mark Kirkwood at Jun 9, 2010 at 3:29 am

    On 09/06/10 15:22, Robert Haas wrote:
    On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood
    wrote:
    Maybe I gave this guy a bad title - is it a concern that the 'width'
    estimate is so far off for xml datatypes (because of no = op)? It seemed to
    me that this could result in some bad plan choices (e.g in subqueries etc).
    It's possible. I don't really see a reason not to add an = operator
    for XML - does anyone else?

    It would need to be done by updating src/include/catalog/pg_*.h,
    rather than via SQL, of course.
    Heh, sure should - I merely included the SQL stuff in case anyone else
    wanted to reproduce what I was seeing!

    Cheers

    Mark
  • Tom Lane at Jun 9, 2010 at 5:15 am

    Robert Haas writes:
    It's possible. I don't really see a reason not to add an = operator
    for XML - does anyone else?
    Yes, that was considered and rejected, IIRC. What is your definition
    of equality for xml?

    regards, tom lane
  • Mark Kirkwood at Jun 9, 2010 at 5:42 am

    On 09/06/10 17:14, Tom Lane wrote:
    Robert Haas<robertmhaas@gmail.com> writes:
    It's possible. I don't really see a reason not to add an = operator
    for XML - does anyone else?
    Yes, that was considered and rejected, IIRC. What is your definition
    of equality for xml?
    Yes - but in that previous discussion the optimizer (lack of)
    information was not considered (or known I suspect), so maybe a rethink
    is worthwhile?

    It seems that the nub of this issue is that there are conceptually two
    types of =, one for datatype specific comparison, and one for optimizer
    statistical information calculation. However the system allows only the
    first, so if you don't (or can't) have one then you lose some possibly
    important optimization data.

    regards

    Mark
  • Tom Lane at Jun 9, 2010 at 2:18 pm

    Mark Kirkwood writes:
    It seems that the nub of this issue is that there are conceptually two
    types of =, one for datatype specific comparison, and one for optimizer
    statistical information calculation. However the system allows only the
    first, so if you don't (or can't) have one then you lose some possibly
    important optimization data.
    Nonsense. ANALYZE and the optimizer work with the datatype's usual
    notion of '=', whatever it is.

    It's possible that we should install a simplified code path in analyze.c
    that can collect width data for a column even in the absence of any '='
    operator. I'm less than convinced that it's worth the trouble though.
    Do you have an actual example where such data would have affected a
    plan choice?

    regards, tom lane
  • Mark Kirkwood at Jun 10, 2010 at 12:42 am

    On 10/06/10 02:17, Tom Lane wrote:
    Mark Kirkwood<mark.kirkwood@catalyst.net.nz> writes:
    It seems that the nub of this issue is that there are conceptually two
    types of =, one for datatype specific comparison, and one for optimizer
    statistical information calculation. However the system allows only the
    first, so if you don't (or can't) have one then you lose some possibly
    important optimization data.
    Nonsense. ANALYZE and the optimizer work with the datatype's usual
    notion of '=', whatever it is.
    Slow down the reading Tom... and read what I was actually saying - note
    the"conceptually". Of course the code uses the datatype's defined "=".
    It's possible that we should install a simplified code path in analyze.c
    that can collect width data for a column even in the absence of any '='
    operator.
    Yeah I was thinking along the same lines.
    Do you have an actual example where such data would have affected a
    plan choice?

    Not at the moment, I was thinking that anywhere that used such datatypes
    in a subquery of similar might be a likely case. I guess I was looking
    at this as a case of "this is an area where we have less accurate
    optimizer data that we could have", and thinking of ways to improve it.

    regards

    Mark
  • Robert Haas at Jun 9, 2010 at 2:53 pm

    On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    It's possible.  I don't really see a reason not to add an = operator
    for XML - does anyone else?
    Yes, that was considered and rejected, IIRC.  What is your definition
    of equality for xml?
    I'd vote for !memcmp().

    There can be (and probably already are) other ways to test for other
    kinds of equality, too, of course.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Tom Lane at Jun 9, 2010 at 3:08 pm

    Robert Haas writes:
    On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane wrote:
    Yes, that was considered and rejected, IIRC.  What is your definition
    of equality for xml?
    I'd vote for !memcmp().
    Surely not. xml is not text.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-bugs @
categoriespostgresql
postedMay 25, '10 at 4:44a
activeJun 10, '10 at 12:42a
posts11
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase