We sometimes transform IN-clauses to a list of ORs:

postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
QUERY PLAN
------------------------------------------------------
Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
Filter: ((a = b) OR (a = c))
(2 rows)

But what if you replace "a" with a volatile function? It doesn't seem
legal to do that transformation in that case, but we do it:

postgres=# explain SELECT * FROM foo WHERE (random()*2)::integer IN (b, c);
QUERY PLAN


-------------------------------------------------------------------------------------------------
-------------------
Seq Scan on foo (cost=0.00..68.20 rows=19 width=12)
Filter: ((((random() * 2::double precision))::integer = b) OR
(((random() * 2::double precision))::integer = c))
(2 rows)

I tried to read the SQL spec to see if it has anything to say about
that, but I couldn't find anything. My common sense says that that
transformation is not legal.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Search Discussions

  • Gianni Ciolli at Apr 1, 2011 at 12:10 pm

    On Fri, Apr 01, 2011 at 02:24:53PM +0300, Heikki Linnakangas wrote:

    I tried to read the SQL spec to see if it has anything to say about
    that, but I couldn't find anything. My common sense says that that
    transformation is not legal.
    Your feeling is correct; I would motivate it as follows.

    random() IN (b,c)

    is not equivalent to

    (random() = b) OR (random() = c)

    because the two random() will evaluate to two different numbers. So,
    for instance, if you define random_boolean() as either true or false
    randomly (and VOLATILEly), then

    random_boolean() IN (true, false)

    is always true, while

    (random_boolean() = true) OR (random_boolean() = false)

    is not (has probability 75%). For instance, the first random_boolean()
    might return false while the second one returns true.

    Best regards,
    Dr. Gianni Ciolli - 2ndQuadrant Italia
    PostgreSQL Training, Services and Support
    gianni.ciolli@2ndquadrant.it | www.2ndquadrant.it
  • Robert Haas at Apr 1, 2011 at 2:37 pm

    On Fri, Apr 1, 2011 at 7:24 AM, Heikki Linnakangas wrote:
    My common sense says that that transformation
    is not legal.
    +1.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Apr 2, 2011 at 6:09 pm

    Heikki Linnakangas writes:
    We sometimes transform IN-clauses to a list of ORs:
    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    ------------------------------------------------------
    Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))
    (2 rows)
    But what if you replace "a" with a volatile function? It doesn't seem
    legal to do that transformation in that case, but we do it:
    This is the fault of transformAExprIn(). But please let's *not* fix
    this by adding volatility to the set of heuristics used there. Looking
    at this again, it seems to me that most of the problem with this code
    is that we're trying to make optimization decisions in the parser.

    I think what we ought to do is have the parser emit a full-fledged
    InExpr node type (with semantics rather like CaseExpr) and then teach
    the planner to optimize that to something else when it seems
    safe/prudent to do so. One nontrivial advantage of that is that
    rules/views containing IN constructs would start to reverse-parse
    in the same fashion, instead of introducing weird substitute
    expressions.

    regards, tom lane
  • Heikki Linnakangas at Apr 5, 2011 at 8:24 am

    On 02.04.2011 20:48, Tom Lane wrote:
    Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
    We sometimes transform IN-clauses to a list of ORs:
    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    ------------------------------------------------------
    Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))
    (2 rows)
    But what if you replace "a" with a volatile function? It doesn't seem
    legal to do that transformation in that case, but we do it:
    This is the fault of transformAExprIn(). But please let's *not* fix
    this by adding volatility to the set of heuristics used there. Looking
    at this again, it seems to me that most of the problem with this code
    is that we're trying to make optimization decisions in the parser.
    Agreed. The history of this is that before 8.2 all IN clauses were
    transformed to OR clauses straight in the grammar. 8.2 added the code to
    represent IN clause as a ScalarArrayOpExpr, but it was changed in 8.2.10
    to use the OR-form again for Vars
    (http://archives.postgresql.org/pgsql-hackers/2008-10/msg01269.php)
    I think what we ought to do is have the parser emit a full-fledged
    InExpr node type (with semantics rather like CaseExpr) and then teach
    the planner to optimize that to something else when it seems
    safe/prudent to do so. One nontrivial advantage of that is that
    rules/views containing IN constructs would start to reverse-parse
    in the same fashion, instead of introducing weird substitute
    expressions.
    Here's my first cut at that. The lefthand expression is now evaluated
    only once, and stored in econtext->caseValue. Parse analysis turns the
    righthand expressions into a list of comparison expressions like
    "CaseTestExpr = value1". It's perhaps time that we rename CaseTestExpr
    into something more generic, now that it's used not only in CASE
    expressions, but also in IN and in UPDATE targets, but I didn't do that
    in this patch.

    eval_const_expressions checks the lefthand expression for volatile
    functions. If there aren't any, it transform the InExprs to a list of ORs.

    This isn't finished, because it doesn't yet do the transformation to
    ScalarArrayOpExpr. The OR form is much slower to plan, which is why the
    ScalarArrayOpExpr transformation was introduced in 8.2. I'll continue
    hacking on that, but please let me know if you have a better idea on how
    to handle that. One alternative is to teach the machinery that matches
    restrictinfos to usable indexes to handle InExpr like it does
    ScalarArrayOpExprs, but I don't know that code very well.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Marti Raudsepp at Apr 5, 2011 at 10:19 am

    On Fri, Apr 1, 2011 at 14:24, Heikki Linnakangas wrote:
    We sometimes transform IN-clauses to a list of ORs:

    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    Seq Scan on foo  (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))

    But what if you replace "a" with a volatile function? It doesn't seem legal
    to do that transformation in that case, but we do it:

    postgres=# explain SELECT * FROM foo WHERE (random()*2)::integer IN (b, c);
    QUERY PLAN

    Seq Scan on foo  (cost=0.00..68.20 rows=19 width=12)
    Filter: ((((random() * 2::double precision))::integer = b) OR (((random()
    * 2::double precision))::integer = c))
    Is there a similar problem with the BETWEEN clause transformation into
    AND expressions?

    marti=> explain verbose select random() between 0.25 and 0.75;
    Result (cost=0.00..0.02 rows=1 width=0)
    Output: ((random() >= 0.25::double precision) AND (random() <=
    0.75::double precision))

    As expected, I get a statistical skew of 0.4375 / 0.5625, whereas the
    "correct" would be 50/50:

    marti=> select random() between 0.25 and 0.75 as result, count(*) from
    generate_series(1,1000000) i group by 1;
    result | count
    --------+--------
    f | 437262
    t | 562738

    I also always noticed that BETWEEN with subqueries produces two
    subplan nodes, this seems suboptimal.

    marti=> explain verbose select (select random()) between 0.25 and 0.75;
    Result (cost=0.03..0.04 rows=1 width=0)
    Output: (($0 >= 0.25::double precision) AND ($1 <= 0.75::double precision))
    InitPlan 1 (returns $0)
    -> Result (cost=0.00..0.01 rows=1 width=0)
    Output: random()
    InitPlan 2 (returns $1)
    -> Result (cost=0.00..0.01 rows=1 width=0)
    Output: random()


    Regards,
    Marti
  • Heikki Linnakangas at Apr 5, 2011 at 3:42 pm

    On 05.04.2011 13:19, Marti Raudsepp wrote:
    On Fri, Apr 1, 2011 at 14:24, Heikki Linnakangas
    wrote:
    We sometimes transform IN-clauses to a list of ORs:

    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))

    But what if you replace "a" with a volatile function? It doesn't seem legal
    to do that transformation in that case, but we do it:

    postgres=# explain SELECT * FROM foo WHERE (random()*2)::integer IN (b, c);
    QUERY PLAN

    Seq Scan on foo (cost=0.00..68.20 rows=19 width=12)
    Filter: ((((random() * 2::double precision))::integer = b) OR (((random()
    * 2::double precision))::integer = c))
    Is there a similar problem with the BETWEEN clause transformation into
    AND expressions?

    marti=> explain verbose select random() between 0.25 and 0.75;
    Result (cost=0.00..0.02 rows=1 width=0)
    Output: ((random()>= 0.25::double precision) AND (random()<=
    0.75::double precision))
    Yes, good point.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at Apr 11, 2011 at 3:42 pm

    On 05.04.2011 18:42, Heikki Linnakangas wrote:
    On 05.04.2011 13:19, Marti Raudsepp wrote:
    On Fri, Apr 1, 2011 at 14:24, Heikki Linnakangas
    wrote:
    We sometimes transform IN-clauses to a list of ORs:

    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))

    But what if you replace "a" with a volatile function? It doesn't seem
    legal
    to do that transformation in that case, but we do it:

    postgres=# explain SELECT * FROM foo WHERE (random()*2)::integer IN
    (b, c);
    QUERY PLAN

    Seq Scan on foo (cost=0.00..68.20 rows=19 width=12)
    Filter: ((((random() * 2::double precision))::integer = b) OR
    (((random()
    * 2::double precision))::integer = c))
    Is there a similar problem with the BETWEEN clause transformation into
    AND expressions?

    marti=> explain verbose select random() between 0.25 and 0.75;
    Result (cost=0.00..0.02 rows=1 width=0)
    Output: ((random()>= 0.25::double precision) AND (random()<=
    0.75::double precision))
    Yes, good point.
    Hmm, the SQL specification explicitly says that

    X BETWEEN Y AND Z

    is equal to

    X >= Y AND X <= Z

    It doesn't say anything about side-effects of X. Seems like an oversight
    in the specification. I would not expect X to be evaluated twice, and I
    think we should change BETWEEN to not do that.


    Does anyone object to making BETWEEN and IN more strict about the data
    types? At the moment, you can do this:

    postgres=# SELECT '1234' BETWEEN '10001'::text AND 10002::int4;
    ?column?
    ----------
    t
    (1 row)

    I'm thinking that it should throw an error. Same with IN, if the values
    in the IN-list can't be coerced to a common type. That will probably
    simplify the code a lot, and is what the SQL standard assumes anyway AFAICS.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Apr 11, 2011 at 4:09 pm

    Heikki Linnakangas wrote:
    On 05.04.2011 18:42, Heikki Linnakangas wrote:
    On 05.04.2011 13:19, Marti Raudsepp wrote:
    On Fri, Apr 1, 2011 at 14:24, Heikki Linnakangas
    wrote:
    We sometimes transform IN-clauses to a list of ORs:

    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))

    But what if you replace "a" with a volatile function? It
    doesn't seem legal to do that transformation in that case, but
    we do it:

    postgres=# explain SELECT * FROM foo WHERE
    (random()*2)::integer IN (b, c);
    QUERY PLAN

    Seq Scan on foo (cost=0.00..68.20 rows=19 width=12)
    Filter: ((((random() * 2::double precision))::integer = b) OR
    (((random()
    * 2::double precision))::integer = c))
    Is there a similar problem with the BETWEEN clause
    transformation into AND expressions?

    marti=> explain verbose select random() between 0.25 and 0.75;
    Result (cost=0.00..0.02 rows=1 width=0)
    Output: ((random()>= 0.25::double precision) AND (random()<=
    0.75::double precision))
    Yes, good point.
    Hmm, the SQL specification explicitly says that

    X BETWEEN Y AND Z

    is equal to

    X >= Y AND X <= Z

    It doesn't say anything about side-effects of X. Seems like an
    oversight in the specification. I would not expect X to be
    evaluated twice, and I think we should change BETWEEN to not do
    that.
    Does the SQL spec explicitly say anything about how many times X
    should be evaluated if you were to code it as?:

    X >= Y AND X <= Z

    If it does, evaluating it a different number of times for BETWEEN
    would seem to be a deviation from standard. Evaluating it once seem
    less surprising, but if we're going to deviate from the standard in
    doing that, it at least deserves a clear note to that effect in the
    docs.

    Evaluating X once for BETWEEN seems better from a POLA perspective,
    unless you happen to be massaging a query to another form and
    trusting that the equivalence defined in the standard will always
    hold.
    Does anyone object to making BETWEEN and IN more strict about the
    data types? At the moment, you can do this:

    postgres=# SELECT '1234' BETWEEN '10001'::text AND 10002::int4;
    ?column?
    ----------
    t
    (1 row)

    I'm thinking that it should throw an error. Same with IN, if the
    values in the IN-list can't be coerced to a common type. That will
    probably simplify the code a lot, and is what the SQL standard
    assumes anyway AFAICS.
    +1 for more strict.

    -Kevin
  • Heikki Linnakangas at Apr 11, 2011 at 4:33 pm

    On 11.04.2011 19:06, Kevin Grittner wrote:
    Heikki Linnakangaswrote:
    On 05.04.2011 18:42, Heikki Linnakangas wrote:
    On 05.04.2011 13:19, Marti Raudsepp wrote:
    On Fri, Apr 1, 2011 at 14:24, Heikki Linnakangas
    wrote:
    We sometimes transform IN-clauses to a list of ORs:

    postgres=# explain SELECT * FROM foo WHERE a IN (b, c);
    QUERY PLAN
    Seq Scan on foo (cost=0.00..39.10 rows=19 width=12)
    Filter: ((a = b) OR (a = c))

    But what if you replace "a" with a volatile function? It
    doesn't seem legal to do that transformation in that case, but
    we do it:

    postgres=# explain SELECT * FROM foo WHERE
    (random()*2)::integer IN (b, c);
    QUERY PLAN

    Seq Scan on foo (cost=0.00..68.20 rows=19 width=12)
    Filter: ((((random() * 2::double precision))::integer = b) OR
    (((random()
    * 2::double precision))::integer = c))
    Is there a similar problem with the BETWEEN clause
    transformation into AND expressions?

    marti=> explain verbose select random() between 0.25 and 0.75;
    Result (cost=0.00..0.02 rows=1 width=0)
    Output: ((random()>= 0.25::double precision) AND (random()<=
    0.75::double precision))
    Yes, good point.
    Hmm, the SQL specification explicitly says that

    X BETWEEN Y AND Z

    is equal to

    X>= Y AND X<= Z

    It doesn't say anything about side-effects of X. Seems like an
    oversight in the specification. I would not expect X to be
    evaluated twice, and I think we should change BETWEEN to not do
    that.
    Does the SQL spec explicitly say anything about how many times X
    should be evaluated if you were to code it as?:

    X>= Y AND X<= Z
    Not explicitly. However, it does say that:

    "
    NOTE 258 — Since <between predicate> is an ordering operation, the
    Conformance Rules of Subclause 9.12, “Ordering
    operations”, also apply.
    "

    If I'm reading those ordering operation conformance rules correctly, it
    only allows the operand to be a simple column or an expression that's
    specified in the ORDER BY or similar, not an arbitrary expression. Which
    seems quite restrictive, but it would dodge the whole issue..

    The spec also has that:

    “X BETWEEN SYMMETRIC Y AND Z” is equivalent to “((X BETWEEN ASYMMETRIC Y
    AND Z)
    OR (X BETWEEN ASYMMETRIC Z AND Y))”.

    So if you take that into account too, X is evaluated four times. The SQL
    standard can be funny sometimes, but I can't believe that they intended
    that.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at Apr 19, 2011 at 6:50 pm

    On 11.04.2011 19:33, Heikki Linnakangas wrote:
    On 11.04.2011 19:06, Kevin Grittner wrote:
    Heikki Linnakangaswrote:
    Hmm, the SQL specification explicitly says that

    X BETWEEN Y AND Z

    is equal to

    X>= Y AND X<= Z

    It doesn't say anything about side-effects of X. Seems like an
    oversight in the specification. I would not expect X to be
    evaluated twice, and I think we should change BETWEEN to not do
    that.
    Does the SQL spec explicitly say anything about how many times X
    should be evaluated if you were to code it as?:

    X>= Y AND X<= Z
    Not explicitly. However, it does say that:

    "
    NOTE 258 — Since <between predicate> is an ordering operation, the
    Conformance Rules of Subclause 9.12, “Ordering
    operations”, also apply.
    "

    If I'm reading those ordering operation conformance rules correctly, it
    only allows the operand to be a simple column or an expression that's
    specified in the ORDER BY or similar, not an arbitrary expression. Which
    seems quite restrictive, but it would dodge the whole issue..
    Another data point on this: DB2 disallow volatile left-operand to BETWEEN

    db2 => SELECT * FROM atable WHERE smallint(rand()*10) BETWEEN 4 AND 5
    SQL0583N The use of routine or expression "SYSFUN.RAND" is invalid
    because it
    is not deterministic or has an external action. SQLSTATE=42845

    I'd like us to still fix this so that there's no multiple evaluation -
    that would actually make BETWEEN more useful than it is today. I'm
    working on a patch to handle both BETWEEN and IN.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Apr 19, 2011 at 7:27 pm

    Heikki Linnakangas writes:
    I'd like us to still fix this so that there's no multiple evaluation -
    that would actually make BETWEEN more useful than it is today. I'm
    working on a patch to handle both BETWEEN and IN.
    One other issue here is that the parser has traditionally handled
    BETWEEN by multiply linking the same expression tree. I've wanted to
    get rid of that behavior for a long time, but never got round to it.
    It causes a lot of headaches for later processing, because you have to
    be wary of multiply processing the same tree. If we switch BETWEEN
    to something with a dedicated representation we could probably get rid
    of all multiple-linking in the parser's output, allowing ensuing
    simplifications downstream.

    regards, tom lane
  • Tom Lane at Apr 11, 2011 at 4:35 pm

    Heikki Linnakangas writes:
    Does anyone object to making BETWEEN and IN more strict about the data
    types? At the moment, you can do this:
    postgres=# SELECT '1234' BETWEEN '10001'::text AND 10002::int4;
    ?column?
    ----------
    t
    (1 row)
    I'm thinking that it should throw an error. Same with IN, if the values
    in the IN-list can't be coerced to a common type.
    You *will* get push-back on that ... maybe from people with badly coded
    applications, but I guarantee there will be complaints.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedApr 1, '11 at 11:25a
activeApr 19, '11 at 7:27p
posts13
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase