hello everybody,

we came across an issue which turned out to be more serious than previously expected.
imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavy in this case.

i tried this one with 5000 unindexed tables (just one col):

test=# \timing
Timing is on.
test=# prepare x(int4) AS select * from t_data order by id desc;
PREPARE
Time: 361.552 ms

you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.

does anybody see a potential way to do a shortcut through the planner?
a prepared query is no solution here as constraint exclusion would be dead in this case (making the entire thing an even bigger drama).

did anybody think of a solution to this problem.
or more precisely: can there be a solution to this problem?

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

Search Discussions

  • Stephen Frost at Sep 3, 2010 at 12:04 pm

    * PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    did anybody think of a solution to this problem.
    or more precisely: can there be a solution to this problem?
    Please post to the correct list (-performance) and provide information
    like PG version, postgresql.conf, the actual table definition, the
    resulting query plan, etc, etc...

    Thanks,

    Stephen
  • Hans-Jürgen Schönig at Sep 3, 2010 at 12:16 pm

    On Sep 3, 2010, at 2:04 PM, Stephen Frost wrote:

    * PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    did anybody think of a solution to this problem.
    or more precisely: can there be a solution to this problem?
    Please post to the correct list (-performance) and provide information
    like PG version, postgresql.conf, the actual table definition, the
    resulting query plan, etc, etc...

    Thanks,

    Stephen


    hello stephen,

    this seems like more a developer question to me than a pre performance one.
    it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists.
    it applies to postgres 9 and most likely to everything before.
    postgresql.conf is not relevant at all at this point.

    the plan is pretty fine.
    the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres?
    also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this case inside the planing process.

    many thanks,

    hans

    --
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
  • Stephen Frost at Sep 3, 2010 at 12:27 pm

    * PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    this seems like more a developer question to me than a pre performance one.
    it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists.
    it applies to postgres 9 and most likely to everything before.
    postgresql.conf is not relevant at all at this point.
    Really? What's constraint_exclusion set to? Is GEQO getting used?
    What are the GEQO parameters set to? Do you have any CHECK constraints
    on the tables?

    If you want someone else to build a test case and start looking into it,
    it's best to not make them have to guess at what you've done.
    the plan is pretty fine.
    the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres?
    also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this case inside the planing process.
    Coming up with cases where PG doesn't perform well in a completely
    contrived unrealistic environment isn't likely to impress anyone to
    do anything.

    A small (but complete..) test case which mimics a real world environment
    and real world problem would go alot farther. I can certainly believe
    that people out there have partitions set up with lots of tables and
    that it will likely grow- but they probably also have CHECK constraints,
    have tweaked what constraint_exclusion is set to, have adjusted their
    work_mem and related settings, maybe tweaked some planner GUCs, etc,
    etc.

    This is an area I'm actually interested in and curious about, I'd rather
    work together on it than be told that the questions I'm asking aren't
    relevant.

    Thanks,

    Stephen
  • Robert Haas at Sep 3, 2010 at 1:27 pm

    2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
    i tried this one with 5000 unindexed tables (just one col):

    test=# \timing
    Timing is on.
    test=# prepare x(int4) AS select * from t_data order by id desc;
    PREPARE
    Time: 361.552 ms

    you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
    I'd like to see (1) a script to reproduce your test environment (as
    Stephen also requested) and (2) gprof or oprofile results.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Robert Haas at Sep 8, 2010 at 1:54 pm

    On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan wrote:
    Hi,

    Robert Haas írta:
    2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
    i tried this one with 5000 unindexed tables (just one col):

    test=# \timing
    Timing is on.
    test=# prepare x(int4) AS select * from t_data order by id desc;
    PREPARE
    Time: 361.552 ms

    you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
    I'd like to see (1) a script to reproduce your test environment (as
    Stephen also requested) and (2) gprof or oprofile results.
    attached are the test scripts, create_tables.sql and childtables.sql.
    The following query takes 4.7 seconds according to psql with \timing on:
    EXPLAIN SELECT * FROM qdrs
    WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
    ORDER BY streamhash;
    Neat. Have you checked what effect this has on memory consumption?

    Also, don't forget to add it to
    https://commitfest.postgresql.org/action/commitfest_view/open

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Hans-Jürgen Schönig at Sep 8, 2010 at 2:14 pm
    hello ...

    no, we have not checked memory consumption.
    there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
    "equal" is called really often in our sample case as well:

    ach sample counts as 0.01 seconds.
    % cumulative self self total
    time seconds seconds calls s/call s/call name
    18.87 0.80 0.80 4812 0.00 0.00 make_canonical_pathkey
    15.33 1.45 0.65 4811 0.00 0.00
    get_eclass_for_sort_expr
    14.15 2.05 0.60 8342410 0.00 0.00 equal
    6.13 2.31 0.26 229172 0.00 0.00 SearchCatCache
    3.66 2.47 0.16 5788835 0.00 0.00 _equalList
    3.07 2.60 0.13 1450043 0.00 0.00
    hash_search_with_hash_value
    2.36 2.70 0.10 2272545 0.00 0.00 AllocSetAlloc
    2.12 2.79 0.09 811460 0.00 0.00 hash_any
    1.89 2.87 0.08 3014983 0.00 0.00 list_head
    1.89 2.95 0.08 574526 0.00 0.00 _bt_compare
    1.77 3.02 0.08 11577670 0.00 0.00 list_head
    1.42 3.08 0.06 1136 0.00 0.00 tzload
    0.94 3.12 0.04 2992373 0.00 0.00 AllocSetFreeIndex


    look at the number of calls ...
    "equal" is scary ...

    make_canonical_pathkey is fixed it seems.
    get_eclass_for_sort_expr seems a little more complex to fix.

    great you like it ...

    regards,

    hans


    On Sep 8, 2010, at 3:54 PM, Robert Haas wrote:
    On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan wrote:
    Hi,

    Robert Haas írta:
    2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
    i tried this one with 5000 unindexed tables (just one col):

    test=# \timing
    Timing is on.
    test=# prepare x(int4) AS select * from t_data order by id desc;
    PREPARE
    Time: 361.552 ms

    you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
    I'd like to see (1) a script to reproduce your test environment (as
    Stephen also requested) and (2) gprof or oprofile results.
    attached are the test scripts, create_tables.sql and childtables.sql.
    The following query takes 4.7 seconds according to psql with \timing on:
    EXPLAIN SELECT * FROM qdrs
    WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
    ORDER BY streamhash;
    Neat. Have you checked what effect this has on memory consumption?

    Also, don't forget to add it to
    https://commitfest.postgresql.org/action/commitfest_view/open

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers

    --
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt
    Web: http://www.postgresql-support.de
  • Stephen Frost at Sep 8, 2010 at 2:18 pm

    * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    no, we have not checked memory consumption.
    there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
    "equal" is called really often in our sample case as well:
    Did the mail with the scripts, etc, get hung up due to size or
    something..? I didn't see it on the mailing list nor in the archives..
    If so, could you post them somewhere so others could look..?

    Thanks,

    Stephen
  • Hans-Jürgen Schönig at Sep 8, 2010 at 2:26 pm
    here is the patch again.
    we accidentally attached a wrong test file to the original posting so it grew to big. we had to revoke it from the moderator (this happens if you code from 8am to 10pm).
    here is just the patch - it is nice and small.

    you can easily test it by making yourself a nice parent table, many subtables (hundreds or thousands) and a decent number of indexes per partition.
    then run PREPARE with \timing to see what happens.
    you should get scary planning times. the more potential indexes and tables the more scary it will be.

    using this wonderful RB tree the time for this function call goes down to basically zero.
    i hope this is something which is useful to some folks out there.

    many thanks,

    hans
  • Tom Lane at Sep 8, 2010 at 3:42 pm

    =?iso-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at> writes:
    here is the patch again.
    This patch seems remarkably documentation-free. What is it trying to
    accomplish and what is it doing to the planner data structures?
    (Which do have documentation BTW.) Also, what will it do to runtime
    in normal cases where the pathkeys list isn't that long?

    regards, tom lane
  • Stephen Frost at Sep 8, 2010 at 2:57 pm

    * Robert Haas (robertmhaas@gmail.com) wrote:
    Neat. Have you checked what effect this has on memory consumption?

    Also, don't forget to add it to
    https://commitfest.postgresql.org/action/commitfest_view/open
    Would be good to have the patch updated to be against HEAD before
    posting to the commitfest.

    Thanks,

    Stephen
  • Hans-Jürgen Schönig at Sep 8, 2010 at 3:09 pm

    On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:

    * Robert Haas (robertmhaas@gmail.com) wrote:
    Neat. Have you checked what effect this has on memory consumption?

    Also, don't forget to add it to
    https://commitfest.postgresql.org/action/commitfest_view/open
    Would be good to have the patch updated to be against HEAD before
    posting to the commitfest.

    Thanks,

    Stephen


    we will definitely provide something which is for HEAD.
    but, it seems the problem we are looking is not sufficiently fixed yet.
    in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.

    regards,

    hans


    --
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt
    Web: http://www.postgresql-support.de
  • Stephen Frost at Sep 8, 2010 at 3:27 pm

    * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    but, it seems the problem we are looking is not sufficiently fixed yet.
    in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
    An 18% increase is certainly nice, provided it doesn't slow down or
    break other things.. I'm looking through the patch now actually and
    I'm not really happy with the naming, comments, or some of the code
    flow, but I think the concept looks reasonable.

    Thanks,

    Stephen
  • Alvaro Herrera at Sep 8, 2010 at 3:39 pm

    Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
    * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    but, it seems the problem we are looking is not sufficiently fixed yet.
    in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
    An 18% increase is certainly nice, provided it doesn't slow down or
    break other things.. I'm looking through the patch now actually and
    I'm not really happy with the naming, comments, or some of the code
    flow, but I think the concept looks reasonable.
    I don't understand the layering between pg_tree and rbtree. Why does it
    exist at all? At first I thought this was another implementation of
    rbtrees, but then I noticed it sits on top of it. Is this really
    necessary?

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Boszormenyi Zoltan at Sep 8, 2010 at 3:58 pm

    Alvaro Herrera írta:
    Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
    * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
    but, it seems the problem we are looking is not sufficiently fixed yet.
    in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that more can be done to reduce this. i guess we have to attack this as well.
    An 18% increase is certainly nice, provided it doesn't slow down or
    break other things.. I'm looking through the patch now actually and
    I'm not really happy with the naming, comments, or some of the code
    flow, but I think the concept looks reasonable.
    I don't understand the layering between pg_tree and rbtree. Why does it
    exist at all? At first I thought this was another implementation of
    rbtrees, but then I noticed it sits on top of it. Is this really
    necessary?
    No, if it's acceptable to omit PlannerInfo from outfuncs.c.
    Or at least its canon_pathkeys member. Otherwise yes, it's
    necessary. We need to store (Node *) in a fast searchable way.

    This applies to anything else that may need to be converted
    from list to tree to decrease planning time. Like ec_members
    in EquivalenceClass.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Tom Lane at Sep 8, 2010 at 4:08 pm

    Boszormenyi Zoltan writes:
    This applies to anything else that may need to be converted
    from list to tree to decrease planning time. Like ec_members
    in EquivalenceClass.
    AFAIR, canonical pathkeys are the *only* thing in the planner where pure
    pointer equality is interesting. So I doubt this hack is of any use for
    EquivalenceClass, even if the lists were likely to be long which they
    aren't.

    regards, tom lane
  • Boszormenyi Zoltan at Sep 8, 2010 at 5:06 pm

    Tom Lane írta:
    Boszormenyi Zoltan <zb@cybertec.at> writes:
    This applies to anything else that may need to be converted
    from list to tree to decrease planning time. Like ec_members
    in EquivalenceClass.
    AFAIR, canonical pathkeys are the *only* thing in the planner where pure
    pointer equality is interesting. So I doubt this hack is of any use for
    EquivalenceClass, even if the lists were likely to be long which they
    aren't.

    regards, tom lane
    No, for EquivalanceClass->ec_member, I need to do something
    funnier, like implement compare(Node *, Node *) and use that
    instead of equal(Node *, Node *)... Something like nodeToString()
    on both Node * and strcmp() the resulting strings.

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Tom Lane at Sep 8, 2010 at 5:19 pm

    Boszormenyi Zoltan writes:
    Tom Lane írta:
    AFAIR, canonical pathkeys are the *only* thing in the planner where pure
    pointer equality is interesting. So I doubt this hack is of any use for
    EquivalenceClass, even if the lists were likely to be long which they
    aren't.
    No, for EquivalanceClass->ec_member, I need to do something
    funnier, like implement compare(Node *, Node *) and use that
    instead of equal(Node *, Node *)... Something like nodeToString()
    on both Node * and strcmp() the resulting strings.
    Well, (a) that doesn't work (hint: there are fields in nodes that are
    intentionally ignored by equal()), and (b) I still don't believe that
    there's an actual bottleneck there. ECs generally aren't very big.

    regards, tom lane
  • Boszormenyi Zoltan at Sep 8, 2010 at 6:51 pm

    Tom Lane írta:
    Boszormenyi Zoltan <zb@cybertec.at> writes:
    Tom Lane írta:
    AFAIR, canonical pathkeys are the *only* thing in the planner where pure
    pointer equality is interesting. So I doubt this hack is of any use for
    EquivalenceClass, even if the lists were likely to be long which they
    aren't.
    No, for EquivalanceClass->ec_member, I need to do something
    funnier, like implement compare(Node *, Node *) and use that
    instead of equal(Node *, Node *)... Something like nodeToString()
    on both Node * and strcmp() the resulting strings.
    Well, (a) that doesn't work (hint: there are fields in nodes that are
    intentionally ignored by equal()),
    Then this compare() needs to work like equal(), which can
    ignore the fields that are ignored by equal(), too.
    nodeToString would need more space anyway and comparing
    non-equal Nodes can return the !0 result faster.
    and (b) I still don't believe that
    there's an actual bottleneck there. ECs generally aren't very big.
    Actually, PlannerInfo->eq_classes needs to be a Tree somehow,
    the comparator function is not yet final in my head.

    equal() is called over 8 million times with or without our patch:

    without

    % cumulative self self total
    time seconds seconds calls s/call s/call name
    18.87 0.80 0.80 4812 0.00 0.00 make_canonical_pathkey
    15.33 1.45 0.65 4811 0.00 0.00
    get_eclass_for_sort_expr
    14.15 2.05 0.60 8342410 0.00 0.00 equal
    6.13 2.31 0.26 229172 0.00 0.00 SearchCatCache
    3.66 2.47 0.16 5788835 0.00 0.00 _equalList
    3.07 2.60 0.13 1450043 0.00 0.00
    hash_search_with_hash_value
    2.36 2.70 0.10 2272545 0.00 0.00 AllocSetAlloc
    2.12 2.79 0.09 811460 0.00 0.00 hash_any
    1.89 2.87 0.08 3014983 0.00 0.00 list_head
    1.89 2.95 0.08 574526 0.00 0.00 _bt_compare
    1.77 3.02 0.08 11577670 0.00 0.00 list_head
    1.42 3.08 0.06 1136 0.00 0.00 tzload
    0.94 3.12 0.04 2992373 0.00 0.00 AllocSetFreeIndex
    0.94 3.16 0.04 91427 0.00 0.00 _bt_checkkeys
    ...

    with

    % cumulative self self total
    time seconds seconds calls s/call s/call name
    24.51 0.88 0.88 4811 0.00 0.00
    get_eclass_for_sort_expr
    14.21 1.39 0.51 8342410 0.00 0.00 equal
    8.22 1.69 0.30 5788835 0.00 0.00 _equalList
    5.29 1.88 0.19 229172 0.00 0.00 SearchCatCache
    2.51 1.97 0.09 1136 0.00 0.00 tzload
    2.23 2.05 0.08 3014983 0.00 0.00 list_head
    2.23 2.13 0.08 2283130 0.00 0.00 AllocSetAlloc
    2.09 2.20 0.08 811547 0.00 0.00 hash_any
    2.09 2.28 0.08 11577670 0.00 0.00 list_head
    1.95 2.35 0.07 1450180 0.00 0.00
    hash_search_with_hash_value
    1.39 2.40 0.05 640690 0.00 0.00 _bt_compare
    1.39 2.45 0.05 157944 0.00 0.00 LockAcquireExtended
    1.39 2.50 0.05 11164 0.00 0.00 AllocSetCheck
    1.11 2.54 0.04 3010547 0.00 0.00 AllocSetFreeIndex
    1.11 2.58 0.04 874975 0.00 0.00 AllocSetFree
    1.11 2.62 0.04 66211 0.00 0.00 heap_form_tuple
    0.84 2.65 0.03 888128 0.00 0.00 LWLockRelease
    ...

    The number of calls are the same for equal and _equalList
    in both cases as you can see.

    And if you compare the number of AllocSetAlloc calls,
    it's actually smaller with our patch, so it's not that the
    conversion to Tree caused this.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Tom Lane at Sep 8, 2010 at 7:27 pm

    Boszormenyi Zoltan writes:
    equal() is called over 8 million times with or without our patch:
    From where, though? You've provided not a shred of evidence that
    searching large ec_member lists is the problem.

    Also, did the test case you're using ever make it to the list?

    regards, tom lane
  • Boszormenyi Zoltan at Sep 10, 2010 at 9:15 am

    Tom Lane írta:
    Boszormenyi Zoltan <zb@cybertec.at> writes:
    equal() is called over 8 million times with or without our patch:
    From where, though? You've provided not a shred of evidence that
    searching large ec_member lists is the problem.
    Indeed. I have put elog(NOTICE) calls in there to see which
    lists is how long. It turned out that the length of ec_members is either 0
    or 1, mostly 1, but the length of eq_classes is constantly growing.
    This is what I need to attack then.
    Also, did the test case you're using ever make it to the list?
    No, because it was too large and because of the test case
    accidentally contained confidential information, we asked
    Bruce to delete it from the moderator queue.

    Attached is a shortened test case that does the same and
    also shows the same problem. The create_table.sql
    creates the parent table, the table_generator.sql produces
    the list of constraint excluded child tables and their indexes.

    The gprof output of this is only slightly modified, the
    number of equal() calls is still over 8 million, it is also
    attached.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Boszormenyi Zoltan at Oct 19, 2010 at 1:39 pm
    Hi,

    attached is a WIP patch against 9.1 current GIT that converts
    eq_classes and canon_pathkeys in PlannerInfo.

    Also attached is the test case again the slow query is:

    explain select * from inh_parent
    where timestamp1 between '2010-04-06' and '2010-06-25'
    order by timestamp2;

    There is intentionally no data, the planning time is slow.
    The currect GIT version plans this query in 2.4 seconds,
    the patched version does it in 0.59 seconds according to
    gprof. The gprof outputs are also attached.

    There is one problem with the patch, it doesn't survive
    "make check". One of the regression tests fails the
    Assert(!cur_em->em_is_child);
    line in process_equivalence() in equivclass.c, but I couldn't
    yet find it what causes it. The "why" is vaguely clear:
    something modifies the ec_members list in the eq_classes'
    tree nodes while the node is in the tree. Because I didn't find
    the offender yet, I couldn't fix it, so I send this patch as is.
    I'll try to fix it if someone doesn't beat me in fixing it. :)

    The query produces the same EXPLAIN output for both the
    stock and the patched version, they were checked with diff.
    I didn't attach it to this mail because of the size constraints.
    Almost all files are compressed because of this.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Boszormenyi Zoltan at Oct 19, 2010 at 7:51 pm
    Hi,

    Boszormenyi Zoltan írta:
    There is one problem with the patch, it doesn't survive
    "make check". One of the regression tests fails the
    Assert(!cur_em->em_is_child);
    line in process_equivalence() in equivclass.c, but I couldn't
    yet find it what causes it. The "why" is vaguely clear:
    something modifies the ec_members list in the eq_classes'
    tree nodes while the node is in the tree. Because I didn't find
    the offender yet, I couldn't fix it, so I send this patch as is.
    I'll try to fix it if someone doesn't beat me in fixing it. :)
    I am a little closer to this bug, maybe I even found the cause of it.
    I found that process_equivalence() is called from:

    path/equivclass.c:
    reconsider_outer_join_clause()
    reconsider_full_join_clause()
    plan/initsplan.c:
    distribute_qual_to_rels()

    The problem is with the two functions in path/equivclass.c,
    as process_equivalance() and those functions are all walk
    the tree, and the current RBTree code can only deal with
    one walk at a time. We need to push/pop the iterator state
    to be able to serve more than one walkers.

    Also, we need to split out the tree modifying part from
    process_equivalence() somehow, as the tree walking
    also cannot deal with node additions and deletions.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Tom Lane at Oct 19, 2010 at 8:15 pm

    Boszormenyi Zoltan writes:
    The problem is with the two functions in path/equivclass.c,
    as process_equivalance() and those functions are all walk
    the tree, and the current RBTree code can only deal with
    one walk at a time. We need to push/pop the iterator state
    to be able to serve more than one walkers.
    Good luck with that --- the iteration state is embedded in the rbtree.
    Also, we need to split out the tree modifying part from
    process_equivalence() somehow, as the tree walking
    also cannot deal with node additions and deletions.
    That's not happening either. Maybe you need to think of some other data
    structure to use. Hashtable maybe? dynahash.c at least has reasonably
    well-defined limitations in this area.

    regards, tom lane
  • Boszormenyi Zoltan at Oct 26, 2010 at 3:35 pm
    Hi,

    Tom Lane írta:
    Boszormenyi Zoltan <zb@cybertec.at> writes:
    The problem is with the two functions in path/equivclass.c,
    as process_equivalance() and those functions are all walk
    the tree, and the current RBTree code can only deal with
    one walk at a time. We need to push/pop the iterator state
    to be able to serve more than one walkers.
    Good luck with that --- the iteration state is embedded in the rbtree.

    Also, we need to split out the tree modifying part from
    process_equivalence() somehow, as the tree walking
    also cannot deal with node additions and deletions.
    That's not happening either. Maybe you need to think of some other data
    structure to use. Hashtable maybe? dynahash.c at least has reasonably
    well-defined limitations in this area.

    regards, tom lane
    thank you very much for pointing me to dynahash, here is the
    next version that finally seems to work.

    Two patches are attached, the first is the absolute minimum for
    making it work, this still has the Tree type for canon_pathkeys
    and eq_classes got the same treatment as join_rel_list/join_rel_hash
    has in the current sources: if the list grows larger than 32, a hash table
    is created. It seems to be be enough for doing in for
    get_eclass_for_sort_expr()
    only, the other users of eq_classes aren't bothered by this change.

    The total speedup figure is in the 70+ percent range from these
    two changes, a little later GIT version than the previous tree
    I tested with before shows 1.74 vs. 0.41 second runtime for the
    example query. These are with asserts and profiling enabled of
    course. Without asserts and profiling enabled, the "time psql"
    figures are:

    $ time psql -p 54321 -c "explain select * from inh_parent where
    timestamp1 between '2010-04-06' and '2010-06-25' order by timestamp2"
    /dev/null
    real 0m1.932s
    user 0m0.035s
    sys 0m0.002s

    vs.

    real 0m0.630s
    user 0m0.033s
    sys 0m0.002s

    The second patch contains extra infrastructure for the Tree type,
    it's currently unused, it was created for experimenting with eq_classes
    being a tree. It may be useful for someone, though.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Heikki Linnakangas at Oct 27, 2010 at 3:41 pm

    On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
    thank you very much for pointing me to dynahash, here is the
    next version that finally seems to work.

    Two patches are attached, the first is the absolute minimum for
    making it work, this still has the Tree type for canon_pathkeys
    and eq_classes got the same treatment as join_rel_list/join_rel_hash
    has in the current sources: if the list grows larger than 32, a hash table
    is created. It seems to be be enough for doing in for
    get_eclass_for_sort_expr()
    only, the other users of eq_classes aren't bothered by this change.
    That's better, but can't you use dynahash for canon_pathkeys as well?


    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Boszormenyi Zoltan at Oct 28, 2010 at 9:33 am

    Heikki Linnakangas írta:
    On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
    thank you very much for pointing me to dynahash, here is the
    next version that finally seems to work.

    Two patches are attached, the first is the absolute minimum for
    making it work, this still has the Tree type for canon_pathkeys
    and eq_classes got the same treatment as join_rel_list/join_rel_hash
    has in the current sources: if the list grows larger than 32, a hash
    table
    is created. It seems to be be enough for doing in for
    get_eclass_for_sort_expr()
    only, the other users of eq_classes aren't bothered by this change.
    That's better, but can't you use dynahash for canon_pathkeys as well?
    Here's a purely dynahash solution. It's somewhat slower than
    the tree version, 0.45 vs 0.41 seconds in the cached case for the
    previously posted test case.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Boszormenyi Zoltan at Oct 28, 2010 at 9:40 am

    Boszormenyi Zoltan írta:
    Heikki Linnakangas írta:
    On 26.10.2010 18:34, Boszormenyi Zoltan wrote:

    thank you very much for pointing me to dynahash, here is the
    next version that finally seems to work.

    Two patches are attached, the first is the absolute minimum for
    making it work, this still has the Tree type for canon_pathkeys
    and eq_classes got the same treatment as join_rel_list/join_rel_hash
    has in the current sources: if the list grows larger than 32, a hash
    table
    is created. It seems to be be enough for doing in for
    get_eclass_for_sort_expr()
    only, the other users of eq_classes aren't bothered by this change.
    That's better, but can't you use dynahash for canon_pathkeys as well?
    Here's a purely dynahash solution. It's somewhat slower than
    the tree version, 0.45 vs 0.41 seconds in the cached case for the
    previously posted test case.
    And now in context diff, sorry for my affection towards unified diffs. :-)
    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Boszormenyi Zoltan at Oct 28, 2010 at 10:55 am

    Boszormenyi Zoltan írta:
    Boszormenyi Zoltan írta:
    Heikki Linnakangas írta:

    On 26.10.2010 18:34, Boszormenyi Zoltan wrote:

    thank you very much for pointing me to dynahash, here is the
    next version that finally seems to work.

    Two patches are attached, the first is the absolute minimum for
    making it work, this still has the Tree type for canon_pathkeys
    and eq_classes got the same treatment as join_rel_list/join_rel_hash
    has in the current sources: if the list grows larger than 32, a hash
    table
    is created. It seems to be be enough for doing in for
    get_eclass_for_sort_expr()
    only, the other users of eq_classes aren't bothered by this change.
    That's better, but can't you use dynahash for canon_pathkeys as well?
    Here's a purely dynahash solution. It's somewhat slower than
    the tree version, 0.45 vs 0.41 seconds in the cached case for the
    previously posted test case.
    And now in context diff, sorry for my affection towards unified diffs. :-)
    A little better version, no need for the heavy hash_any, hash_uint32
    on the lower 32 bits on pk_eclass is enough. The profiling runtime
    is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Boszormenyi Zoltan at Oct 28, 2010 at 11:29 am

    Boszormenyi Zoltan írta:
    Boszormenyi Zoltan írta:
    Boszormenyi Zoltan írta:

    Heikki Linnakangas írta:


    On 26.10.2010 18:34, Boszormenyi Zoltan wrote:


    thank you very much for pointing me to dynahash, here is the
    next version that finally seems to work.

    Two patches are attached, the first is the absolute minimum for
    making it work, this still has the Tree type for canon_pathkeys
    and eq_classes got the same treatment as join_rel_list/join_rel_hash
    has in the current sources: if the list grows larger than 32, a hash
    table
    is created. It seems to be be enough for doing in for
    get_eclass_for_sort_expr()
    only, the other users of eq_classes aren't bothered by this change.

    That's better, but can't you use dynahash for canon_pathkeys as well?

    Here's a purely dynahash solution. It's somewhat slower than
    the tree version, 0.45 vs 0.41 seconds in the cached case for the
    previously posted test case.

    And now in context diff, sorry for my affection towards unified diffs. :-)
    A little better version, no need for the heavy hash_any, hash_uint32
    on the lower 32 bits on pk_eclass is enough. The profiling runtime
    is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

    Best regards,
    Zoltán Böszörményi
    Btw, the top entries in the current gprof output are:

    Each sample counts as 0.01 seconds.
    % cumulative self self total
    time seconds seconds calls ms/call ms/call name
    19.05 0.08 0.08 482 0.17 0.29
    add_child_rel_equivalences
    11.90 0.13 0.05 1133447 0.00 0.00 bms_is_subset
    9.52 0.17 0.04 331162 0.00 0.00
    hash_search_with_hash_value
    7.14 0.20 0.03 548971 0.00 0.00 AllocSetAlloc
    4.76 0.22 0.02 2858 0.01 0.01 get_tabstat_entry
    4.76 0.24 0.02 1136 0.02 0.02 tzload

    This means add_child_rel_equivalences() is still takes
    too much time, the previously posted test case calls this
    function 482 times, it's called for almost every 10th entry
    added to eq_classes. The elog() I put into this function says
    that at the last call list_length(eq_classes) == 4754.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Tom Lane at Oct 28, 2010 at 1:37 pm

    Boszormenyi Zoltan writes:
    This means add_child_rel_equivalences() is still takes
    too much time, the previously posted test case calls this
    function 482 times, it's called for almost every 10th entry
    added to eq_classes. The elog() I put into this function says
    that at the last call list_length(eq_classes) == 4754.
    That seems like a ridiculously large number of ECs. What is the
    test query again?

    regards, tom lane
  • Boszormenyi Zoltan at Oct 28, 2010 at 5:11 pm

    Tom Lane írta:
    Boszormenyi Zoltan <zb@cybertec.at> writes:
    This means add_child_rel_equivalences() is still takes
    too much time, the previously posted test case calls this
    function 482 times, it's called for almost every 10th entry
    added to eq_classes. The elog() I put into this function says
    that at the last call list_length(eq_classes) == 4754.
    That seems like a ridiculously large number of ECs. What is the
    test query again?

    regards, tom lane
    The test case is here:
    http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

    create_table.sql for the main table plus childtables.sql.gz, the EXPLAIN
    query is in the message body.

    Basically, it's a model for a lot of data for three months, partitioned by
    4 hour intervals for every day. Imagine the call list handled by a
    phone company in a large country.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Tom Lane at Oct 28, 2010 at 10:57 pm

    Boszormenyi Zoltan writes:
    Tom Lane írta:
    That seems like a ridiculously large number of ECs. What is the
    test query again?
    The test case is here:
    http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at
    After poking through that a bit, I think that the real issue is in this
    division of labor:

    index_pathkeys = build_index_pathkeys(root, index,
    ForwardScanDirection);
    useful_pathkeys = truncate_useless_pathkeys(root, rel,
    index_pathkeys);

    If you trace what is happening here, the index pathkeys that actually
    survive the "usefulness" test all refer to exactly ONE equivalence
    class, namely the one arising from the query's "order by timestamp2"
    clause. All the other pathkeys that get created are immediately
    discarded as being irrelevant to the query. The reason that we end up
    with so many equivalence classes is that there is nothing causing the
    variables of the different child tables to be recognized as all
    sort-equivalent. Maybe that's a bug in itself, but I would argue that
    the right way to make this faster is to refactor things so that we
    don't generate useless equivalence classes in the first place, or
    at least don't keep them around in the planner's lists once we realize
    they're useless.

    I like Heikki's hack to cut down on searching in make_canonical_pathkey,
    but I think that complicating the data structure searching beyond that
    is just a band-aid. Reasonably-sized queries shouldn't contain very
    many equivalence classes: they should only come from equality clauses
    or sort conditions that appeared in the query text. Therefore, there
    also shouldn't be all that many distinct pathkeys.

    regards, tom lane
  • Tom Lane at Oct 29, 2010 at 6:09 am

    I wrote:
    the right way to make this faster is to refactor things so that we
    don't generate useless equivalence classes in the first place, or
    at least don't keep them around in the planner's lists once we realize
    they're useless.
    After a bit of hacking, I propose the attached patch.
    I like Heikki's hack to cut down on searching in make_canonical_pathkey,
    but I think that complicating the data structure searching beyond that
    is just a band-aid.
    With the given test case and this patch, we end up with exactly two
    canonical pathkeys referencing a single EquivalenceClass. So as far
    as I can tell there's not a lot of point in refining the pathkey
    searching. Now, the EquivalenceClass has got 483 members, which
    means that there's still some O(N^2) behavior in
    get_eclass_for_sort_expr. There might be some use in refining the
    search for a matching eclass member. It's not sticking out in
    profiling like it did before though.

    regards, tom lane
  • Boszormenyi Zoltan at Oct 29, 2010 at 8:00 am

    Tom Lane írta:
    I wrote:
    the right way to make this faster is to refactor things so that we
    don't generate useless equivalence classes in the first place, or
    at least don't keep them around in the planner's lists once we realize
    they're useless.
    After a bit of hacking, I propose the attached patch.

    I like Heikki's hack to cut down on searching in make_canonical_pathkey,
    but I think that complicating the data structure searching beyond that
    is just a band-aid.
    With the given test case and this patch, we end up with exactly two
    canonical pathkeys referencing a single EquivalenceClass. So as far
    as I can tell there's not a lot of point in refining the pathkey
    searching. Now, the EquivalenceClass has got 483 members, which
    means that there's still some O(N^2) behavior in
    get_eclass_for_sort_expr. There might be some use in refining the
    search for a matching eclass member. It's not sticking out in
    profiling like it did before though.

    regards, tom lane
    Thanks, this patch made get_eclass_from_sort_expr almost,
    make_canonical_pathkeys and add_child_rel_equivalences
    completely disappear from the gprof timing.

    +1 for including this into 9.1.

    On the other hand, if I use a similar test case to my original one
    (i.e. the tables are much wider) then the query planning takes
    1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
    as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
    output now looks like this:

    Each sample counts as 0.01 seconds.
    % cumulative self self total
    time seconds seconds calls s/call s/call name
    21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache
    7.04 0.40 0.10 1507206 0.00 0.00
    hash_search_with_hash_value
    3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc
    3.52 0.50 0.05 845776 0.00 0.00 hash_any
    3.52 0.55 0.05 341637 0.00 0.00 HeapTupleSatisfiesNow
    3.52 0.60 0.05 1136 0.00 0.00 tzload
    2.82 0.64 0.04 547 0.00 0.00 get_rel_data_width
    2.11 0.67 0.03 669414 0.00 0.00 hash_search
    2.11 0.70 0.03 235091 0.00 0.00 SearchSysCache
    2.11 0.73 0.03 192590 0.00 0.00 copyObject
    2.11 0.76 0.03 164457 0.00 0.00 pgstat_initstats
    2.11 0.79 0.03 152999 0.00 0.00 index_getnext
    ...

    Use the attached synthetic create_table_wide.sql together with the
    previous childtables.sql. The full compressed gprof output is attached.
    Your patch creates a 70% speedup in planning time, which is excellent.

    Best regards,
    Zoltán Böszörményi

    --
    ----------------------------------
    Zoltán Böszörményi
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
    http://www.postgresql.at/
  • Leonardo Francalanci at Oct 29, 2010 at 8:49 am

    On the other hand, if I use a similar test case to my original one
    (i.e. the tables are much wider) then the query planning takes
    1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
    as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
    output now looks like this:
    Hi,

    I'm really interested in this patch.

    I tried a simple test case:

    create table t (a integer, b text);

    DO $$DECLARE i int;
    BEGIN
    FOR i IN 0..9000 LOOP
    EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
    and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
    EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)';
    END LOOP;
    END$$;

    explain select * from t where a > 1060 and a < 1090;

    but I don't get any gain from the patch... explain time is still around 250 ms.

    Tried with 9000 partitions, time is still 2 secs.


    Maybe I've missed completely the patch purpose?


    (I tried the test case at

    http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

    and that, in fact, gets a boost with this patch).



    Leonardo
  • Leonardo Francalanci at Oct 29, 2010 at 8:57 am

    but I don't get any gain from the patch... explain time is still around 250
    ms.
    Tried with 9000 partitions, time is still 2 secs.

    Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
    and got 250ms with both versions, with 9000 partitions 2 secs (again
    no gain from the patch)
  • Tom Lane at Oct 29, 2010 at 3:23 pm

    Leonardo Francalanci writes:
    I tried a simple test case:
    create table t (a integer, b text);
    DO $$DECLARE i int;
    BEGIN
    FOR i IN 0..9000 LOOP
    EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
    and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
    EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)';
    END LOOP;
    END$$;
    explain select * from t where a > 1060 and a < 1090;
    This is going to be dominated by constraint exclusion checking. There's
    basically no fix for that except a more explicit representation of the
    partitioning rules. If the planner has to make 8999 theorem proofs to
    remove the 8999 unwanted partitions from the plan, it's gonna take
    awhile.

    regards, tom lane
  • Leonardo Francalanci at Oct 29, 2010 at 3:41 pm

    This is going to be dominated by constraint exclusion checking. There's
    basically no fix for that except a more explicit representation of the
    partitioning rules.
    Damn, I knew that was going to be more complicated :)

    So in which case does this patch help? I guess in a multi-index
    scenario? childtables.sql is kind of hard to read (I think a FOR loop
    would have been more auto-explaining?).
  • Tom Lane at Oct 29, 2010 at 5:16 pm

    I wrote:
    This is going to be dominated by constraint exclusion checking.
    Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
    like this:

    samples % symbol name
    447433 47.1553 get_tabstat_entry
    185458 19.5456 find_all_inheritors
    53064 5.5925 SearchCatCache
    33864 3.5690 pg_strtok
    26301 2.7719 hash_search_with_hash_value
    22577 2.3794 AllocSetAlloc
    6696 0.7057 MemoryContextAllocZeroAligned
    6250 0.6587 expression_tree_walker
    5141 0.5418 LockReleaseAll
    4779 0.5037 get_relation_info
    4506 0.4749 MemoryContextAlloc
    4467 0.4708 expression_tree_mutator
    4136 0.4359 pgstat_initstats
    3914 0.4125 relation_excluded_by_constraints

    get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
    the number of tables they have to deal with. However, the constant
    factors are small enough that you need a heck of a lot of tables
    before they become significant consumers of runtime. I'm not convinced
    that we should be optimizing for 9000-child-table cases.

    It'd be worth fixing these if we can do it without either introducing a
    lot of complexity, or slowing things down for typical cases that have
    only a few tables. Offhand not sure about how to do either.

    regards, tom lane
  • Leonardo Francalanci at Oct 29, 2010 at 5:22 pm

    Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
    like this:

    Well, the 9000-table-test-case was meant to check the difference in
    performance with/without the patch... I don't see the reason for trying
    to optimize such an unrealistic case.

    BTW can someone explain to me which are the cases where the
    patch actually helps?
  • Tom Lane at Oct 29, 2010 at 5:31 pm

    Leonardo Francalanci writes:
    BTW can someone explain to me which are the cases where the
    patch actually helps?
    Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
    per child table, only one of which was relevant to the query. In your
    test case there are no irrelevant indexes, which is why the runtime
    didn't change.

    regards, tom lane
  • Leonardo Francalanci at Oct 29, 2010 at 8:09 pm

    Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
    per child table, only one of which was relevant to the query. In your
    test case there are no irrelevant indexes, which is why the runtime
    didn't change.


    Mmh... I must be doing something wrong. It looks to me it's not just
    the irrelevant indexes: it's the "order by" that counts. If I remove that
    times are the same with and without the patch:

    using the test case:

    explain select * from inh_parent
    where timestamp1 between '2010-04-06' and '2010-06-25'

    this one runs in the same time with the patch; but adding:


    order by timestamp2

    made the non-patched version run 3 times slower.

    My test case:

    create table t (a integer, b integer, c integer, d integer, e text);

    DO $$DECLARE i int;
    BEGIN
    FOR i IN 0..2000 LOOP
    EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
    and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
    EXECUTE 'create index taidx' || i || ' ON t' || i || ' (a)';
    EXECUTE 'create index tbidx' || i || ' ON t' || i || ' (b)';
    EXECUTE 'create index tcidx' || i || ' ON t' || i || ' (c)';
    EXECUTE 'create index tdidx' || i || ' ON t' || i || ' (d)';
    END LOOP;
    END$$;


    explain select * from t where a > 1060 and a < 109000

    this runs in 1.5 secs with and without the patch. But if I add

    order by b

    the non-patched version runs in 10 seconds.

    Am I getting it wrong?
  • Tom Lane at Oct 29, 2010 at 8:23 pm

    Leonardo Francalanci writes:
    Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
    per child table, only one of which was relevant to the query. In your
    test case there are no irrelevant indexes, which is why the runtime
    didn't change.
    Mmh... I must be doing something wrong. It looks to me it's not just
    the irrelevant indexes: it's the "order by" that counts.
    Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
    any mergejoinable join clauses, then the possibly_useful_pathkeys test
    in find_usable_indexes figures out that we aren't interested in the sort
    ordering of *any* indexes, so the whole thing gets short-circuited.
    You need at least the possibility of interest in sorted output from an
    indexscan before any of this code runs.

    regards, tom lane
  • Bruce Momjian at Nov 13, 2010 at 3:47 am

    Tom Lane wrote:
    Leonardo Francalanci <m_lists@yahoo.it> writes:
    Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
    per child table, only one of which was relevant to the query. In your
    test case there are no irrelevant indexes, which is why the runtime
    didn't change.
    Mmh... I must be doing something wrong. It looks to me it's not just
    the irrelevant indexes: it's the "order by" that counts.
    Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
    any mergejoinable join clauses, then the possibly_useful_pathkeys test
    in find_usable_indexes figures out that we aren't interested in the sort
    ordering of *any* indexes, so the whole thing gets short-circuited.
    You need at least the possibility of interest in sorted output from an
    indexscan before any of this code runs.
    FYI, I always wondered if the rare use of mergejoins justified the extra
    planning time of carrying around all those joinpaths.

    --
    Bruce Momjian <bruce@momjian.us> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • Tom Lane at Nov 13, 2010 at 3:56 am

    Bruce Momjian writes:
    FYI, I always wondered if the rare use of mergejoins justified the extra
    planning time of carrying around all those joinpaths.
    They're hardly rare.

    regards, tom lane
  • Robert Haas at Nov 13, 2010 at 10:26 pm

    On Fri, Nov 12, 2010 at 10:55 PM, Tom Lane wrote:
    Bruce Momjian <bruce@momjian.us> writes:
    FYI, I always wondered if the rare use of mergejoins justified the extra
    planning time of carrying around all those joinpaths.
    They're hardly rare.
    They fairly rare in the sorts of queries I normally issue, but I'd
    quibble with the statement on other grounds: IME, we generate far more
    nest loops paths than anything else. The comment in
    match_unsorted_outer() says it all:

    * We always generate a nestloop path for each available outer path.
    * In fact we may generate as many as five: one on the cheapest-total-cost
    * inner path, one on the same with materialization, one on the
    * cheapest-startup-cost inner path (if different), one on the
    * cheapest-total inner-indexscan path (if any), and one on the
    * cheapest-startup inner-indexscan path (if different).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Alvaro Herrera at Oct 29, 2010 at 7:12 pm

    Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
    I wrote:
    This is going to be dominated by constraint exclusion checking.
    Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
    like this:

    samples % symbol name
    447433 47.1553 get_tabstat_entry
    Is there a reason for keeping the pgstat info in plain lists? We could
    use dynahash there too, I think. It would have more palloc overhead
    that way, though (hmm, but perhaps that can be fixed by having a smart
    "alloc" function for it, preallocating a bunch of entries instead of one
    by one).

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Tom Lane at Oct 29, 2010 at 7:37 pm

    Alvaro Herrera writes:
    Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
    samples % symbol name
    447433 47.1553 get_tabstat_entry
    Is there a reason for keeping the pgstat info in plain lists?
    Yeah: anything else loses for small numbers of tables per query, which
    is the normal case. I'd guess you'd need ~100 tables touched in
    a single transaction before a hashtable is even worth thinking about.

    We could possibly adopt a solution similar to the planner's approach for
    joinrels: start with a simple list, and switch over to hashing if the
    list gets too long. But I'm really doubtful that it's worth the code
    space. Even with Zoltan's 500-or-so-table case, this wasn't on the
    radar screen.

    regards, tom lane
  • Tom Lane at Nov 1, 2010 at 2:18 pm

    I wrote:
    samples % symbol name
    447433 47.1553 get_tabstat_entry
    185458 19.5456 find_all_inheritors
    53064 5.5925 SearchCatCache
    33864 3.5690 pg_strtok
    get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
    the number of tables they have to deal with. However, the constant
    factors are small enough that you need a heck of a lot of tables
    before they become significant consumers of runtime. I'm not convinced
    that we should be optimizing for 9000-child-table cases.
    It'd be worth fixing these if we can do it without either introducing a
    lot of complexity, or slowing things down for typical cases that have
    only a few tables. Offhand not sure about how to do either.
    I had a thought about how to make get_tabstat_entry() faster without
    adding overhead: what if we just plain remove the search, and always
    assume that a new entry has to be added to the tabstat array?

    The existing code seems to be designed to make no assumptions about
    how it's being used, but that's a bit silly. We know that the links are
    coming from the relcache, which will have only one entry per relation,
    and that the relcache is designed to hang onto the links for (at least)
    the life of a transaction. So rather than optimizing for the case where
    the relcache fails to remember the tabstat link, maybe we should
    optimize for the case where it does remember.

    The worst-case consequence AFAICS would be multiple tabstat entries for
    the same relation, which seems pretty noncritical anyway.

    Thoughts?

    regards, tom lane
  • Tom Lane at Nov 5, 2010 at 2:59 am
    [ for the archives' sake ]

    I wrote:
    I had a thought about how to make get_tabstat_entry() faster without
    adding overhead: what if we just plain remove the search, and always
    assume that a new entry has to be added to the tabstat array?
    I spent some time looking into this idea. It doesn't really work,
    because there are places that will break if a transaction has more than
    one tabstat entry for the same relation. The one that seems most
    difficult to fix is that pgstat_recv_tabstat() clamps its n_live_tuples
    and n_dead_tuples values to be nonnegative after adding in each delta
    received from a backend. That is a good idea because it prevents insane
    results if some messages get lost --- but if a transaction's updates get
    randomly spread into several tabstat items, the intermediate counts
    might get clamped, resulting in a wrong final answer even though nothing
    was lost.

    I also added some instrumentation printouts and found that in our
    regression tests:
    * about 10% of get_tabstat_entry() calls find an existing entry
    for the relation OID. This seems to happen only when a
    relcache entry gets flushed mid-transaction, but that does
    happen, and not so infrequently either.
    * about half of the transactions use as many as 20 tabstats,
    and 10% use 50 or more; but it drops off fast after that.
    Almost no transactions use as many as 100 tabstats.
    It's not clear that these numbers are representative of typical
    database applications, but they're something to start with anyway.

    I also did some testing to compare the cost of get_tabstat_entry's
    linear search against a dynahash.c table with OID key. As I suspected,
    a hash table would make this code a *lot* slower for small numbers of
    tabstat entries: about a factor of 10 slower. You need upwards of 100
    tabstats touched in a transaction before the hash table begins to pay
    for itself. This is largely because dynahash doesn't have any cheap way
    to reset a hashtable to empty, so you have to initialize and destroy the
    table for each transaction. By the time you've eaten that overhead,
    you've already expended as many cycles as the linear search takes to
    handle several dozen entries.

    I conclude that if we wanted to do something about this, the most
    practical solution would be the one of executing linear searches until
    we get to 100+ tabstat entries in a transaction, and then building a
    hashtable for subsequent searches. However, it's exceedingly unclear
    that it will ever be worth the effort or code space to do that.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 3, '10 at 9:40a
activeNov 13, '10 at 10:26p
posts64
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase