FAQ
Patch implements much more accuracy estimation of cost for GIN index scan than
generic cost estimation function. Basic idea is described on PGCon-2010:
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2010-1.pdf, pages 48-54.

After discussion on PGCon, the storage of additional statistic information has
been changed from pg_class table to meta-page of index. Statistics data is
cached in Relation->rd_amcache to prevent frequent read of meta-page.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Search Discussions

  • Jan Urbański at Jul 25, 2010 at 9:41 pm

    On 02/07/10 14:33, Teodor Sigaev wrote:
    Patch implements much more accuracy estimation of cost for GIN index
    scan than generic cost estimation function.
    Hi,

    I'm reviewing this patch, and to begin with it I tried to reproduce the
    problem that originally came up on -performance in
    http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php

    The links from that mail are now dead, so I set up my own test environment:
    * one table testfts(id serial, body text, body_fts tsvector)
    * 50000 rows, each with 1000 random words taken from
    /usr/share/dict/british-english-insane (the wbritish-insane Debian
    package) separated by a single space
    * each row also had the word "commonterm" at the end, 80% had
    commonterm80, 60% had commonterm60 etc (using the same methodology as
    Jesper, that commonterm60 can appear only if commonterm80 is in the row)
    * a GIN index on the tsvectors

    I was able to reproduce his issue, that is: select id from ftstest where
    body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
    which was resulting in much longer execution than the bitmap index plan
    that I got after disabling seqscans.

    I then applied the patch, recompiled PG and tried again... and nothing
    changed. I first tried running ANALYSE and then dropping and recreating
    the GIN index, but the planner still chooses the seq scan.

    Full explains below (the NOTICE is a debugging aid from the patch, which
    I temporarily enabled to see if it's picking up the code).

    I'll continue reading the code and trying to understand what it does,
    but in the meantime: am I doing something wrong that I don't see the
    planner switching to the bitmap index plan? I see that the difference in
    costs is small, so maybe I just need to tweak the planner knobs a bit?
    Is the output below expected?

    Cheers,
    Jan


    wulczer=# explain analyse select id from ftstest where body_fts @@
    to_tsquery('commonterm80');
    NOTICE: GIN stats: nEntryPages: 49297.000000 nDataPages: 16951.000000
    nPendingPages :0.000000 nEntries: 277521.000000
    QUERY PLAN

    ------------------------------------------------------------------------------------------------------------------
    Seq Scan on ftstest (cost=0.00..1567.00 rows=39890 width=4) (actual
    time=221.893..33179.794 rows=39923 loops=1)
    Filter: (body_fts @@ to_tsquery('commonterm80'::text))
    Total runtime: 33256.661 ms
    (3 rows)

    wulczer=# set enable_seqscan to false;
    SET
    Time: 0.257 ms
    wulczer=# explain analyse select id from ftstest where body_fts @@
    to_tsquery('commonterm80');
    NOTICE: GIN stats: nEntryPages: 49297.000000 nDataPages: 16951.000000
    nPendingPages :0.000000 nEntries: 277521.000000
    QUERY PLAN

    ------------------------------------------------------------------------------------------------------------------------------------
    Bitmap Heap Scan on ftstest (cost=449.15..1864.50 rows=39890 width=4)
    (actual time=107.421..181.284 rows=39923 loops=1)
    Recheck Cond: (body_fts @@ to_tsquery('commonterm80'::text))
    -> Bitmap Index Scan on ftstest_gin_idx (cost=0.00..439.18
    rows=39890 width=0) (actual time=97.057..97.057 rows=39923 loops=1)
    Index Cond: (body_fts @@ to_tsquery('commonterm80'::text))
    Total runtime: 237.218 ms
    (5 rows)

    Time: 237.999 ms
  • Oleg Bartunov at Jul 26, 2010 at 10:58 am
  • Jan Urbański at Jul 26, 2010 at 11:06 am

    On 26/07/10 12:58, Oleg Bartunov wrote:
    Jan,
    On Sun, 25 Jul 2010, Jan Urbaski wrote:
    On 02/07/10 14:33, Teodor Sigaev wrote:
    Patch implements much more accuracy estimation of cost for GIN index
    scan than generic cost estimation function.
    I was able to reproduce his issue, that is: select id from ftstest where
    body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
    which was resulting in much longer execution than the bitmap index plan
    that I got after disabling seqscans.

    I then applied the patch, recompiled PG and tried again... and nothing
    changed. I first tried running ANALYSE and then dropping and recreating
    the GIN index, but the planner still chooses the seq scan.
    read thread
    http://archives.postgresql.org/pgsql-hackers/2010-04/msg01407.php
    There is always a fuzz factor, as Tom said, about 1% in path cost
    comparisons.
    You may compare plans for 'commonterm60', 'commonterm40'.
    OK, I thought this might be the case, as with the patch the sequential
    scan is
    winning only be a small margin.

    Thanks,
    Jan
  • Jan Urbański at Jul 30, 2010 at 5:16 pm
    OK, here's a review, as much as I was able to do it without
    understanding deeply how GIN works.

    The patch is context, applies cleanly to HEAD, compiles without warnings
    and passes regression tests.

    Using the script from
    http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php I
    was able to get an index scan with commonterm40, while with the
    unpatched source I was getting an index scan only for commonterm20, so
    it indeed improves the situation as far as cost estimation is concerned.

    Codewise I have one question: the patch changes a loop in
    ginvacuumcleanup from

    for (blkno = GIN_ROOT_BLKNO + 1; blkno < npages; blkno++)

    to

    for (blkno = GIN_ROOT_BLKNO; blkno < npages; blkno++)

    why should it now go through all blocks? I couldn't immediately see why
    was it not OK to do it before and why is it OK to do it now, but I don't
    really get how GIN works internally. I guess a comment would be good to
    have there in any case.

    The patch has lots of statements like if ( GinPageIsLeaf(page) ), that
    is with extra space between the outer parenthesis and the condition,
    which AFAIK is not the project style. I guess pgindent fixes that, so
    it's no big deal.

    There are lines with elog(NOTICE) that are #if 0, they probably should
    either become elog(DEBUGX) or get removed.

    As for performance, I tried running the attached script a couple of
    times. I used the standard config file, only changed checkpoint_segments
    to 30 and shared_buffers to 512MB. The timings were:

    HEAD

    INSERT 0 500000
    Time: 13487.450 ms
    VACUUM
    Time: 337.673 ms

    INSERT 0 500000
    Time: 13751.110 ms
    VACUUM
    Time: 315.812 ms

    INSERT 0 500000
    Time: 12691.259 ms
    VACUUM
    Time: 312.320 ms

    HEAD + gincostestimate

    INSERT 0 500000
    Time: 13961.894 ms
    VACUUM
    Time: 355.798 ms

    INSERT 0 500000
    Time: 14114.975 ms
    VACUUM
    Time: 341.822 ms

    INSERT 0 500000
    Time: 13679.871 ms
    VACUUM
    Time: 340.576 ms

    so there is no immediate slowdown for a quick test with one client.

    Since there was no stability or performance issues and it solves the
    problem, I am marking this as ready for committer, although it might be
    beneficial if someone more acquianted with GIN takes another look at it
    before the committer review.

    I will be travelling during the whole August and will only have
    intermittent email access, so in case of any questions with regards to
    review the respionse time can be a few days.

    Cheers,
    Jan
  • Robert Haas at Jul 30, 2010 at 6:04 pm

    On Fri, Jul 30, 2010 at 1:19 PM, Jan Urbański wrote:
    The patch has lots of statements like if ( GinPageIsLeaf(page) ), that is
    with extra space between the outer parenthesis and the condition, which
    AFAIK is not the project style. I guess pgindent fixes that, so it's no big
    deal.
    It's better if these get cleaned up. pgindent will fix it eventually,
    but the less stuff pgindent has to touch, the less likelihood there is
    of breaking outstanding patches when it's run.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Tom Lane at Aug 6, 2010 at 9:46 pm

    =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer@wulczer.org> writes:
    [ review of gincostestimate-0.19 ]
    I went through this patch, re-synced with current HEAD, and did some
    minor editorializing; a new version is attached. (Caution: I have not
    tested this beyond verifying that it still compiles.)
    Codewise I have one question: the patch changes a loop in
    ginvacuumcleanup from
    for (blkno = GIN_ROOT_BLKNO + 1; blkno < npages; blkno++)
    to
    for (blkno = GIN_ROOT_BLKNO; blkno < npages; blkno++)
    why should it now go through all blocks?
    I think this is correct. Before, vacuum had nothing useful to do on the
    root page so it just skipped it. Now, it needs to count the root page
    in the appropriate way in the stats it's gathering. The previous coding
    maybe could have used a comment, but this version is unsurprising.
    The patch has lots of statements like if ( GinPageIsLeaf(page) ), that
    is with extra space between the outer parenthesis and the condition,
    which AFAIK is not the project style. I guess pgindent fixes that, so
    it's no big deal.
    There are lines with elog(NOTICE) that are #if 0, they probably should
    either become elog(DEBUGX) or get removed.
    I fixed the latter and cleaned up some of the formatting violations,
    though not all. I dunno if anyone feels like running pgindent on the
    patch at the moment.

    I think there are two big problems left before this patch can be
    applied:

    1. The use of rd_amcache is very questionable. There's no provision for
    updates executed by one session to get reflected into stats already
    cached in another session. You could fix that by forcing relcache
    flushes whenever you update the stats, as btree does:

    /* send out relcache inval for metapage change */
    if (!InRecovery)
    CacheInvalidateRelcache(rel);

    However I think that's probably a Really Bad Idea, because it would
    result in extremely frequent relcache flushes, and those are expensive.
    (The reason this mechanism is good for btree is it only needs to update
    its cache after a root page split, which is infrequent.) My advice is
    to drop the use of rd_amcache completely, and just have the code read
    the metapage every time gincostestimate runs. If you don't like that
    then you're going to need to think hard about how often to update the
    cache and what can drive that operation at the right times.

    BTW, another problem that would need to be fixed if you keep this code
    is that ginUpdateStatInPlace wants to force the new values into
    rd_amcache, which (a) is pretty useless and (b) risks a PANIC on
    palloc failure, because it's called from a critical section.

    2. Some of the calculations in gincostestimate seem completely bogus.
    In particular, this bit:

    /* calc partial match: we don't need a search but an index scan */
    *indexStartupCost += partialEntriesInQuals * numEntryPages / numEntries;

    is surely wrong, because the number being added to indexStartupCost is
    a pure ratio not scaled by any cost unit. I don't understand what this
    number is supposed to be, so it's not clear what cost variable ought to
    be included.

    This equation doesn't seem amazingly sane either:

    /* cost to scan data pages for each matched entry */
    pagesFetched = ceil((searchEntriesInQuals + partialEntriesInQuals) *
    numDataPages / numEntries);

    This has pagesFetched *decreasing* as numEntries increases, which cannot
    be right can it?

    Also, right after that step, there's a bit of code that re-estimates
    pagesFetched from selectivity and uses the larger value. Fine, but why
    are you only applying that idea here and not to the entry-pages
    calculation done a few lines earlier?

    regards, tom lane
  • Tom Lane at Aug 7, 2010 at 2:36 pm

    I wrote:
    1. The use of rd_amcache is very questionable.
    Attached is an alternate patch that I think you should give serious
    consideration to. The basic idea here is to only update the metapage
    stats data during VACUUM, and not bother with incremental updates during
    other operations. That gets rid of a lot of the complexity and
    opportunities for bugs-of-omission in the original approach, and also
    reduces contention for the metapage as well as WAL traffic.
    gincostestimate can compensate fairly well for index growth since the
    last VACUUM by scaling up the recorded values by the known growth ratio
    of the overall index size. (Note that the index->pages count passed to
    gincostestimate is accurate, having been recently gotten from
    RelationGetNumberOfBlocks.) Of course, this is only approximate, but
    considering that the equations the values are going to be fed into are
    even more approximate, I don't see a problem with that.

    I also dropped the use of rd_amcache, instead having ginGetStats()
    just read the metapage every time. Since the planner stats per se
    are now only updated during vacuum, it would be reasonable to use
    rd_amcache to remember them, but there's still a problem with
    nPendingPages. I think that keeping it simple is the way to go,
    at least until someone can show a performance problem with this way.

    I didn't do anything about the questionable equations in
    gincostestimate. Those need to either be fixed, or documented as
    to why they're correct. Other than that I think this could be
    committed.

    regards, tom lane

    PS: I still haven't tested this further than running the regression
    tests, since I see little point in trying to check its estimation
    behavior until those equations are fixed. However, the hstore
    regression test did expose a core dump in gincostestimate (failing
    to guard against null partial_matches), which I have fixed here.
  • Teodor Sigaev at Sep 7, 2010 at 4:06 pm

    I also dropped the use of rd_amcache, instead having ginGetStats()
    Ok, I'm agree
    I didn't do anything about the questionable equations in
    gincostestimate. Those need to either be fixed, or documented as
    to why they're correct. Other than that I think this could be
    committed.
    Fixed, and slightly reworked to be more clear.
    Attached patch is based on your patch.

    --
    Teodor Sigaev E-mail: teodor@sigaev.ru
    WWW: http://www.sigaev.ru/
  • Itagaki Takahiro at Oct 6, 2010 at 9:57 am

    On Wed, Sep 8, 2010 at 1:02 AM, Teodor Sigaev wrote:
    Fixed, and slightly reworked to be more clear.
    Attached patch is based on your patch.
    The patch will improve accuracy of plans using gin indexes.
    It only adds block-level statistics information into the meta
    pages in gin indexes. Data-level statistics are not collected
    by the patch, and there are no changes in pg_statistic.

    The stats page is updated only in VACUUM. ANALYZE doesn't update
    the information at all. In addition, REINDEX, VACUUM FULL, and
    CLUSTER reset the information to zero, but the reset is not preferable.
    Is it possible to fill the statistic fields at bulk index-build?
    No one wants to run VACUUM after VACUUM FULL to update the GIN stats.

    We don't have any methods to dump the meta information at all.
    They might be internal information, but some developers and
    debuggers might want such kinds of tools. Contrib/pageinspect
    might be a good location to have such function; it has bt_metap().

    The patch can be applied cleanly, no compiler warnings, and it passed
    all existing regression tests. There are no additional documentation
    and regression tests -- I'm not sure whether we should have them.
    If the patch is an internal improvement, docs are not needed.

    --
    Itagaki Takahiro
  • Tom Lane at Oct 18, 2010 at 12:57 am

    Itagaki Takahiro writes:
    On Wed, Sep 8, 2010 at 1:02 AM, Teodor Sigaev wrote:
    Fixed, and slightly reworked to be more clear.
    Attached patch is based on your patch.
    The stats page is updated only in VACUUM. ANALYZE doesn't update
    the information at all.
    ANALYZE doesn't scan indexes, so it's not realistic to expect ANALYZE to
    update these numbers.

    In addition, REINDEX, VACUUM FULL, and
    CLUSTER reset the information to zero, but the reset is not preferable.
    Is it possible to fill the statistic fields at bulk index-build?
    I fixed that and committed it. It probably wouldn't be a bad idea for
    Teodor or Oleg to double-check where I put in the counter increments;
    although I did test that they matched the results of VACUUM for a
    reasonably large GIN index.
    We don't have any methods to dump the meta information at all.
    They might be internal information, but some developers and
    debuggers might want such kinds of tools. Contrib/pageinspect
    might be a good location to have such function; it has bt_metap().
    That seems like a good idea, but I haven't done it.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJul 2, '10 at 12:33p
activeOct 18, '10 at 12:57a
posts11
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase