Please find attached a patch implementing a basic version of
index-only scans. This patch is the work of my colleague Ibrar Ahmed
and myself, and also incorporates some code from previous patches
posted by Heikki Linnakanagas.

I'm able to demonstrate a significant performance benefit from this
patch, even in a situation where it doesn't save any disk I/O, due to
reduced thrashing of shared_buffers. Configuration settings:

max_connections = 100
shared_buffers = 400MB
maintenance_work_mem = 256MB
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9
checkpoint_warning = 90s
seq_page_cost = 1.0
random_page_cost = 1.0
effective_cache_size = 3GB

Test setup:

pgbench -i -s 50
create table sample_data as select (random()*5000000)::int as aid,
repeat('a', 1000) as filler from generate_series(1,100000);

Test queries:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

On my laptop, the first query executes in about 555 ms, while the
second one takes about 1125 ms. Inspection via pg_buffercache reveals
that the second one thrashes shared_buffers something fierce, while
the first one does not. You can actually get the time for the first
query down to about 450 ms if you can persuade PostgreSQL to cache the
entire sample_data table - which is difficult, due the
BufferAccessStrategy stuff - and as soon as you run the second query,
it blows the table out of cache, so practically speaking you're not
going to get that faster time very often. I expect that you could get
an even larger benefit from this type of query if you could avoid
actual disk I/O, rather than just buffer cache thrashing, but I
haven't come up with a suitable test cases for that yet (volunteers?).

There are several things about this patch that probably need further
thought and work, or at least some discussion.

1. The way that nodeIndexscan.c builds up the faux heap tuple is
perhaps susceptible to improvement. I thought about building a
virtual tuple, but then what do I do with an OID column, if I have
one? Or maybe this should be done some other way altogether.

2. Suppose we scan one tuple on a not-all-visible page followed by 99
tuples on all-visible pages. The code as written will hold the pin on
the first heap page for the entire scan. As soon as we hit the end of
the scan or another tuple where we have to actually visit the page,
the old pin will be released, but until then we hold onto it. This
isn't totally stupid: the next tuple that's on a not-all-visible page
could easily be on the same not-all-visible page we already have
pinned. And in 99% cases holding the pin for slightly longer is
probably completely harmless. On the flip side, it could increase the
chances of interfering with VACUUM. Then again, a non-index-only scan
would still interfere with the same VACUUM, so maybe we don't care.

3. The code in create_index_path() builds up a bitmapset of heap
attributes that get used for any purpose anywhere in the query, and
hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
index under consideration. However, if it were somehow possible to
have the rel involved without using any attributes at all, we'd
rebuild the cache over and over, since it would never become non-NULL.
I don't think that can happen right now, but future planner changes
might make it possible.

4. There are a couple of cases that use index-only scans even though
the EXPLAIN output sort of makes it look like they shouldn't. For
example, in the above queries, an index-only scan is chosen even
though the query does "SELECT *" from the table being scanned. Even
though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
that the target list of an EXISTS query is in fact discarded, e.g.:

create or replace function error() returns int as $$begin select 1/0;
end$$ language plpgsql;
select * from pgbench_accounts a where exists (select error() from
pgbench_branches b where b.bid = a.aid); -- doesn't error out!

Along similar lines, COUNT(*) does not preclude an index-only scan,
because the * is apparently just window dressing. You'll still get
just a seq-scan unless you have an indexable qual in the query
somewhere, because...

5. We haven't made any planner changes at all, not even for cost
estimation. It is not clear to me what the right way to do cost
estimation here is. It seems that it would be useful to know what
percentage of the table is all-visible at planning time, but even if
we did, there could (and probably often will) be correlation between
the pages the query is interested in and which visibility map bits are
set. So I'm concerned about being overly optimistic about the cost
savings. Also, there's the problem of figuring out how to keep the
percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe
that's a job for ANALYZE but I haven't thought about it much yet.

6. I'm sure there are probably some statements in the documentation
that need updating, but I haven't tracked 'em down yet.

Comments, testing, review appreciated...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Greg Sabino Mullane at Aug 11, 2011 at 8:57 pm

    1. The way that nodeIndexscan.c builds up the faux heap tuple is
    perhaps susceptible to improvement. I thought about building a
    virtual tuple, but then what do I do with an OID column, if I have
    one? Or maybe this should be done some other way altogether.
    Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?

    - --
    Greg Sabino Mullane greg@turnstep.com
    End Point Corporation http://www.endpoint.com/
    PGP Key: 0x14964AC8 201108111654
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • Joshua D. Drake at Aug 11, 2011 at 9:02 pm

    On 08/11/2011 01:57 PM, Greg Sabino Mullane wrote:
    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: RIPEMD160

    1. The way that nodeIndexscan.c builds up the faux heap tuple is
    perhaps susceptible to improvement. I thought about building a
    virtual tuple, but then what do I do with an OID column, if I have
    one? Or maybe this should be done some other way altogether.
    Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?
    As a user space option certainly. However they are also used to track
    system objects, I don't know that we need to get rid of them for that
    purpose. We just need to remove WITH/WITHOUT OIDS for user tables.

    JD
    --
    Command Prompt, Inc. - http://www.commandprompt.com/
    PostgreSQL Support, Training, Professional Services and Development
    The PostgreSQL Conference - http://www.postgresqlconference.org/
    @cmdpromptinc - @postgresconf - 509-416-6579
  • Robert Haas at Aug 11, 2011 at 9:06 pm

    On Thu, Aug 11, 2011 at 4:57 PM, Greg Sabino Mullane wrote:
    1. The way that nodeIndexscan.c builds up the faux heap tuple is
    perhaps susceptible to improvement.  I thought about building a
    virtual tuple, but then what do I do with an OID column, if I have
    one?  Or maybe this should be done some other way altogether.
    Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?
    I thought about just not supporting that for index-only scans, but
    system catalogs use them pretty extensively, and it doesn't seem out
    of the question that that could matter to people who have lots of SQL
    objects floating around. I am *definitely* not volunteering to
    reengineer OIDs out of the system catalogs. For that amount of work,
    I could implement probably half a dozen major features any single one
    of which would be more valuable than the tiny amount of notational
    convenience such a change would buy.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Greg Sabino Mullane at Aug 12, 2011 at 1:45 am

    Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?
    I thought about just not supporting that for index-only scans, but
    system catalogs use them pretty extensively, and it doesn't seem out
    of the question that that could matter to people who have lots of SQL
    objects floating around.
    Right - when I said remove, I meant for all but system catalogs. I
    would think those are generally small enough that for most people
    the lack of index-only scans on those would not matter. Heck, the
    system catalogs are already "special" in lots of ways other than
    having OIDs (most anyway), so it's not as though we'd be breaking
    sacred ground with an index-only exception. :)

    I guess the question that should be asked is "we are going to finally
    remove OIDs someday, right?". If so, and if it's potentially blocking a
    major new feature, why not now?

    - --
    Greg Sabino Mullane greg@turnstep.com
    End Point Corporation http://www.endpoint.com/
    PGP Key: 0x14964AC8 201108112140
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • Andrew Dunstan at Aug 12, 2011 at 1:51 am

    On 08/11/2011 09:44 PM, Greg Sabino Mullane wrote:
    I guess the question that should be asked is "we are going to finally
    remove OIDs someday, right?". If so, and if it's potentially blocking a
    major new feature, why not now?
    It seems a bit odd then that we added "ALTER TABLE SET WITH OIDS" just a
    couple of years ago.

    cheers

    andrew
  • Robert Haas at Aug 12, 2011 at 3:22 am

    On Thu, Aug 11, 2011 at 9:44 PM, Greg Sabino Mullane wrote:
    Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?
    I thought about just not supporting that for index-only scans, but
    system catalogs use them pretty extensively, and it doesn't seem out
    of the question that that could matter to people who have lots of SQL
    objects floating around.
    Right - when I said remove, I meant for all but system catalogs. I
    would think those are generally small enough that for most people
    the lack of index-only scans on those would not matter. Heck, the
    system catalogs are already "special" in lots of ways other than
    having OIDs (most anyway), so it's not as though we'd be breaking
    sacred ground with an index-only exception. :)
    Yeah, maybe. But since there's no evidence that we actually need that
    exception for performance, I'm not in favor of adding it at this
    point.
    I guess the question that should be asked is "we are going to finally
    remove OIDs someday, right?".
    I don't necessarily see a reason to do that. I wouldn't object to
    turning the system OID columns in the system catalogs into normal OID
    columns, but that would be a lot of work and it doesn't seem to me to
    be the most important problem we need to solve (or even in the top
    25).
    If so, and if it's potentially blocking a
    major new feature, why not now?
    It's not blocking a major new feature, except to the extent that we're
    having a conversation about it, instead of talking about the major new
    feature.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Cédric Villemain at Aug 11, 2011 at 9:39 pm

    2011/8/11 Robert Haas <robertmhaas@gmail.com>:
    Please find attached a patch implementing a basic version of
    index-only scans.  This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas. Great!.
    I'm able to demonstrate a significant performance benefit from this
    patch, even in a situation where it doesn't save any disk I/O, due to
    reduced thrashing of shared_buffers.  Configuration settings:

    max_connections = 100
    shared_buffers = 400MB
    maintenance_work_mem = 256MB
    synchronous_commit = off
    checkpoint_segments = 100
    checkpoint_timeout = 30min
    checkpoint_completion_target = 0.9
    checkpoint_warning = 90s
    seq_page_cost = 1.0
    random_page_cost = 1.0
    effective_cache_size = 3GB

    Test setup:

    pgbench -i -s 50
    create table sample_data as select (random()*5000000)::int as aid,
    repeat('a', 1000) as filler from generate_series(1,100000);

    Test queries:

    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

    On my laptop, the first query executes in about 555 ms, while the
    second one takes about 1125 ms.  Inspection via pg_buffercache reveals
    that the second one thrashes shared_buffers something fierce, while
    the first one does not.  You can actually get the time for the first
    query down to about 450 ms if you can persuade PostgreSQL to cache the
    entire sample_data table - which is difficult, due the
    BufferAccessStrategy stuff - and as soon as you run the second query,
    it blows the table out of cache, so practically speaking you're not
    going to get that faster time very often.  I expect that you could get
    an even larger benefit from this type of query if you could avoid
    actual disk I/O, rather than just buffer cache thrashing, but I
    haven't come up with a suitable test cases for that yet (volunteers?).

    There are several things about this patch that probably need further
    thought and work, or at least some discussion.

    1. The way that nodeIndexscan.c builds up the faux heap tuple is
    perhaps susceptible to improvement.  I thought about building a
    virtual tuple, but then what do I do with an OID column, if I have
    one?  Or maybe this should be done some other way altogether.
    Can this faux heap tuple be appended by the data from another index
    once it has been created ? ( if the query involves those 2 index)

    --
    Cédric Villemain +33 (0)6 20 30 22 52
    http://2ndQuadrant.fr/
    PostgreSQL: Support 24x7 - Développement, Expertise et Formation
  • Robert Haas at Aug 12, 2011 at 2:39 am

    On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain wrote:
    2011/8/11 Robert Haas <robertmhaas@gmail.com>:
    Please find attached a patch implementing a basic version of
    index-only scans.  This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas.
    Great!
    Glad you like it...!
    Can this faux heap tuple be appended by the data from another index
    once it has been created ? ( if the query involves those 2 index)
    I don't see how to make that work. In general, a query like "SELECT
    a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
    use a bitmap index scan on each followed by a bitmapand and then a
    bitmap heap scan. However, this patch only touches the index-scan
    path, which only knows how to use one index for any given query.

    Actually, I can see a possible way to allow an index-only type
    optimization to be used for bitmap scans. As you scan the index, any
    tuples that can be handled index-only get returned immediately; the
    rest are thrown into a bitmap. Once you're done examining the index,
    you then do a bitmap heap scan to get the tuples that couldn't be
    handled index-only. This seems like it might be our best hope for a
    "fast count(*)" type optimization, especially if you could combine it
    with some method of scanning the index in physical order rather than
    logical order.

    But even that trick would only work with a single index. I'm not sure
    there's any good way to assemble pieces of tuples from two different
    indexes, at least not efficiently.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Cédric Villemain at Aug 12, 2011 at 10:20 am

    2011/8/12 Robert Haas <robertmhaas@gmail.com>:
    On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain
    wrote:
    2011/8/11 Robert Haas <robertmhaas@gmail.com>:
    Please find attached a patch implementing a basic version of
    index-only scans.  This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas.
    Great!
    Glad you like it...!
    Can this faux heap tuple be appended by the data from another index
    once it has been created ? ( if the query involves those 2 index)
    I don't see how to make that work.  In general, a query like "SELECT
    a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
    use a bitmap index scan on each followed by a bitmapand and then a
    bitmap heap scan.  However, this patch only touches the index-scan
    path, which only knows how to use one index for any given query.
    I thought of something like that: 'select a,b from foo where a=1
    order by b limit 100' (or: where a=1 and b< now() )
    Actually, I can see a possible way to allow an index-only type
    optimization to be used for bitmap scans.  As you scan the index, any
    tuples that can be handled index-only get returned immediately; the
    rest are thrown into a bitmap.  Once you're done examining the index,
    you then do a bitmap heap scan to get the tuples that couldn't be
    handled index-only.  This seems like it might be our best hope for a
    "fast count(*)" type optimization, especially if you could combine it
    with some method of scanning the index in physical order rather than
    logical order.
    IIRC we expose some ideas around that, yes. (optimizing bitmap)

    Maybe a question that will explain me more about the feature
    limitation (if any):
    Does an index-only scan used when the table has no vismap set will
    cost (in duration, IO, ...) more than a normal Index scan ?
    But even that trick would only work with a single index.  I'm not sure
    there's any good way to assemble pieces of tuples from two different
    indexes, at least not efficiently. okay.
    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company


    --
    Cédric Villemain +33 (0)6 20 30 22 52
    http://2ndQuadrant.fr/
    PostgreSQL: Support 24x7 - Développement, Expertise et Formation
  • Robert Haas at Aug 12, 2011 at 1:10 pm

    On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain wrote:
    Can this faux heap tuple be appended by the data from another index
    once it has been created ? ( if the query involves those 2 index)
    I don't see how to make that work.  In general, a query like "SELECT
    a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
    use a bitmap index scan on each followed by a bitmapand and then a
    bitmap heap scan.  However, this patch only touches the index-scan
    path, which only knows how to use one index for any given query.
    I thought of something like that:  'select a,b from foo where a=1
    order by b limit 100' (or: where a=1 and b< now() )
    Well... PostgreSQL can only use the index on a or the index on b, not
    both. This patch doesn't change that. I'm not trying to use indexes
    in some completely new way; I'm just trying to make them faster by
    optimizing away the heap access.
    Actually, I can see a possible way to allow an index-only type
    optimization to be used for bitmap scans.  As you scan the index, any
    tuples that can be handled index-only get returned immediately; the
    rest are thrown into a bitmap.  Once you're done examining the index,
    you then do a bitmap heap scan to get the tuples that couldn't be
    handled index-only.  This seems like it might be our best hope for a
    "fast count(*)" type optimization, especially if you could combine it
    with some method of scanning the index in physical order rather than
    logical order.
    IIRC we expose some ideas around that, yes. (optimizing bitmap)

    Maybe a question that will explain me more about the feature
    limitation (if any):
    Does an index-only scan used when the table has no vismap set will
    cost (in duration, IO, ...) more than a normal Index scan ?
    Yeah, it'll do a bit of extra work - the btree AM will cough up the
    tuple uselessly, and we'll check the visibility map, also uselessly.
    Then we'll end up doing it the regular way anyhow. I haven't measured
    that effect yet; hopefully it's fairly small.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Cédric Villemain at Aug 12, 2011 at 1:32 pm

    2011/8/12 Robert Haas <robertmhaas@gmail.com>:
    On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain
    wrote:
    Can this faux heap tuple be appended by the data from another index
    once it has been created ? ( if the query involves those 2 index)
    I don't see how to make that work.  In general, a query like "SELECT
    a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
    use a bitmap index scan on each followed by a bitmapand and then a
    bitmap heap scan.  However, this patch only touches the index-scan
    path, which only knows how to use one index for any given query.
    I thought of something like that:  'select a,b from foo where a=1
    order by b limit 100' (or: where a=1 and b< now() )
    Well... PostgreSQL can only use the index on a or the index on b, not
    both.  This patch doesn't change that.  I'm not trying to use indexes
    in some completely new way; I'm just trying to make them faster by
    optimizing away the heap access.
    For this kind of plan :
    Bitmap Heap Scan
    Recheck Cond
    BitmapAnd
    Bitmap Index Scan
    Bitmap Index Scan

    It may prevent useless Heap Fetch during "Bitmap Heap Scan", isn't it ?
    Actually, I can see a possible way to allow an index-only type
    optimization to be used for bitmap scans.  As you scan the index, any
    tuples that can be handled index-only get returned immediately; the
    rest are thrown into a bitmap.  Once you're done examining the index,
    you then do a bitmap heap scan to get the tuples that couldn't be
    handled index-only.  This seems like it might be our best hope for a
    "fast count(*)" type optimization, especially if you could combine it
    with some method of scanning the index in physical order rather than
    logical order.
    IIRC we expose some ideas around that, yes. (optimizing bitmap)

    Maybe a question that will explain me more about the feature
    limitation (if any):
    Does an index-only scan used when the table has no vismap set will
    cost (in duration, IO, ...) more than a normal Index scan ?
    Yeah, it'll do a bit of extra work - the btree AM will cough up the
    tuple uselessly, and we'll check the visibility map, also uselessly.
    Then we'll end up doing it the regular way anyhow.  I haven't measured
    that effect yet; hopefully it's fairly small.
    If it is small, or if we can reduce it to be near absent.
    Then... why do we need to distinguish Index Scan at all ? (or
    Index*-Only* scan, which aren't 100% 'Only' btw)
    It is then just an internal optimisation on how we can access/use an
    index. (really good and promising one)

    --
    Cédric Villemain +33 (0)6 20 30 22 52
    http://2ndQuadrant.fr/
    PostgreSQL: Support 24x7 - Développement, Expertise et Formation
  • Robert Haas at Aug 12, 2011 at 3:14 pm

    On Fri, Aug 12, 2011 at 9:31 AM, Cédric Villemain wrote:
    Well... PostgreSQL can only use the index on a or the index on b, not
    both.  This patch doesn't change that.  I'm not trying to use indexes
    in some completely new way; I'm just trying to make them faster by
    optimizing away the heap access.
    For this kind of plan :
    Bitmap Heap Scan
    Recheck Cond
    BitmapAnd
    Bitmap Index Scan
    Bitmap Index Scan

    It may prevent useless Heap Fetch during "Bitmap Heap Scan", isn't it ?
    The input to the bitmap heap scan is just a TID bitmap. It's too late
    to decide we want the index tuples at that point; we've forgotten
    where they are, and even if we remembered it, it would necessarily be
    any cheaper than checking the heap. We could optimize this to avoid a
    heap fetch in the case where we don't need any of the tuple data at
    all, but that's going to be somewhat rare, I think.
    Actually, I can see a possible way to allow an index-only type
    optimization to be used for bitmap scans.  As you scan the index, any
    tuples that can be handled index-only get returned immediately; the
    rest are thrown into a bitmap.  Once you're done examining the index,
    you then do a bitmap heap scan to get the tuples that couldn't be
    handled index-only.  This seems like it might be our best hope for a
    "fast count(*)" type optimization, especially if you could combine it
    with some method of scanning the index in physical order rather than
    logical order.
    IIRC we expose some ideas around that, yes. (optimizing bitmap)

    Maybe a question that will explain me more about the feature
    limitation (if any):
    Does an index-only scan used when the table has no vismap set will
    cost (in duration, IO, ...) more than a normal Index scan ?
    Yeah, it'll do a bit of extra work - the btree AM will cough up the
    tuple uselessly, and we'll check the visibility map, also uselessly.
    Then we'll end up doing it the regular way anyhow.  I haven't measured
    that effect yet; hopefully it's fairly small.
    If it is small, or if we can reduce it to be near absent.
    Then... why do we need to distinguish Index Scan at all ? (or
    Index*-Only* scan, which aren't 100% 'Only' btw)
    It is then just an internal optimisation on how we can access/use an
    index. (really good and promising one)
    Right, that's kind of what I was going for. But the overhead isn't
    going to be exactly zero, so I think it makes sense to disable it in
    the cases where it clearly can't work. The question is really more
    whether we need to get more fine-grained than that, and actually do a
    cost-based comparison of an index-only scan vs. a regular index scan.
    I hope not, but I haven't tested it yet.

    One other thing to think about is that the choice to use an index-scan
    is not free of externalities. The index-only scan is hopefully faster
    than touching all the heap pages, but even if it weren't, not touching
    all of the heap pages potentially means avoiding evicting things from
    shared_buffers that some *other* query might need.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Oleg Bartunov at Aug 12, 2011 at 1:10 pm
    Robert,

    I imagine we store positional information in gin index and return
    tuples in relevant order - instant full-text search !
    Great work, guys !

    Oleg
    On Thu, 11 Aug 2011, Robert Haas wrote:

    Please find attached a patch implementing a basic version of
    index-only scans. This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas.

    I'm able to demonstrate a significant performance benefit from this
    patch, even in a situation where it doesn't save any disk I/O, due to
    reduced thrashing of shared_buffers. Configuration settings:

    max_connections = 100
    shared_buffers = 400MB
    maintenance_work_mem = 256MB
    synchronous_commit = off
    checkpoint_segments = 100
    checkpoint_timeout = 30min
    checkpoint_completion_target = 0.9
    checkpoint_warning = 90s
    seq_page_cost = 1.0
    random_page_cost = 1.0
    effective_cache_size = 3GB

    Test setup:

    pgbench -i -s 50
    create table sample_data as select (random()*5000000)::int as aid,
    repeat('a', 1000) as filler from generate_series(1,100000);

    Test queries:

    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

    On my laptop, the first query executes in about 555 ms, while the
    second one takes about 1125 ms. Inspection via pg_buffercache reveals
    that the second one thrashes shared_buffers something fierce, while
    the first one does not. You can actually get the time for the first
    query down to about 450 ms if you can persuade PostgreSQL to cache the
    entire sample_data table - which is difficult, due the
    BufferAccessStrategy stuff - and as soon as you run the second query,
    it blows the table out of cache, so practically speaking you're not
    going to get that faster time very often. I expect that you could get
    an even larger benefit from this type of query if you could avoid
    actual disk I/O, rather than just buffer cache thrashing, but I
    haven't come up with a suitable test cases for that yet (volunteers?).

    There are several things about this patch that probably need further
    thought and work, or at least some discussion.

    1. The way that nodeIndexscan.c builds up the faux heap tuple is
    perhaps susceptible to improvement. I thought about building a
    virtual tuple, but then what do I do with an OID column, if I have
    one? Or maybe this should be done some other way altogether.

    2. Suppose we scan one tuple on a not-all-visible page followed by 99
    tuples on all-visible pages. The code as written will hold the pin on
    the first heap page for the entire scan. As soon as we hit the end of
    the scan or another tuple where we have to actually visit the page,
    the old pin will be released, but until then we hold onto it. This
    isn't totally stupid: the next tuple that's on a not-all-visible page
    could easily be on the same not-all-visible page we already have
    pinned. And in 99% cases holding the pin for slightly longer is
    probably completely harmless. On the flip side, it could increase the
    chances of interfering with VACUUM. Then again, a non-index-only scan
    would still interfere with the same VACUUM, so maybe we don't care.

    3. The code in create_index_path() builds up a bitmapset of heap
    attributes that get used for any purpose anywhere in the query, and
    hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
    index under consideration. However, if it were somehow possible to
    have the rel involved without using any attributes at all, we'd
    rebuild the cache over and over, since it would never become non-NULL.
    I don't think that can happen right now, but future planner changes
    might make it possible.

    4. There are a couple of cases that use index-only scans even though
    the EXPLAIN output sort of makes it look like they shouldn't. For
    example, in the above queries, an index-only scan is chosen even
    though the query does "SELECT *" from the table being scanned. Even
    though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
    that the target list of an EXISTS query is in fact discarded, e.g.:

    create or replace function error() returns int as $$begin select 1/0;
    end$$ language plpgsql;
    select * from pgbench_accounts a where exists (select error() from
    pgbench_branches b where b.bid = a.aid); -- doesn't error out!

    Along similar lines, COUNT(*) does not preclude an index-only scan,
    because the * is apparently just window dressing. You'll still get
    just a seq-scan unless you have an indexable qual in the query
    somewhere, because...

    5. We haven't made any planner changes at all, not even for cost
    estimation. It is not clear to me what the right way to do cost
    estimation here is. It seems that it would be useful to know what
    percentage of the table is all-visible at planning time, but even if
    we did, there could (and probably often will) be correlation between
    the pages the query is interested in and which visibility map bits are
    set. So I'm concerned about being overly optimistic about the cost
    savings. Also, there's the problem of figuring out how to keep the
    percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe
    that's a job for ANALYZE but I haven't thought about it much yet.

    6. I'm sure there are probably some statements in the documentation
    that need updating, but I haven't tracked 'em down yet.

    Comments, testing, review appreciated...
    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
  • Heikki Linnakangas at Aug 12, 2011 at 8:03 pm

    On 11.08.2011 23:06, Robert Haas wrote:
    Comments, testing, review appreciated...
    I would've expected this to use an index-only scan:

    postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id;
    SELECT 100000
    postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10;
    CREATE INDEX
    postgres=# VACUUM ANALYZE foo;
    VACUUM
    postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
    QUERY PLAN
    -----------------------------------------------------------------
    Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4)
    Index Cond: (id = 10)
    (2 rows)

    If it's not a predicate index, then it works:

    postgres=# DROP INDEX i_foo;
    DROP INDEX
    postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
    QUERY PLAN
    -----------------------------------------------------------------------
    Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4)
    Index Cond: (id = 10)
    (2 rows)

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Robert Haas at Aug 12, 2011 at 8:06 pm

    On Fri, Aug 12, 2011 at 4:03 PM, Heikki Linnakangas wrote:
    On 11.08.2011 23:06, Robert Haas wrote:

    Comments, testing, review appreciated...
    I would've expected this to use an index-only scan:

    postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id;
    Ugh. I think there's something wrong with this test:

    + if (index->amcanreturn && !index->predOK && !index->indexprs)

    I think that should probably say something like (ind->indpred == NIL
    ind->predOK).
    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Hans-Jürgen Schönig at Aug 12, 2011 at 8:21 pm

    On Aug 12, 2011, at 10:03 PM, Heikki Linnakangas wrote:
    On 11.08.2011 23:06, Robert Haas wrote:
    Comments, testing, review appreciated...
    I would've expected this to use an index-only scan:

    postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id;
    SELECT 100000
    postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10;
    CREATE INDEX
    postgres=# VACUUM ANALYZE foo;
    VACUUM
    postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
    QUERY PLAN
    -----------------------------------------------------------------
    Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4)
    Index Cond: (id = 10)
    (2 rows)

    If it's not a predicate index, then it works:

    postgres=# DROP INDEX i_foo;
    DROP INDEX
    postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
    QUERY PLAN
    -----------------------------------------------------------------------
    Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4)
    Index Cond: (id = 10)
    (2 rows)

    is there any plan to revise the cost for index only scans compared to what it is now?

    many thanks,

    hans

    --
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt, Austria
    Web: http://www.postgresql-support.de
  • Robert Haas at Aug 12, 2011 at 9:26 pm

    2011/8/12 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
    is there any plan to revise the cost for index only scans compared to what it is now?
    That's one of the points I asked for feedback on in my original email.
    "How should the costing be done?"

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Aug 12, 2011 at 9:39 pm

    Robert Haas wrote:

    That's one of the points I asked for feedback on in my original
    email. "How should the costing be done?"
    It seems pretty clear that there should be some cost adjustment. If
    you can't get good numbers somehow on what fraction of the heap
    accesses will be needed, I would suggest using a magic number based
    on the assumption that half the heap access otherwise necessary will
    be done. It wouldn't be the worst magic number in the planner. Of
    course, real numbers are always better if you can get them.

    -Kevin
  • Robert Haas at Aug 13, 2011 at 7:33 pm

    On Fri, Aug 12, 2011 at 5:39 PM, Kevin Grittner wrote:
    Robert Haas wrote:
    That's one of the points I asked for feedback on in my original
    email.  "How should the costing be done?"
    It seems pretty clear that there should be some cost adjustment.  If
    you can't get good numbers somehow on what fraction of the heap
    accesses will be needed, I would suggest using a magic number based
    on the assumption that half the heap access otherwise necessary will
    be done.  It wouldn't be the worst magic number in the planner.  Of
    course, real numbers are always better if you can get them.
    It wouldn't be that difficult (I think) to make VACUUM and/or ANALYZE
    gather some statistics; what I'm worried about is that we'd have
    correlation problems. Consider a wide table with an index on (id,
    name), and the query:

    SELECT name FROM tab WHERE id = 12345

    Now, suppose that we know that 50% of the heap pages have their
    visibility map bits set. What's the chance that this query won't need
    a heap fetch? Well, the chance is 50% *if* you assume that a row
    which has been quiescent for a long time is just as likely to be
    queried as one that has been recently inserted or updated. However,
    in many real-world use cases, nothing could be farther from the truth.

    What do we do about that?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kääriäinen Anssi at Aug 13, 2011 at 8:58 pm
    """
    Now, suppose that we know that 50% of the heap pages have their
    visibility map bits set. What's the chance that this query won't need
    a heap fetch? Well, the chance is 50% *if* you assume that a row
    which has been quiescent for a long time is just as likely to be
    queried as one that has been recently inserted or updated. However,
    in many real-world use cases, nothing could be farther from the truth.

    What do we do about that?
    """

    The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query. Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived from usage...

    Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weighted visibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage would minimize the average cost of the plans much better than just the non-weighted visibility percentage.

    For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibility percentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentage would be incorrect for the queries reading old rows, by definition of the weighted visibility percentage there would not be too many of them.

    The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though.

    Of course, keeping such statistic could be more expensive than the benefit it gives. On the other hand, page hit percentage is already available...

    - Anssi
  • Heikki Linnakangas at Aug 13, 2011 at 9:31 pm

    On 13.08.2011 23:35, Kääriäinen Anssi wrote:
    """
    Now, suppose that we know that 50% of the heap pages have their
    visibility map bits set. What's the chance that this query won't need
    a heap fetch? Well, the chance is 50% *if* you assume that a row
    which has been quiescent for a long time is just as likely to be
    queried as one that has been recently inserted or updated. However,
    in many real-world use cases, nothing could be farther from the truth.

    What do we do about that?
    """

    The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query.
    That is somewhat compensated by the fact that tuples that are accessed
    more often are also more likely to be in cache. Fetching the heap tuple
    to check visibility is very cheap when the tuple is in cache.

    I'm not sure how far that compensates it, though. I'm sure there's
    typically nevertheless a fairly wide range of pages that have been
    modified since the last vacuum, but not in cache anymore.
    Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived from usage...

    Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weighted visibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage would minimize the average cost of the plans much better than just the non-weighted visibility percentage.

    For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibility percentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentage would be incorrect for the queries reading old rows, by definition of the weighted visibility percentage there would not be too many of them.

    The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though.
    Yeah, I'm not excited about making the planner and statistics more
    dynamic. Feedback loops and plan instability are not fun. I think we
    should rather be more aggressive in setting the visibility map bits.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Jim Nasby at Aug 15, 2011 at 9:53 pm

    On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote:
    The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query.
    That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache.

    I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore.
    http://xkcd.org/937/ :)

    Could something be added to pg_stats that tracks visibility map usefulness on a per-attribute basis? Perhaps another set of stats buckets that show visibility percentages for each stats bucket?
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Anssi Kääriäinen at Aug 16, 2011 at 10:02 am

    On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:
    The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though.
    Yeah, I'm not excited about making the planner and statistics more
    dynamic. Feedback loops and plan instability are not fun.
    I might be a little out of my league here... But I was thinking about
    the cache hit ratio and feedback loops. I understand automatic tuning
    would be hard. But making automatic tuning easier (by using pg_tune for
    example) would be a big plus for most use cases.

    To make it easier to tune the page read costs automatically, it would be
    nice if there would be four variables instead of the current two:
    - random_page_cost is the cost of reading a random page from storage.
    Currently it is not, it is the cost of accessing a random page, taking
    in account it might be in memory.
    - seq_page_cost is the cost of reading pages sequentially from storage
    - memory_page_cost is the cost of reading a page in memory
    - cache_hit_ratio is the expected cache hit ratio

    memory_page_cost would be server global, random and seq page costs
    tablespace specific, and cache_hit_ratio relation specific. You would
    get the current behavior by tuning *_page_costs realistically, and
    setting cache_hit_ratio globally so that the expected random_page_cost /
    seq_page_cost stays the same as now.

    The biggest advantage of this would be that the correct values are much
    easier to detect automatically compared to current situation. This can
    be done using pg_statio_* views and IO speed testing. They should not be
    tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as
    that leads to the possibility of feedback loops and plan instability.
    The variables would also be much easier to understand.

    There is the question if one should be allowed to tune the *_page_costs
    at all. If I am not missing something, it is possible to detect the
    correct values programmatically and they do not change if you do not
    change the hardware. Cache hit ratio is the real reason why they are
    currently so important for tuning.

    An example why the current random_page_cost and seq_page_cost tuning is
    not adequate is that you can only set random_page_cost per tablespace.
    That makes perfect sense if random_page_cost would be the cost of
    accessing a page in storage. But it is not, it is a combination of that
    and caching effects, so that it actually varies per relation (and over
    time). How do you set it correctly for a query where one relation is
    fully cached and another one not?

    Another problem is that if you use random_page_cost == seq_page_cost,
    you are effectively saying that everything is in cache. But if
    everything is in cache, the cost of page access relative to cpu_*_costs
    is way off. The more random_page_cost and seq_page_cost are different,
    the more they mean the storage access costs. When they are the same,
    they mean the memory page cost. There can be an order of magnitude in
    difference of a storage page cost and a memory page cost. So it is hard
    to tune the cpu_*_costs realistically for cases where sometimes data is
    in cache and sometimes not.

    Ok, enough hand waving for one post :) Sorry if this all is obvious /
    discussed before. My googling didn't turn out anything directly related,
    although these have some similarity:
    - Per-table random_page_cost for tables that we know are always cached
    [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]
    - Script to compute random page cost
    [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
    - The science of optimization in practical terms?
    [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php],
    getting really interesting starting from here:
    [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]

    - Anssi
  • Cédric Villemain at Sep 23, 2011 at 2:35 pm

    2011/8/16 Anssi Kääriäinen <anssi.kaariainen@thl.fi>:
    On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:

    The same idea could of course be used to calculate the effective cache
    hit ratio for each table. Cache hit ratio would have the problem of feedback
    loops, though.
    Yeah, I'm not excited about making the planner and statistics more
    dynamic. Feedback loops and plan instability are not fun.
    I might be a little out of my league here... But I was thinking about the
    cache hit ratio and feedback loops. I understand automatic tuning would be
    hard. But making automatic tuning easier (by using pg_tune for example)
    would be a big plus for most use cases.

    To make it easier to tune the page read costs automatically, it would be
    nice if there would be four variables instead of the current two:
    - random_page_cost is the cost of reading a random page from storage.
    Currently it is not, it is the cost of accessing a random page, taking in
    account it might be in memory.
    - seq_page_cost is the cost of reading pages sequentially from storage
    - memory_page_cost is the cost of reading a page in memory
    - cache_hit_ratio is the expected cache hit ratio

    memory_page_cost would be server global, random and seq page costs
    tablespace specific, and cache_hit_ratio relation specific. You would get
    the current behavior by tuning *_page_costs realistically, and setting
    cache_hit_ratio globally so that the expected random_page_cost /
    seq_page_cost stays the same as now.

    The biggest advantage of this would be that the correct values are much
    easier to detect automatically compared to current situation. This can be
    done using pg_statio_* views and IO speed testing. They should not be tuned
    automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads
    to the possibility of feedback loops and plan instability. The variables
    would also be much easier to understand.

    There is the question if one should be allowed to tune the *_page_costs at
    all. If I am not missing something, it is possible to detect the correct
    values programmatically and they do not change if you do not change the
    hardware. Cache hit ratio is the real reason why they are currently so
    important for tuning.

    An example why the current random_page_cost and seq_page_cost tuning is not
    adequate is that you can only set random_page_cost per tablespace. That
    makes perfect sense if random_page_cost would be the cost of accessing a
    page in storage. But it is not, it is a combination of that and caching
    effects, so that it actually varies per relation (and over time). How do you
    set it correctly for a query where one relation is fully cached and another
    one not?

    Another problem is that if you use random_page_cost == seq_page_cost, you
    are effectively saying that everything is in cache. But if everything is in
    cache, the cost of page access relative to cpu_*_costs is way off. The more
    random_page_cost and seq_page_cost are different, the more they mean the
    storage access costs. When they are the same, they mean the memory page
    cost. There can be an order of magnitude in difference of a storage page
    cost and a memory page cost. So it is hard to tune the cpu_*_costs
    realistically for cases where sometimes data is in cache and sometimes not.

    Ok, enough hand waving for one post :) Sorry if this all is obvious /
    discussed before. My googling didn't turn out anything directly related,
    although these have some similarity:
    - Per-table random_page_cost for tables that we know are always cached
    [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]
    - Script to compute random page cost
    [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
    -  The science of optimization in practical terms?
    [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting
    really interesting starting from here:
    [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]
    late reply.
    You can add this link to your list:
    http://archives.postgresql.org/pgsql-hackers/2011-06/msg01140.php


    --
    Cédric Villemain +33 (0)6 20 30 22 52
    http://2ndQuadrant.fr/
    PostgreSQL: Support 24x7 - Développement, Expertise et Formation
  • Greg Stark at Sep 23, 2011 at 2:43 pm

    On Tue, Aug 16, 2011 at 11:24 AM, Anssi Kääriäinen wrote:
    There is the question if one should be allowed to tune the *_page_costs at
    all. If I am not missing something, it is possible to detect the correct
    values programmatically and they do not change if you do not change the
    hardware. Cache hit ratio is the real reason why they are currently so
    important for tuning.
    Unfortunately things a tad more complex than this picture. There are
    multiple levels of cache involved here. There's the Postgres buffer
    cache, the filesystem buffer cache, and then the raid controller or
    drives often have cache as well.

    Also the difference between seq_page_cost and random_page_cost is
    hiding another cache effect. The reason sequential reads are faster is
    twofold, there's no seek but also there's an increased chance the
    buffer is already in the filesystem cache due to having been
    prefetched. Actually it's hardly even probabilistic -- only every nth
    page needs to do i/o when doing sequential reads.


    --
    greg
  • Marti Raudsepp at Sep 25, 2011 at 6:44 pm

    On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas wrote:
    That is somewhat compensated by the fact that tuples that are accessed more
    often are also more likely to be in cache. Fetching the heap tuple to check
    visibility is very cheap when the tuple is in cache.

    I'm not sure how far that compensates it, though. I'm sure there's typically
    nevertheless a fairly wide range of pages that have been modified since the
    last vacuum, but not in cache anymore.
    Would it make sense to re-evaluate the visibility bit just before a
    page gets flushed out from shared buffers? On a system with no long
    transactions, it seems likely that a dirty page is already all-visible
    by the time bgwriter (or shared buffers memory pressure) gets around
    to writing it out. That way we don't have to wait for vacuum to do it
    and would make your observation hold more often.

    Regards,
    Marti
  • Robert Haas at Sep 26, 2011 at 12:46 am

    On Sun, Sep 25, 2011 at 2:43 PM, Marti Raudsepp wrote:
    On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas
    wrote:
    That is somewhat compensated by the fact that tuples that are accessed more
    often are also more likely to be in cache. Fetching the heap tuple to check
    visibility is very cheap when the tuple is in cache.

    I'm not sure how far that compensates it, though. I'm sure there's typically
    nevertheless a fairly wide range of pages that have been modified since the
    last vacuum, but not in cache anymore.
    Would it make sense to re-evaluate the visibility bit just before a
    page gets flushed out from shared buffers? On a system with no long
    transactions, it seems likely that a dirty page is already all-visible
    by the time bgwriter (or shared buffers memory pressure) gets around
    to writing it out. That way we don't have to wait for vacuum to do it
    and would make your observation hold more often.
    This has been suggested before, and, sure, there might be cases where
    it helps. But you need to choose your test case fairly carefully.
    For example, if you're doing a large sequential scan on a table, the
    ring-buffer logic causes processes to evict their own pages, and so
    the background writer doesn't get a chance to touch any of those
    pages. You need some kind of a workload where pages are being evicted
    from shared buffers slowly enough that it ends up being the background
    writer, rather than the individual backends, that do the work. But if
    you have that kind of workload, then we can infer that most of your
    working set fits into shared buffers. And in that case you don't
    really need index-only scans in the first place. The main point of
    index only scans is to optimize the case where you have a gigantic
    table and you're trying to avoid swamping the system with random I/O.
    I'm not saying that such a change would be a particularly bad idea,
    but I'm not personally planning to work on it any time soon because I
    can't convince myself that it really helps all that much.

    I think the real solution to getting visibility map bits set is to
    vacuum more frequently, but have it be cheaper each time. Our default
    autovacuum settings vacuum the table when the number of updates and
    deletes reaches 20% of the table size. But those settings were put in
    place under the assumption that we'll have to scan the entire heap,
    dirtying every page that contains dead tuples, scan all the indexes
    and remove the associated index pointers, and then scan and dirty the
    heap pages that contain now-dead line pointers a second time to remove
    those. The visibility map has eroded those assumptions to some
    degree, because now we probably won't have to scan the entire heap
    every time we vacuum; and I'm hoping we're going to see some further
    erosion. Pavan has a pending patch which, if we can work out the
    details, will eliminate the second heap scan; and we've also talked
    about doing the index scan only when there are enough dead line
    pointers to justify the effort. That, it seems to me, would open the
    door to lowering the scale factor, maybe by quite a bit - which, in
    turn, would help us control bloat better and get visibility map bits
    set sooner.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Greg Smith at Aug 15, 2011 at 11:37 pm

    On 08/11/2011 04:06 PM, Robert Haas wrote:
    On my laptop, the first query executes in about 555 ms, while the
    second one takes about 1125 ms...I expect that you could get
    an even larger benefit from this type of query if you could avoid
    actual disk I/O, rather than just buffer cache thrashing, but I
    haven't come up with a suitable test cases for that yet (volunteers?).
    That part I can help with, using a Linux test that kills almost every
    cache. I get somewhat faster times on my desktop here running the cached
    version like you were doing (albeit with debugging options on, so I
    wouldn't read too much into this set of numbers):

    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
    sum
    --------------
    250279412983
    (1 row)

    Time: 472.778 ms

    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------
    Aggregate (cost=133325.00..133325.01 rows=1 width=4)
    -> Nested Loop Semi Join (cost=0.00..133075.00 rows=100000 width=4)
    -> Seq Scan on sample_data a1 (cost=0.00..15286.00
    rows=100000 width=4)
    -> Index Only Scan using pgbench_accounts_pkey on
    pgbench_accounts a (cost=0.00..1.17 rows=1 width=4)
    Index Cond: (aid = a1.aid)
    Filter: (aid <> 1234567)

    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
    sum
    --------------
    250279412983

    Time: 677.902 ms
    explain select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
    QUERY PLAN
    ------------------------------------------------------------------------------------------------------------
    Aggregate (cost=133325.00..133325.01 rows=1 width=4)
    -> Nested Loop Semi Join (cost=0.00..133075.00 rows=100000 width=4)
    -> Seq Scan on sample_data a1 (cost=0.00..15286.00
    rows=100000 width=4)
    -> Index Scan using pgbench_accounts_pkey on pgbench_accounts
    a (cost=0.00..1.17 rows=1 width=4)
    Index Cond: (aid = a1.aid)
    Filter: (bid <> 1234567)

    If I setup my gsmith account to be able to start and stop the server
    with pg_ctl and a valid PGDATA, and drop these two scripts in that home
    directory:

    == index-only-1.sql ==

    \timing
    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

    explain select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

    == index-only-2.sql ==

    \timing
    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

    explain select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

    I can then run this script as root:

    #!/bin/bash
    ME="gsmith"
    su - $ME -c "pg_ctl stop -w"
    echo 3 > /proc/sys/vm/drop_caches
    su - $ME -c "pg_ctl start -w"
    su - $ME -c "psql -ef index-only-1.sql"
    su - $ME -c "pg_ctl stop -w"
    echo 3 > /proc/sys/vm/drop_caches
    su - $ME -c "pg_ctl start -w"
    su - $ME -c "psql -ef index-only-2.sql"

    And get results that start with zero information cached in RAM, showing
    a much more dramatic difference. Including some snippets of interesting
    vmstat too, the index-only one gets faster as it runs while the regular
    one is pretty flat:

    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
    Time: 31677.683 ms

    $ vmstat 5
    procs -----------memory---------- ---swap-- -----io---- -system--
    ----cpu----
    0 1 0 15807288 4388 126440 0 0 4681 118 1407 2432 1
    1 89 10
    1 1 0 15779388 4396 154448 0 0 3587 17 1135 2058 1
    0 86 13
    0 1 0 15739956 4396 193672 0 0 5800 0 1195 2056 1
    0 87 12
    0 1 0 15691844 4396 241832 0 0 7053 3 1299 2044 1
    0 86 13
    0 1 0 15629736 4396 304096 0 0 7995 37 1391 2053 1
    0 87 12
    0 1 0 15519244 4400 414268 0 0 11639 14 1448 2189 1
    0 87 12

    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
    Time: 172381.235 ms

    $ vmstat 5
    procs -----------memory---------- ---swap-- -----io---- -system--
    ----cpu----
    0 1 0 15736500 4848 196444 0 0 3142 22 1092 1989 1
    0 86 13
    0 1 0 15711948 4852 221072 0 0 3411 1 1039 1943 0
    0 88 12
    0 1 0 15685412 4852 247496 0 0 3618 34 1111 1997 0
    0 86 13
    [this is the middle part, rate doesn't vary too much]

    That's 5.4X as fast; not too shabby! Kind of interesting how much
    different the I/O pattern is on the index-only version. I ran this test
    against a 3-disk RAID0 set with a 256MB BBWC, so there's some
    possibility of caching here. But given that each query blows away a
    large chunk of the other's data, I wouldn't expect that to be a huge
    factor here:

    gsmith=# select pg_size_pretty(pg_relation_size('pgbench_accounts'));
    pg_size_pretty
    ----------------
    640 MB

    gsmith=# select pg_size_pretty(pg_relation_size('pgbench_accounts_pkey'));
    pg_size_pretty
    ----------------
    107 MB

    gsmith=# select pg_size_pretty(pg_relation_size('sample_data'));
    pg_size_pretty
    ----------------
    112 MB

    And with the large difference in response time, things appear to be
    working as hoped even in this situation. If you try this on your
    laptop, where drive cache size and random I/O are likely to be even
    slower, you might see an ever larger difference.

    --
    Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
  • Robert Haas at Aug 16, 2011 at 12:52 am

    On Mon, Aug 15, 2011 at 7:37 PM, Greg Smith wrote:
    That's 5.4X as fast; not too shabby! Awesome!
    And with the large difference in response time, things appear to be working
    as hoped even in this situation.  If you try this on your laptop, where
    drive cache size and random I/O are likely to be even slower, you might see
    an ever larger difference.
    Hah! Just in case a 5.4X performance improvement isn't good enough? :-)

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Oct 6, 2011 at 7:15 pm

    Robert Haas writes:
    Please find attached a patch implementing a basic version of
    index-only scans. This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas.
    I'm starting to review this patch now. Has any further work been done
    since the first version was posted? Also, was any documentation
    written? I'm a tad annoyed by having to reverse-engineer the changes
    in the AM API specification from the code.

    regards, tom lane
  • Robert Haas at Oct 6, 2011 at 7:46 pm

    On Thu, Oct 6, 2011 at 3:15 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Please find attached a patch implementing a basic version of
    index-only scans.  This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas.
    I'm starting to review this patch now. Thanks!
    Has any further work been done
    since the first version was posted?  Also, was any documentation
    written?  I'm a tad annoyed by having to reverse-engineer the changes
    in the AM API specification from the code.
    Not really. We have detected a small performance regression when both
    heap and index fit in shared_buffers and an index-only scan is used.
    I have a couple of ideas for improving this. One is to store a
    virtual tuple into the slot instead of building a regular tuple, but
    what do we do about tuples with OIDs? Another is to avoid locking the
    index buffer multiple times - right now it locks the index buffer to
    get the TID, and then relocks it to extract the index tuple (after
    checking that nothing disturbing has happened meanwhile). It seems
    likely that with some refactoring we could get this down to a single
    lock/unlock cycle, but I haven't figured out exactly where the TID
    gets copied out.

    With regard to the AM API, the basic idea is we're just adding a
    Boolean to say whether the AM is capable of returning index tuples.
    If it's not, then we don't ever try an index-only scan. If it is,
    then we'll set the want_index_tuple flag if an index-only scan is
    possible. This requests that the AM attempt to return the tuple; but
    the AM is also allowed to fail and not return the tuple whenever it
    wants. This is more or less the interface that Heikki suggested a
    couple years back, but it might well be vulnerable to improvement.

    Incidentally, if you happen to feel the urge to beat this around and
    send it back rather than posting a list of requested changes, feel
    free.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Oct 6, 2011 at 8:11 pm

    Robert Haas writes:
    Not really. We have detected a small performance regression when both
    heap and index fit in shared_buffers and an index-only scan is used.
    I have a couple of ideas for improving this. One is to store a
    virtual tuple into the slot instead of building a regular tuple, but
    what do we do about tuples with OIDs?
    Yeah, I was just looking at that. I think it's going to be a clear win
    to use a virtual tuple slot instead of constructing and deconstructing
    a HeapTuple. The obvious solution is to decree that you can't use an
    index-only scan if the query requires fetching OID (or any other system
    column). This would be slightly annoying for catalog fetches but I'm
    not sure I believe that catalog queries are that important a use-case.

    I was also toying with the notion of pushing the slot fill-in into the
    AM, so that the AM API is to return a filled TupleSlot not an
    IndexTuple. This would not save any cycles AFAICT --- at least in
    btree, we still have to make a palloc'd copy of the IndexTuple so that
    we can release lock on the index page. The point of it would be to
    avoid the assumption that the index's internal storage has exactly the
    format of IndexTuple. Don't know whether we'd ever have any actual use
    for that flexibility, but it seems like it wouldn't cost much to
    preserve the option.
    Another is to avoid locking the
    index buffer multiple times - right now it locks the index buffer to
    get the TID, and then relocks it to extract the index tuple (after
    checking that nothing disturbing has happened meanwhile). It seems
    likely that with some refactoring we could get this down to a single
    lock/unlock cycle, but I haven't figured out exactly where the TID
    gets copied out.
    Yeah, maybe, but let's get the patch committed before we start looking
    for second-order optimizations.
    With regard to the AM API, the basic idea is we're just adding a
    Boolean to say whether the AM is capable of returning index tuples.
    If it's not, then we don't ever try an index-only scan. If it is,
    then we'll set the want_index_tuple flag if an index-only scan is
    possible. This requests that the AM attempt to return the tuple; but
    the AM is also allowed to fail and not return the tuple whenever it
    wants.
    It was the "allowed to fail" bit that I was annoyed to discover only by
    reading some well-buried code.
    Incidentally, if you happen to feel the urge to beat this around and
    send it back rather than posting a list of requested changes, feel
    free.
    You can count on that ;-). I've already found a lot of things I didn't
    care for.

    regards, tom lane
  • Thom Brown at Oct 6, 2011 at 8:19 pm

    On 6 October 2011 21:11, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Not really.  We have detected a small performance regression when both
    heap and index fit in shared_buffers and an index-only scan is used.
    I have a couple of ideas for improving this.  One is to store a
    virtual tuple into the slot instead of building a regular tuple, but
    what do we do about tuples with OIDs?
    Yeah, I was just looking at that.  I think it's going to be a clear win
    to use a virtual tuple slot instead of constructing and deconstructing
    a HeapTuple.  The obvious solution is to decree that you can't use an
    index-only scan if the query requires fetching OID (or any other system
    column).  This would be slightly annoying for catalog fetches but I'm
    not sure I believe that catalog queries are that important a use-case.

    I was also toying with the notion of pushing the slot fill-in into the
    AM, so that the AM API is to return a filled TupleSlot not an
    IndexTuple.  This would not save any cycles AFAICT --- at least in
    btree, we still have to make a palloc'd copy of the IndexTuple so that
    we can release lock on the index page.  The point of it would be to
    avoid the assumption that the index's internal storage has exactly the
    format of IndexTuple.  Don't know whether we'd ever have any actual use
    for that flexibility, but it seems like it wouldn't cost much to
    preserve the option.
    Another is to avoid locking the
    index buffer multiple times - right now it locks the index buffer to
    get the TID, and then relocks it to extract the index tuple (after
    checking that nothing disturbing has happened meanwhile).  It seems
    likely that with some refactoring we could get this down to a single
    lock/unlock cycle, but I haven't figured out exactly where the TID
    gets copied out.
    Yeah, maybe, but let's get the patch committed before we start looking
    for second-order optimizations.
    +1

    Been testing this today with a few regression tests and haven't seen
    anything noticeably broken. Does what it says on the tin.

    --
    Thom Brown
    Twitter: @darkixion
    IRC (freenode): dark_ixion
    Registered Linux user: #516935

    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Oct 8, 2011 at 3:52 pm

    I wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Not really. We have detected a small performance regression when both
    heap and index fit in shared_buffers and an index-only scan is used.
    I have a couple of ideas for improving this. One is to store a
    virtual tuple into the slot instead of building a regular tuple, but
    what do we do about tuples with OIDs?
    [ that's done ]
    I was also toying with the notion of pushing the slot fill-in into the
    AM, so that the AM API is to return a filled TupleSlot not an
    IndexTuple. This would not save any cycles AFAICT --- at least in
    btree, we still have to make a palloc'd copy of the IndexTuple so that
    we can release lock on the index page. The point of it would be to
    avoid the assumption that the index's internal storage has exactly the
    format of IndexTuple. Don't know whether we'd ever have any actual use
    for that flexibility, but it seems like it wouldn't cost much to
    preserve the option.
    BTW, I concluded that that would be a bad idea, because it would involve
    the index AM in some choices that we're likely to want to change. In
    particular it would foreclose ever doing anything with expression
    indexes, without an AM API change. Better to just define the AM's
    responsibility as to hand back a tuple defined according to the index's
    columns.
    Another is to avoid locking the
    index buffer multiple times - right now it locks the index buffer to
    get the TID, and then relocks it to extract the index tuple (after
    checking that nothing disturbing has happened meanwhile). It seems
    likely that with some refactoring we could get this down to a single
    lock/unlock cycle, but I haven't figured out exactly where the TID
    gets copied out.
    Yeah, maybe, but let's get the patch committed before we start looking
    for second-order optimizations.
    On reflection I'm starting to think that the above would be a good idea
    because there are a couple of bogosities in the basic choices this patch
    made. In particular, I'm thinking about how we could use an index on
    f(x) to avoid recalculating f() in something like

    select f(x) from tab where f(x) < 42;

    assuming that f() is expensive but immutable. The planner side of this
    is already a bit daunting, because it's not clear how to recognize that
    an index on f(x) is a covering index (the existing code is going to
    think that x itself needs to be available). But the executor side is a
    real problem, because it will fail to make use of the f() value fetched
    from the index anytime the heap visibility test fails.

    I believe that we should rejigger things so that when an index-only scan
    is selected, the executor *always* works from the data supplied by the
    index. Even if it has to visit the heap --- it will do that but just to
    consult the tuple's visibility data, and then use what it got from the
    index anyway. This means we'd build the plan node's filter quals and
    targetlist to reference the index tuple columns not the underlying
    table's. (Which in passing gets rid of the behavior you were
    complaining about that EXPLAIN VERBOSE shows a lot of columns that
    aren't actually being computed.)

    In order to make this work, we have to remove the API wart that says the
    index AM is allowed to choose to not return the index tuple. And that
    ties into what you were saying above. What we ought to do is have
    _bt_readpage() save off copies of the whole tuples not only the TIDs
    when it is extracting data from the page. This is no more net copying
    work than what happens now (assuming that all the tuples get fetched)
    since we won't need the per-tuple memcpy that occurs now in
    bt_getindextuple. The tuples can go into a page-sized workspace buffer
    associated with the BTScanPosData structure, and then just be referenced
    in-place in that workspace with no second copy step.

    I'm inclined to do the last part immediately, since there's a
    performance argument for it, and then work on revising the executor
    implementation.

    regards, tom lane
  • Robert Haas at Oct 8, 2011 at 10:06 pm

    On Oct 8, 2011, at 11:52 AM, Tom Lane wrote:
    I'm inclined to do the last part immediately, since there's a
    performance argument for it, and then work on revising the executor
    implementation.
    Sounds great. Thanks for working on this.

    ...Robert
  • Tom Lane at Oct 9, 2011 at 8:03 pm

    I wrote:
    I believe that we should rejigger things so that when an index-only scan
    is selected, the executor *always* works from the data supplied by the
    index. Even if it has to visit the heap --- it will do that but just to
    consult the tuple's visibility data, and then use what it got from the
    index anyway. This means we'd build the plan node's filter quals and
    targetlist to reference the index tuple columns not the underlying
    table's.
    I've been studying this a bit. The key decision is how to represent
    Vars that reference columns of the index. We really have to have
    varattno equal to the index column number, else ExecEvalVar will pull
    the wrong column from the tuple. However, varno is not so clear cut.
    There are at least four things we could do:

    1. Keep varno = table's rangetable index. The trouble with this is that
    a Var referencing index column N would look exactly like a Var
    referencing table column N; so the same Var would mean something
    different in an index-only scan node than it does in any other type of
    scan node for the same table. We could maybe make that work, but it
    seems confusing and fragile as heck. The executor isn't going to care
    much, but inspection of the plan tree by e.g. EXPLAIN sure will.

    2. Set varno = OUTER (or maybe INNER). This is safe because there's no
    other use for OUTER/INNER in a table scan node. We would have to hack
    things so that the index tuple gets put into econtext->ecxt_outertuple
    (resp. ecxt_innertuple) at runtime, but that seems like no big problem.
    In both setrefs.c and ruleutils.c, it would be desirable to have a
    TargetEntry list somewhere representing the index columns, which setrefs
    would want so it could set up the special Var nodes with fix_upper_expr,
    and ruleutils would want so it could interpret the Vars using existing
    machinery. I'm not sure whether to hang that list on the index-only
    plan node or expect EXPLAIN to regenerate it at need.

    3. Invent another special varno value similar to OUTER/INNER but
    representing an index reference. This is just about like #2 except that
    we could still put the index tuple into econtext->ecxt_scantuple, and
    ExecEvalVar would do the right thing as it stands.

    4. Create a rangetable entry specifically representing the index,
    and set varno equal to that RTE's number. This has some attractiveness
    in terms of making the meaning of the Vars clear, but an RTE that
    represents an index rather than a table seems kind of ugly otherwise.
    It would likely require changes in unrelated parts of the code.


    One point here is that we have historically used solution #1 to
    represent the index keys in index qual expressions. We avoid the
    ambiguity issues by not asking EXPLAIN to try to interpret the indexqual
    tree at all: it works from indexqualorig which contains ordinary Vars.
    So one way to dodge the disadvantages of solution #1 would be to add
    untransformed "targetlistorig" and "qualorig" fields to an index-only
    plan node, and use those for EXPLAIN. However, those fields would be
    totally dead weight if the plan were never EXPLAINed, whereas
    indexqualorig has a legitimate use for rechecking indexquals against the
    heap tuple in case of a lossy index. (BTW, if we go with any solution
    other than #1, I'm strongly inclined to change the representation of
    indexqual to match. See the comments in fix_indexqual_operand.)

    At the moment I'm leaning to approach #3, but I wonder if anyone has
    a different opinion or another idea altogether.

    regards, tom lane
  • Greg Stark at Oct 9, 2011 at 9:32 pm

    On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane wrote:
    At the moment I'm leaning to approach #3, but I wonder if anyone has
    a different opinion or another idea altogether.
    Would any of these make it more realistic to talk about the crazy
    plans Heikki suggested like doing two index scans, doing the join
    between the index tuples, and only then looking up the visibility
    information and remaining columns for the tuple on the matching rows?

    --
    greg
  • Tom Lane at Oct 9, 2011 at 9:54 pm

    Greg Stark writes:
    On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane wrote:
    At the moment I'm leaning to approach #3, but I wonder if anyone has
    a different opinion or another idea altogether.
    Would any of these make it more realistic to talk about the crazy
    plans Heikki suggested like doing two index scans, doing the join
    between the index tuples, and only then looking up the visibility
    information and remaining columns for the tuple on the matching rows?
    I don't think it's particularly relevant --- we would not want to use
    weird representations of the Vars outside the index scan nodes. Above
    the scan they'd be just like any other upper-level Vars.

    (FWIW, that idea isn't crazy; I remember having discussions of it back
    in 2003 or so.)

    regards, tom lane
  • Greg Stark at Oct 9, 2011 at 10:23 pm

    On Sun, Oct 9, 2011 at 10:54 PM, Tom Lane wrote:
    I don't think it's particularly relevant --- we would not want to use
    weird representations of the Vars outside the index scan nodes.  Above
    the scan they'd be just like any other upper-level Vars.
    I can't say I fully understand the planner data structures and the
    implications of the options. I guess what I was imagining was that
    being able to reference the indexes as regular rangetable entries
    would make it more convenient for the rest of the planner to keep
    working as if nothing had changed when working with values extracted
    from index tuples.


    --
    greg
  • Tom Lane at Oct 10, 2011 at 1:47 am

    I wrote:
    There are at least four things we could do: ...
    2. Set varno = OUTER (or maybe INNER). This is safe because there's no
    other use for OUTER/INNER in a table scan node. We would have to hack
    things so that the index tuple gets put into econtext->ecxt_outertuple
    (resp. ecxt_innertuple) at runtime, but that seems like no big problem.
    In both setrefs.c and ruleutils.c, it would be desirable to have a
    TargetEntry list somewhere representing the index columns, which setrefs
    would want so it could set up the special Var nodes with fix_upper_expr,
    and ruleutils would want so it could interpret the Vars using existing
    machinery. I'm not sure whether to hang that list on the index-only
    plan node or expect EXPLAIN to regenerate it at need.
    3. Invent another special varno value similar to OUTER/INNER but
    representing an index reference. This is just about like #2 except that
    we could still put the index tuple into econtext->ecxt_scantuple, and
    ExecEvalVar would do the right thing as it stands.
    I have mostly-working code for approach #3, but I haven't tried to make
    EXPLAIN work yet. While looking at that I realized that there's a
    pretty good argument for adding the above-mentioned explicit TargetEntry
    list representing the index columns to index-only plan nodes. Namely,
    that if we don't do it, EXPLAIN will have to go to the catalogs to find
    out what's in that index, and this will fall down for "hypothetical"
    indexes injected into the planner by index advisors. We could imagine
    adding some more hooks to let the advisor inject bogus catalog data at
    EXPLAIN time, but on the whole it seems easier and less fragile to just
    have the planner include a data structure it has to build anyway into
    the finished plan.

    The need for this additional node list field also sways me in a
    direction that I'd previously been on the fence about, namely that
    I think index-only scans need to be their own independent plan node type
    instead of sharing a node type with regular indexscans. It's just too
    weird that a simple boolean indexonly property would mean completely
    different contents/interpretation of the tlist and quals.

    I've run into one other thing that's going to need to be hacked up
    a bit: index-only scans on "name" columns fall over with this modified
    code, because there's now tighter checking of the implied tuple
    descriptors:

    regression=# select relname from pg_class where relname = 'tenk1';
    ERROR: attribute 1 has wrong type
    DETAIL: Table has type cstring, but query expects name.

    The reason for this is the hack we put in some time back to conserve
    space in system catalog indexes by having "name" columns be indexed as
    though they were "cstring", cf commit
    5f6f840e93a3649e0d07e85bad188d163e96ec0e. We will probably need some
    compensatory hack in index-only scans, unless we can think of a less
    klugy way of representing that optimization. (Basically, the index-only
    code is assuming that btrees don't have storage type distinct from input
    type, and that's not the case for the name opclass. I had kind of
    expected the original patch to have some issues with that too, and I'm
    still not fully convinced that there aren't corner cases where it'd be
    an issue even with the currently committed code.)

    regards, tom lane
  • Greg Stark at Oct 10, 2011 at 2:20 am

    On Mon, Oct 10, 2011 at 2:47 AM, Tom Lane wrote:
    The need for this additional node list field also sways me in a
    direction that I'd previously been on the fence about, namely that
    I think index-only scans need to be their own independent plan node type
    instead of sharing a node type with regular indexscans
    At a superficial PR level it'll go over quite well to have a special
    plan node be visible in the explain output. People will love to see
    Fast Index Scan or Covering Index Scan or whatever you call it in
    their plans.

    --
    greg
  • Tom Lane at Oct 11, 2011 at 4:20 am

    I wrote:
    I have mostly-working code for approach #3, but I haven't tried to make
    EXPLAIN work yet. While looking at that I realized that there's a
    pretty good argument for adding the above-mentioned explicit TargetEntry
    list representing the index columns to index-only plan nodes. Namely,
    that if we don't do it, EXPLAIN will have to go to the catalogs to find
    out what's in that index, and this will fall down for "hypothetical"
    indexes injected into the planner by index advisors. We could imagine
    adding some more hooks to let the advisor inject bogus catalog data at
    EXPLAIN time, but on the whole it seems easier and less fragile to just
    have the planner include a data structure it has to build anyway into
    the finished plan.
    The need for this additional node list field also sways me in a
    direction that I'd previously been on the fence about, namely that
    I think index-only scans need to be their own independent plan node type
    instead of sharing a node type with regular indexscans. It's just too
    weird that a simple boolean indexonly property would mean completely
    different contents/interpretation of the tlist and quals.
    Attached is a draft patch for this. It needs some more review before
    committing, but it does pass regression tests now.

    One point worth commenting on is that I chose to rename the OUTER and
    INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with
    adding a new special varno INDEX_VAR. It's bothered me for some time
    that those macro names were way too generic/susceptible to collision;
    and since I had to look at all the uses anyway to see if the INDEX case
    needed to be added, this seemed like a good time to rename them.

    regards, tom lane
  • Robert Haas at Oct 11, 2011 at 10:36 am

    On Tue, Oct 11, 2011 at 12:19 AM, Tom Lane wrote:
    I wrote:
    I have mostly-working code for approach #3, but I haven't tried to make
    EXPLAIN work yet.  While looking at that I realized that there's a
    pretty good argument for adding the above-mentioned explicit TargetEntry
    list representing the index columns to index-only plan nodes.  Namely,
    that if we don't do it, EXPLAIN will have to go to the catalogs to find
    out what's in that index, and this will fall down for "hypothetical"
    indexes injected into the planner by index advisors.  We could imagine
    adding some more hooks to let the advisor inject bogus catalog data at
    EXPLAIN time, but on the whole it seems easier and less fragile to just
    have the planner include a data structure it has to build anyway into
    the finished plan.
    The need for this additional node list field also sways me in a
    direction that I'd previously been on the fence about, namely that
    I think index-only scans need to be their own independent plan node type
    instead of sharing a node type with regular indexscans.  It's just too
    weird that a simple boolean indexonly property would mean completely
    different contents/interpretation of the tlist and quals.
    Attached is a draft patch for this.  It needs some more review before
    committing, but it does pass regression tests now.

    One point worth commenting on is that I chose to rename the OUTER and
    INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with
    adding a new special varno INDEX_VAR.  It's bothered me for some time
    that those macro names were way too generic/susceptible to collision;
    and since I had to look at all the uses anyway to see if the INDEX case
    needed to be added, this seemed like a good time to rename them.
    +1 for that renaming, for sure.

    Have you given any thought to what would be required to support
    index-only scans on non-btree indexes - particularly, GIST? ISTM we
    might have had a thought that some GIST opclasses but not others would
    be able to regurgitate tuples, or maybe even that it might vary from
    index tuple to index tuple. But that discussion was a long time ago,
    and my memory is fuzzy.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Alexander Korotkov at Oct 11, 2011 at 12:16 pm

    On Tue, Oct 11, 2011 at 2:36 PM, Robert Haas wrote:

    Have you given any thought to what would be required to support
    index-only scans on non-btree indexes - particularly, GIST? ISTM we
    might have had a thought that some GIST opclasses but not others would
    be able to regurgitate tuples, or maybe even that it might vary from
    index tuple to index tuple. But that discussion was a long time ago,
    and my memory is fuzzy.
    In some GiST opclasses original tuple can't be restored from index tuple.
    For example, polygon can't be restored from it's MBR. In some opclasses, for
    example box_ops and point_ops, original tuple can be easily restore from
    index tuple.
    I can't remeber any implementation where this possibility can vary from
    index tuple to index tuple. GiST opclass for ts_vector have different
    representation of leaf index tuple depending on it's length. But, at most,
    it store array of hashes of words, so it's lossy anyway.
    I think GiST interface could be extended with optional function which
    restores original tuple. But there is some additional difficulty, when some
    opclasses of composite index provide this function, but others - not.

    ------
    With best regards,
    Alexander Korotkov.
  • Tom Lane at Oct 11, 2011 at 1:22 pm

    Robert Haas writes:
    Have you given any thought to what would be required to support
    index-only scans on non-btree indexes - particularly, GIST? ISTM we
    might have had a thought that some GIST opclasses but not others would
    be able to regurgitate tuples, or maybe even that it might vary from
    index tuple to index tuple. But that discussion was a long time ago,
    and my memory is fuzzy.
    It would have to be a per-opclass property, for sure, since the basic
    criterion is whether the opclass's "compress" function is lossy.
    I don't think we can tolerate a situation where some of the index tuples
    might be able to yield data and others not --- that would undo all the
    improvements I've been working on over the past few days.

    I haven't thought as far ahead as how we might get the information
    needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
    CLASS might be the only way.

    regards, tom lane
  • Alexander Korotkov at Oct 11, 2011 at 1:36 pm

    On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane wrote:
    I haven't thought as far ahead as how we might get the information
    needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
    CLASS might be the only way.
    Shouldn't it be implemented through additional interface function? There are
    situations when restoring of original tuple requires some transformation.
    For example, in point_ops we store box in the leaf index tuple, while point
    can be easily restored from box.

    ------
    With best regards,
    Alexander Korotkov.
  • Tom Lane at Oct 11, 2011 at 8:35 pm

    Alexander Korotkov writes:
    On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane wrote:
    I haven't thought as far ahead as how we might get the information
    needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
    CLASS might be the only way.
    Shouldn't it be implemented through additional interface function? There are
    situations when restoring of original tuple requires some transformation.
    For example, in point_ops we store box in the leaf index tuple, while point
    can be easily restored from box.
    Hm. I had been supposing that lossless compress functions would just be
    no-ops. If that's not necessarily the case then we might need something
    different from the opclass's decompress function to get back the
    original data. However, that doesn't really solve the problem I'm
    concerned about, because the existence and use of such a function would
    be entirely internal to GiST. There still needs to be a way for the
    planner to know which opclasses support data retrieval. And I do *not*
    want to see us hard-wire "the presence of opclass function 8 means a
    GiST opclass can return data" into the planner.

    Maybe, instead of a simple constant amcanreturn column, we need an AM
    API function that says whether the index can return data.

    regards, tom lane
  • Alexander Korotkov at Oct 11, 2011 at 8:45 pm

    On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane wrote:
    Hm. I had been supposing that lossless compress functions would just be
    no-ops. If that's not necessarily the case then we might need something
    different from the opclass's decompress function to get back the
    original data. However, that doesn't really solve the problem I'm
    concerned about, because the existence and use of such a function would
    be entirely internal to GiST. There still needs to be a way for the
    planner to know which opclasses support data retrieval. And I do *not*
    want to see us hard-wire "the presence of opclass function 8 means a
    GiST opclass can return data" into the planner.

    Maybe, instead of a simple constant amcanreturn column, we need an AM
    API function that says whether the index can return data.
    I like idea of such AM API function. Since single multicolumn index can use
    multiple opclasses, AM API function should also say *what* data index can
    return.

    ------
    With best regards,
    Alexander Korotkov.
  • Tom Lane at Oct 11, 2011 at 9:02 pm

    Alexander Korotkov writes:
    On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane wrote:
    Maybe, instead of a simple constant amcanreturn column, we need an AM
    API function that says whether the index can return data.
    I like idea of such AM API function. Since single multicolumn index can use
    multiple opclasses, AM API function should also say *what* data index can
    return.
    I was thinking more like "amcanreturn(index, column_number) returns bool"
    which says if the index can return the data for that column. The AM
    would still have to return a full IndexTuple at runtime, but it'd be
    allowed to insert nulls or garbage for columns it hadn't promised to
    return.

    BTW, if we do this, I'm rather strongly tempted to get rid of the
    name-versus-cstring hack (see index_descriptor_hack() in HEAD) by
    defining btree name_ops as not capable of returning data. I don't
    trust that hack much at all.

    regards, tom lane
  • Dimitri Fontaine at Oct 11, 2011 at 8:54 pm

    Tom Lane writes:
    I haven't thought as far ahead as how we might get the information
    needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
    CLASS might be the only way.
    It looks to me like it's related to the RECHECK property. Maybe it's
    just too late, though.

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Related Discussions

People

Translate

site design / logo © 2022 Grokbase