I've wanted for a long time to have EXPLAIN ANALYZE output per-node I/O usage.
This would be especially useful if we could distinguish hardware versus
logical I/O though. And I always thought that would be very hard.

My thought in the past was that would could do it on Solaris by having
Postgres use DTrace directly via its (undocumented but existing) programmatic
interface.

For other operating systems it was tempting to suggest just timing the read(2)
call to see if it took too long to be a logical operation. The problem there
is that gettimeofday would impose far too much overhead to make that practical
(or even be precise enough to work properly).

But it occurred to me just now that the hardware instruction counter available
on just about every platform would be good enough for a heuristic guess at
whether the read(2) was cached. The problem generally with using the hardware
instruction counter is that it's not necessarily in sync between processors
and might therefore run backwards or skip time forwards. This is a problem for
profiling but if all we care about is a boolean guess at whether the request
was satisfied quickly from cache then any such skipping forward or backward
would represent a context switch which we could just toss in the hardware i/o
bucket. It doesn't matter exactly how long the hardware i/o took, only that
there was one.

To that end I would love to see something like:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on h (cost=8.52..16.45 rows=2 width=512) (actual time=78.926..87.708 rows=2 loops=1 logical-I/O=2 physical-I/O=1)
Recheck Cond: (i = ANY ('{100,1000}'::integer[]))
-> Bitmap Index Scan on hi (cost=0.00..8.52 rows=2 width=0) (actual time=74.539..74.539 rows=2 loops=1 logical-I/O=2 physical-I/O=2))
Index Cond: (i = ANY ('{100,1000}'::integer[]))
Total runtime: 87.820 ms


--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

Search Discussions

  • Neil Conway at Dec 15, 2007 at 5:21 am

    On Fri, 2007-12-14 at 15:47 +0000, Gregory Stark wrote:
    I've wanted for a long time to have EXPLAIN ANALYZE output per-node I/O usage.
    This would be especially useful if we could distinguish hardware versus
    logical I/O though. And I always thought that would be very hard.

    My thought in the past was that would could do it on Solaris by having
    Postgres use DTrace directly via its (undocumented but existing) programmatic
    interface.

    For other operating systems it was tempting to suggest just timing the read(2)
    call to see if it took too long to be a logical operation. The problem there
    is that gettimeofday would impose far too much overhead to make that practical
    (or even be precise enough to work properly).

    But it occurred to me just now that the hardware instruction counter available
    on just about every platform would be good enough for a heuristic guess at
    whether the read(2) was cached.
    I'm skeptical that this would be reliable enough to be very useful,
    especially in the face of concurrent, unpredictable system activity on a
    busy system. I agree that it would be useful information, though.
    Perhaps a useful first step would be to teach EXPLAIN ANALYZE to report
    the number of logical and physical I/Os from Postgres' perspective (i.e.
    physical I/O just means we need to go to the kernel).
    The problem generally with using the hardware
    instruction counter is that it's not necessarily in sync between processors
    and might therefore run backwards or skip time forwards. This is a problem for
    profiling but if all we care about is a boolean guess at whether the request
    was satisfied quickly from cache then any such skipping forward or backward
    would represent a context switch which we could just toss in the hardware i/o
    bucket. It doesn't matter exactly how long the hardware i/o took, only that
    there was one.

    To that end I would love to see something like:

    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------
    Bitmap Heap Scan on h (cost=8.52..16.45 rows=2 width=512) (actual time=78.926..87.708 rows=2 loops=1 logical-I/O=2 physical-I/O=1)
    Recheck Cond: (i = ANY ('{100,1000}'::integer[]))
    -> Bitmap Index Scan on hi (cost=0.00..8.52 rows=2 width=0) (actual time=74.539..74.539 rows=2 loops=1 logical-I/O=2 physical-I/O=2))
    Index Cond: (i = ANY ('{100,1000}'::integer[]))
    Total runtime: 87.820 ms
  • Gokulakannan Somasundaram at Dec 15, 2007 at 7:49 am
    Hi,
    I already made a discussion about it. We can view the Logical I/Os. If
    we enable the log_statement_stats in the conf file and apply the following
    patch, it is possible. But putting it in Explain analyze makes more sense to
    me.



    *** postgresql-8.3beta1/src/backend/storage/buffer/bufmgr.c Tue Sep 25
    18:11:48 2007
    --- postgresql-8.3patch/src/backend/storage/buffer/bufmgr.c Fri Oct 19
    23:18:36 2007
    ***************
    *** 1470,1477 ****
    localhitrate = (float) LocalBufferHitCount *100.0 /
    ReadLocalBufferCount;

    appendStringInfo(&str,
    ! "!\tShared blocks: %10ld read, %10ld written, buffer hit rate =
    %.2f%%\n",
    ! ReadBufferCount - BufferHitCount,
    BufferFlushCount, hitrate);
    appendStringInfo(&str,
    "!\tLocal blocks: %10ld read, %10ld written, buffer hit rate =
    %.2f%%\n",
    ReadLocalBufferCount -
    LocalBufferHitCount,
    LocalBufferFlushCount, localhitrate);
    --- 1470,1477 ----
    localhitrate = (float) LocalBufferHitCount *100.0 /
    ReadLocalBufferCount;

    appendStringInfo(&str,
    ! "!\tShared blocks: %10ld Logical Reads, %10ld Physical Reads, %10ld
    written, buffer hit rate = %.2f%%\n",
    ! ReadBufferCount, ReadBufferCount -
    BufferHitCount,
    BufferFlushCount, hitrate);
    appendStringInfo(&str,
    "!\tLocal blocks: %10ld read, %10ld written, buffer hit rate =
    %.2f%%\n",
    ReadLocalBufferCount -
    LocalBufferHitCount,
    LocalBufferFlushCount, localhitrate);





    --
    Thanks,
    Gokul.
    CertoSQL Project,
    Allied Solution Group.
    (www.alliedgroups.com)
  • Gregory Stark at Dec 15, 2007 at 9:28 am

    "Gokulakannan Somasundaram" <gokul007@gmail.com> writes:

    Hi,
    I already made a discussion about it. We can view the Logical I/Os. If
    we enable the log_statement_stats in the conf file and apply the following
    patch, it is possible. But putting it in Explain analyze makes more sense to
    me.
    I was going to say that I'm really only interested in physical I/O. Logical
    I/O which is satisfied by the kernel cache is only marginally interesting and
    buffer fetches from Postgres's shared buffer is entirely uninteresting from
    the point of view of trying to figure out what is slowing down a query.

    However I suppose that's not true. There are other reasons why buffer fetches
    could be interesting. In particular I imagine when users post explain analyzes
    it would give us a good idea of whether their tables or bloated or their
    tuples are extremely wide (in cases where the planner gets it wrong).

    But I do think that showing logical I/Os without even an heuristic based
    measurement of actual physical i/o is pretty useless. It will make people
    think they want to grow their shared buffers to cover all of memory.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
    Ask me about EnterpriseDB's PostGIS support!
  • Gokulakannan Somasundaram at Dec 15, 2007 at 1:20 pm
    I was going to say that I'm really only interested in physical I/O. Logical
    I/O which is satisfied by the kernel cache is only marginally interesting
    and
    buffer fetches from Postgres's shared buffer is entirely uninteresting
    from
    the point of view of trying to figure out what is slowing down a query.

    Ok the Physical I/Os are already visible, if you enable log_statement_stats.
    Again i accept that it would be more helpful, if it gets displayed with
    Explain Analyze.

    However I suppose that's not true. There are other reasons why buffer
    fetches
    could be interesting. In particular I imagine when users post explain
    analyzes
    it would give us a good idea of whether their tables or bloated or their
    tuples are extremely wide (in cases where the planner gets it wrong).

    I have used it a lot for query tuning. If we re-write a query in such a way,
    the logical reads will come down, then it implies lesser physical reads in
    production. I think you would accept that there are some ways in which the
    query can be re-written only by humans and not by the optimizer. When we do
    that, instead of looking at the explain analyze time, it makes more sense
    for me to look at the logical reads


    But I do think that showing logical I/Os without even an heuristic based
    measurement of actual physical i/o is pretty useless. It will make people
    think they want to grow their shared buffers to cover all of memory.

    I just want to clarify that we should display both Logical reads and
    physical reads together. But increasing the shared buffer by looking at the
    performance of a query doesn't seem to be a good idea. But people should be
    aware that Logical reads is not for shared buffer management.


    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
    Ask me about EnterpriseDB's PostGIS support!


    --
    Thanks,
    Gokul.
    CertoSQL Project,
    Allied Solution Group.
    (www.alliedgroups.com)
  • Heikki Linnakangas at Dec 15, 2007 at 7:33 pm

    Gokulakannan Somasundaram wrote:
    I was going to say that I'm really only interested in physical I/O. Logical
    I/O which is satisfied by the kernel cache is only marginally interesting
    and
    buffer fetches from Postgres's shared buffer is entirely uninteresting
    from
    the point of view of trying to figure out what is slowing down a query.
    Ok the Physical I/Os are already visible, if you enable log_statement_stats.
    I think you missed the point. What log_statement_stats shows are not
    physical I/Os, they're read() system calls. Unfortunately there's no
    direct way to tell if a read() is satisfied from OS cache or not. Greg's
    suggestion was about how to do that.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Gokulakannan Somasundaram at Dec 17, 2007 at 7:28 am

    On Dec 16, 2007 1:03 AM, Heikki Linnakangas wrote:

    Gokulakannan Somasundaram wrote:
    I was going to say that I'm really only interested in physical I/O.
    Logical
    I/O which is satisfied by the kernel cache is only marginally
    interesting
    and
    buffer fetches from Postgres's shared buffer is entirely uninteresting
    from
    the point of view of trying to figure out what is slowing down a query.
    Ok the Physical I/Os are already visible, if you enable
    log_statement_stats.

    I think you missed the point. What log_statement_stats shows are not
    physical I/Os, they're read() system calls. Unfortunately there's no
    direct way to tell if a read() is satisfied from OS cache or not. Greg's
    suggestion was about how to do that.
    Oh OK. Thanks for clarifying..

    --
    Thanks,
    Gokul.
    CertoSQL Project,
    Allied Solution Group.
    (www.alliedgroups.com)
  • Decibel! at Dec 18, 2007 at 2:17 am

    On Dec 14, 2007, at 11:10 PM, Neil Conway wrote:
    But it occurred to me just now that the hardware instruction
    counter available
    on just about every platform would be good enough for a heuristic
    guess at
    whether the read(2) was cached.
    I'm skeptical that this would be reliable enough to be very useful,
    especially in the face of concurrent, unpredictable system activity
    on a
    busy system. I agree that it would be useful information, though.
    Perhaps a useful first step would be to teach EXPLAIN ANALYZE to
    report
    the number of logical and physical I/Os from Postgres' perspective
    (i.e.
    physical I/O just means we need to go to the kernel).

    *watches hands wave*

    If we assume that what we *really* want to see the difference of is a
    cache IO vs one that truly hits a platter, I don't see how you could
    come up with enough variation to account for that. Unless you're
    super lucky and the sector you need is just about to hit the head
    when you issue the request, you're going to have a delay measured in
    milliseconds. 1ms on a modern CPU is over 100,000 cycles. That's a
    lot of instructions...

    Someone want to throw together some code that actually measures this?
    Maybe something that keeps a histogram of how many instructions take
    place per I/O request? If it turns out that counters do vary too much
    between CPUs, there might be ways that we can account for that.

    Also, has anyone looked into adding a class of system calls that
    would actually tell us if the kernel issued physical IO? I find it
    hard to believe that other RDBMSes wouldn't like to have that info...
    --
    Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
    Give your computer some brain candy! www.distributed.net Team #1828
  • Greg Smith at Dec 18, 2007 at 2:56 am

    On Mon, 17 Dec 2007, Decibel! wrote:

    Someone want to throw together some code that actually measures this? Maybe
    something that keeps a histogram of how many instructions take place per I/O
    request? If it turns out that counters do vary too much between CPUs, there
    might be ways that we can account for that.
    I'd expect the easiest way to do a proof of concept here would be to use
    the Linux oprofile tool. That's already abstracted away the differences
    in hardware counters and provides a relatively simple interface to collect
    the data without getting bogged down with implementation there. Starting
    from scratch, going right to the hardware counters and building from
    there, is a big project--they've been hacking on oprofile for almost six
    years now and still aren't suggesting it's release quality yet.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
  • Trevor Talbot at Dec 18, 2007 at 6:29 am

    On 12/17/07, Decibel! wrote:

    Also, has anyone looked into adding a class of system calls that
    would actually tell us if the kernel issued physical IO? I find it
    hard to believe that other RDBMSes wouldn't like to have that info...
    Non-blocking style interfaces can help here. On Windows, for instance,
    a read returns data at the call site if it was satisfied by cache,
    instead of invoking the asynchronous notification.
  • Gregory Stark at Dec 18, 2007 at 9:32 am

    "Decibel!" <decibel@decibel.org> writes:

    Also, has anyone looked into adding a class of system calls that would
    actually tell us if the kernel issued physical IO? I find it hard to believe
    that other RDBMSes wouldn't like to have that info...
    Yeah, I think that's called "DTrace"

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
    Ask me about EnterpriseDB's 24x7 Postgres support!
  • Decibel! at Dec 19, 2007 at 1:21 am

    On Dec 18, 2007, at 3:32 AM, Gregory Stark wrote:
    Also, has anyone looked into adding a class of system calls that
    would
    actually tell us if the kernel issued physical IO? I find it hard
    to believe
    that other RDBMSes wouldn't like to have that info...
    Yeah, I think that's called "DTrace"

    That seems like way overkill...

    When a read() call returns, surely the kernel knows whether it
    actually issued a physical read request to satisfy that. I don't see
    any reason why you couldn't have a version of read() that returns
    that information. I also rather doubt that we're the only userland
    software that would make use of that.
    --
    Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
    Give your computer some brain candy! www.distributed.net Team #1828
  • Gregory Stark at Dec 19, 2007 at 2:08 am

    "Decibel!" <decibel@decibel.org> writes:

    When a read() call returns, surely the kernel knows whether it actually issued
    a physical read request to satisfy that. I don't see any reason why you
    couldn't have a version of read() that returns that information. I also rather
    doubt that we're the only userland software that would make use of that.
    I'm told this exists on Windows for the async interface. But AFAIK it doesn't
    on Unix. The visibility into things like this is what makes DTrace so
    remarkable.

    I think there aren't many userland software interested in this. The only two
    cases I can think of are databases -- which use direct I/O partly because of
    this issue -- and real-time software like multimedia software -- which use
    aren't so much interested in measuring it as forcing things to be preloaded
    with stuff like posix_fadvise() or mlock().

    I don't think DTrace is overkill either. The programmatic interface is
    undocumented (but I've gotten Sun people to admit it exists -- I just have to
    reverse engineer it from the existing code samples) but should be more or less
    exactly what we need.

    But the lowest-common-denominator of just timing read() and seeing if it took
    long enough to involve either a context switch or sleeping on physical i/o
    should be a pretty close approximation. The case where it would be least
    accurate is when most or all of the data is actually in the cache. Then even
    with a low false-positive rate detecting cache misses it'll still dominate the
    true near-zero rate of cache misses.

    We could mitigate that somewhat by describing it in the plan as something like

    ................... (... I/O fast=nnn slow=nnn)

    instead of the more descriptive "physical" and "logical"

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
    Ask me about EnterpriseDB's 24x7 Postgres support!
  • Robert Lor at Jan 3, 2008 at 4:11 pm
    Greg,

    Gregory Stark wrote:
    I don't think DTrace is overkill either. The programmatic interface is
    undocumented (but I've gotten Sun people to admit it exists -- I just have to
    reverse engineer it from the existing code samples) but should be more or less
    exactly what we need.
    You probably know this already. There are existing commands that use
    the programmatic interface and would provide a good starting point. Here
    are a couple:

    http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/lockstat/lockstat.c
    http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/dtrace/dtrace.c

    One of my colleagues is in the process of putting a tutorial together
    for how to do this, so if you decided to pursue this approach and need
    assistance, please let me know.

    Regards,
    Robert

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedDec 14, '07 at 3:47p
activeJan 3, '08 at 4:11p
posts14
users8
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase