FAQ
I have some time now for working on the mystery of why there are no
block write statistics in the database. I hacked out the statistics
collection and reporting side already, rough patch attached if you want
to see the boring parts.

I guessed that there had to be a gotcha behind why this wasn't done
before now, and I've found one so far. All of the read statistics are
collected with a macro that expects to know a Relation number. Callers
to ReadBuffer pass one. On the write side, the two places that
increment the existing write counters (pg_stat_statements, vacuum) are
MarkBufferDirty and MarkBufferDirtyHint. Neither of those is given a
Relation though. Inspecting the Buffer passed can only find the buffer
tag's RelFileNode.

I've thought of two paths to get a block write count out of that so far:

-Provide a function to find the Relation from the RelFileNode. There is
a warning about the perils of assuming you can map that way from a
buftag value in buf_internals.h though:

"Note: the BufferTag data must be sufficient to determine where to write
the block, without reference to pg_class or pg_tablespace entries. It's
possible that the backend flushing the buffer doesn't even believe the
relation is visible yet (its xact may have started before the xact that
created the rel). The storage manager must be able to cope anyway."

-Modify every caller of MarkDirty* to include a relation when that
information is available. There are about 200 callers of those
functions around, so that won't be fun. I noted that many of them are
in the XLog replay functions, which won't have the relation--but those
aren't important to count anyway.

Neither of these options feels very good to me, so selecting between the
two feels like picking the lesser of two evils. I'm fine with chugging
through all of the callers to modify MarkDirty*, but I didn't want to do
that only to have the whole approach rejected as wrong. That's why I
stopped here to get some feedback.

The way that MarkDirty requires this specific underlying storage manager
behavior to work properly strikes me as as a bit of a layering violation
too. I'd like the read and write paths to have a similar API, but here
they don't even operate on the same type of inputs. Addressing that is
probably harder than just throwing a hack on the existing code though.

--
Greg Smith 2ndQuadrant US greg@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Search Discussions

  • Heikki Linnakangas at May 20, 2013 at 11:51 am

    On 19.05.2013 11:15, Greg Smith wrote:
    I've thought of two paths to get a block write count out of that so far:

    -Provide a function to find the Relation from the RelFileNode. There is
    a warning about the perils of assuming you can map that way from a
    buftag value in buf_internals.h though:

    "Note: the BufferTag data must be sufficient to determine where to write
    the block, without reference to pg_class or pg_tablespace entries. It's
    possible that the backend flushing the buffer doesn't even believe the
    relation is visible yet (its xact may have started before the xact that
    created the rel). The storage manager must be able to cope anyway."

    -Modify every caller of MarkDirty* to include a relation when that
    information is available. There are about 200 callers of those functions
    around, so that won't be fun. I noted that many of them are in the XLog
    replay functions, which won't have the relation--but those aren't
    important to count anyway.

    Neither of these options feels very good to me, so selecting between the
    two feels like picking the lesser of two evils. I'm fine with chugging
    through all of the callers to modify MarkDirty*, but I didn't want to do
    that only to have the whole approach rejected as wrong. That's why I
    stopped here to get some feedback.
    Adding a Relation argument to all the Mark* calls seems fine to me. I
    don't find it ugly at all. The biggest objection would be that if there
    are extensions out there that call those functions, they would need to
    be changed, but I think that's fine.
    The way that MarkDirty requires this specific underlying storage manager
    behavior to work properly strikes me as as a bit of a layering violation
    too. I'd like the read and write paths to have a similar API, but here
    they don't even operate on the same type of inputs. Addressing that is
    probably harder than just throwing a hack on the existing code though.
    To be honest, I don't understand what you mean by that. ?

    - Heikki
  • Greg Smith at May 23, 2013 at 11:10 pm

    On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
    The way that MarkDirty requires this specific underlying storage manager
    behavior to work properly strikes me as as a bit of a layering violation
    too. I'd like the read and write paths to have a similar API, but here
    they don't even operate on the same type of inputs. Addressing that is
    probably harder than just throwing a hack on the existing code though.
    To be honest, I don't understand what you mean by that. ?
    Let's say you were designing a storage layer API from scratch for what
    Postgres does. That might take a relation as its input, like ReadBuffer
    does. Hiding the details of how that turns into a physical file
    operation would be a useful goal of such a layer. I'd then consider it
    a problem if that exposed things like the actual mapping of relations
    into files to callers.

    What we actually have right now is this MarkDirty function that operates
    on BufferTag data, which points directly to the underlying file. I see
    that as cutting the storage API in half and calling a function in the
    middle of the implementation. It strikes me as kind of weird that the
    read side and write side are not more symmetrical.

    --
    Greg Smith 2ndQuadrant US greg@2ndquadrant.com Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
  • Heikki Linnakangas at May 24, 2013 at 1:33 am

    On 23.05.2013 19:10, Greg Smith wrote:
    On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
    The way that MarkDirty requires this specific underlying storage
    manager behavior to work properly strikes me as as a bit of a
    layering violation too. I'd like the read and write paths to have
    a similar API, but here they don't even operate on the same type
    of inputs. Addressing that is probably harder than just throwing
    a hack on the existing code though.
    To be honest, I don't understand what you mean by that. ?
    Let's say you were designing a storage layer API from scratch for
    what Postgres does. That might take a relation as its input, like
    ReadBuffer does. Hiding the details of how that turns into a physical
    file operation would be a useful goal of such a layer. I'd then
    consider it a problem if that exposed things like the actual mapping
    of relations into files to callers.
    Ok, got it.
    What we actually have right now is this MarkDirty function that
    operates on BufferTag data, which points directly to the underlying
    file. I see that as cutting the storage API in half and calling a
    function in the middle of the implementation.
    Well, no, the BufferTag struct is internal to the buffer manager
    implementation. It's not part of the API; it's an implementation detail
    of the buffer manager.
    It strikes me as kind of weird that the read side and write side are
    not more symmetrical.
    It might seem weird if you expect the API to be similar to POSIX read()
    and write(), where you can read() an arbitrary block at any location,
    and write() an arbitrary block at any location. A better comparison
    would be e.g open() and close(). open() returns a file descriptor, which
    you pass to other functions to operate on the file. When you're done,
    you call close(fd). The file descriptor is completely opaque to the user
    program, you do all the operations through the functions that take the
    fd as argument. Similarly, ReadBuffer() returns a Buffer, which is
    completely opaque to the caller, and you do all the operations through
    various functions and macros that operate on the Buffer. When you're
    done, you release the buffer with ReleaseBuffer().

    (sorry for the digression from the original topic, I don't have any
    problem with what adding an optional Relation argument to MarkBuffer if
    you need that :-) )

    - Heikki
  • Satoshi Nagayasu at Jul 1, 2013 at 7:10 am
    Hi,

    I'm looking into this patch as a reviewer.

    (2013/05/24 10:33), Heikki Linnakangas wrote:
    On 23.05.2013 19:10, Greg Smith wrote:
    On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
    It strikes me as kind of weird that the read side and write side are
    not more symmetrical.
    It might seem weird if you expect the API to be similar to POSIX read()
    and write(), where you can read() an arbitrary block at any location,
    and write() an arbitrary block at any location. A better comparison
    would be e.g open() and close(). open() returns a file descriptor, which
    you pass to other functions to operate on the file. When you're done,
    you call close(fd). The file descriptor is completely opaque to the user
    program, you do all the operations through the functions that take the
    fd as argument. Similarly, ReadBuffer() returns a Buffer, which is
    completely opaque to the caller, and you do all the operations through
    various functions and macros that operate on the Buffer. When you're
    done, you release the buffer with ReleaseBuffer().

    (sorry for the digression from the original topic, I don't have any
    problem with what adding an optional Relation argument to MarkBuffer if
    you need that :-) )
    Or should we add some pointer, which is accociated with the Relation,
    into BufferDesc? Maybe OID?

    I'm thinking of FlushBuffer() too. How can we count physical write
    for each relation in FlushBuffer()? Or any other appropriate place?

    BTW, Greg's first patch could not be applied to the latest HEAD.
    I guess it need to have some fix, I couldn't clafiry it though.

    Regards,
    --
    Satoshi Nagayasu <snaga@uptime.jp>
    Uptime Technologies, LLC. http://www.uptime.jp
  • Greg Smith at Jul 5, 2013 at 7:51 am

    On 7/1/13 3:10 AM, Satoshi Nagayasu wrote:
    Or should we add some pointer, which is accociated with the Relation,
    into BufferDesc? Maybe OID?
    That is the other option here, I looked at it but didn't like it. The
    problem is that at the point when a new page is created, it's not always
    clear yet what relation it's going to hold. That means that if the
    buffer page itself is where you look up the relation OID, every code
    path that manipulates relation IDs will need to worry about maintaining
    this information. It's possible to do it that way, but I don't know
    that it's any less work than making all the write paths keep track of
    it. It would need some extra locking around updating the relation tag
    in the buffer pages too, and I'd prefer not to

    The other thing that I like about the writing side is that I can
    guarantee the code is correct pretty easily. Yes, it's going to take
    days worth of modifying writing code. But the compile will force you to
    fix all of them, and once they're all updated I'd be surprised if
    something unexpected happened.

    If instead the data moves onto the buffer page header instead, we could
    be chasing bugs similar to the relcache ones forever. Also, as a tie
    breaker, buffer pages will get bigger and require more locking this way.
       Making writers tell us the relation doesn't need any of that.
    I'm thinking of FlushBuffer() too. How can we count physical write
    for each relation in FlushBuffer()? Or any other appropriate place?
    SMgrRelation is available there, so tagging the relation should be easy
    in this case.
    BTW, Greg's first patch could not be applied to the latest HEAD.
    I guess it need to have some fix, I couldn't clafiry it though.
    Since this is a WIP patch, what I was looking for was general design
    approach feedback, with my bit as a simple example. I didn't expect
    anyone to spend much time doing a formal review of that quick hack.
    You're welcome to try and play with that code to add things, I'm not
    very attached to it yet. I've basically gotten what I wanted to here
    from this submission; I'm marking it returned with feedback. If anyone
    else has comments, I'm still open to discussion here too.

    --
    Greg Smith 2ndQuadrant US greg@2ndquadrant.com Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 19, '13 at 8:15a
activeJul 5, '13 at 7:51a
posts6
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase