Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
came out 5 times faster. The benchmark isn't very thoroughly
described, but it turns out not to matter.

http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html

Apparently, the reason we're faster is that wal_sync_method =
open_datasync, which is the default on MacOS X, doesn't actually work.

[rhaas pgbench]$ pgbench -t 100000 -j 4 -c 4 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 4
number of threads: 4
number of transactions per client: 100000
number of transactions actually processed: 400000/400000
tps = 1292.258304 (including connections establishing)
tps = 1292.281493 (excluding connections establishing)

Clearly we're not getting 1292 (or even 1292/4) fsync per second out
of whatever HD is in my laptop. So what happens if we change to
fsync_writethrough, which is the equivalent of what InnoDB apparently
does out of the box?

[rhaas pgsql]$ pg_ctl reload
server signaled
LOG: received SIGHUP, reloading configuration files
LOG: parameter "wal_sync_method" changed to "fsync_writethrough"
[rhaas pgbench]$ pgbench -t 100000 -j 4 -c 4 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 4
number of threads: 4
number of transactions per client: 100000
number of transactions actually processed: 400000/400000
tps = 27.845797 (including connections establishing)
tps = 27.845809 (excluding connections establishing)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Search Discussions

  • Dave Page at Sep 30, 2010 at 8:09 pm

    On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas wrote:
    Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
    came out 5 times faster.  The benchmark isn't very thoroughly
    described, but it turns out not to matter.

    http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html

    Apparently, the reason we're faster is that wal_sync_method =
    open_datasync, which is the default on MacOS X, doesn't actually work.
    That might be true, but if you check the comments, Jayant replied to say:

    @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux
    does not support fsync_writethrough
    http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

    So I don't think that invalidates his benchmark. Something else might
    of course...

    --
    Dave Page
    Blog: http://pgsnake.blogspot.com
    Twitter: @pgsnake

    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Dave Page at Sep 30, 2010 at 8:11 pm

    On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas wrote:
    Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
    came out 5 times faster.  The benchmark isn't very thoroughly
    described, but it turns out not to matter.

    http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html

    Apparently, the reason we're faster is that wal_sync_method =
    open_datasync, which is the default on MacOS X, doesn't actually work.
    That might be true, but if you check the comments, Jayant replied to say:

    @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux
    does not support fsync_writethrough
    http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

    So I don't think that invalidates his benchmark. Something else might
    of course...

    --
    Dave Page
    Blog: http://pgsnake.blogspot.com
    Twitter: @pgsnake

    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Robert Haas at Sep 30, 2010 at 8:27 pm

    On Thu, Sep 30, 2010 at 4:09 PM, Dave Page wrote:
    On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas wrote:
    Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
    came out 5 times faster.  The benchmark isn't very thoroughly
    described, but it turns out not to matter.

    http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html

    Apparently, the reason we're faster is that wal_sync_method =
    open_datasync, which is the default on MacOS X, doesn't actually work.
    That might be true, but if you check the comments, Jayant replied to say:

    @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux
    does not support fsync_writethrough
    http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

    So I don't think that invalidates his benchmark. Something else might
    of course...
    Oh, I missed that. Actually, I wasn't really so concerned with
    whether his benchmark is correct. I *am* concerned about being broken
    out of the box on MacOS X.

    (I also suspect problems with the benchmark. It's hard to believe
    we're 5x faster than InnoDB on an apples-to-apples comparison on
    trivial queries. I'd believe 20% either way, but 5x is a lot. But
    that's a question for another day.)

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Tom Lane at Sep 30, 2010 at 9:02 pm

    Robert Haas writes:
    Oh, I missed that. Actually, I wasn't really so concerned with
    whether his benchmark is correct. I *am* concerned about being broken
    out of the box on MacOS X.
    Actually, the problem with OSX is that OSX is broken out of the box,
    at least by that standard. The system's normal configuration is that
    fsync() does nothing, so it's hardly surprising that O_DSYNC is no
    better. You have to use wal_sync_method = fsync_writethrough to get
    actual bits-to-the-platter behavior.

    I'm not sure whether we should select fsync_writethrough as the default
    on OSX. We don't make an equivalent attempt to prevent OS or storage
    malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
    ahead of the game in that you *can* force writethrough without resorting
    to arcane hacks with hdparm or some such.

    We could definitely stand to be a bit more verbose about documenting
    the platform-specific issues in this area.

    regards, tom lane
  • A.M. at Sep 30, 2010 at 9:22 pm

    On Sep 30, 2010, at 5:02 PM, Tom Lane wrote:

    Robert Haas <robertmhaas@gmail.com> writes:
    Oh, I missed that. Actually, I wasn't really so concerned with
    whether his benchmark is correct. I *am* concerned about being broken
    out of the box on MacOS X.
    Actually, the problem with OSX is that OSX is broken out of the box,
    at least by that standard. The system's normal configuration is that
    fsync() does nothing,
    That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which meets the specification and is distinctly different from doing nothing.

    "The fsync() function can be used by an application to indicate that all data for the open file description named by fildes is to be transferred to the storage device associated with the file described by fildes in an implementation-dependent manner."
    http://opengroup.org/onlinepubs/007908799/xsh/fsync.html

    "On MacOS X, fsync() always has and always will flush all file data
    from host memory to the drive on which the file resides."
    http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html
    I'm not sure whether we should select fsync_writethrough as the default
    on OSX. We don't make an equivalent attempt to prevent OS or storage
    malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
    ahead of the game in that you *can* force writethrough without resorting
    to arcane hacks with hdparm or some such.

    We could definitely stand to be a bit more verbose about documenting
    the platform-specific issues in this area.
    Not only is this issue platform-specific, it is also bus-, controller- and disk-specific. Luckily, hardware that ships from Apple responds properly to F_FULLFSYNC. It's too bad there is no cross-platform way to ask what level of hardware-syncing is available.

    Cheers,
    M
  • Greg Stark at Sep 30, 2010 at 10:08 pm

    On Thu, Sep 30, 2010 at 2:22 PM, A.M. wrote:
    That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which meets the specification and is distinctly different from doing nothing.
    How exactly is it different from doing nothing? That is, in what
    situation does doing this change in any way the behaviour from the
    user's point of view versus doing nothing?

    People keep saying it "meets the specification" but it's a useless
    interpretation of the specification. And it requires a pretty
    malicious reading of "transferred to the storage device" to read it as
    "transferred it from one set of ram buffers to another more closely
    associated with the actual persistent storage".

    It's too bad there is no cross-platform way to ask what level of hardware-syncing is available.
    Why would the user want to ask this? As far as the user is concerned
    either there are only two "levels": synced or not synced. If it's not
    guaranteed to persist after a power failure it's not synced. It
    doesn't matter whether it's in kernel buffers, drive buffers, or
    anywhere else -- they're all the same from the user's point of view --
    they're non-persistent.

    The other useful syncing behaviour would be to get write barriers. But
    that's a whole different api, not just a behaviour that can be
    attached to fsync.



    --
    greg
  • Peter Eisentraut at Oct 1, 2010 at 7:50 am

    On tor, 2010-09-30 at 15:07 -0700, Greg Stark wrote:
    It's too bad there is no cross-platform way to ask what level of
    hardware-syncing is available.

    Why would the user want to ask this? As far as the user is concerned
    either there are only two "levels": synced or not synced. If it's not
    guaranteed to persist after a power failure it's not synced. It
    doesn't matter whether it's in kernel buffers, drive buffers, or
    anywhere else -- they're all the same from the user's point of view --
    they're non-persistent.
    Well, it's not really useful, but that's how it works "everywhere". On
    Linux, fsync carries the stuff from the kernel's RAM to the disk
    controller's RAM, and then it depends on some hdparm magic or something
    what happens next.
  • Robert Haas at Oct 5, 2010 at 3:41 am

    On Fri, Oct 1, 2010 at 3:50 AM, Peter Eisentraut wrote:
    On tor, 2010-09-30 at 15:07 -0700, Greg Stark wrote:
    It's too bad there is no cross-platform way to ask what level of
    hardware-syncing is available.

    Why would the user want to ask this? As far as the user is concerned
    either there are only two "levels": synced or not synced. If it's not
    guaranteed to persist after a power failure it's not synced. It
    doesn't matter whether it's in kernel buffers, drive buffers, or
    anywhere else -- they're all the same from the user's point of view --
    they're non-persistent.
    Well, it's not really useful, but that's how it works "everywhere".  On
    Linux, fsync carries the stuff from the kernel's RAM to the disk
    controller's RAM, and then it depends on some hdparm magic or something
    what happens next.
    That's a bit vaguer than I'd like. TFD says "The aim of WAL is to
    ensure that the log is written before database records are altered,
    but this can be subverted by disk drives that falsely report a
    successful write to the kernel, when in fact they have only cached the
    data and not yet stored it on the disk. A power failure in such a
    situation might lead to irrecoverable data corruption. Administrators
    should try to ensure that disks holding PostgreSQL's WAL log files do
    not make such false reports." This leaves open the question of how
    they should attempt to do this; we should say what we know about that.

    I also notice the following sentence in our documentation, which now
    appears to me to be flat-out wrong: "The wal_sync_method parameter
    determines how PostgreSQL will ask the kernel to force WAL updates
    out to disk. All the options should be the same in terms of
    reliability, but it's quite platform-specific which one will be the
    fastest." Obviously, we know now (if we didn't before) that this
    isn't the case, per my OP.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Peter Eisentraut at Oct 5, 2010 at 12:11 pm

    On mån, 2010-10-04 at 23:41 -0400, Robert Haas wrote:
    Well, it's not really useful, but that's how it works "everywhere". On
    Linux, fsync carries the stuff from the kernel's RAM to the disk
    controller's RAM, and then it depends on some hdparm magic or something
    what happens next.
    That's a bit vaguer than I'd like. TFD says "The aim of WAL is to
    ensure that the log is written before database records are altered,
    but this can be subverted by disk drives that falsely report a
    successful write to the kernel, when in fact they have only cached the
    data and not yet stored it on the disk. A power failure in such a
    situation might lead to irrecoverable data corruption. Administrators
    should try to ensure that disks holding PostgreSQL's WAL log files do
    not make such false reports." This leaves open the question of how
    they should attempt to do this; we should say what we know about that.
    That is explained in section 29.1 "Reliability".
    I also notice the following sentence in our documentation, which now
    appears to me to be flat-out wrong: "The wal_sync_method parameter
    determines how PostgreSQL will ask the kernel to force WAL updates
    out to disk. All the options should be the same in terms of
    reliability, but it's quite platform-specific which one will be the
    fastest." Obviously, we know now (if we didn't before) that this
    isn't the case, per my OP.
    Right. It was true before fsync_writethrough was invented.
  • Robert Haas at Oct 7, 2010 at 3:11 pm

    On Tue, Oct 5, 2010 at 8:11 AM, Peter Eisentraut wrote:
    On mån, 2010-10-04 at 23:41 -0400, Robert Haas wrote:
    Well, it's not really useful, but that's how it works "everywhere".  On
    Linux, fsync carries the stuff from the kernel's RAM to the disk
    controller's RAM, and then it depends on some hdparm magic or something
    what happens next.
    That's a bit vaguer than I'd like.  TFD says "The aim of WAL is to
    ensure that the log is written before database records are altered,
    but this can be subverted by disk drives that falsely report a
    successful write to the kernel, when in fact they have only cached the
    data and not yet stored it on the disk. A power failure in such a
    situation might lead to irrecoverable data corruption. Administrators
    should try to ensure that disks holding PostgreSQL's WAL log files do
    not make such false reports."  This leaves open the question of how
    they should attempt to do this; we should say what we know about that.
    That is explained in section 29.1 "Reliability".
    I also notice the following sentence in our documentation, which now
    appears to me to be flat-out wrong: "The wal_sync_method parameter
    determines how PostgreSQL will ask the kernel to force WAL  updates
    out to disk. All the options should be the same in terms of
    reliability, but it's quite platform-specific which one will be the
    fastest."  Obviously, we know now (if we didn't before) that this
    isn't the case, per my OP.
    Right.  It was true before fsync_writethrough was invented.
    Proposed doc patch attached.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Greg Smith at Oct 7, 2010 at 3:45 pm

    Robert Haas wrote:
    Proposed doc patch attached.
    Looks accurate to me. I like the additional linking to the Reliability
    page you put in there too. Heavily referencing that important page from
    related areas is a good thing, particularly now that it's got a lot more
    details than it used to.

    --
    Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
    PostgreSQL Training, Services and Support www.2ndQuadrant.us
  • Robert Haas at Oct 7, 2010 at 4:26 pm

    On Thu, Oct 7, 2010 at 11:45 AM, Greg Smith wrote:
    Robert Haas wrote:
    Proposed doc patch attached.
    Looks accurate to me.  I like the additional linking to the Reliability page
    you put in there too.  Heavily referencing that important page from related
    areas is a good thing, particularly now that it's got a lot more details
    than it used to.
    Cool, thanks for the fast review. I suspect there are more details
    that could stand to be added to the WAL reliability page as well, but
    I don't know what they are so I can't add them.

    I still have the feeling that we have not put quite a large enough
    red, blinking light around this issue, but I don't have a concrete
    suggestion.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • A.M. at Oct 7, 2010 at 5:28 pm

    On Oct 7, 2010, at 12:26 PM, Robert Haas wrote:
    On Thu, Oct 7, 2010 at 11:45 AM, Greg Smith wrote:
    Robert Haas wrote:
    Proposed doc patch attached.
    Looks accurate to me. I like the additional linking to the Reliability page
    you put in there too. Heavily referencing that important page from related
    areas is a good thing, particularly now that it's got a lot more details
    than it used to.
    Cool, thanks for the fast review. I suspect there are more details
    that could stand to be added to the WAL reliability page as well, but
    I don't know what they are so I can't add them.

    I still have the feeling that we have not put quite a large enough
    red, blinking light around this issue, but I don't have a concrete
    suggestion.
    I think the general problem is that there is no simple way to verify that a PostgreSQL commit is pushing the bits to persistent storage. It would be helpful if there were a platform-specific, volume-specific tool to deduce this. Currently, there is no warning light that goes on when commits are not persistent.

    On Linux, a tool could check filesystem parameters, hdparm (if relevant), and hard drive and controller specs (possibly against a blacklist of known liars).

    Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within the realm of possibility.

    How else can a DBA today ensure that a commit is a commit?

    Cheers,
    M
  • Greg Smith at Oct 7, 2010 at 11:29 pm

    A.M. wrote:
    Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within the realm of possibility.
    This is what the test_fsync utility that already ships with the database
    should be useful for. The way Bruce changed it to report numbers in
    commits/second for 9.0 makes it a lot easier to use for this purpose
    than it used to be. I think there's still some additional improvements
    that could be made there, but it's a tricky test to run accurately. The
    current code is probably too detailed in some ways (it delivers a lot of
    output not relevant to this use-case) and not detailed enough in
    others. Providing a summary that understands things like
    fsync_writethrough on platforms that support it was the first
    refactoring I had in my mind. If that thing came back and said
    "fsync_writethrough works for you, so don't even consider the other
    possibilities if you want reliability even though they are faster", that
    would be nice for example.
    How else can a DBA today ensure that a commit is a commit?
    You can't ensure a commit is a commit without running a pull the plug
    test. And I think the best way to do that accurately is using a "remote
    witness" server focusing on finding this particular problem to look for
    glitches, rather than than using the database as your test program and
    seeing if you happen to hit corruption or not. The documentation for
    9.0 now suggests running the diskchecker.pl program for this exact
    purpose. I've seen enough reports of it finding even subtle cache loss
    situations to believe that encouraging heavier use of that would be
    enough to make people much safer than they typically are today. What we
    probably need to do next is provide people with an exact walkthrough of
    setting up and using the program, showing what a passing result looks
    like, and what a failing one looks like.

    --
    Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
    PostgreSQL Training, Services and Support www.2ndQuadrant.us
  • Bruce Momjian at Oct 19, 2010 at 3:23 pm

    Greg Smith wrote:
    A.M. wrote:
    Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within the realm of possibility.
    This is what the test_fsync utility that already ships with the database
    should be useful for. The way Bruce changed it to report numbers in
    commits/second for 9.0 makes it a lot easier to use for this purpose
    than it used to be. I think there's still some additional improvements
    that could be made there, but it's a tricky test to run accurately. The
    test_fsync was designed to test various things like whether several
    open-sync writes are better than two write and an fsync, and whether you
    can fsync data written on a different file descriptor. It is really a
    catch-all test right now, not one specific for choosing sync methods.

    --
    Bruce Momjian <bruce@momjian.us> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • A.M. at Oct 19, 2010 at 3:38 pm

    On Oct 19, 2010, at 11:22 AM, Bruce Momjian wrote:

    Greg Smith wrote:
    A.M. wrote:
    Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within the realm of possibility.
    This is what the test_fsync utility that already ships with the database
    should be useful for. The way Bruce changed it to report numbers in
    commits/second for 9.0 makes it a lot easier to use for this purpose
    than it used to be. I think there's still some additional improvements
    that could be made there, but it's a tricky test to run accurately. The
    test_fsync was designed to test various things like whether several
    open-sync writes are better than two write and an fsync, and whether you
    can fsync data written on a different file descriptor. It is really a
    catch-all test right now, not one specific for choosing sync methods.
    I am working on simplifying the test_fsync tool and making it a contrib function which can be run by the superuser based on the configured fsync method. That way, the list can ask a user to run it to report fsyncs-per-second for suspiciousness. The goal is to make it more accessible. I was also thinking about adding some notes along the lines of "Your drive fsync speed rates between a 5400 RPM SATA drive and a 7200 RPM SATA drive." or "Your drive fsync speed rates as high as RAM- your fsync method may be wrong."

    Currently, the test tool is not even compiled by default.

    Thoughts?

    Cheers,
    M
  • Bruce Momjian at Oct 19, 2010 at 4:16 pm

    A.M. wrote:
    On Oct 19, 2010, at 11:22 AM, Bruce Momjian wrote:

    Greg Smith wrote:
    A.M. wrote:
    Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within the realm of possibility.
    This is what the test_fsync utility that already ships with the database
    should be useful for. The way Bruce changed it to report numbers in
    commits/second for 9.0 makes it a lot easier to use for this purpose
    than it used to be. I think there's still some additional improvements
    that could be made there, but it's a tricky test to run accurately. The
    test_fsync was designed to test various things like whether several
    open-sync writes are better than two write and an fsync, and whether you
    can fsync data written on a different file descriptor. It is really a
    catch-all test right now, not one specific for choosing sync methods.
    I am working on simplifying the test_fsync tool and making it a contrib function which can be run by the superuser based on the configured fsync method. That way, the list can ask a user to run it to report fsyncs-per-second for suspiciousness. The goal is to make it more accessible. I was also thinking about adding some notes along the lines of "Your drive fsync speed rates between a 5400 RPM SATA drive and a 7200 RPM SATA drive." or "Your drive fsync speed rates as high as RAM- your fsync method may be wrong."

    Currently, the test tool is not even compiled by default.

    Thoughts?
    Agreed. Let me know if you have any questions.

    --
    Bruce Momjian <bruce@momjian.us> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • Tom Lane at Oct 7, 2010 at 3:52 pm

    Robert Haas writes:
    Proposed doc patch attached.
    "discusesed"? Otherwise +1

    regards, tom lane
  • Robert Haas at Oct 7, 2010 at 4:24 pm

    On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Proposed doc patch attached.
    "discusesed"?  Otherwise +1
    Woops, thanks. Committed with that change. I back-patched it back to
    8.3, which is as far as it applied with only minor conflicts.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Bruce Momjian at Oct 19, 2010 at 3:07 pm

    Robert Haas wrote:
    On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Proposed doc patch attached.
    "discusesed"? ?Otherwise +1
    Woops, thanks. Committed with that change. I back-patched it back to
    8.3, which is as far as it applied with only minor conflicts.
    I have applied the attached patch which mentions tools/fsync for testing
    fsync method performance, and clarified the new paragraph about sync
    methods.

    I am glad to see we are beefing up this area of the docs.

    --
    Bruce Momjian <bruce@momjian.us> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • Robert Haas at Oct 25, 2010 at 1:34 pm

    On Tue, Oct 19, 2010 at 11:06 AM, Bruce Momjian wrote:
    Robert Haas wrote:
    On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Proposed doc patch attached.
    "discusesed"? ?Otherwise +1
    Woops, thanks.  Committed with that change.  I back-patched it back to
    8.3, which is as far as it applied with only minor conflicts.
    I have applied the attached patch which mentions tools/fsync for testing
    fsync method performance, and clarified the new paragraph about sync
    methods.

    I am glad to see we are beefing up this area of the docs.
    It seems we're still missing some relevant details, because hdparm
    doesn't seem to work on SCSI devices. Is sdparm the right utility in
    that case? Does anyone know what the correct incantations look like?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Peter Eisentraut at Oct 25, 2010 at 4:51 pm

    On mån, 2010-10-25 at 09:33 -0400, Robert Haas wrote:
    It seems we're still missing some relevant details, because hdparm
    doesn't seem to work on SCSI devices. Is sdparm the right utility in
    that case? Does anyone know what the correct incantations look like?
    Search the sdparm man page for "Writeback Cache". It has detailed
    examples.
  • Robert Haas at Oct 27, 2010 at 1:07 am

    On Mon, Oct 25, 2010 at 12:51 PM, Peter Eisentraut wrote:
    On mån, 2010-10-25 at 09:33 -0400, Robert Haas wrote:
    It seems we're still missing some relevant details, because hdparm
    doesn't seem to work on SCSI devices.  Is sdparm the right utility in
    that case?  Does anyone know what the correct incantations look like?
    Search the sdparm man page for "Writeback Cache".  It has detailed
    examples.
    Here's a patch. This adds a few more details about sdparm and makes
    it clear that it applies to both FreeBSD and Linux. But, perhaps more
    significantly, it rearranges what is currently a fairly long paragraph
    into a bulleted list, which I think is more readable. Comments?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Greg Smith at Oct 1, 2010 at 1:03 am

    A.M. wrote:
    That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which meets the specification and is distinctly different from doing nothing...
    "On MacOS X, fsync() always has and always will flush all file data
    from host memory to the drive on which the file resides."
    http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html
    You didn't quote the next part of that, which says "fsync() is not
    sufficient to guarantee that your data is on stable
    storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask
    the drive to flush all buffered data to stable storage." That's exactly
    what turning on fsync_writethrough does in PostgreSQL. See
    http://archives.postgresql.org/pgsql-hackers/2005-04/msg00390.php as the
    first post on this topic that ultimately led to that behavior being
    implemented.

    From the perspective of the database, whether or not the behavior is
    standards compliant isn't the issue. Whether pages make it to physical
    disk or not when fsync is called, or when O_DSYNC writes are done on
    platforms that support them, is the important part. If you the OS
    doesn't do that, it is doing nothing useful from the perspective of the
    database's expectations. And that's not true on Darwin unless you
    specify F_FULLFSYNC, which doesn't happen by default in PostgreSQL. It
    only does that when you switch wal_sync_method=fsync_writethrough

    --
    Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
    PostgreSQL Training, Services and Support www.2ndQuadrant.us
    Author, "PostgreSQL 9.0 High Performance" Pre-ordering at:
    https://www.packtpub.com/postgresql-9-0-high-performance/book
  • Darren Duncan at Oct 1, 2010 at 3:03 am

    Greg Smith wrote:
    You didn't quote the next part of that, which says "fsync() is not
    sufficient to guarantee that your data is on stable
    storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask
    the drive to flush all buffered data to stable storage." That's exactly
    what turning on fsync_writethrough does in PostgreSQL. See
    http://archives.postgresql.org/pgsql-hackers/2005-04/msg00390.php as the
    first post on this topic that ultimately led to that behavior being
    implemented.

    From the perspective of the database, whether or not the behavior is
    standards compliant isn't the issue. Whether pages make it to physical
    disk or not when fsync is called, or when O_DSYNC writes are done on
    platforms that support them, is the important part. If you the OS
    doesn't do that, it is doing nothing useful from the perspective of the
    database's expectations. And that's not true on Darwin unless you
    specify F_FULLFSYNC, which doesn't happen by default in PostgreSQL. It
    only does that when you switch wal_sync_method=fsync_writethrough
    Greg Smith also wrote:
    The main downside to switching the default on either OS X or Windows is
    developers using those platforms for test deployments will suffer greatly from a
    performance drop for data they don't really care about. As those two in
    particular are much more likely to be client development platforms, too, that's
    a scary thing to consider.

    I think that, bottom line, Postgres should be defaulting to whatever the safest
    and most reliable behavior is, per each platform, because data integrity is the
    most important thing, ensuring that a returning commit has actually written data
    to disk. If performance is worse, then so what? Code that does nothing has the
    best performance of all, and is also generally useless.

    Whenever there is a tradeoff to be made, reliability for speed, then users
    should have to explicitly choose the less reliable option, which would
    demonstrate they know what they're doing. Let the testers explicitly choose a
    faster and less reliable option for the data they don't care about, and
    otherwise by default users who don't better should get the safest option, for
    data they likely care about. That is a DBMS priority.

    This matter reminds me of a discussion on the SQLite list years ago about
    whether pragma synchronous=normal or synchronous=full should be the default, and
    thankfully 'full' won.

    -- Darren Duncan
  • Greg Smith at Oct 1, 2010 at 6:35 am

    Darren Duncan wrote:
    This matter reminds me of a discussion on the SQLite list years ago
    about whether pragma synchronous=normal or synchronous=full should be
    the default, and thankfully 'full' won.
    Right now, when I see deployments in the field, serious database servers
    setup by professional DBAs tend to use the right hardware and setup to
    do the correct thing with PostgreSQL. And trivial installs done for
    testing purposes cheat, but many of those users don't really care
    because they know they are not running a real server, and expect that
    their desktop is neither reliable nor fast at database work. The SQLite
    situation has a slightly different context than this, because the places
    it's put into don't so regularly have a DBA involved at all in
    situations where the data is important. It's often just system software
    sitting in the background nobody even is aware of.

    I also remember when SQLite did come out of the background, when it was
    crucified for being the cause of Firefox slowdowns actually linked to
    changed kernel fsync behavior. That's the sort of bad press this
    project really doesn't need right now, when it actually doesn't matter
    on so many production database servers. You may not be aware that
    there's already such a change floating around out there. PostgreSQL
    installs on Linux kernel 2.6.32 or later using ext4 are dramatically
    slower out of the box than they used to be, because the OS started doing
    the right thing by default; no change in the database code. I remain in
    mild terror that this news is going to break in a bad way and push this
    community into damage control. So far I've only seen that reported on
    Phoronix, and that included a testimony from a kernel developer that
    they introduced the regression so it wasn't so bad. The next such
    publicized report may not be so informed.

    Some of this works out to when to change things rather than what to
    change. PostgreSQL is at a somewhat critical spot right now. If people
    grab a new version, and performance sucks compared to earlier ones,
    they're not going to think "oh, maybe they changed an option and the new
    version is tuned for safety better". They're going to say "performance
    sucks on this database now" and give up on it. Many evals are done on
    hardware that isn't representative of a real database server, and if we
    make a change that only hurts those people--while not actually impacting
    production quality hardware--that needs to be done carefully. And
    that's exactly what I think would happen here if this was just changed
    all of the sudden.

    I don't think anyone is seriously opposed to changing the defaults for
    safety instead of performance. The problem is that said change would
    need to be *preceeded* by a major update to the database documentation,
    and perhaps even some code changes to issue warnings when you create a
    cluster with what is going to turn out to now be a slow configuration.
    We'd need to make it really obvious to people who upgrade and notice
    that performance tanks that it's because of a configuration change made
    for safety reasons, one that they can undo for test deployments. That
    particular area, giving people better advice about what they should do
    to properly tune a new install for its intended workload, is something
    that's been making slow progress but still needs a lot of work. I think
    if some better tools there come along, so that most people are expected
    to follow a path that involves a tuning tool, it will be much easier to
    stomach the idea of changing the default--knowing that something that
    will undo that change is likely to appears to the user that suggests the
    possibility is available.

    --
    Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
    PostgreSQL Training, Services and Support www.2ndQuadrant.us
    Author, "PostgreSQL 9.0 High Performance" Pre-ordering at:
    https://www.packtpub.com/postgresql-9-0-high-performance/book
  • Robert Haas at Sep 30, 2010 at 9:31 pm

    On Thu, Sep 30, 2010 at 5:02 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Oh, I missed that.  Actually, I wasn't really so concerned with
    whether his benchmark is correct.  I *am* concerned about being broken
    out of the box on MacOS X.
    Actually, the problem with OSX is that OSX is broken out of the box,
    at least by that standard.  The system's normal configuration is that
    fsync() does nothing, so it's hardly surprising that O_DSYNC is no
    better.  You have to use wal_sync_method = fsync_writethrough to get
    actual bits-to-the-platter behavior.

    I'm not sure whether we should select fsync_writethrough as the default
    on OSX.  We don't make an equivalent attempt to prevent OS or storage
    malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
    ahead of the game in that you *can* force writethrough without resorting
    to arcane hacks with hdparm or some such.

    We could definitely stand to be a bit more verbose about documenting
    the platform-specific issues in this area.
    I think some documentation is definitely in order, at the least. It's
    certainly astonishing that the default settings aren't crash-safe.
    I'd really like to understand how this shakes out on different
    plaforms.

    Whether we should try to work around them is a trickier question, but
    I'm somewhat inclined to say yes. If we're trying to have the system
    be performant in the default config, turning off synchronous_commit
    would be saner than failing to make use of a system call which we know
    absolutely for sure to be necessary to avoid the possibility of
    database corruption.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Greg Smith at Oct 1, 2010 at 12:53 am

    Tom Lane wrote:
    I'm not sure whether we should select fsync_writethrough as the default
    on OSX. We don't make an equivalent attempt to prevent OS or storage
    malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
    ahead of the game in that you *can* force writethrough without resorting
    to arcane hacks with hdparm or some such.
    The same issue exists on Windows too, with the same workaround:
    normally unsafe by default, have to switch to fsync_writethrough in
    order to get commit safety and write caching for regular writes, still
    ahead of most Unixes because at least it's that easy.

    The main downside to switching the default on either OS X or Windows is
    developers using those platforms for test deployments will suffer
    greatly from a performance drop for data they don't really care about.
    As those two in particular are much more likely to be client development
    platforms, too, that's a scary thing to consider.

    As for the documentation, I wrote
    http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm a
    while ago trying to address this better, but never got to merging some
    of those notes into the official docs. At the time, such heavy linking
    to external URLs was particularly frowned upon in the official docs.
    While that's eased up a bit, what I think I could do now is convert that
    article into something that's on the Wiki instead, and we could point
    the "Reliability" section of the docs toward there as something that
    could stay live as platform-specific changes happen. I'm thinking of
    the whole ext4 behavior change when I say that; stuff like that will
    only be addressable in a live document, since changes were happening in
    between major releases in that area.

    More recently, I've written a few things covering this area in what
    approaches excruciating detail, for this little document you might have
    been spammed in somebody's signature about. I'm going to ask the
    publisher to make that specific material the public sample chapter for
    the book, in hopes of making it easier for people to find a detailed
    discussion of this topic. Regardless of whether that works out, I have
    to clean up documentation in this whole area up for the checkpoint
    changes I'm working on for 9.1 anyway. That tinkers with the timing on
    fsync calls, and whether that works or not is very filesystem specific.
    I'm basically stuck with documenting exactly what happens for most
    possibilities in order to provide a comprehensive manual section
    covering that. I have a deadline for when I have to get my work in
    progress organized to share with the world now:
    https://www.postgresqlconference.org/content/righting-your-writes , so
    after I get back from that conference in November I'll see what I can do
    about the docs too.

    --
    Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
    PostgreSQL Training, Services and Support www.2ndQuadrant.us
    Author, "PostgreSQL 9.0 High Performance" Pre-ordering at:
    https://www.packtpub.com/postgresql-9-0-high-performance/book

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 30, '10 at 7:26p
activeOct 27, '10 at 1:07a
posts29
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase