I've been trying to optimize a Linux system where benchmarking suggests
large performance differences between the various wal_sync_method options
(with o_sync being the big winner). I started that by using
src/tools/fsync/test_fsync to get an idea what I was dealing with (and to
spot which drives had write caching turned on). Since those results
didn't match what I was seeing in the benchmarks, I've been browsing the
backend source to figure out why. I noticed test_fsync appears to be,
ahem, out of sync with what the engine is doing.

It looks like V8.1 introduced O_DIRECT writes to the WAL, determined at
compile time by a series of preprocessor tests in
src/backend/access/transam/xlog.c When O_DIRECT is available,
O_SYNC/O_FSYNC/O_DSYNC writes use it. test_fsync doesn't do that.

I moved the new code (in 8.2 beta 3, lines 61-92 in xlog.c) into
test_fsync; all the flags had the same name so it dropped right in. You
can get the version I made at http://www.westnet.com/~gsmith/test_fsync.c
(fixed a compiler warning, too)

The results I get now look fishy. I'm not sure if I screwed up a step, or
if I'm seeing a real problem. The system here is running RedHat Linux,
RHEL ES 4.0 kernel 2.6.9, and the disk I'm writing to is a standard
7200RPM IDE drive. I turned off write caching with hdparm -W 0

Here's an excerpt from the stock test_fsync:

Compare one o_sync write to two:
one 16k o_sync write 8.717944
two 8k o_sync writes 17.501980

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 17.018495
write, fdatasync 8.842473
write, fsync, 8.809117

And here's the version I tried to modify to include O_DIRECT support:

Compare one o_sync write to two:
one 16k o_sync write 0.004995
two 8k o_sync writes 0.003027

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.004978
write, fdatasync 8.845498
write, fsync, 8.834037

Obivously the o_sync writes aren't waiting for the disk. Is this a
problem with O_DIRECT under Linux? Or is my code just not correctly
testing this behavior?

Just as a sanity check, I did try this on another system, running SuSE
with drives connected to a cciss SCSI device, and I got exactly the same
results. I'm concerned that Linux users who use O_SYNC because they
notice it's faster will be losing their WAL integrity without being aware
of the problem, especially as the whole O_DIRECT business isn't even
mentioned in the WAL documentation--it really deserves to be brought up in
the wal_sync_method notes at
http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

And while I'm mentioning improvements to that particular documentation
page...the wal_buffers notes there are so sparse they misled me initially.
They suggest only bumping it up for situations with very large
transactions; since I was testing with small ones I left it woefully
undersized initially. I would suggest copying the text from
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html to
here: "When full_page_writes is set and the system is very busy, setting
this value higher will help smooth response times during the period
immediately following each checkpoint." That seems to match what I found
in testing.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Search Discussions

  • Tom Lane at Nov 23, 2006 at 4:46 pm

    Greg Smith writes:
    The results I get now look fishy.
    There are at least two things wrong with this program:

    * It does not respect the alignment requirement for O_DIRECT buffers
    (reportedly either 512 or 4096 bytes depending on filesystem).

    * It does not check for errors (if it had, you might have realized the
    other problem).

    regards, tom lane
  • Greg Smith at Nov 23, 2006 at 6:10 pm

    On Thu, 23 Nov 2006, Tom Lane wrote:

    * It does not check for errors (if it had, you might have realized the
    other problem).
    All the test_fsync code needs to check for errors better; there have been
    multiple occasions where I've run that with quesiontable input and it
    didn't complain, it just happily ran and reported times that were almost
    0.

    Thanks for the note about alignment, I had seen something about that in
    the xlog.c but wasn't sure if that was important in this case.

    It's very important to the project I'm working on that I get this cleared
    up, and I think I'm in a good position to fix it myself now. I just
    wanted to report the issue and get some initial feedback on what's wrong.
    I'll try to rewrite that code with an eye toward the "Determine optimal
    fdatasync/fsync, O_SYNC/O_DSYNC options" to-do item, which is what I'd
    really like to have.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
  • Bruce Momjian at Nov 24, 2006 at 4:06 am

    Greg Smith wrote:
    On Thu, 23 Nov 2006, Tom Lane wrote:

    * It does not check for errors (if it had, you might have realized the
    other problem).
    All the test_fsync code needs to check for errors better; there have been
    multiple occasions where I've run that with quesiontable input and it
    didn't complain, it just happily ran and reported times that were almost
    0.

    Thanks for the note about alignment, I had seen something about that in
    the xlog.c but wasn't sure if that was important in this case.

    It's very important to the project I'm working on that I get this cleared
    up, and I think I'm in a good position to fix it myself now. I just
    wanted to report the issue and get some initial feedback on what's wrong.
    I'll try to rewrite that code with an eye toward the "Determine optimal
    fdatasync/fsync, O_SYNC/O_DSYNC options" to-do item, which is what I'd
    really like to have.
    Please send an updated patch for test_fsync.c so we can get it working
    for 8.2.

    --
    Bruce Momjian bruce@momjian.us
    EnterpriseDB http://www.enterprisedb.com

    + If your life is a hard drive, Christ can be your backup. +

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedNov 23, '06 at 6:40a
activeNov 24, '06 at 4:06a
posts4
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase