Hello.

We are trying to use HP CISS contoller (Smart Array E200i) with internal
cache memory (100M for write caching, built-in power battery) together with
Postgres. Typically under a heavy load Postgres runs checkpoint fsync very
slow:

checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms

(If we turn off fsync, the speed increases greatly, fsync=0.) And
unfortunately it affects all the database productivity during the
checkpoint.
Here is the timing (in milliseconds) of a test transaction called multiple
times concurrently (6 threads) with fsync turned ON:

40.4
44.4
37.4
44.0
42.7
41.8
218.1
254.2
101.0
42.2
42.4
41.0
39.5

(you may see a significant slowdown during a checkpoint).
Here is dstat disc write activity log for that test:



284k


84k


276k
37M
208k




156k





I have written a small perl script to check how slow is fsync for Smart
Array E200i controller. Theoretically, because of write cache, fsync MUST
cost nothing, but in practice it is not true:

# cd /mnt/c0d1p1/
# perl -e 'use Time::HiRes qw(gettimeofday tv_interval); system "sync"; open
F, ">bulk"; print F ("a" x (1024 * 1024 * 20)); close F; $t0=[gettimeofday];
system "sync"; print ">>> fsync took " . tv_interval ( $t0, [gettimeofday])
. " s\n"; unlink "bulk"'
fsync took 0.247033 s
You see, 50M block was fsynced for 0.25 s.

The question is: how to solve this problem and make fsync run with no delay.
Seems to me that controller's internal write cache is not used (strange,
because all configuration options are fine), but how to check it? Or, maybe,
there is another side-effect?

Search Discussions

  • Scott Marlowe at Aug 22, 2007 at 3:47 pm

    On 8/22/07, Dmitry Koterov wrote:
    Hello.
    You see, 50M block was fsynced for 0.25 s.

    The question is: how to solve this problem and make fsync run with no delay.
    Seems to me that controller's internal write cache is not used (strange,
    because all configuration options are fine), but how to check it? Or, maybe,
    there is another side-effect?
    I would suggest that either the controller is NOT configured fine, OR
    there's some bug in how the OS is interacting with it.

    What options are there for this RAID controller, and what are they set
    to? Specifically, the writeback / writethru type options for the
    cache, and it might be if it doesn't preoprly detect a battery backup
    module it refuses to go into writeback mode.
  • Dmitry Koterov at Aug 22, 2007 at 5:29 pm
    And here are results of built-in Postgres test script:

    Simple write timing:
    write 0.006355

    Compare fsync times on write() and non-write() descriptor:
    (If the times are similar, fsync() can sync data written
    on a different descriptor.)
    write, fsync, close 0.233793
    write, close, fsync 0.227444

    Compare one o_sync write to two:
    one 16k o_sync write 0.297093
    two 8k o_sync writes 0.402803

    Compare file sync methods with one 8k write:

    (o_dsync unavailable)
    write, fdatasync 0.228725
    write, fsync, 0.223302

    Compare file sync methods with 2 8k writes:
    (o_dsync unavailable)
    open o_sync, write 0.414954
    write, fdatasync 0.335280
    write, fsync, 0.327195

    (Also, I tried to manually specify open_sync method in postgresql.conf, but
    after that Postgres database had completely crashed. :-)

    On 8/22/07, Dmitry Koterov wrote:

    All settings seems to be fine. Mode is writeback.

    We temporarily (for tests only on test machine!!!) put pg_xlog into RAM
    drive (to completely exclude xlog fsync from the statistics), but slowdown
    during the checkpoint and 5-10 second fsync during the checkpoint are alive
    yet.

    Here are some statistical data from the controller. Other report data is
    attached to the mail.

    ACCELERATOR STATUS:
    Logical Drive Disable Map: 0x00000000
    Read Cache Size: 24 MBytes
    Posted Write Size: 72 MBytes
    Disable Flag: 0x00
    Status: 0x00000001
    Disable Code: 0x0000
    Total Memory Size: 128 MBytes
    Battery Count: 1
    Battery Status: 0x0001
    Parity Read Errors: 0000
    Parity Write Errors: 0000
    Error Log: N/A
    Failed Batteries: 0x0000
    Board Present: Yes
    Accelerator Failure Map: 0x00000000
    Max Error Log Entries: 12
    NVRAM Load Status: 0x00
    Memory Size Shift Factor: 0x0a
    Non Battery Backed Memory: 0 MBytes
    Memory State: 0x00

    On 8/22/07, Scott Marlowe wrote:
    On 8/22/07, Dmitry Koterov wrote:
    Hello.
    You see, 50M block was fsynced for 0.25 s.

    The question is: how to solve this problem and make fsync run with no delay.
    Seems to me that controller's internal write cache is not used (strange,
    because all configuration options are fine), but how to check it? Or, maybe,
    there is another side-effect?
    I would suggest that either the controller is NOT configured fine, OR
    there's some bug in how the OS is interacting with it.

    What options are there for this RAID controller, and what are they set
    to? Specifically, the writeback / writethru type options for the
    cache, and it might be if it doesn't preoprly detect a battery backup
    module it refuses to go into writeback mode.
  • Phoenix Kiula at Aug 22, 2007 at 5:39 pm
    Hi,

    On 23/08/07, Dmitry Koterov wrote:
    And here are results of built-in Postgres test script:


    Can you tell me how I can execute this script on my system? Where is
    this script?

    Thanks!
  • Dmitry Koterov at Aug 22, 2007 at 5:51 pm
    This script is here:
    postgresql-8.2.3\src\tools\fsync\test_fsync.c

    On 8/22/07, Phoenix Kiula wrote:

    Hi,

    On 23/08/07, Dmitry Koterov wrote:
    And here are results of built-in Postgres test script:


    Can you tell me how I can execute this script on my system? Where is
    this script?

    Thanks!
  • Greg Smith at Aug 22, 2007 at 8:18 pm

    On Wed, 22 Aug 2007, Dmitry Koterov wrote:

    I have written a small perl script to check how slow is fsync for Smart
    Array E200i controller. Theoretically, because of write cache, fsync MUST
    cost nothing, but in practice it is not true
    That theory is fundamentally flawed; you don't know what else is in the
    operating system write cache in front of what you're trying to fsync, and
    you also don't know exactly what's in the controller's cache when you
    start. For all you know, the controller might be filled with cached reads
    and refuse to kick all of them out. This is a complicated area where
    tests are much more useful than trying to predict the behavior.

    You haven't mentioned any details yet about the operating system you're
    running on; Solaris? Guessing from the device name. There have been some
    comments passing by lately about the write caching behavior not being
    turned on by default in that operating system.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
  • Dmitry Koterov at Aug 22, 2007 at 10:43 pm

    I have written a small perl script to check how slow is fsync for Smart
    Array E200i controller. Theoretically, because of write cache, fsync MUST
    cost nothing, but in practice it is not true
    That theory is fundamentally flawed; you don't know what else is in the
    operating system write cache in front of what you're trying to fsync, and
    you also don't know exactly what's in the controller's cache when you
    start. For all you know, the controller might be filled with cached reads
    and refuse to kick all of them out. This is a complicated area where
    tests are much more useful than trying to predict the behavior.


    Nobody else writes, nobody reads. The machine is for tests, it is clean. I
    monitor dstat - for 5 minutes before there is no disc activity. So I suppose
    that the conntroller cache is already flushed before I am running the test.

    tests are much more useful than trying to predict the behavior. You
    haven't mentioned any details yet about the operating system you're
    running on; Solaris? Guessing from the device name. There have been some
    comments passing by lately about the write caching behavior not being
    turned on by default in that operating system.
    Linux CentOS x86_64. A lot of memory, 8 processors.
    Filesystem is ext2 (to reduce the journalling side-effects).
    OS write caching is turned on, turned off and also set to flush once per
    second (all these cases are tested, all these have no effect).

    The question is - MUST my test script report about a zero fsync time or not,
    if the controler has built-in and large write cache. If yes, something wrong
    with controller or drivers (how to diagnose?). If no, why?

    There are a lot of discussions in this maillist about fsync & battery-armed
    controller, people say that a controller with builtin cache memory reduces
    the price of fsync to zero. I just want to achieve this.
  • Dmitry Koterov at Aug 22, 2007 at 10:45 pm
    Also, the controller is configured to use 75% of its memory for write
    caching and 25% - for read caching. So reads cannot flood writes.
    On 8/23/07, Dmitry Koterov wrote:

    I have written a small perl script to check how slow is fsync for Smart
    Array E200i controller. Theoretically, because of write cache, fsync MUST
    cost nothing, but in practice it is not true
    That theory is fundamentally flawed; you don't know what else is in the
    operating system write cache in front of what you're trying to fsync,
    and
    you also don't know exactly what's in the controller's cache when you
    start. For all you know, the controller might be filled with cached
    reads
    and refuse to kick all of them out. This is a complicated area where
    tests are much more useful than trying to predict the behavior.


    Nobody else writes, nobody reads. The machine is for tests, it is clean. I
    monitor dstat - for 5 minutes before there is no disc activity. So I suppose
    that the conntroller cache is already flushed before I am running the test.

    tests are much more useful than trying to predict the behavior. You
    haven't mentioned any details yet about the operating system you're
    running on; Solaris? Guessing from the device name. There have been some
    comments passing by lately about the write caching behavior not being
    turned on by default in that operating system.
    Linux CentOS x86_64. A lot of memory, 8 processors.
    Filesystem is ext2 (to reduce the journalling side-effects).
    OS write caching is turned on, turned off and also set to flush once per
    second (all these cases are tested, all these have no effect).

    The question is - MUST my test script report about a zero fsync time or
    not, if the controler has built-in and large write cache. If yes, something
    wrong with controller or drivers (how to diagnose?). If no, why?

    There are a lot of discussions in this maillist about fsync &
    battery-armed controller, people say that a controller with builtin cache
    memory reduces the price of fsync to zero. I just want to achieve this.

  • Ron Johnson at Aug 22, 2007 at 11:14 pm

    On 08/22/07 17:45, Dmitry Koterov wrote:
    Also, the controller is configured to use 75% of its memory for write
    caching and 25% - for read caching. So reads cannot flood writes.
    That seems to be a very extreme ratio. Most databases do *many*
    times more reads than writes.

    - --
    Ron Johnson, Jr.
    Jefferson LA USA

    Give a man a fish, and he eats for a day.
    Hit him with a fish, and he goes away for good!
  • Greg Smith at Aug 23, 2007 at 3:16 am

    On Wed, 22 Aug 2007, Ron Johnson wrote:

    That seems to be a very extreme ratio. Most databases do *many*
    times more reads than writes.
    Yes, but the OS has a lot more memory to cache the reads for you, so you
    should be relying more heavily on it in cases like this where the card has
    a relatively small amount of memory. The main benefit for having a
    caching controller is fsync acceleration, the reads should pass right
    through the controller's cache and then stay in system RAM afterwards if
    they're needed again.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
  • Scott Marlowe at Aug 22, 2007 at 11:49 pm

    On 8/22/07, Dmitry Koterov wrote:
    Also, the controller is configured to use 75% of its memory for write
    caching and 25% - for read caching. So reads cannot flood writes.
    128 Meg is a pretty small cache for a modern RAID controller. I
    wonder if this one is just a dog performer.

    Have you looked at things like the Areca or Escalade with 1g or more
    cache on them?
  • Greg Smith at Aug 23, 2007 at 3:57 am

    On Wed, 22 Aug 2007, Dmitry Koterov wrote:

    We are trying to use HP CISS contoller (Smart Array E200i)
    There have been multiple reports of problems with general performance
    issues specifically with the cciss Linux driver for other HP cards. The
    E200i isn't from the same series, but I wouldn't expect that their drivers
    have gotten much better. Wander through the thread at
    http://svr5.postgresql.org/pgsql-performance/2006-07/msg00257.php to see
    one example I recall from last year; there are more in the archives if you
    search around a bit.
    I have written a small perl script to check how slow is fsync for Smart
    Array E200i controller. Theoretically, because of write cache, fsync MUST
    cost nothing, but in practice it is not true:
    fsync took 0.247033 s
    For comparision sake, your script run against my system with an Areca
    ARC-1210 card with 256MB of cache 20 times gives me the following minimum
    and maximum times (full details on my server config are at
    http://www.westnet.com/~gsmith/content/postgresql/serverinfo.htm ):
    fsync took 0.039676 s
    fsync took 0.041137 s
    And here's what the last set of test_fsync results look like on my system:

    Compare file sync methods with 2 8k writes:
    open o_sync, write 0.099819
    write, fdatasync 0.100054
    write, fsync, 0.094009

    So basically your card is running 3 (test_fsync) to 6 (your script) times
    slower than my Areca unit on these low-level tests. I don't know that
    it's possible to drive the fsync times completely to zero, but there's
    certainly a whole lot of improvement from where you are to what I'd expect
    from even a cheap caching controller like I'm using. I've got maybe $900
    worth of hardware total in this box and it's way faster than yours in this
    area.
    (Also, I tried to manually specify open_sync method in postgresql.conf,
    but after that Postgres database had completely crashed. :-)
    This is itself a sign there's something really strange going on. There's
    something wrong with your system, your card, or the OS/driver you're using
    if open_sync doesn't work under Linux; in fact, it should be faster in
    practice even if it looks a little slower on test_fsync.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
  • Lincoln Yeoh at Aug 23, 2007 at 5:33 pm

    At 11:28 PM 8/22/2007, Dmitry Koterov wrote:
    Hello.

    We are trying to use HP CISS contoller (Smart Array E200i) with
    internal cache memory (100M for write caching, built-in power
    battery) together with Postgres. Typically under a heavy load
    Postgres runs checkpoint fsync very slow:

    checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms

    (If we turn off fsync, the speed increases greatly, fsync=0.) And
    unfortunately it affects all the database productivity during the checkpoint.
    Here is the timing (in milliseconds) of a test transaction called
    multiple times concurrently (6 threads) with fsync turned ON:
    It's likely your controller is probably not doing the write caching
    thingy or the write caching is still slow (I've seen raid controllers
    that are slower than software raid).

    Have you actually configured your controller to do the write caching?
    Won't be surprised if it's in a conservative setting which means
    "write-through" rather than "write-back", even if there's a battery.

    BTW, what happens if someone replaced a faulty battery backed
    controller card on a "live" system with one from a "don't care test
    system" (identical hardware tho) that was powered down abruptly
    because people didn't care? Would the new card proceed to trash the
    "live" system?

    Probably not that important, but what are your mount options for the
    partition? Is the partition mounted noatime (or similar)?

    Regards,
    Link.
  • Greg Smith at Aug 23, 2007 at 5:53 pm

    On Fri, 24 Aug 2007, Lincoln Yeoh wrote:

    BTW, what happens if someone replaced a faulty battery backed controller card
    on a "live" system with one from a "don't care test system" (identical
    hardware tho) that was powered down abruptly because people didn't care?
    Would the new card proceed to trash the "live" system?
    All the caching controllers I've examined this behavior on give each disk
    a unique ID, so if you connect new disks to them they wouldn't trash
    anything because those writes will only go out to the original drives.
    What happens to the pending writes for the drives that aren't there
    anymore is kind of undefined though; presumably they'll just be thrown
    away, I don't know if there are any cards that try to hang on to them in
    case the original disks are connected later.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedAug 22, '07 at 3:28p
activeAug 23, '07 at 5:53p
posts14
users6
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase