FAQ

Using pgiosim realistically

John Rouillard
May 13, 2011 at 9:09 pm
Hi all:

I am adding pgiosim to our testing for new database hardware and I am
seeing something I don't quite get and I think it's because I am using
pgiosim incorrectly.

Specs:

OS: centos 5.5 kernel: 2.6.18-194.32.1.el5
memory: 96GB
cpu: 2x Intel(R) Xeon(R) X5690 @ 3.47GHz (6 core, ht enabled)
disks: WD2003FYYS RE4
raid: lsi - 9260-4i with 8 disks in raid 10 configuration
1MB stripe size
raid cache enabled w/ bbu
disk caches disabled
filesystem: ext3 created with -E stride=256

I am seeing really poor (70) iops with pgiosim. According to:
http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-8.html in the
database benchmark they are seeing ~170 iops on a single disk for
these drives. I would expect an 8 disk raid 10 should get better then
3x the single disk rate (assuming the data is randomly distributed).

To test I am using 5 100GB files with

sudo ~/pgiosim -c -b 100G -v file?

I am using 100G sizes to make sure that the data read and files sizes
exceed the memory size of the system.

However if I use 5 1GB files (and still 100GB read data) I see 200+ to
400+ iops at 50% of the 100GB of data read, which I assume means that
the data is cached in the OS cache and I am not really getting hard
drive/raid I/O measurement of iops.

However, IIUC postgres will never have an index file greater than 1GB
in size
(http://www.postgresql.org/docs/8.4/static/storage-file-layout.html)
and will just add 1GB segments, so the 1GB size files seems to be more
realistic.

So do I want 100 (or probably 2 or 3 times more say 300) 1GB files to
feed pgiosim? That way I will have enough data that not all of it can
be cached in memory and the file sizes (and file operations:
open/close) more closely match what postgres is doing with index
files?

Also in the output of pgiosim I see:

25.17%, 2881 read, 0 written, 2304.56kB/sec 288.07 iops

which I interpret (left to right) as the % of the 100GB that has been
read, the number of read operations over some time period, number of
bytes read/written and the io operations/sec. Iops always seems to be
1/10th of the read number (rounded up to an integer). Is this
expected and if so anybody know why?

While this is running if I also run "iostat -p /dev/sdc 5" I see:

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc 166.40 2652.80 4.80 13264 24
sdc1 2818.80 1.20 999.20 6 4996

which I am interpreting as 2818 read/io operations (corresponding more
or less to read in the pgiosim output) to the partition and of those
only 116 are actually going to the drive??? with the rest handled from
OS cache.

However the tps isn't increasing when I see pgiosim reporting:

48.47%, 4610 read, 0 written, 3687.62kB/sec 460.95 iops

an iostat 5 output near the same time is reporting:

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc 165.87 2647.50 4.79 13264 24
sdc1 2812.97 0.60 995.41 3 4987

so I am not sure if there is a correlation between the read and tps
settings.

Also I am assuming blks written is filesystem metadata although that
seems like a lot of data

If I stop the pgiosim, the iostat drops to 0 write and reads as
expected.

So does anybody have any comments on how to test with pgiosim and how
to correlate the iostat and pgiosim outputs?

Thanks for your feedback.
--
-- rouilj

John Rouillard System Administrator
Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
reply

Search Discussions

6 responses

  • Ktm at May 14, 2011 at 5:16 pm

    On Fri, May 13, 2011 at 09:09:41PM +0000, John Rouillard wrote:
    Hi all:

    I am adding pgiosim to our testing for new database hardware and I am
    seeing something I don't quite get and I think it's because I am using
    pgiosim incorrectly.

    Specs:

    OS: centos 5.5 kernel: 2.6.18-194.32.1.el5
    memory: 96GB
    cpu: 2x Intel(R) Xeon(R) X5690 @ 3.47GHz (6 core, ht enabled)
    disks: WD2003FYYS RE4
    raid: lsi - 9260-4i with 8 disks in raid 10 configuration
    1MB stripe size
    raid cache enabled w/ bbu
    disk caches disabled
    filesystem: ext3 created with -E stride=256

    I am seeing really poor (70) iops with pgiosim. According to:
    http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-8.html in the
    database benchmark they are seeing ~170 iops on a single disk for
    these drives. I would expect an 8 disk raid 10 should get better then
    3x the single disk rate (assuming the data is randomly distributed).

    To test I am using 5 100GB files with

    sudo ~/pgiosim -c -b 100G -v file?

    I am using 100G sizes to make sure that the data read and files sizes
    exceed the memory size of the system.

    However if I use 5 1GB files (and still 100GB read data) I see 200+ to
    400+ iops at 50% of the 100GB of data read, which I assume means that
    the data is cached in the OS cache and I am not really getting hard
    drive/raid I/O measurement of iops.

    However, IIUC postgres will never have an index file greater than 1GB
    in size
    (http://www.postgresql.org/docs/8.4/static/storage-file-layout.html)
    and will just add 1GB segments, so the 1GB size files seems to be more
    realistic.

    So do I want 100 (or probably 2 or 3 times more say 300) 1GB files to
    feed pgiosim? That way I will have enough data that not all of it can
    be cached in memory and the file sizes (and file operations:
    open/close) more closely match what postgres is doing with index
    files?

    Also in the output of pgiosim I see:

    25.17%, 2881 read, 0 written, 2304.56kB/sec 288.07 iops

    which I interpret (left to right) as the % of the 100GB that has been
    read, the number of read operations over some time period, number of
    bytes read/written and the io operations/sec. Iops always seems to be
    1/10th of the read number (rounded up to an integer). Is this
    expected and if so anybody know why?

    While this is running if I also run "iostat -p /dev/sdc 5" I see:

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    sdc 166.40 2652.80 4.80 13264 24
    sdc1 2818.80 1.20 999.20 6 4996

    which I am interpreting as 2818 read/io operations (corresponding more
    or less to read in the pgiosim output) to the partition and of those
    only 116 are actually going to the drive??? with the rest handled from
    OS cache.

    However the tps isn't increasing when I see pgiosim reporting:

    48.47%, 4610 read, 0 written, 3687.62kB/sec 460.95 iops

    an iostat 5 output near the same time is reporting:

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    sdc 165.87 2647.50 4.79 13264 24
    sdc1 2812.97 0.60 995.41 3 4987

    so I am not sure if there is a correlation between the read and tps
    settings.

    Also I am assuming blks written is filesystem metadata although that
    seems like a lot of data

    If I stop the pgiosim, the iostat drops to 0 write and reads as
    expected.

    So does anybody have any comments on how to test with pgiosim and how
    to correlate the iostat and pgiosim outputs?

    Thanks for your feedback.
    --
    -- rouilj

    John Rouillard System Administrator
    Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
    Hi John,

    Those drives are 7200 rpm drives which would give you a maximum write
    rate of 120/sec at best with the cache disabled. I actually think your
    70/sec is closer to reality and what you should anticipate in real use.
    I do not see how they could make 170/sec. Did they strap a jet engine to
    the drive. :)

    Regards,
    Ken
  • John Rouillard at May 16, 2011 at 1:17 pm

    On Sat, May 14, 2011 at 12:07:02PM -0500, ktm@rice.edu wrote:
    On Fri, May 13, 2011 at 09:09:41PM +0000, John Rouillard wrote:
    I am adding pgiosim to our testing for new database hardware and I am
    seeing something I don't quite get and I think it's because I am using
    pgiosim incorrectly.

    Specs:

    OS: centos 5.5 kernel: 2.6.18-194.32.1.el5
    memory: 96GB
    cpu: 2x Intel(R) Xeon(R) X5690 @ 3.47GHz (6 core, ht enabled)
    disks: WD2003FYYS RE4
    raid: lsi - 9260-4i with 8 disks in raid 10 configuration
    1MB stripe size
    raid cache enabled w/ bbu
    disk caches disabled
    filesystem: ext3 created with -E stride=256

    I am seeing really poor (70) iops with pgiosim. According to:
    http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-8.html in the
    database benchmark they are seeing ~170 iops on a single disk for
    these drives. I would expect an 8 disk raid 10 should get better then
    3x the single disk rate (assuming the data is randomly distributed).
    Those drives are 7200 rpm drives which would give you a maximum write
    rate of 120/sec at best with the cache disabled. I actually think your
    70/sec is closer to reality and what you should anticipate in real use.
    I do not see how they could make 170/sec. Did they strap a jet engine to
    the drive. :)
    Hmm, I stated the disk cache was disabled. I should have said the disk
    write cache, but it's possible the readhead cache is disabled as well
    (not quite sure how to tell on the lsi cards). Also there isn't a lot
    of detail in what the database test mix is and I haven't tried
    researching the site to see if the spec the exact test. If it included
    a lot of writes and they were being handled by a cache then that could
    explain it.

    However, in my case I have an 8 disk raid 10 with a read only load (in
    this testing configuration). Shouldn't I expect more iops than a
    single disk can provide? Maybe pgiosim is hitting some other boundary
    than just i/o?

    Also it turns out that pgiosim can only handle 64 files. I haven't
    checked to see if this is a compile time changable item or not.

    --
    -- rouilj

    John Rouillard System Administrator
    Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
  • Jeff at May 16, 2011 at 4:23 pm

    On May 16, 2011, at 9:17 AM, John Rouillard wrote:


    I am seeing really poor (70) iops with pgiosim. According to:
    http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-8.html in the
    database benchmark they are seeing ~170 iops on a single disk for
    these drives. I would expect an 8 disk raid 10 should get better
    then
    3x the single disk rate (assuming the data is randomly distributed).
    Those drives are 7200 rpm drives which would give you a maximum write
    rate of 120/sec at best with the cache disabled. I actually think
    your
    70/sec is closer to reality and what you should anticipate in real
    use.
    I do not see how they could make 170/sec. Did they strap a jet
    engine to
    the drive. :)
    also you are reading with a worst case scenario for the mechanical
    disk - randomly seeking around everywhere, which will lower
    performance drastically.
    Hmm, I stated the disk cache was disabled. I should have said the disk
    write cache, but it's possible the readhead cache is disabled as well
    (not quite sure how to tell on the lsi cards). Also there isn't a lot
    of detail in what the database test mix is and I haven't tried
    researching the site to see if the spec the exact test. If it included
    a lot of writes and they were being handled by a cache then that could
    explain it.
    you'll get some extra from the os readahead and the drive's potential
    own readahead.

    However, in my case I have an 8 disk raid 10 with a read only load (in
    this testing configuration). Shouldn't I expect more iops than a
    single disk can provide? Maybe pgiosim is hitting some other boundary
    than just i/o?
    given your command line you are only running a single thread - use the
    -t argument to add more threads and that'll increase concurrency. a
    single process can only process so much at once and with multiple
    threads requesting different things the drive will actually be able to
    respond faster since it will have more work to do.
    I tend to test various levels - usually a single (-t 1 - the default)
    to get a base line, then -t (drives / 2), -t (#drives) up to probably
    4x drives (you'll see iops level off).
    Also it turns out that pgiosim can only handle 64 files. I haven't
    checked to see if this is a compile time changable item or not.
    that is a #define in pgiosim.c

    also, are you running the latest pgiosim from pgfoundry?

    the -w param to pgiosim has it rewrite blocks out as it runs. (it is a
    percentage).
  • John Rouillard at May 16, 2011 at 5:06 pm

    On Mon, May 16, 2011 at 12:23:13PM -0400, Jeff wrote:
    On May 16, 2011, at 9:17 AM, John Rouillard wrote:
    However, in my case I have an 8 disk raid 10 with a read only load (in
    this testing configuration). Shouldn't I expect more iops than a
    single disk can provide? Maybe pgiosim is hitting some other boundary
    than just i/o?
    given your command line you are only running a single thread - use
    the -t argument to add more threads and that'll increase
    concurrency. a single process can only process so much at once and
    with multiple threads requesting different things the drive will
    actually be able to respond faster since it will have more work to
    do.
    I tend to test various levels - usually a single (-t 1 - the
    default) to get a base line, then -t (drives / 2), -t (#drives) up
    to probably 4x drives (you'll see iops level off).
    Ok cool. I'll try that.
    Also it turns out that pgiosim can only handle 64 files. I haven't
    checked to see if this is a compile time changable item or not.
    that is a #define in pgiosim.c
    So which is a better test, modifying the #define to allow specifying
    200-300 1GB files, or using 64 files but increasing the size of my
    files to 2-3GB for a total bytes in the file two or three times the
    memory in my server (96GB)?
    also, are you running the latest pgiosim from pgfoundry?
    yup version 0.5 from the foundry.
    the -w param to pgiosim has it rewrite blocks out as it runs. (it is
    a percentage).
    Yup, I was running with that and getting low enough numbers, that I
    switched to pure read tests. It looks like I just need multiple
    threads so I can have multiple reads/writes in flight at the same
    time.

    --
    -- rouilj

    John Rouillard System Administrator
    Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
  • Jeff at May 16, 2011 at 5:54 pm

    On May 16, 2011, at 1:06 PM, John Rouillard wrote:

    that is a #define in pgiosim.c
    So which is a better test, modifying the #define to allow specifying
    200-300 1GB files, or using 64 files but increasing the size of my
    files to 2-3GB for a total bytes in the file two or three times the
    memory in my server (96GB)?
    I tend to make 10G chunks with dd and run pgiosim over that.
    dd if=/dev/zero of=bigfile bs=1M count=10240
    the -w param to pgiosim has it rewrite blocks out as it runs. (it is
    a percentage).
    Yup, I was running with that and getting low enough numbers, that I
    switched to pure read tests. It looks like I just need multiple
    threads so I can have multiple reads/writes in flight at the same
    time.
    Yep - you need multiple threads to get max throughput of your io.
  • John Rouillard at May 17, 2011 at 5:35 pm

    On Mon, May 16, 2011 at 01:54:06PM -0400, Jeff wrote:
    Yep - you need multiple threads to get max throughput of your io.
    I am running:

    ~/pgiosim -c -b 100G -v -t4 file[0-9]*

    Will each thread move 100GB of data? I am seeing:

    158.69%, 4260 read, 0 written, 3407.64kB/sec 425.95 iops

    Maybe the completion target percentage is off because of the threads?

    --
    -- rouilj

    John Rouillard System Administrator
    Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111

Related Discussions

Discussion Navigation
viewthread | post

3 users in discussion

John Rouillard: 4 posts Jeff: 2 posts Ktm: 1 post