I'm trying to get converted over to Postgres from Oracle (Postgres is
billions of times more straightforward and pragmatically clean than
Oracle), but I'm having some severe performance problems on what
naively appears to be a very straightforward dead-simple test.

The test is comprised of two parts: a write part which attempts to
accumulate (sum) numbers by distinct keys, and a read part which
searches for keys in the database (some of which will be present, some
of which will not). In a more realistic scenario, both will be
happening all of the time, but we can start off easy.

However, performance is terrible: around 39 write transactions/second
and 69 searches/second. Oracle, by comparison, writes at 314 and
reads at 395--practically an order of magnitude better performance.
Both are using the same hardware (obviously not at the same time)
which is a dual-processor AMD 2000+ with 3GB memory and both oracle
and postgres loaded on a 105GB ``MD'' striped (no redundancy) 2 SCSI
disks running ext3 fs (no special flags) with Linux 2.4.18-10smp.

I actually have seven different schemes for performing the writes
using Postgres:

----------------------------------------------------------------------
"normal" C libpq 39 t/s
"normal" Perl DBI 39 t/s
"DBI Prepared Statement" Perl DBI 39 t/s
"Batching" Perl DBI 45 t/s
"arrays" Perl DBI 26 t/s
"server-side function" Perl DBI 39 t/s
"server-side trigger" Perl DBI 39 t/s
"normal" Perl DBI read 69 t/s
"normal" Perl DBI for Oracle 314 t/s
"normal" Perl DBI read for Oracle 395 t/s
----------------------------------------------------------------------

Only batching had a statistically significant improvement, and it
wasn't that major. I couldn't use true Postgres prepared statements
since you cannot determine the success/failure of the statements yet.
I was planning on using arrays as well, but the additional 33%
performance impact is not amusing (though I suppose it is only an
additional 3% if you consider the 87% performance drop of Postgres
from Oracle).

I'll include all methods in the attached file, but since there was no
significant difference, I'll concentrate on the basic one:

Example table:
----------------------------------------------------------------------
CREATE TABLE test (
val BIGINT PRIMARY KEY, # "vals" may be between 0 and 2^32-1
accum INTEGER
);
----------------------------------------------------------------------

Basic algorithm for writes
----------------------------------------------------------------------
while (<>)
{
chomp;
@A = split;

if (dosql($dbh, "UPDATE test SET accum = accum + $A[1] WHERE val = '$A[0]';",0) eq "0E0")
{
dosql($dbh, "INSERT INTO test VALUES ( $A[0], $A[1] );");
}
}
----------------------------------------------------------------------

Basic algorithm for reads
----------------------------------------------------------------------
while (<>)
{
chomp;
@A = split;
$sth = querysql($dbh,"SELECT accum FROM test WHERE val = $A[0];");
$hit++ if ($sth && ($row = $sth->fetchrow_arrayref));
$tot++;
}
----------------------------------------------------------------------

What could be simpler.

In my randomly generated write data, I usually have about 18K inserts
and 82K updates. In my randomly generated read data, I have 100K keys
which will be found and 100K keys which will not be found.

The postgresql.conf file is default (my sysadmin nuked all of my
changes when he upgraded to 7.3.1--grr) and there are some shared
memory configs: kernel.sem = 250 32000 100 128, kernel.shmmax =
2147483648, kernel.shmmni = 100, kernel.shmmax = 134217728 The
WAL is not seperated (but see below).

A "vacuum analyze" is performed between the write phase and the read
phase. However, for your analysis pleasure, here are the results
of a full verbose analyze and some explain results (both before and after).

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
seth=> explain update test set accum = accum + 53 where val = '5';
QUERY PLAN
-----------------------------------------------------
Seq Scan on test (cost=0.00..0.00 rows=1 width=18)
Filter: (val = 5::bigint)
(2 rows)
seth=> explain insert into test values (5, 53);
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0)
(1 row)
seth=> vacuum full verbose analyze test;
INFO: --Relation public.test--
INFO: Pages 541: Changed 2, reaped 540, Empty 0, New 0; Tup 18153: Vac 81847, Keep/VTL 0/0, UnUsed 0, MinLen 40, MaxLen 40; Re-using: Free/Avail. Space 3294932/3294932; EndEmpty/Avail. Pages 0/541.
CPU 0.00s/0.03u sec elapsed 0.02 sec.
INFO: Index test_pkey: Pages 355; Tuples 18153: Deleted 81847.
CPU 0.03s/0.34u sec elapsed 0.65 sec.
INFO: Rel test: Pages: 541 --> 99; Tuple(s) moved: 18123.
CPU 1.01s/0.31u sec elapsed 9.65 sec.
INFO: Index test_pkey: Pages 355; Tuples 18153: Deleted 18123.
CPU 0.02s/0.06u sec elapsed 0.19 sec.
INFO: Analyzing public.test
VACUUM

seth=> explain select accum from test where val = 5;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=4)
Index Cond: (val = 5)
(2 rows)
seth=> explain update test set accum = accum + 53 where val = '5';
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=18)
Index Cond: (val = 5::bigint)
(2 rows)
seth=> explain insert into test values (5, 53);
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0)
(1 row)
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

I certainly understand that using an index scan might well speed
things up WRT the update policy, but considering the search
performance is post-analyze (pre-analyze it is even more deadly slow),
I am dubious that doing it during the updates will get me within
striking distance of Oracle since read performance has got to be
better than write performance, right?. This is also why I am dubious
that moving the WAL to another filesystem or futzing with the fsync
policy will do anything.

I will include below a compressed tarball of the programs I used (and
the corresponding RUNME script) in case you wish to play along at
home. I don't claim they are pretty, BTW :-)

-Seth Robertson

Search Discussions

  • Stephan Szabo at Jan 21, 2003 at 9:46 pm

    On Tue, 21 Jan 2003, Seth Robertson wrote:

    The postgresql.conf file is default (my sysadmin nuked all of my
    changes when he upgraded to 7.3.1--grr) and there are some shared
    memory configs: kernel.sem = 250 32000 100 128, kernel.shmmax =
    2147483648, kernel.shmmni = 100, kernel.shmmax = 134217728 The
    WAL is not seperated (but see below).
    You almost certainly want to raise shared_buffers from the default (64?)
    to say 1k-10k. I'm not sure how much that'll help but it should help
    some.
    A "vacuum analyze" is performed between the write phase and the read
    phase. However, for your analysis pleasure, here are the results
    of a full verbose analyze and some explain results (both before and after).
    BTW: what does explain analyze (rather than plain explain) show?
  • Seth Robertson at Jan 21, 2003 at 10:09 pm
    In message <20030121134242.Q84028-100000@megazone23.bigpanda.com>, Stephan Szabo writes:
    On Tue, 21 Jan 2003, Seth Robertson wrote:

    The postgresql.conf file is default (my sysadmin nuked all of my
    changes when he upgraded to 7.3.1--grr) and there are some shared
    memory configs: kernel.sem = 250 32000 100 128, kernel.shmmax =
    2147483648, kernel.shmmni = 100, kernel.shmmax = 134217728 The
    WAL is not seperated (but see below).
    You almost certainly want to raise shared_buffers from the default (64?)
    to say 1k-10k. I'm not sure how much that'll help but it should help
    some.

    I'll try that and report back later, but I was under the (false?)
    impression that it was primarily important when you had multiple
    database connections using the same table.
    A "vacuum analyze" is performed between the write phase and the
    read phase. However, for your analysis pleasure, here are the
    results of a full verbose analyze and some explain results (both
    before and after).
    BTW: what does explain analyze (rather than plain explain) show?


    /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
    seth=> explain analyze select accum from test where val = 5;
    QUERY PLAN
    ----------------------------------------------------------------------------------------------
    Seq Scan on test (cost=0.00..0.00 rows=1 width=4) (actual time=94.55..94.55 rows=0 loops=1)
    Filter: (val = 5)
    Total runtime: 99.20 msec
    (3 rows)

    seth=> explain analyze update test set accum = accum + 53 where val = '5';
    QUERY PLAN
    -----------------------------------------------------------------------------------------------
    Seq Scan on test (cost=0.00..0.00 rows=1 width=18) (actual time=31.95..31.95 rows=0 loops=1)
    Filter: (val = 5::bigint)
    Total runtime: 32.04 msec
    (3 rows)

    seth=> explain analyze insert into test values (5, 53);
    QUERY PLAN
    ----------------------------------------------------------------------------------
    Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=1)
    Total runtime: 7.50 msec
    (2 rows)

    seth=> vacuum full verbose analyze test
    seth-> ;
    INFO: --Relation public.test--
    INFO: Pages 541: Changed 1, reaped 539, Empty 0, New 0; Tup 18071: Vac 81930, Keep/VTL 0/0, UnUsed 0, MinLen 40, MaxLen 40; Re-using: Free/Avail. Space 3298208/3298176; EndEmpty/Avail. Pages 0/540.
    CPU 0.03s/0.00u sec elapsed 0.02 sec.
    INFO: Index test_pkey: Pages 355; Tuples 18071: Deleted 81930.
    CPU 0.04s/0.41u sec elapsed 1.96 sec.
    INFO: Rel test: Pages: 541 --> 98; Tuple(s) moved: 18046.
    CPU 0.95s/0.42u sec elapsed 12.74 sec.
    INFO: Index test_pkey: Pages 355; Tuples 18071: Deleted 18046.
    CPU 0.02s/0.05u sec elapsed 0.31 sec.
    INFO: Analyzing public.test
    VACUUM
    seth=> explain analyze select accum from test where val = 5;
    QUERY PLAN
    -----------------------------------------------------------------------------------------------
    Seq Scan on test (cost=0.00..323.89 rows=1 width=4) (actual time=0.13..14.20 rows=1 loops=1)
    Filter: (val = 5)
    Total runtime: 14.26 msec
    (3 rows)

    seth=> explain analyze select accum from test where val = 2147483648;
    QUERY PLAN
    --------------------------------------------------------------------------------------------------------------
    Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=4) (actual time=0.11..0.11 rows=0 loops=1)
    Index Cond: (val = 2147483648::bigint)
    Total runtime: 0.16 msec
    (3 rows)

    seth=> explain analyze update test set accum = accum + 53 where val = '5';
    QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------
    Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=18) (actual time=0.24..0.24 rows=1 loops=1)
    Index Cond: (val = 5::bigint)
    Total runtime: 0.39 msec
    (3 rows)

    seth=> explain analyze insert into test values (6, 53);
    QUERY PLAN
    ----------------------------------------------------------------------------------
    Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=1)
    Total runtime: 0.08 msec
    (2 rows)

    seth=> explain analyze insert into test values (2147483647, 53);
    QUERY PLAN
    ----------------------------------------------------------------------------------
    Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=1)
    Total runtime: 0.33 msec
    (2 rows)
    /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

    -Seth Robertson
  • Tom Lane at Jan 21, 2003 at 10:33 pm

    Seth Robertson writes:
    I'll try that and report back later, but I was under the (false?)
    impression that it was primarily important when you had multiple
    database connections using the same table.
    Definitely false. shared_buffers needs to be 1000 or so for
    production-grade performance. There are varying schools of thought
    about whether it's useful to raise it even higher, but in any case
    64 is just a toy-installation setting.
    seth=> explain analyze select accum from test where val = 5;
    QUERY PLAN
    -----------------------------------------------------------------------------------------------
    Seq Scan on test (cost=0.00..323.89 rows=1 width=4) (actual time=0.13..14.20 rows=1 loops=1)
    Filter: (val = 5)
    Total runtime: 14.26 msec
    (3 rows)
    seth=> explain analyze update test set accum = accum + 53 where val = '5';
    QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------
    Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=18) (actual time=0.24..0.24 rows=1 loops=1)
    Index Cond: (val = 5::bigint)
    Total runtime: 0.39 msec
    (3 rows)
    The quotes are important when you are dealing with BIGINT indexes.
    You won't get an indexscan if the constant looks like int4 rather than int8.

    regards, tom lane
  • Brian Hirt at Jan 22, 2003 at 1:44 am
    Tom and others:

    There has been a lot of talk about shared memory size recently, along
    with many conflicting statements from various people. Earlier threads
    said that setting the shared buffer to a high values (like 512MB on a
    2GB dedicated DB server) is not a good idea. A couple of reasons were
    mentioned. a) potential inefficiencies with the kernel and VM system
    b) modern kernels aggressive caching with all free memory and c) the
    shared memory stealing from memory the kernel would use to cache, etc.

    So my question is: if the kernel is caching all this data, what's the
    benefit of setting this to 1000 or higher? Why wouldn't i just set it
    to 0 if I believe my kernel is doing a good job.


    From all the discussion on this topic, it's still not clear to me how
    to calculate what value this should be set at and why. I've read these
    documents and others and have yet to find explanations and
    recommendations that i can use.

    http://www.postgresql.org/docs/momjian/hw_performance.pdf
    http://www.postgresql.org/idocs/index.php?runtime-config.html
    http://www.postgresql.org/idocs/index.php?kernel-resources.html
    http://www.postgresql.org/idocs/index.php?performance-tips.html
    http://www.ca.postgresql.org/docs/momjian/hw_performance/node6.html
    http://www.ca.postgresql.org/docs/momjian/hw_performance/node5.html
    http://www.ca.postgresql.org/docs/faq-english.html#3.6

    This is such a common topic, it would be nice to see a more definitive
    and comprehensive section in the docs for tuning. Google searches for
    "shared_buffers site:www.postgresql.org" and "tuning
    site:www.postgresql.org" come up with little info.

    FYI: I've been running our database which is mostly read only with 1500
    buffers. On a whole, we see very little IO. postgresql performs
    many many million queries a day, many simple, many complex. Though the
    database is relatively small, around 3GB.

    --brian
    On Tuesday, January 21, 2003, at 03:31 PM, Tom Lane wrote:

    Seth Robertson <pgsql-performance@sysd.com> writes:
    I'll try that and report back later, but I was under the (false?)
    impression that it was primarily important when you had multiple
    database connections using the same table.
    Definitely false. shared_buffers needs to be 1000 or so for
    production-grade performance. There are varying schools of thought
    about whether it's useful to raise it even higher, but in any case
    64 is just a toy-installation setting.
    seth=> explain analyze select accum from test where val = 5;
    QUERY PLAN
    ----------------------------------------------------------------------
    -------------------------
    Seq Scan on test (cost=0.00..323.89 rows=1 width=4) (actual
    time=0.13..14.20 rows=1 loops=1)
    Filter: (val = 5)
    Total runtime: 14.26 msec
    (3 rows)
    seth=> explain analyze update test set accum = accum + 53 where val =
    '5';
    QUERY PLAN
    ----------------------------------------------------------------------
    -----------------------------------------
    Index Scan using test_pkey on test (cost=0.00..5.99 rows=1
    width=18) (actual time=0.24..0.24 rows=1 loops=1)
    Index Cond: (val = 5::bigint)
    Total runtime: 0.39 msec
    (3 rows)
    The quotes are important when you are dealing with BIGINT indexes.
    You won't get an indexscan if the constant looks like int4 rather than
    int8.

    regards, tom lane

    ---------------------------(end of
    broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to
    majordomo@postgresql.org)
  • Andrew Sullivan at Jan 22, 2003 at 12:05 pm

    On Tue, Jan 21, 2003 at 06:44:57PM -0700, Brian Hirt wrote:

    So my question is: if the kernel is caching all this data, what's the
    benefit of setting this to 1000 or higher? Why wouldn't i just set it
    to 0 if I believe my kernel is doing a good job.
    If Postgres tries to fetch a bit of data which is in its own shared
    buffer, it does not even need to make a system call in order to fetch
    it. The data fetch is extremely fast.

    The problem is that managing that shared memory comes at some cost.

    If the data is not in a shared buffer, then Postgres makes exactly
    the same call, no matter what, to the OS kernel, asking for the data
    from disk. It might happen, however, that the kernel will have the
    data in its disk cache, however. The total cost of the operation,
    therefore, is much lower in case the data is in the kernel's disk
    cache than in the case where it is actually on the disk. It is
    nevertheless still higher (atomically speaking) than fetching the
    data from Postgres's own shared buffer.

    So the question is this: where is the "sweet spot" where it costs
    little enough for Postgres to manage the shared buffer that the
    reduced cost of a system call is worth it. (As you point out, this
    caclulation is complicated by the potential to waste memory by
    caching the data twice -- once in the shared buffer and once in the
    disk cache. Some systems, like Solaris, allow you to turn off the
    disk cache, so the problem may not be one you face.) The trouble is
    that there is no agreement on the answer to that question, and
    precious little evidence which seems to settle the question.

    The only way to determine the best setting, then, is to use your
    system with more-or-less simulated production loads, and see where
    the best setting lies. You have to do this over time, because
    sometimes inefficiencies turn up only after running for a while. In
    an experiment we tried, we used a 2G shared buffer on a 12G machine.
    It looked brilliantly fast at first, but 48 hours later was
    _crawling_; that indicates a problem with shared-buffer management on
    the part of Postgres, I guess, but it was hard to say more than that.
    We ultimately settled on a value somewhere less than 1 G as
    appropriate for our use. But if I had to justify the number I picked
    (as opposed to one a few hundred higher or lower), I'd have a tough
    time.
    From all the discussion on this topic, it's still not clear to me how
    to calculate what value this should be set at and why. I've read these
    documents and others and have yet to find explanations and
    recommendations that i can use.
    I'm afraid what I'm saying is that it's a bit of a black art. The
    pg_autotune project is an attempt to help make this a little more
    scientific. It relies on pgbench, which has its own problems,
    however.

    Hope that's helpful, but I fear it doesn't give you the answer you'd
    like.

    A

    --
    ----
    Andrew Sullivan 204-4141 Yonge Street
    Liberty RMS Toronto, Ontario Canada
    <andrew@libertyrms.info> M2P 2A8
    +1 416 646 3304 x110
  • Neil Conway at Jan 30, 2003 at 11:18 pm

    On Wed, 2003-01-22 at 07:05, Andrew Sullivan wrote:
    (As you point out, this caclulation is complicated by the potential to
    waste memory by caching the data twice
    If we had a good buffer replacement algorithm (which we currently do
    not), ISTM that hot pages retained in PostgreSQL's buffer cache would
    never get loaded from the OS's IO cache, thus causing those pages to
    eventually be evicted from the OS's cache. So the "cache the data twice"
    problem doesn't apply in all circumstances.
    Some systems, like Solaris, allow you to turn off the
    disk cache, so the problem may not be one you face.)
    I think it would be interesting to investigate disabling the OS' cache
    for all relation I/O (i.e. heap files, index files). That way we could
    both improve performance (by moving all the caching into PostgreSQL's
    domain, where there is more room for optimization), as well as make
    configuration simpler: in an ideal world, it would remove the need to
    consider the OS' caching when configuring the amount of shared memory to
    allocate to PostgreSQL.

    Can this be done using O_DIRECT? If so, is it portable?

    BTW, if anyone has any information on actually *using* O_DIRECT, I'd be
    interested in it. I tried to quickly hack PostgreSQL to use it, without
    success...

    Cheers,

    Neil
    --
    Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
  • Curt Sampson at Jan 31, 2003 at 4:04 am

    On Fri, 30 Jan 2003, Neil Conway wrote:

    If we had a good buffer replacement algorithm (which we currently do
    not), ISTM that hot pages retained in PostgreSQL's buffer cache would
    never get loaded from the OS's IO cache, thus causing those pages to
    eventually be evicted from the OS's cache. So the "cache the data twice"
    problem doesn't apply in all circumstances.
    No, but it does apply to every block at some point, since during the
    initial load it's present in both caches, and it has to be flushed from
    the OS's cache at some point.
    Some systems, like Solaris, allow you to turn off the
    disk cache, so the problem may not be one you face.)
    I think it would be interesting to investigate disabling the OS' cache
    for all relation I/O (i.e. heap files, index files). That way we could
    both improve performance (by moving all the caching into PostgreSQL's
    domain, where there is more room for optimization)...
    I'm not so sure that there is that all much more room for optimization.
    But take a look at what Solaris and FFS do now, and consider how much
    work it would be to rewrite it, and then see if you even want to do that
    before adding stuff to improve performance.
    , as well as make configuration simpler: in an ideal world, it would
    remove the need to consider the OS' caching when configuring the
    amount of shared memory to allocate to PostgreSQL.
    We could do that much more simply by using mmap.
    Can this be done using O_DIRECT?
    It can, but you're doing to lose some of the advantages that you'd get
    from using raw devices instead. In particular, you have no way to know
    the physical location of blocks on the disk, because those locations are
    often different from the location in the file.
    If so, is it portable?
    O_DIRECT is not all that portable, I don't think. Certainly not as
    portable as mmap.

    cjs
    --
    Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light. --XTC
  • Bruce Momjian at Feb 2, 2003 at 11:00 am

    Curt Sampson wrote:
    Some systems, like Solaris, allow you to turn off the
    disk cache, so the problem may not be one you face.)
    I think it would be interesting to investigate disabling the OS' cache
    for all relation I/O (i.e. heap files, index files). That way we could
    both improve performance (by moving all the caching into PostgreSQL's
    domain, where there is more room for optimization)...
    I'm not so sure that there is that all much more room for optimization.
    But take a look at what Solaris and FFS do now, and consider how much
    work it would be to rewrite it, and then see if you even want to do that
    before adding stuff to improve performance.
    We need free-behind for large sequential scans, like Solaris has. Do we
    have LRU-2 or LRU-K now?
    If so, is it portable?
    O_DIRECT is not all that portable, I don't think. Certainly not as
    portable as mmap.
    As I remember, DIRECT doesn't return until the data hits the disk
    (because there is no OS cache), so if you want to write a page so you
    can reused the buffer, DIRECT would be quite slow.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Neil Conway at Feb 2, 2003 at 5:01 pm

    On Sun, 2003-02-02 at 05:39, Bruce Momjian wrote:
    We need free-behind for large sequential scans, like Solaris has. Do we
    have LRU-2 or LRU-K now? No.
    As I remember, DIRECT doesn't return until the data hits the disk
    (because there is no OS cache), so if you want to write a page so you
    can reused the buffer, DIRECT would be quite slow.
    Why? If there is a finite amount of memory for doing buffering, the data
    needs to be written to disk at *some* point, anyway. And if we didn't
    use the OS cache, the size of the PostgreSQL shared buffer would be much
    larger (I'd think 80% or more of the physical RAM in a typical high-end
    machine, for dedicated PostgreSQL usage).

    One possible problem would be the fact that it might mean that writing
    out dirty pages would become part of some key code paths in PostgreSQL
    (rather than assuming that the OS can write out dirty pages in the
    background, as it chooses to). But there are lots of ways to work around
    this, notably by using a daemon to periodically write out some of the
    pages in the background.

    Cheers,

    Neil
    --
    Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
  • Curt Sampson at Feb 2, 2003 at 5:16 pm

    On Mon, 2 Feb 2003, Neil Conway wrote:

    As I remember, DIRECT doesn't return until the data hits the disk
    (because there is no OS cache), so if you want to write a page so you
    can reused the buffer, DIRECT would be quite slow.
    ...
    One possible problem would be the fact that it might mean that writing
    out dirty pages would become part of some key code paths in PostgreSQL
    (rather than assuming that the OS can write out dirty pages in the
    background, as it chooses to). But there are lots of ways to work around
    this, notably by using a daemon to periodically write out some of the
    pages in the background.
    If you're doing blocking direct I/O, you really have to have and use
    what I guess I'd call "scatter-scatter I/O": you need to chose a large
    number of blocks scattered over various positions in all your open
    files, and be able to request a write for all of them at once.

    If you write one by one, with each write going to disk before your
    request returns, you're going to be forcing the physical order of the
    writes with no knowledge of where the blocks physically reside on the
    disk, and you stand a snowball's chance in you-know-where of getting
    a write ordering that will maximize your disk throughput.

    This is why systems that use direct I/O, for the most part, use a raw
    partition and their own "filesystem" as well; you need to know the
    physical layout of the blocks to create efficient write strategies.

    (MS SQL Server on Windows NT is a notable exception to this. They do,
    however, make you pre-create the data file in advance, and they suggest
    doing it on an empty partition, which at the very least would get you
    long stretches of the file in contiguous order. They may also be using
    tricks to make sure the file gets created in contiguous order, or they
    may be able to get information from the OS about the physical block
    numbers corresponding to logical block numbers in the file.)

    cjs
    --
    Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light. --XTC
  • Josh Berkus at Feb 2, 2003 at 7:32 pm
    Curt,
    (MS SQL Server on Windows NT is a notable exception to this. They do,
    however, make you pre-create the data file in advance, and they suggest
    doing it on an empty partition, which at the very least would get you
    long stretches of the file in contiguous order. They may also be using
    tricks to make sure the file gets created in contiguous order, or they
    may be able to get information from the OS about the physical block
    numbers corresponding to logical block numbers in the file.)
    MSSQL is, in fact, doing some kind of direct-block-addressing. If you
    attempt to move a partition on the disk containing MSSQL databases, SQL
    Server will crash on restart and be unrecoverable ... even if the other files
    on that disk are fine. Nor can you back up MSSQL by using disk imaging,
    unless you can recover to an identical model disk/array.

    --
    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Bruce Momjian at Feb 3, 2003 at 12:41 am

    Neil Conway wrote:
    On Sun, 2003-02-02 at 05:39, Bruce Momjian wrote:
    We need free-behind for large sequential scans, like Solaris has. Do we
    have LRU-2 or LRU-K now? No.
    As I remember, DIRECT doesn't return until the data hits the disk
    (because there is no OS cache), so if you want to write a page so you
    can reused the buffer, DIRECT would be quite slow.
    Why? If there is a finite amount of memory for doing buffering, the data
    needs to be written to disk at *some* point, anyway. And if we didn't
    use the OS cache, the size of the PostgreSQL shared buffer would be much
    larger (I'd think 80% or more of the physical RAM in a typical high-end
    machine, for dedicated PostgreSQL usage).

    One possible problem would be the fact that it might mean that writing
    out dirty pages would become part of some key code paths in PostgreSQL
    (rather than assuming that the OS can write out dirty pages in the
    background, as it chooses to). But there are lots of ways to work around
    this, notably by using a daemon to periodically write out some of the
    pages in the background.
    Right. This is what we _don't_ want to do. If we need a buffer, we
    need it now. We can't wait for some other process to write the buffer
    directly to disk, nor do we want to group the writes somehow.

    And the other person mentioning we have to group writes again causes the
    same issues --- we are bypassing the kernel buffers which know more than
    we do. I can see advantage of preventing double buffering _quickly_
    being overtaken by the extra overhead of direct i/o.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Tom Lane at Jan 22, 2003 at 4:21 pm

    Brian Hirt writes:
    So my question is: if the kernel is caching all this data, what's the
    benefit of setting this to 1000 or higher? Why wouldn't i just set it
    to 0 if I believe my kernel is doing a good job.
    Well, setting it to 0 won't work ;-). There's some minimum number of
    buffers needed for PG to work at all; depending on complexity of your
    queries and number of active backends it's probably around 25-100.
    (You'll get "out of buffers" failures if too few.) But more to the
    point, when shared_buffers is too small you'll waste CPU cycles on
    unnecessary data transfers between kernel and user memory. It seems to
    be pretty well established that 64 is too small for most applications.
    I'm not sure how much is enough, but I suspect that a few thousand is
    plenty to get past the knee of the performance curve in most scenarios.

    regards, tom lane
  • Seth Robertson at Jan 22, 2003 at 7:19 am
    In message <13165.1043188295@sss.pgh.pa.us>, Tom Lane writes:

    Seth Robertson <pgsql-performance@sysd.com> writes:
    I'll try that and report back later, but I was under the (false?)
    impression that it was primarily important when you had multiple
    database connections using the same table.
    Definitely false. shared_buffers needs to be 1000 or so for
    production-grade performance. There are varying schools of thought
    about whether it's useful to raise it even higher, but in any case
    64 is just a toy-installation setting.

    Increasing the setting to 4096 improved write performance by 20%.
    Increasing the setting to 8192 had no additional effect. I could try
    a few more probes if anyone cared.

    The quotes are important when you are dealing with BIGINT indexes.
    You won't get an indexscan if the constant looks like int4 rather
    than int8.

    You are not kidding!!!! Changing this increased the search
    performance to 2083 transactions/second. This is 30 times faster than
    before, and 5 times faster than Oracle! Go Tom Lane!!!

    Unfortunately, the update accidentally already used the quoting, so
    this top did not directly help the write case. However, it did
    inspire me to check some other suggestions I have read since obviously
    performance was to be had.

    ----------------------------------------------------------------------
    Oracle read performance: 395
    Original read performance: 69
    shared_buffer = 4096 118
    + quoted where (WHERE val = '5') 2083
    ----------------------------------------------------------------------

    ----------------------------------------------------------------------
    Oracle write performance: 314
    Original write performance: 39
    shared_buffer = 4096: 47
    + Occassional (@ 10K & 60K vectors) vacuum analyze in bg: 121
    + Periodic (every 10K vectors) vacuum analyze in background: 124
    + wal_buffers = 24: 125
    + wal_method = fdatasync 127
    + wal_method = open_sync 248
    + wal_method = open_datasync Not Supported
    + fsync=false: 793
    ----------------------------------------------------------------------

    Just to round out my report, using the fastest safe combination I was
    able to find (open_sync *is* safe, isn't it?), I reran all 7
    performance tests to see if there was any different using the
    different access methods:

    ----------------------------------------------------------------------
    "normal" C libpq 256 t/s
    "normal" Perl DBI 251 t/s
    "DBI Prepared Statement" Perl DBI 254 t/s
    "Batching" Perl DBI 1149 t/s
    "arrays" Perl DBI 43 t/s
    "server-side function" Perl DBI 84 t/s
    "server-side trigger" Perl DBI 84 t/s
    "normal" Perl DBI read 1960 t/s
    "normal" Perl DBI for Oracle 314 t/s
    "normal" Perl DBI read for Oracle 395 t/s
    ----------------------------------------------------------------------

    With a batching update of 1149 transactions per second (2900%
    improvement), I am willing to call it a day unless anyone else has any
    brilliant ideas. However, it looks like my hope to use arrays is
    doomed though, I'm not sure I can handle the performance penalty.

    -Seth Robertson
  • Seth Robertson at Jan 22, 2003 at 6:15 pm
    Seth Robertson writes:


    However, it looks like my hope to use arrays is
    doomed though, I'm not sure I can handle the performance penalty.

    Just in case I get the person who implemented arrays annoyed or
    worried, I did not properly modify the "array" test and was vacuum'ing
    the wrong table every 10000 vectors during the test. I realized that
    this morning and the new array results are listed below. I also
    experimented with batching read operations, and I was surprised to find
    that this helps a great deal as well.

    ----------------------------------------------------------------------
    "normal" C libpq 256 t/s
    "normal" Perl DBI 251 t/s
    "DBI Prepared Statement" Perl DBI 254 t/s
    "Batching" Perl DBI 1149 t/s
    "arrays" Perl DBI 250 t/s (*)
    "arrays with batching" Perl DBI 1020 t/s (*)
    "server-side function" Perl DBI 84 t/s
    "server-side trigger" Perl DBI 84 t/s
    "normal" Perl DBI read 1960 t/s
    "batched" Perl DBI read 3076 t/s (*)
    "array" Perl DBI read 1754 t/s (*)
    "batched array" Perl DBI read 2702 t/s (*)
    "normal" Perl DBI for Oracle 314 t/s
    "normal" Perl DBI read for Oracle 395 t/s
    ----------------------------------------------------------------------
    (*) New/updated from this morning

    This brings array code to within 11% of the performance of batched
    non-arrays, and close enough to be an option. I may well be doing
    something wrong with the server-side functions, but I don't see
    anything quite so obviously wrong.

    -Seth Robertson

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedJan 21, '03 at 9:21p
activeFeb 3, '03 at 12:41a
posts16
users9
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase