--- lnd@hnit.is wrote: > You can roughly estimate time
spent for just scaning
the table using
something like this:

select sum(version) from ... where version is not
null

and just

select sum(version) from ...

The results would be interesting to compare.
To answer (I hope) everyones questions at once:

1) Oracle and postmaster were not running at the same
time
2) The queries were run once, to cache as much as
possible then run again to get the timing

3) Distinct vs. no distinct (i.e. sort performance).

select length(version) from vers where version is not
null;

Time: 9748.174 ms

select distinct(version) from vers where version is
not null;

Time: 67988.972 ms

So about an extra 60 seconds with the distinct on.

Here is the explain analyze output from psql:

# explain analyze select distinct version from vers
where version is not null;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Unique (cost=117865.77..120574.48 rows=142
width=132) (actual time=63623.428..68269.111 rows=536
loops=1)
-> Sort (cost=117865.77..119220.13 rows=541741
width=132) (actual time=63623.417..66127.641
rows=541741 loops=1)
Sort Key: "version"
-> Seq Scan on vers (cost=0.00..21367.41
rows=541741 width=132) (actual time=0.218..7214.903
rows=541741 loops=1)
Filter: ("version" IS NOT NULL)
Total runtime: 68324.215 ms
(6 rows)

Time: 68326.062 ms


And the non-default .conf parameters:

tcpip_socket = true
max_connections = 100
password_encryption = true
shared_buffers = 2000
sort_mem = 16384
vacuum_mem = 8192
effective_cache_size = 4000
syslog = 2

postgresql version is 7.4.3
compiled with GCC 3.3.2 on sun4u architecture.








___________________________________________________________ALL-NEW Yahoo! Messenger - sooooo many all-new ways to express yourself http://uk.messenger.yahoo.com

Search Discussions

  • Pginfo at Jun 18, 2004 at 1:52 pm
    Hi ,
    I have similare problem and found that the problem is by pg sort.
    It is extremly slow by me.

    Also in my case I tryed to migrate one db from oracle to pg .

    To solve this problem I dinamicaly set sort_mem to some big value.
    In this case the sort is working into RAM and is relative fast.
    You can try this and remember sort mem is per sort, not per connection.

    In my migration I found the only advantage for oracle is the very good sort.

    regards,
    ivan.

    Gary Cowell wrote:
    --- lnd@hnit.is wrote: > You can roughly estimate time
    spent for just scaning

    the table using
    something like this:

    select sum(version) from ... where version is not
    null

    and just

    select sum(version) from ...

    The results would be interesting to compare.
    To answer (I hope) everyones questions at once:

    1) Oracle and postmaster were not running at the same
    time
    2) The queries were run once, to cache as much as
    possible then run again to get the timing

    3) Distinct vs. no distinct (i.e. sort performance).

    select length(version) from vers where version is not
    null;

    Time: 9748.174 ms

    select distinct(version) from vers where version is
    not null;

    Time: 67988.972 ms

    So about an extra 60 seconds with the distinct on.

    Here is the explain analyze output from psql:

    # explain analyze select distinct version from vers
    where version is not null;

    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------
    Unique (cost=117865.77..120574.48 rows=142
    width=132) (actual time=63623.428..68269.111 rows=536
    loops=1)
    -> Sort (cost=117865.77..119220.13 rows=541741
    width=132) (actual time=63623.417..66127.641
    rows=541741 loops=1)
    Sort Key: "version"
    -> Seq Scan on vers (cost=0.00..21367.41
    rows=541741 width=132) (actual time=0.218..7214.903
    rows=541741 loops=1)
    Filter: ("version" IS NOT NULL)
    Total runtime: 68324.215 ms
    (6 rows)

    Time: 68326.062 ms


    And the non-default .conf parameters:

    tcpip_socket = true
    max_connections = 100
    password_encryption = true
    shared_buffers = 2000
    sort_mem = 16384
    vacuum_mem = 8192
    effective_cache_size = 4000
    syslog = 2

    postgresql version is 7.4.3
    compiled with GCC 3.3.2 on sun4u architecture.








    ___________________________________________________________ALL-NEW Yahoo! Messenger - sooooo many all-new ways to express yourself http://uk.messenger.yahoo.com

    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


  • Richard Huxton at Jun 18, 2004 at 1:56 pm

    Gary Cowell wrote:
    --- lnd@hnit.is wrote: > You can roughly estimate time
    spent for just scaning
    the table using
    something like this:

    select sum(version) from ... where version is not
    null

    and just

    select sum(version) from ...

    The results would be interesting to compare.

    To answer (I hope) everyones questions at once:

    1) Oracle and postmaster were not running at the same
    time
    2) The queries were run once, to cache as much as
    possible then run again to get the timing

    3) Distinct vs. no distinct (i.e. sort performance).

    select length(version) from vers where version is not
    null;

    Time: 9748.174 ms

    select distinct(version) from vers where version is
    not null;

    Time: 67988.972 ms

    So about an extra 60 seconds with the distinct on.
    Which is basically the sorting time...
    Here is the explain analyze output from psql:

    # explain analyze select distinct version from vers
    where version is not null;

    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------
    Unique (cost=117865.77..120574.48 rows=142
    width=132) (actual time=63623.428..68269.111 rows=536
    loops=1)
    -> Sort (cost=117865.77..119220.13 rows=541741
    width=132) (actual time=63623.417..66127.641
    rows=541741 loops=1)
    Sort Key: "version"
    -> Seq Scan on vers (cost=0.00..21367.41
    rows=541741 width=132) (actual time=0.218..7214.903
    rows=541741 loops=1)
    Filter: ("version" IS NOT NULL)
    Total runtime: 68324.215 ms
    (6 rows)

    Time: 68326.062 ms
    Yep - the seq-scan takes 7214.903 ms, there's a huge setup time for the
    sort (63623.417) and it's not finished until 66127.641ms have elapsed.
    And the non-default .conf parameters:

    tcpip_socket = true
    max_connections = 100
    password_encryption = true
    shared_buffers = 2000
    sort_mem = 16384
    vacuum_mem = 8192
    effective_cache_size = 4000
    syslog = 2
    Well, I'd probably up vacuum_mem, and check how much RAM is being used
    for disk cache - I'd guess it's more than 32MB (4000 * 8kb).

    You might want to up the shared_buffers, but that's going to depend on
    the load.

    Try increasing sort_mem temporarily, and see if that makes a difference:
    SET sort_mem = 64000;
    EXPLAIN ANALYSE ...
    The only thing I can think is that you're getting disk activity to get a
    sort that slow. I'd be expecting a hash-sort if PG thought it could fit
    the distinct values in memory.

    --
    Richard Huxton
    Archonet Ltd
  • Tom Lane at Jun 18, 2004 at 2:57 pm

    =?iso-8859-1?q?Gary=20Cowell?= <gary_cowell@yahoo.co.uk> writes:
    -> Sort (cost=117865.77..119220.13 rows=541741
    width=132) (actual time=63623.417..66127.641
    rows=541741 loops=1)
    This is clearly where the time is going.
    sort_mem = 16384
    Probably not enough for this problem. The estimated data size is
    upwards of 60 meg (132 bytes * half a mil rows); allowing for per-row
    overhead I suspect that you'd need sort_mem approaching 100 meg for
    a fully-in-memory sort. (Also I'd take the width=132 with a *big*
    grain of salt, unless you have reason to know that it's accurate.)

    The on-disk sorting algorithm that we use is designed to favor minimum
    disk space consumption over speed. It has a fairly nonrandom access
    pattern that can be pretty slow if your disks don't have good seek-time
    specs.

    I don't know whether Oracle's performance advantage is because they're
    not swapping the sort to disk at all, or because they use a different
    on-disk sort method with a more sequential access pattern.

    [... thinks for awhile ...] It seems possible that they may use sort
    code that knows it is performing a DISTINCT operation and discards
    duplicates on sight. Given that there are only 534 distinct values,
    the sort would easily stay in memory if that were happening.

    It would be interesting to compare Oracle and PG times for a straight
    sort of half a million rows, without the DISTINCT part; that would
    give us a clue whether they simply have much better sort technology,
    or whether they have a special optimization for sort+unique.

    regards, tom lane
  • Pginfo at Jun 18, 2004 at 3:15 pm
    Hi,

    Tom Lane wrote:
    =?iso-8859-1?q?Gary=20Cowell?= <gary_cowell@yahoo.co.uk> writes:

    -> Sort (cost=117865.77..119220.13 rows=541741
    width=132) (actual time=63623.417..66127.641
    rows=541741 loops=1)
    This is clearly where the time is going.


    sort_mem = 16384
    Probably not enough for this problem. The estimated data size is
    upwards of 60 meg (132 bytes * half a mil rows); allowing for per-row
    overhead I suspect that you'd need sort_mem approaching 100 meg for
    a fully-in-memory sort. (Also I'd take the width=132 with a *big*
    grain of salt, unless you have reason to know that it's accurate.)

    The on-disk sorting algorithm that we use is designed to favor minimum
    disk space consumption over speed. It has a fairly nonrandom access
    pattern that can be pretty slow if your disks don't have good seek-time
    specs.

    I don't know whether Oracle's performance advantage is because they're
    not swapping the sort to disk at all, or because they use a different
    on-disk sort method with a more sequential access pattern.

    [... thinks for awhile ...] It seems possible that they may use sort
    code that knows it is performing a DISTINCT operation and discards
    duplicates on sight. Given that there are only 534 distinct values,
    the sort would easily stay in memory if that were happening.

    It would be interesting to compare Oracle and PG times for a straight
    sort of half a million rows, without the DISTINCT part; that would
    give us a clue whether they simply have much better sort technology,
    or whether they have a special optimization for sort+unique.
    I was tested this situation and found that oracle is working also in
    this case much faster (in some cases x10 ) compared to pg.
    Also by in memory sort oracle is faster but the diferenc is not so big.
    So I have oracle 8 and oracle 10 (also pg - it is my primary platform)
    installed and can run some tests.
    I am ready to help in this direction or if you can send any example I
    will run it and post the result .

    regards,
    ivan.
    regards, tom lane

    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend


  • Stephen Frost at Jun 18, 2004 at 4:30 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    [... thinks for awhile ...] It seems possible that they may use sort
    code that knows it is performing a DISTINCT operation and discards
    duplicates on sight. Given that there are only 534 distinct values,
    the sort would easily stay in memory if that were happening.
    Could this optimization be added to PostgreSQL? It sounds like a very
    reasonable thing to do. Hopefully there wouldn't be too much complexity
    needed to add it.

    Stephen
  • Tom Lane at Jun 18, 2004 at 5:01 pm

    Stephen Frost writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    [... thinks for awhile ...] It seems possible that they may use sort
    code that knows it is performing a DISTINCT operation and discards
    duplicates on sight. Given that there are only 534 distinct values,
    the sort would easily stay in memory if that were happening.
    Could this optimization be added to PostgreSQL? It sounds like a very
    reasonable thing to do.
    That's what I was wondering about too. But first I'd like to get
    some kind of reading on how effective it would be. If someone can
    demonstrate that Oracle can do sort-and-drop-dups a lot faster than
    it can do a straight sort of the same amount of input data, that
    would be a strong indication that it's worth doing. At this point
    we don't know if that's the source of their win or not.

    regards, tom lane
  • Stephen Frost at Jun 18, 2004 at 5:53 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    Stephen Frost <sfrost@snowman.net> writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    [... thinks for awhile ...] It seems possible that they may use sort
    code that knows it is performing a DISTINCT operation and discards
    duplicates on sight. Given that there are only 534 distinct values,
    the sort would easily stay in memory if that were happening.
    Could this optimization be added to PostgreSQL? It sounds like a very
    reasonable thing to do.
    That's what I was wondering about too. But first I'd like to get
    some kind of reading on how effective it would be. If someone can
    demonstrate that Oracle can do sort-and-drop-dups a lot faster than
    it can do a straight sort of the same amount of input data, that
    would be a strong indication that it's worth doing. At this point
    we don't know if that's the source of their win or not.
    Alright, I did a couple tests, these are different systems with
    different hardware, but in the end I think the difference is clear:

    tsf=# explain analyze select distinct access_type_id from p_gen_dom_dedicated_swc_access ;
    QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------------------------------
    Unique (cost=321591.00..333205.56 rows=16 width=10) (actual time=32891.141..37420.429 rows=16 loops=1)
    -> Sort (cost=321591.00..327398.28 rows=2322912 width=10) (actual time=32891.137..35234.810 rows=2322912 loops=1)
    Sort Key: access_type_id
    -> Seq Scan on p_gen_dom_dedicated_swc_access (cost=0.00..55492.12 rows=2322912 width=10) (actual time=0.013..3743.470 rows=2322912 loops=1)
    Total runtime: 37587.519 ms
    (5 rows)

    tsf=# explain analyze select access_type_id from p_gen_dom_dedicated_swc_access order by access_type_id;
    QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------------------------
    Sort (cost=321591.00..327398.28 rows=2322912 width=10) (actual time=32926.696..35278.847 rows=2322912 loops=1)
    Sort Key: access_type_id
    -> Seq Scan on p_gen_dom_dedicated_swc_access (cost=0.00..55492.12 rows=2322912 width=10) (actual time=0.014..3753.443 rows=2322912 loops=1)
    Total runtime: 36737.628 ms
    (4 rows)

    So, about the same from postgres in each case. From Oracle:

    (select access_type_id from p_gen_dom_dedicated_swc_access order by access_type_id)
    sauron:/home/sfrost> time sqlplus mci_vendor/mci @test.sql > /dev/null

    real 3m55.12s
    user 2m25.87s
    sys 0m10.59s

    (select distinct access_type_id from p_gen_dom_dedicated_swc_access)
    sauron:/home/sfrost> time sqlplus mci_vendor/mci @test.sql > /dev/null

    real 0m5.08s
    user 0m3.86s
    sys 0m0.95s

    All the queries were run multiple times, though there wasn't all that
    much difference in the times. Both systems are pretty speedy, but I
    tend to feel the Postgres box is faster in CPU/disk access time, which
    is probably why the Oracle system took 4 minutes to do what the Postgres
    systems does in 40 seconds. My only other concern is the Oracle system
    having to do the write I/O while the postgres one doesn't... I don't
    see an obvious way to get around that though, and I'm not sure if it'd
    really make *that* big of a difference.

    Stephen
  • Stephen Frost at Jun 18, 2004 at 6:16 pm

    * Stephen Frost (sfrost@snowman.net) wrote:
    systems does in 40 seconds. My only other concern is the Oracle system
    having to do the write I/O while the postgres one doesn't... I don't
    see an obvious way to get around that though, and I'm not sure if it'd
    really make *that* big of a difference.
    Alright, after talking with some people on #postgresql I found that in
    Oracle you can do 'set autotrace traceonly', which removes the I/O
    factor from the Oracle query. Doing this I also discovered that it
    appears Oracle actually uses an index on that field that it knows about
    to derive what the distinct results would be. That probably invalidates
    this test for what we were specifically looking for, but, hey, using the
    index to figure out what the distinct values for the key are isn't
    exactly a bad idea. :)

    Here's the new results:

    (select access_type_id from p_gen_dom_dedicated_swc_access order by access_type_id;)
    -----------------------------------------------------------------------------------
    sauron:/home/sfrost> time sqlplus mci_vendor/mci @test.sql

    SQL*Plus: Release 9.2.0.1.0 - Production on Fri Jun 18 14:10:12 2004

    Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.


    Connected to:
    Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.1.0 - Production


    2322912 rows selected.


    Execution Plan
    ----------------------------------------------------------
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=11459 Card=1303962 B
    ytes=16951506)

    1 0 SORT* (ORDER BY) (Cost=11459 Card=1303962 Bytes=16951506) :Q457001
    2 1 TABLE ACCESS* (FULL) OF 'P_GEN_DOM_DEDICATED_SWC_ACCESS' :Q457000
    (Cost=1550 Card=1303962 Bytes=16951506)



    1 PARALLEL_TO_SERIAL SELECT A1.C0 C0 FROM :Q457000 A1 ORDER BY A1
    .C0

    2 PARALLEL_TO_PARALLEL SELECT /*+ NO_EXPAND ROWID(A1) */ A1."ACCESS
    _TYPE_ID" C0 FROM "P_GEN_DOM_DEDICAT



    Statistics
    ----------------------------------------------------------
    32 recursive calls
    1594 db block gets
    64495 consistent gets
    105975 physical reads
    0 redo size
    40109427 bytes sent via SQL*Net to client
    1704111 bytes received via SQL*Net from client
    154862 SQL*Net roundtrips to/from client
    2 sorts (memory)
    4 sorts (disk)
    2322912 rows processed

    Disconnected from Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.1.0 - Production

    real 1m38.55s
    user 0m23.36s
    sys 0m9.61s

    -----------------------------------------------------------------------------------
    (select distinct access_type_id from p_gen_dom_dedicated_swc_access)
    -----------------------------------------------------------------------------------
    sauron:/home/sfrost> time sqlplus mci_vendor/mci @test.sql

    SQL*Plus: Release 9.2.0.1.0 - Production on Fri Jun 18 14:13:54 2004

    Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.


    Connected to:
    Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.1.0 - Production


    16 rows selected.


    Execution Plan
    ----------------------------------------------------------
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=44874 Card=1303962 B
    ytes=16951506)

    1 0 SORT (UNIQUE) (Cost=44874 Card=1303962 Bytes=16951506)
    2 1 INDEX (FAST FULL SCAN) OF 'TABLE_8111_DUPLICATE_CHECK' (
    UNIQUE) (Cost=4 Card=1303962 Bytes=16951506)





    Statistics
    ----------------------------------------------------------
    0 recursive calls
    0 db block gets
    47069 consistent gets
    47067 physical reads
    0 redo size
    841 bytes sent via SQL*Net to client
    662 bytes received via SQL*Net from client
    3 SQL*Net roundtrips to/from client
    1 sorts (memory)
    0 sorts (disk)
    16 rows processed

    Disconnected from Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.1.0 - Production

    real 0m5.36s
    user 0m0.04s
    sys 0m0.07s
    -----------------------------------------------------------------------------------

    Stephen
  • Mischa Sandberg at Jun 22, 2004 at 5:12 am
    Don't know about Oracle, but select-distinct in MSSQL2K will indeed throw
    away duplicates, which chops the CPU time. Very easy to see in the graphic
    query plan, both in terms of CPU and the number of rows retrieved from a
    single-node or nested-loop subtree. Definitely a worthwhile optimization.

    "Tom Lane" <tgl@sss.pgh.pa.us> wrote in message
    news:22514.1087578111@sss.pgh.pa.us...
    Stephen Frost <sfrost@snowman.net> writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    [... thinks for awhile ...] It seems possible that they may use sort
    code that knows it is performing a DISTINCT operation and discards
    duplicates on sight. Given that there are only 534 distinct values,
    the sort would easily stay in memory if that were happening.
    Could this optimization be added to PostgreSQL? It sounds like a very
    reasonable thing to do.
    That's what I was wondering about too. But first I'd like to get
    some kind of reading on how effective it would be. If someone can
    demonstrate that Oracle can do sort-and-drop-dups a lot faster than
    it can do a straight sort of the same amount of input data, that
    would be a strong indication that it's worth doing. At this point
    we don't know if that's the source of their win or not.

    regards, tom lane

    ---------------------------(end of broadcast)---------------------------
    TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedJun 18, '04 at 1:25p
activeJun 22, '04 at 5:12a
posts10
users6
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase