Dear All,

I have some problems with regexp queries performance - common sense tells me
that my queries should run faster than they do.

The database - table in question has 590 K records, table's size is 3.5GB. I
am effectively querying a single attribute "subject" which has an average
size of 2KB, so we are doing a query on ~1GB of data. The query looks more
or less like this:

SELECT T.tender_id FROM archive_tender T WHERE
(( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR
[4-5 more similar terms] ) AND T.erased = 0 AND T.rejected = 0
ORDER BY
tender_id DESC
LIMIT
10000;

The planner shows seq scan on subject which is OK with regexp match.

Now, the query above takes about 60sec to execute; exactly: 70s for the
first run and 60s for the next runs. In my opinion this is too long: It
should take 35 s to read the whole table into RAM (assuming 100 MB/s
transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole
table should be easily buffered on the operating system level. The regexp
match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre
test). The system is not in the production mode, so there is no additional
database activity (no reads, no updates, effectively db is read-only)

To summarize: any idea how to speed up this query? (please, don't suggest
regexp indexing - in this application it would be too time consuming to
implement them, and besides - as above - I think that Postgres should do
better here even with seq-scan).

Server parameters:
RAM: 12 GB
Cores: 8
HDD: SATA; shows 200 MB/s transfer speed
OS: Linux 64bit; Postgres 8.4


Some performance params from postgresql.conf:
max_connections = 16
shared_buffers = 24MB
temp_buffers = 128MB
max_prepared_transactions = 50
work_mem = 128MB
maintenance_work_mem = 1GB
effective_cache_size = 8GB

Database is vacuumed.


Regards,

Greg

Search Discussions

  • Tomas Vondra at Aug 10, 2011 at 3:08 pm

    On 10 Srpen 2011, 16:26, Grzegorz Blinowski wrote:
    Now, the query above takes about 60sec to execute; exactly: 70s for the
    first run and 60s for the next runs. In my opinion this is too long: It
    should take 35 s to read the whole table into RAM (assuming 100 MB/s
    transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole
    table should be easily buffered on the operating system level. The regexp
    And is it really in the page cache? I'm not an expert in this field, but
    I'd guess no. Check if it really gets the data from cache using iostat or
    something like that. Use fincore to see what's really in the cache, it's
    available here:

    http://code.google.com/p/linux-ftools/
    Some performance params from postgresql.conf:
    max_connections = 16
    shared_buffers = 24MB
    Why just 24MBs? Have you tried with more memory here, e.g. 256MB or 512MB?
    I'm not suggesting the whole table should fit here (seq scan uses small
    ring cache anyway), but 24MB is just the bare minimum to start the DB.
    Database is vacuumed.
    Just vacuumed or compacted? The simple vacuum just marks the dead tuples
    as empty, it does not compact the database. So if you've done a lot of
    changes and then just run vacuum, it may still may occupy a lot of space
    on the disk. How did you get that the table size is 3.5GB? Is that the
    size of the raw data, have you used pg_relation_size or something else?

    Tomas
  • Pasman Pasmański at Aug 10, 2011 at 3:09 pm
    Try to use single regular expression.

    2011/8/10, Grzegorz Blinowski <g.blinowski@gmail.com>:
    Dear All,

    I have some problems with regexp queries performance - common sense tells me
    that my queries should run faster than they do.

    The database - table in question has 590 K records, table's size is 3.5GB. I
    am effectively querying a single attribute "subject" which has an average
    size of 2KB, so we are doing a query on ~1GB of data. The query looks more
    or less like this:

    SELECT T.tender_id FROM archive_tender T WHERE
    (( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR
    [4-5 more similar terms] ) AND T.erased = 0 AND T.rejected = 0
    ORDER BY
    tender_id DESC
    LIMIT
    10000;

    The planner shows seq scan on subject which is OK with regexp match.

    Now, the query above takes about 60sec to execute; exactly: 70s for the
    first run and 60s for the next runs. In my opinion this is too long: It
    should take 35 s to read the whole table into RAM (assuming 100 MB/s
    transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole
    table should be easily buffered on the operating system level. The regexp
    match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre
    test). The system is not in the production mode, so there is no additional
    database activity (no reads, no updates, effectively db is read-only)

    To summarize: any idea how to speed up this query? (please, don't suggest
    regexp indexing - in this application it would be too time consuming to
    implement them, and besides - as above - I think that Postgres should do
    better here even with seq-scan).

    Server parameters:
    RAM: 12 GB
    Cores: 8
    HDD: SATA; shows 200 MB/s transfer speed
    OS: Linux 64bit; Postgres 8.4


    Some performance params from postgresql.conf:
    max_connections = 16
    shared_buffers = 24MB
    temp_buffers = 128MB
    max_prepared_transactions = 50
    work_mem = 128MB
    maintenance_work_mem = 1GB
    effective_cache_size = 8GB

    Database is vacuumed.


    Regards,

    Greg

    --
    ------------
    pasman
  • Kevin Grittner at Aug 10, 2011 at 3:28 pm

    Grzegorz Blinowski wrote:

    Some performance params from postgresql.conf:
    Please paste the result of running the query on this page:

    http://wiki.postgresql.org/wiki/Server_Configuration

    For a start, the general advice is usually to start with
    shared_buffers at the lesser of 25% of system RAM or 8 GB, and
    adjust from there based on benchmarks. So you might want to try 4GB
    for that one.

    Just to confirm, you are using 2 Phase Commit? (People sometimes
    mistake the max_prepared_transactions setting for something related
    to prepared statements.)

    I concur with previous advice that using one regular expression
    which matches all of the terms is going to be a lot faster than
    matching each small regular expression separately and then combining
    them.

    -Kevin
  • Grzegorz Blinowski at Aug 10, 2011 at 5:01 pm
    Thnaks for all the help so far, I increased the shared_mem config parameter
    (Postgress didn't accept higher values than default, had to increase
    systemwide shared mem). The current config (as suggested by Kevin Grittner)
    is as follows:

    version | PostgreSQL 8.4.7 on x86_64-redhat-linux-gnu,
    compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit
    autovacuum | off
    client_encoding | LATIN2
    effective_cache_size | 8GB
    lc_collate | en_US.UTF-8
    lc_ctype | en_US.UTF-8
    listen_addresses | *
    log_rotation_age | 1d
    log_rotation_size | 0
    log_truncate_on_rotation | on
    logging_collector | on
    maintenance_work_mem | 1GB
    max_connections | 16
    max_prepared_transactions | 50
    max_stack_depth | 8MB
    port | 5432
    server_encoding | UTF8
    shared_buffers | 1GB
    statement_timeout | 25min
    temp_buffers | 16384
    TimeZone | Europe/Berlin
    work_mem | 128MB


    However, changing shared_mem didn't help. We also checked system I/O stats
    during the query - and in fact there is almost no IO (even with suboptimal
    shared_memory). So the problem is not disk transfer/access but rather the
    way Postgres handles regexp queries... As I have wirtten it is difficult to
    rewrite the query syntax (the SQL generation in this app is quite complex),
    but it should be relatively easy to at least join all OR clauses into one
    regexp, I can try this from the psql CLI. I will post an update if anything
    interesting happens...

    Cheers,

    Greg

    On Wed, Aug 10, 2011 at 5:27 PM, Kevin Grittner wrote:

    Grzegorz Blinowski wrote:
    Some performance params from postgresql.conf:
    Please paste the result of running the query on this page:

    http://wiki.postgresql.org/wiki/Server_Configuration

    For a start, the general advice is usually to start with
    shared_buffers at the lesser of 25% of system RAM or 8 GB, and
    adjust from there based on benchmarks. So you might want to try 4GB
    for that one.

    Just to confirm, you are using 2 Phase Commit? (People sometimes
    mistake the max_prepared_transactions setting for something related
    to prepared statements.)

    I concur with previous advice that using one regular expression
    which matches all of the terms is going to be a lot faster than
    matching each small regular expression separately and then combining
    them.

    -Kevin
  • Tomas Vondra at Aug 10, 2011 at 5:15 pm

    Dne 10.8.2011 19:01, Grzegorz Blinowski napsal(a):
    However, changing shared_mem didn't help. We also checked system I/O
    stats during the query - and in fact there is almost no IO (even with
    suboptimal shared_memory). So the problem is not disk transfer/access
    but rather the way Postgres handles regexp queries... As I have wirtten
    it is difficult to rewrite the query syntax (the SQL generation in this
    app is quite complex), but it should be relatively easy to at least join
    all OR clauses into one regexp, I can try this from the psql CLI. I will
    post an update if anything interesting happens...
    Can you post EXPLAIN ANALYZE, prefferably using explain.depesz.com?

    Tomas
  • Kevin Grittner at Aug 10, 2011 at 5:17 pm

    Grzegorz Blinowski wrote:

    the problem is not disk transfer/access but rather the way
    Postgres handles regexp queries.
    As a diagnostic step, could you figure out some non-regexp way to
    select about the same percentage of rows with about the same
    distribution across the table, and compare times? So far I haven't
    seen any real indication that the time is spent in evaluating the
    regular expressions, versus just loading pages from the OS into
    shared buffers and picking out individual tuples and columns from
    the table. For all we know, the time is mostly spent decompressing
    the 2K values. Perhaps you need to save them without compression.
    If they are big enough after compression to be stored out-of-line by
    default, you might want to experiment with having them in-line in
    the tuple.

    http://www.postgresql.org/docs/8.4/interactive/storage-toast.html

    -Kevin
  • Kevin Grittner at Aug 10, 2011 at 6:10 pm

    "Kevin Grittner" wrote:

    So far I haven't seen any real indication that the time is spent
    in evaluating the regular expressions
    Just as a reality check here, I ran some counts against a
    moderately-sized table (half a million rows). Just counting the
    rows unconditionally was about five times as fast as having to pick
    out even a small column for a compare. Taking a substring of a
    bigger (but normally non-TOASTed) value and doing a compare was only
    a little slower. Using a regular expression anchored to the front
    of the string to do the equivalent of the compare to the substring
    took about twice as long as the substring approach. For a
    non-anchored regular expression where it would normally need to scan
    in a bit, it took twice as long as the anchored regular expression.
    These times seem like they might leave some room for improvement,
    but it doesn't seem too outrageous.

    Each test run three times.

    select count(*) from "Case";
    count
    --------
    527769
    (1 row)

    Time: 47.696 ms
    Time: 47.858 ms
    Time: 47.687 ms

    select count(*) from "Case" where "filingCtofcNo" = '0878';
    count
    --------
    198645
    (1 row)

    Time: 219.233 ms
    Time: 225.410 ms
    Time: 226.723 ms

    select count(*) from "Case"
    where substring("caption" from 1 for 5) = 'State';
    count
    --------
    178142
    (1 row)

    Time: 238.160 ms
    Time: 237.114 ms
    Time: 240.388 ms

    select count(*) from "Case" where "caption" ~ '^State';
    count
    --------
    178142
    (1 row)

    Time: 532.821 ms
    Time: 535.341 ms
    Time: 529.121 ms

    select count(*) from "Case" where "caption" ~ 'Wisconsin';
    count
    --------
    157483
    (1 row)

    Time: 1167.433 ms
    Time: 1172.282 ms
    Time: 1170.562 ms

    -Kevin
  • Grzegorz Blinowski at Aug 11, 2011 at 8:39 am
    A small followup regarding the suggestion to turn off compression - I used:

    ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE EXTERNAL

    to turn off compression, however I get an impression that "nothing happend".
    When exactly this alteration takes effect? Perhaps I should reload the
    entire db from backup to change the storage method?

    Regards,

    greg

    On Wed, Aug 10, 2011 at 7:17 PM, Kevin Grittner wrote:

    Grzegorz Blinowski wrote:
    the problem is not disk transfer/access but rather the way
    Postgres handles regexp queries.
    As a diagnostic step, could you figure out some non-regexp way to
    select about the same percentage of rows with about the same
    distribution across the table, and compare times? So far I haven't
    seen any real indication that the time is spent in evaluating the
    regular expressions, versus just loading pages from the OS into
    shared buffers and picking out individual tuples and columns from
    the table. For all we know, the time is mostly spent decompressing
    the 2K values. Perhaps you need to save them without compression.
    If they are big enough after compression to be stored out-of-line by
    default, you might want to experiment with having them in-line in
    the tuple.

    http://www.postgresql.org/docs/8.4/interactive/storage-toast.html

    -Kevin
  • Kevin Grittner at Aug 11, 2011 at 1:56 pm

    Grzegorz Blinowski wrote:

    A small followup regarding the suggestion to turn off compression
    - I used:

    ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE
    EXTERNAL

    to turn off compression, however I get an impression that "nothing
    happend". When exactly this alteration takes effect? Perhaps I
    should reload the entire db from backup to change the storage
    method?
    Yeah, the storage option just affects future storage of values; it
    does not perform a conversion automatically. There are various ways
    you could cause the rows to be re-written so that they use the new
    TOAST policy for the column. One of the simplest would be to do a
    data-only dump of the table, truncate the table, and restore the
    data. If that table is a big enough portion of the database, a
    pg_dump of the whole database might be about as simple.

    -Kevin
  • Grzegorz Blinowski at Aug 12, 2011 at 3:07 pm
    To summarize this thread:

    We have tried most of the suggestions and found two of them effective:

    1) collapsing OR expressions in the WHERE clause into one '(...)|(...)'
    regexp resulted in about 60% better search time
    2) changing long attribute storage to EXTERNAL gave 30% better search time
    (but only on the first search - i.e. before data is cached)

    Surprisingly, changing shared_mem from 24MB to 1 GB gave no apparent effect.

    Thanks once again for all your help!!!

    Regards,

    Greg

    On Thu, Aug 11, 2011 at 3:56 PM, Kevin Grittner wrote:

    Grzegorz Blinowski wrote:
    A small followup regarding the suggestion to turn off compression
    - I used:

    ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE
    EXTERNAL

    to turn off compression, however I get an impression that "nothing
    happend". When exactly this alteration takes effect? Perhaps I
    should reload the entire db from backup to change the storage
    method?
    Yeah, the storage option just affects future storage of values; it
    does not perform a conversion automatically. There are various ways
    you could cause the rows to be re-written so that they use the new
    TOAST policy for the column. One of the simplest would be to do a
    data-only dump of the table, truncate the table, and restore the
    data. If that table is a big enough portion of the database, a
    pg_dump of the whole database might be about as simple.

    -Kevin
  • Kevin Grittner at Aug 12, 2011 at 3:25 pm

    Grzegorz Blinowski wrote:

    2) changing long attribute storage to EXTERNAL gave 30% better
    search time (but only on the first search - i.e. before data is
    cached)
    That suggests that all of the following are true:

    (1) The long value was previously being compressed and stored
    in-line.

    (2) It's now being stored uncompressed, out-of-line in the TOAST
    table.

    (3) Following the TOAST pointers on cached tuples isn't
    significantly more or less expensive than decompressing the data.

    (4) The smaller base tuple caused fewer page reads from disk, even
    with the out-of-line storage for the large value.

    The first three aren't surprising; that last one is. Unless there
    is significant bloat of the table, I'm having trouble seeing why
    that first run is cheaper this way. Make sure your vacuum policy is
    aggressive enough; otherwise you will probably see a slow but steady
    deterioration in performance..

    -Kevin
  • Kevin Grittner at Aug 10, 2011 at 5:22 pm

    Grzegorz Blinowski wrote:

    autovacuum | off
    BTW, that's generally not a good idea -- it leaves you much more
    vulnerable to bloat which could cause performance problems to
    manifest in any number of ways. You might want to calculate your
    heap bloat on this table.

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedAug 10, '11 at 2:26p
activeAug 12, '11 at 3:25p
posts13
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase