New to Postgres and am prototyping a migration from Oracle to Postgres 9.0.1 on Linux. Starting with the data warehouse. Current process is to load the data from
our OLTP (Oracle), dump it into another instance of Oracle for staging and manipulation, then extract it and load it into Infobright. I am trying
to replace the Oracle instance used for staging and manipulation with Postgres. Kettle (PDI), a Java ETL tool, is used for this process.

Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers
in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table
dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from
our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get
about a 70x increase in performance of the query. The EXPLAIN ANALYZE output is nearly identical, except for the casting of 1 to a real when the column
is a real. In real life, this query is actually bound and parameterized, but I wished to simplify things a bit here (and don't yet know how to EXPLAIN ANALYZE a parameterized
query). Now in terms of actual performance, the same query executed about 25k times takes 7 seconds with the real column, and 500 seconds with the integer column.

What gives here? Seems like integer (or serial) is a pretty common choice for primary key columns, and therefore what I'm experiencing must be an anomoly.



Table "hits_olap.dim_carrier"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
carrier_id | integer | not null
dim_version | smallint |
dim_effect_date | timestamp without time zone |
dim_expire_date | timestamp without time zone |
carrier_source_id | integer |
carrier_name | character varying(30) |
carrier_type | character varying(30) |
carrier_scac | character varying(4) |
carrier_currency | character varying(3) |
current_row | smallint | default 0
Indexes:
"dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
"idx_dim_carrier_lookup" btree (carrier_source_id)

VACUUM
ANALYZE
REINDEX

EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1' AND now() >= DIM_EFFECT_DATE
AND now() < DIM_EXPIRE_DATE;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_dim_carrier_lookup on dim_carrier (cost=0.00..12.10 rows=2 width=6) (actual time=0.076..0.077 rows=1 loops=1)
Index Cond: (carrier_source_id = 1)
Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date)) Total runtime: 0.108 ms
(4 rows)

ALTER TABLE
ALTER TABLE
Table "hits_olap.dim_carrier"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
carrier_id | integer | not null
dim_version | smallint |
dim_effect_date | timestamp without time zone |
dim_expire_date | timestamp without time zone |
carrier_source_id | real |
carrier_name | character varying(30) |
carrier_type | character varying(30) |
carrier_scac | character varying(4) |
carrier_currency | character varying(3) |
current_row | smallint | default 0
Indexes:
"dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
"idx_dim_carrier_lookup" btree (carrier_source_id)

VACUUM
ANALYZE
REINDEX

EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1' AND now() >= DIM_EFFECT_DATE
AND now() < DIM_EXPIRE_DATE;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_dim_carrier_lookup on dim_carrier (cost=0.00..12.10 rows=2 width=6) (actual time=0.068..0.069 rows=1 loops=1)
Index Cond: (carrier_source_id = 1::real)
Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date)) Total runtime: 0.097 ms
(4 rows)



Thanks for the help,

Dave Greco

Search Discussions

  • Kevin Grittner at Jan 26, 2011 at 9:53 pm

    David Greco wrote:

    If I change this field from an integer to a real, I get about a
    70x increase in performance of the query.
    I wished to simplify things a bit here (and don't yet know how to
    EXPLAIN ANALYZE a parameterized query).
    carrier_source_id | integer |
    runtime: 0.108 ms
    carrier_source_id | real |
    runtime: 0.097 ms
    This doesn't show the problem, so it's hard to guess the cause.
    Perhaps you can do it with a prepared statement?:

    http://www.postgresql.org/docs/9.0/interactive/sql-prepare.html

    Also, plans can be completely different based on the number of rows,
    width of the rows, distribution of values, etc. You may want to
    select against the actual tables where you've seen the problem.

    One tip -- if size permits, try to CLUSTER both tables to avoid any
    bloat issues, and VACUUM ANALYZE the tables to ensure that hint bits
    are set and statistics are up to date before running the tests. Run
    each test several times in a row to see what affect caching has on
    the issue.

    -Kevin
  • Tom Lane at Jan 26, 2011 at 10:12 pm

    David Greco writes:
    Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers
    in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table
    dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from
    our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get
    about a 70x increase in performance of the query.
    That's really, really hard to believe, given that all else is equal ---
    so I'm betting it isn't. I suspect that what is really happening is
    that you're passing non-integral comparison constants in your queries.
    For example, if carrier_id is an integer, then

    SELECT ... WHERE carrier_id = 42

    is indexable, but this isn't:

    SELECT ... WHERE carrier_id = 42.0

    The latter case however *would* be indexable if carrier_id were float.

    The examples you show fail to show any performance difference at all,
    but that's probably because you used quoted literals ('42' not 42),
    which prevents the parser from deciding that a cross-type comparison
    is demanded.

    I believe Oracle handles such things differently, so running into this
    type of issue during an Oracle port isn't too surprising.
    In real life, this query is actually bound and parameterized,
    In that case, an EXPLAIN using literal constants is next door to useless
    in terms of telling you what will happen in real life. You need to pay
    attention to exactly how the parameterization is done. Again, I'm
    suspecting a wrong datatype indication.

    regards, tom lane
  • David Greco at Jan 27, 2011 at 2:04 pm
    Right you are. Kettle is turning the number(11) field from Oracle into a BigNumber, which is a decimal. If I cast the field into an Integer in Kettle and keep the field an integer in Postgres, I get good performance. Suspect the correct course of action would simply be to make number(11) fields in Oracle numeric(11,0) fields in Postgres.


    -----Original Message-----
    From: Tom Lane
    Sent: Wednesday, January 26, 2011 5:12 PM
    To: David Greco
    Cc: [email protected]
    Subject: Re: [PERFORM] Real vs Int performance

    David Greco <[email protected]> writes:
    Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers
    in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table
    dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from
    our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get
    about a 70x increase in performance of the query.
    That's really, really hard to believe, given that all else is equal ---
    so I'm betting it isn't. I suspect that what is really happening is
    that you're passing non-integral comparison constants in your queries.
    For example, if carrier_id is an integer, then

    SELECT ... WHERE carrier_id = 42

    is indexable, but this isn't:

    SELECT ... WHERE carrier_id = 42.0

    The latter case however *would* be indexable if carrier_id were float.

    The examples you show fail to show any performance difference at all,
    but that's probably because you used quoted literals ('42' not 42),
    which prevents the parser from deciding that a cross-type comparison
    is demanded.

    I believe Oracle handles such things differently, so running into this
    type of issue during an Oracle port isn't too surprising.
    In real life, this query is actually bound and parameterized,
    In that case, an EXPLAIN using literal constants is next door to useless
    in terms of telling you what will happen in real life. You need to pay
    attention to exactly how the parameterization is done. Again, I'm
    suspecting a wrong datatype indication.

    regards, tom lane
  • Tom Lane at Jan 27, 2011 at 2:18 pm

    David Greco writes:
    Right you are. Kettle is turning the number(11) field from Oracle into
    a BigNumber, which is a decimal. If I cast the field into an Integer
    in Kettle and keep the field an integer in Postgres, I get good
    performance. Suspect the correct course of action would simply be to
    make number(11) fields in Oracle numeric(11,0) fields in Postgres.
    Not if you can persuade the client-side code to output integers as
    integers. "numeric" type is orders of magnitude slower than integers.

    regards, tom lane
  • Shaun Thomas at Jan 27, 2011 at 2:30 pm

    On 01/27/2011 08:18 AM, Tom Lane wrote:

    Not if you can persuade the client-side code to output integers as
    integers. "numeric" type is orders of magnitude slower than integers.
    I sadly have to vouch for this. My company converted an old Oracle app
    and they changed all their primary keys (and foreign keys, and random
    larger int fields) to NUMERIC(19)'s. I've convinced them all new stuff
    should be BIGINT if they need that level of coverage, but the damage is
    already done.

    I'm not sure about orders of magnitude on the storage/index side, but my
    tests gave us a 10% boost if just the keys are switched over to INT or
    BIGINT.

    --
    Shaun Thomas
    OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
    312-676-8870
    [email protected]

    ______________________________________________

    See http://www.peak6.com/email_disclaimer.php
    for terms and conditions related to this email
  • Mladen Gogala at Jan 27, 2011 at 3:11 pm

    On 1/27/2011 9:30 AM, Shaun Thomas wrote:
    I'm not sure about orders of magnitude on the storage/index side, but my
    tests gave us a 10% boost if just the keys are switched over to INT or
    BIGINT.
    Well, it depends on what you're doing. Searching by an integer vs.
    searching by a text string will probably not make much of a difference.
    However, if you are calculating sums or averages, there will be a huge
    difference.

    --
    Mladen Gogala
    Sr. Oracle DBA
    1500 Broadway
    New York, NY 10036
    (212) 329-5251
    www.vmsinfo.com
  • Igor Neyman at Jan 27, 2011 at 4:17 pm

    -----Original Message-----
    From: Tom Lane
    Sent: Wednesday, January 26, 2011 5:12 PM
    To: David Greco
    Cc: [email protected]
    Subject: Re: Real vs Int performance

    David Greco <[email protected]> writes:
    Came across a problem I find perplexing. I recreated the
    dimensional
    tables in Oracle and the fields that are integers in Oracle became
    integers in Postgres. Was experiencing terrible performance
    during the
    load and narrowed down to a particular dimensional lookup
    problem.
    .......................................
    .......................................
    .......................................
    .......................................
    In real life, this query is actually bound and parameterized,

    In that case, an EXPLAIN using literal constants is next door
    to useless in terms of telling you what will happen in real
    life. You need to pay attention to exactly how the
    parameterization is done. Again, I'm suspecting a wrong
    datatype indication.

    regards, tom lane
    To see what happens with parametrized query in "real life" you could try
    "auto_explain" contrib module.

    Regards,
    Igor Neyman

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedJan 26, '11 at 7:47p
activeJan 27, '11 at 4:17p
posts8
users6
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2023 Grokbase