Trying to determine the best overall approach for the following
scenario:

Each month our primary table accumulates some 30 million rows (which
could very well hit 60+ million rows per month by year's end). Basically
there will end up being a lot of historical data with little value
beyond archival.

The question arises then as the best approach of which I have enumerated
three:

1) Just allow the records to accumulate and maintain constant vacuuming,
etc allowing for the fact that most queries will only be from a recent
subset of data and should be mostly cached.

2) Each month:
SELECT * INTO 3monthsago_dynamically_named_table FROM bigtable WHERE
targetdate < $3monthsago;
DELETE FROM bigtable where targetdate < $3monthsago;
VACUUM ANALYZE bigtable;
pg_dump 3monthsago_dynamically_named_table for archiving;

3) Each month:
CREATE newmonth_dynamically_named_table (like mastertable) INHERITS
(mastertable);
modify the copy.sql script to copy newmonth_dynamically_named_table;
pg_dump 3monthsago_dynamically_named_table for archiving;
drop table 3monthsago_dynamically_named_table;

Any takes on which approach makes most sense from a performance and/or
maintenance point of view and are there other options I may have missed?

Sven Willenberger

Search Discussions

  • Tom Lane at Mar 1, 2005 at 1:07 am

    Sven Willenberger writes:
    3) Each month:
    CREATE newmonth_dynamically_named_table (like mastertable) INHERITS
    (mastertable);
    modify the copy.sql script to copy newmonth_dynamically_named_table;
    pg_dump 3monthsago_dynamically_named_table for archiving;
    drop table 3monthsago_dynamically_named_table;
    A number of people use the above approach. It's got some limitations,
    mainly that the planner isn't super bright about what you are doing
    --- in particular, joins involving such a table may work slowly.

    On the whole I'd probably go with the other approach (one big table).
    A possible win is to use CLUSTER rather than VACUUM ANALYZE to recover
    space after your big deletes; however this assumes that you can schedule
    downtime to do the CLUSTERs in.

    regards, tom lane
  • John Arbash Meinel at Mar 1, 2005 at 1:41 am

    Sven Willenberger wrote:
    Trying to determine the best overall approach for the following
    scenario:

    Each month our primary table accumulates some 30 million rows (which
    could very well hit 60+ million rows per month by year's end). Basically
    there will end up being a lot of historical data with little value
    beyond archival.


    If this statement is true, then 2 seems the best plan.
    2) Each month:
    SELECT * INTO 3monthsago_dynamically_named_table FROM bigtable WHERE
    targetdate < $3monthsago;
    DELETE FROM bigtable where targetdate < $3monthsago;
    VACUUM ANALYZE bigtable;
    pg_dump 3monthsago_dynamically_named_table for archiving;

    It seems like this method would force the table to stay small, and would
    keep your queries fast. But if you ever actually *need* the old data,
    then you start having problems.

    ...

    I think (3) would tend to force a whole bunch of joins (one for each
    child table), rather than just one join against 3months of data.
    Any takes on which approach makes most sense from a performance and/or
    maintenance point of view and are there other options I may have missed?

    Sven Willenberger
    If you can get away with it 2 is the best.

    John
    =:->
  • Sven Willenberger at Mar 1, 2005 at 4:27 pm

    On Tue, 2005-03-01 at 09:48 -0600, John Arbash Meinel wrote:
    Sven Willenberger wrote:
    Trying to determine the best overall approach for the following
    scenario:

    Each month our primary table accumulates some 30 million rows (which
    could very well hit 60+ million rows per month by year's end). Basically
    there will end up being a lot of historical data with little value
    beyond archival.

    The question arises then as the best approach of which I have enumerated
    three:
    I just thought of another possibility. You could create each table
    month-by-month, and then use a view to combine them, and possibly a rule
    to keep things clean.

    So you would do something like:

    I will assume you already have the data in one big table to show the
    easiest way to create the small tables.

    create table tblname-2005-01 as select * from orig_tbl where day >=
    '2005-01-01' and day < '2005-02-01';
    create table tblname-2005-02 as select * from orig_tbl where day >=
    '2005-02-01' and day < '2005-03-01';
    create table tblname-2005-03 as select * from orig_tbl where day >=
    '2005-03-01' and day < '2005-04-01';
    -- create appropriate indicies, rules, constraints on these tables

    Then you create a view which includes all of these tables.

    create or replace view tblname as
    select * from tblname-2005-01
    union all select * from tblname-2005-02
    union all select * from tblname-2005-03
    ;

    Then create insert and update rules which fixe which table gets the new
    data.

    create rule up_tblname as on update to tblname do instead
    update tblname-2005-03 set
    col1 = NEW.col1,
    col2 = NEW.col2,
    ...
    where id = NEW.id;
    -- This assumes that you have a unique id on your tables. This is just
    whatever your
    -- primary key is, so it should be a decent assumption.

    create rule ins_tblname as on insert to tblname do instead
    insert into tblname-2005-03 (col1, col2, ...)
    values (new.col1, new.col2, ...);

    Now the downside of this method, is that every month you need to create
    a new table, and then update the views and the rules. The update rules
    are pretty straightforward, though.

    The nice thing is that it keeps your data partitioned, and you don't
    ever have a large select/delete step. You probably will want a small one
    each month to keep the data exactly aligned by month. You don't really
    have to have exact alignments, but as humans, we tend to like that stuff. :)

    Probably this is more overhead than you would like to do. Especially if
    you know that you can get away with method 2 (keep 1 big table, and just
    remove old rows out of it every month.)

    But this method means that all of your data stays live, but queries with
    appropriate restrictions should stay fast. You also have the ability
    (with v8.0) to move the individual tables onto separate disks.

    One more time, though, if you can get away with removing old data and
    just archiving it, do so. But if you want to keep the data live, there
    are a couple of alternatives.
    Actually that was the thought behind my using inheritance; when querying
    the <bigtable>, it basically does a union all; also, I think it would be
    quicker to insert directly into the child table (simply by modifying my
    query once a month) rather than the overhead sustained by the rule.

    Since the children tables are individual tables, all the benefits you
    cite above still hold.

    Thanks for the input on this ... will have to try a couple things to see
    which is most manageable.\

    Sven
  • John Arbash Meinel at Mar 1, 2005 at 4:41 pm

    Sven Willenberger wrote:
    On Tue, 2005-03-01 at 09:48 -0600, John Arbash Meinel wrote:

    Sven Willenberger wrote:


    Trying to determine the best overall approach for the following
    scenario:

    Each month our primary table accumulates some 30 million rows (which
    could very well hit 60+ million rows per month by year's end). Basically
    there will end up being a lot of historical data with little value
    beyond archival.

    The question arises then as the best approach of which I have enumerated
    three:

    I just thought of another possibility. You could create each table
    month-by-month, and then use a view to combine them, and possibly a rule
    to keep things clean.

    So you would do something like:

    I will assume you already have the data in one big table to show the
    easiest way to create the small tables.

    create table tblname-2005-01 as select * from orig_tbl where day >=
    '2005-01-01' and day < '2005-02-01';
    create table tblname-2005-02 as select * from orig_tbl where day >=
    '2005-02-01' and day < '2005-03-01';
    create table tblname-2005-03 as select * from orig_tbl where day >=
    '2005-03-01' and day < '2005-04-01';
    -- create appropriate indicies, rules, constraints on these tables

    Then you create a view which includes all of these tables.

    create or replace view tblname as
    select * from tblname-2005-01
    union all select * from tblname-2005-02
    union all select * from tblname-2005-03
    ;

    Then create insert and update rules which fixe which table gets the new
    data.

    create rule up_tblname as on update to tblname do instead
    update tblname-2005-03 set
    col1 = NEW.col1,
    col2 = NEW.col2,
    ...
    where id = NEW.id;
    -- This assumes that you have a unique id on your tables. This is just
    whatever your
    -- primary key is, so it should be a decent assumption.

    create rule ins_tblname as on insert to tblname do instead
    insert into tblname-2005-03 (col1, col2, ...)
    values (new.col1, new.col2, ...);

    Now the downside of this method, is that every month you need to create
    a new table, and then update the views and the rules. The update rules
    are pretty straightforward, though.

    The nice thing is that it keeps your data partitioned, and you don't
    ever have a large select/delete step. You probably will want a small one
    each month to keep the data exactly aligned by month. You don't really
    have to have exact alignments, but as humans, we tend to like that stuff. :)

    Probably this is more overhead than you would like to do. Especially if
    you know that you can get away with method 2 (keep 1 big table, and just
    remove old rows out of it every month.)

    But this method means that all of your data stays live, but queries with
    appropriate restrictions should stay fast. You also have the ability
    (with v8.0) to move the individual tables onto separate disks.

    One more time, though, if you can get away with removing old data and
    just archiving it, do so. But if you want to keep the data live, there
    are a couple of alternatives.

    Actually that was the thought behind my using inheritance; when querying
    the <bigtable>, it basically does a union all; also, I think it would be
    quicker to insert directly into the child table (simply by modifying my
    query once a month) rather than the overhead sustained by the rule.

    Since the children tables are individual tables, all the benefits you
    cite above still hold.

    Thanks for the input on this ... will have to try a couple things to see
    which is most manageable.\

    Sven
    You're right, child tables to act like that. I just recall that at least
    at one point, postgres didn't handle indexes with child tables very
    well. That's more just what someone else ran into, so he could have been
    doing something wrong.
    I agree, if child tables end up doing a union all, then it is much
    easier to maintain. A select against the master table should
    automatically get all of the child tables.
    It might just be that you need to create a new index on the child table
    whenever you create it, and then postgres can use that new index to do
    the filtering.

    John
    =:->
  • Gaetano Mendola at Mar 2, 2005 at 12:58 am

    Sven Willenberger wrote:
    Trying to determine the best overall approach for the following
    scenario:

    Each month our primary table accumulates some 30 million rows (which
    could very well hit 60+ million rows per month by year's end). Basically
    there will end up being a lot of historical data with little value
    beyond archival.

    The question arises then as the best approach of which I have enumerated
    three:

    1) Just allow the records to accumulate and maintain constant vacuuming,
    etc allowing for the fact that most queries will only be from a recent
    subset of data and should be mostly cached.

    2) Each month:
    SELECT * INTO 3monthsago_dynamically_named_table FROM bigtable WHERE
    targetdate < $3monthsago;
    DELETE FROM bigtable where targetdate < $3monthsago;
    VACUUM ANALYZE bigtable;
    pg_dump 3monthsago_dynamically_named_table for archiving;

    In my experience copy/delete in a single transaction 60+ million rows
    is not feseable, at least on my 1 GB ram, 2 way CPU box.



    Regards
    Gaetano Mendola

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedFeb 28, '05 at 11:58p
activeMar 2, '05 at 12:58a
posts6
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase