On 15 Jan 2011, at 1:41, bubba postgres wrote:
I've been googling, but haven't found a good answer to what I should do if I want to store time series in Postgres.
My current solution is store serialized (compressed) blobs of data.
(So for example store 1 day worth of 1 minute samples (~1440 samples) stored as one row in a bytea. (Plus meta data)
It would be nice if I could use 1 sample per column,(because updating individual columns/samples is clear to me) but postgres doesn't compress the row (which is bad because of high amount of repetitive data.. Easily 10X bigger.
Not an easy problem to solve without knowing more about your data, like what the nature of this repetition is, how repetitive it is, etc. It also much depends on how you intend to use this data later on. Are there gaps in your data ranges? How accurate does your data need to be stored? Etc.
For a decision you're probably the only person who'll be close enough to the data and how it will be used.
That said, assuming your data has significant _sequential_ repetition (That's just how I would phrase it; meaning that a range of values in sequence are the same), it may be an idea to just store differences larger than 0 and the length of a sample of equal values. If such ranges are small, this would just add overhead though.
You could extend that approach by fitting curves to ranges of values with a similar tendency and store those curves instead. Retrieving single values from those is of course a little harder, plus you probably will get some inaccuracy when you can't fit curves exactly.
Hard to say how effective this would be though.
I've been considering a Double array, which would get compressed, but before I start down that path (I suppose I need to make some storedprocs to update individual samples), has anyone built anything like this? Any open source projects I should look at?
This is the safest approach I think. You don't have to rely on any assumptions on how your data behaves over time. It's also quite predictable in terms of database and storage requirements, you won't run into surprises here. Just the usual time-related stuff (DST-changes and such).
One change I'd probably make is to store them like this:
CREATE TABLE sample (
That way you don't need to reserve space for longer gaps. For example, if a day ends "early" because data stopped coming in, you can just store a shorter day (length < 1440 minutes), or if you missed data at the start of the day you can make it start later (once the first sample arrives).
Querying them isn't too tough either, although determining the array index you need based on a timestamp is somewhat tricky I just found out.
Finding the record is pretty easy though, that's just:
WHERE now() BETWEEN start AND start + length * interval '1 minute';
I'd advise putting an index on the latter formula ;)
There is one possible problem I'm seeing with this approach though: Updating values requires the whole row to be rewritten (MVCC).
If your arrays are wide enough, then updating those rows will take longer and longer because the amount of data that needs to be written each time is increasing.
Seeing that your data-samples apparently arrive at 1-minute intervals you'll probably be safe, but if you're going to add significant processing around updating a row, then you could run into a race where the previous version of the record is still being processed.
There are ways around that, but they only complicate matters more and probably hurt performance, so that's probably best left for if it's needed at all.
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.