I've been googling, but haven't found a good answer to what I should do if I
want to store time series in Postgres.
My current solution is store serialized (compressed) blobs of data.
(So for example store 1 day worth of 1 minute samples (~1440 samples) stored
as one row in a bytea. (Plus meta data)
It would be nice if I could use 1 sample per column,(because updating
individual columns/samples is clear to me) but postgres doesn't compress the
row (which is bad because of high amount of repetitive data.. Easily 10X
bigger.

I've been considering a Double[] array, which would get compressed, but
before I start down that path (I suppose I need to make some storedprocs to
update individual samples), has anyone built anything like this? Any open
source projects I should look at?

Thanks.

Search Discussions

  • Whit Armstrong at Jan 15, 2011 at 12:49 am
    I think you want to look at kdb, onetick, and LIM (those are
    commercial). or potentially mongoDB where you could probably store a
    compressed ts directly in the db if you want.

    If you're not going to store each observation as a row, then why use a
    db at all. why not stick to flat files?

    -Whit


    On Fri, Jan 14, 2011 at 7:41 PM, bubba postgres
    wrote:
    I've been googling, but haven't found a good answer to what I should do if I
    want to store time series in Postgres.
    My current solution is store serialized (compressed) blobs of data.
    (So for example store 1 day worth of 1 minute samples (~1440 samples) stored
    as one row in a bytea. (Plus meta data)
    It would be nice if I could use 1 sample per column,(because updating
    individual columns/samples is clear to me) but postgres doesn't compress the
    row (which is bad because of high amount of repetitive data.. Easily 10X
    bigger.

    I've been considering a Double[] array, which would get compressed, but
    before I start down that path (I suppose I need to make some storedprocs to
    update individual samples), has anyone built anything like this? Any open
    source projects I should look at?

    Thanks.
  • Jasen Betts at Jan 15, 2011 at 2:50 am

    On 2011-01-15, bubba postgres wrote:
    --000e0cd332267ac2e60499d7ceb2
    Content-Type: text/plain; charset=ISO-8859-1

    I've been googling, but haven't found a good answer to what I should do if I
    want to store time series in Postgres.
    It would be nice if I could use 1 sample per column,(because updating
    individual columns/samples is clear to me) but postgres doesn't compress the
    row (which is bad because of high amount of repetitive data.. Easily 10X
    bigger.
    if there are lots of equal readings perhaps you can drop them and only
    record when the reading changes.

    this would probably require one table per source.

    --
    ⚂⚃ 100% natural
  • Alban Hertroys at Jan 15, 2011 at 11:55 am

    On 15 Jan 2011, at 1:41, bubba postgres wrote:

    I've been googling, but haven't found a good answer to what I should do if I want to store time series in Postgres.
    My current solution is store serialized (compressed) blobs of data.
    (So for example store 1 day worth of 1 minute samples (~1440 samples) stored as one row in a bytea. (Plus meta data)
    It would be nice if I could use 1 sample per column,(because updating individual columns/samples is clear to me) but postgres doesn't compress the row (which is bad because of high amount of repetitive data.. Easily 10X bigger.
    Not an easy problem to solve without knowing more about your data, like what the nature of this repetition is, how repetitive it is, etc. It also much depends on how you intend to use this data later on. Are there gaps in your data ranges? How accurate does your data need to be stored? Etc.

    For a decision you're probably the only person who'll be close enough to the data and how it will be used.

    That said, assuming your data has significant _sequential_ repetition (That's just how I would phrase it; meaning that a range of values in sequence are the same), it may be an idea to just store differences larger than 0 and the length of a sample of equal values. If such ranges are small, this would just add overhead though.

    You could extend that approach by fitting curves to ranges of values with a similar tendency and store those curves instead. Retrieving single values from those is of course a little harder, plus you probably will get some inaccuracy when you can't fit curves exactly.

    Hard to say how effective this would be though.
    I've been considering a Double[] array, which would get compressed, but before I start down that path (I suppose I need to make some storedprocs to update individual samples), has anyone built anything like this? Any open source projects I should look at?

    This is the safest approach I think. You don't have to rely on any assumptions on how your data behaves over time. It's also quite predictable in terms of database and storage requirements, you won't run into surprises here. Just the usual time-related stuff (DST-changes and such).

    One change I'd probably make is to store them like this:
    CREATE TABLE sample (
    start timestamp,
    length int,
    values double[]
    );

    That way you don't need to reserve space for longer gaps. For example, if a day ends "early" because data stopped coming in, you can just store a shorter day (length < 1440 minutes), or if you missed data at the start of the day you can make it start later (once the first sample arrives).

    Querying them isn't too tough either, although determining the array index you need based on a timestamp is somewhat tricky I just found out.
    Finding the record is pretty easy though, that's just:

    WHERE now() BETWEEN start AND start + length * interval '1 minute';

    I'd advise putting an index on the latter formula ;)


    There is one possible problem I'm seeing with this approach though: Updating values requires the whole row to be rewritten (MVCC).
    If your arrays are wide enough, then updating those rows will take longer and longer because the amount of data that needs to be written each time is increasing.
    Seeing that your data-samples apparently arrive at 1-minute intervals you'll probably be safe, but if you're going to add significant processing around updating a row, then you could run into a race where the previous version of the record is still being processed.
    There are ways around that, but they only complicate matters more and probably hurt performance, so that's probably best left for if it's needed at all.

    Alban Hertroys

    --
    If you can't see the forest for the trees,
    cut the trees and you'll see there is no forest.


    !DSPAM:737,4d318b0811872055413410!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedJan 15, '11 at 12:41a
activeJan 15, '11 at 11:55a
posts4
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase