I have a reporting query that is taking nearly all of it's time in aggregate
functions and I'm trying to figure out how to optimize it. The query takes
approximately 170ms when run with "select *", but when run with all the
aggregate functions the query takes 18 seconds. The slowness comes from our
attempt to find distribution data using selects of the form:

SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

repeated across many different x,y values and fields to build out several
histograms of the data. The main culprit appears to be the CASE statement,
but I'm not sure what to use instead. I'm sure other people have had
similar queries and I was wondering what methods they used to build out data
like this?
Thanks for your help,
Doug

Search Discussions

  • Merlin Moncure at Oct 22, 2009 at 12:39 am

    On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole wrote:
    I have a reporting query that is taking nearly all of it's time in aggregate
    functions and I'm trying to figure out how to optimize it.  The query takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds.  The slowness comes from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out several
    histograms of the data.  The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead.  I'm sure other people have had
    similar queries and I was wondering what methods they used to build out data
    like this?
    have you tried:

    count(*) where field >= x AND field < y;

    ??

    merlin
  • Doug Cole at Oct 22, 2009 at 2:21 am

    On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure wrote:
    On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole wrote:
    I have a reporting query that is taking nearly all of it's time in aggregate
    functions and I'm trying to figure out how to optimize it.  The query takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds.  The slowness comes from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out several
    histograms of the data.  The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead.  I'm sure other people have had
    similar queries and I was wondering what methods they used to build out data
    like this?
    have you tried:

    count(*) where field >= x AND field < y;

    ??

    merlin
    Unless I'm misunderstanding you, that would require breaking each bin
    into a separate sql statement and since I'm trying to calculate more
    than 100 bins between the different fields any improvement in the
    aggregate functions would be overwhelmed by the cost of the actual
    query, which is about 170ms.
    Thanks,
    Doug
  • Nikolas Everett at Oct 22, 2009 at 2:48 am
    So you've got a query like:
    SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as
    zeroToTen,
    SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as
    tenToTwenty,
    SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as
    tenToTwenty,
    ...
    FROM bigtable


    My guess is this forcing a whole bunch of if checks and your getting cpu
    bound. Could you try something like:

    SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as
    zeroToTen,
    SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0
    END) as tenToTwenty,
    SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0
    END) as tenToTwenty,
    ...
    FROM (SELECT field, count(*) FROM bigtable GROUP BY field)

    which will allow a hash aggregate? You'd do a hash aggregate on the whole
    table which should be quick and then you'd summarize your bins.

    This all supposes that you don't want to just query postgres's column
    statistics.
    On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole wrote:
    On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure wrote:
    On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole wrote:
    I have a reporting query that is taking nearly all of it's time in
    aggregate
    functions and I'm trying to figure out how to optimize it. The query
    takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds. The slowness comes
    from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out
    several
    histograms of the data. The main culprit appears to be the CASE
    statement,
    but I'm not sure what to use instead. I'm sure other people have had
    similar queries and I was wondering what methods they used to build out
    data
    like this?
    have you tried:

    count(*) where field >= x AND field < y;

    ??

    merlin
    Unless I'm misunderstanding you, that would require breaking each bin
    into a separate sql statement and since I'm trying to calculate more
    than 100 bins between the different fields any improvement in the
    aggregate functions would be overwhelmed by the cost of the actual
    query, which is about 170ms.
    Thanks,
    Doug

    --
    Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
  • Marc Mamin at Oct 26, 2009 at 9:40 am
    Hello,

    I didn't try it, but following should be slightly faster:

    COUNT( CASE WHEN field >= x AND field < y THEN true END)
    intead of
    SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    HTH,

    Marc Mamin


    ________________________________

    From: pgsql-performance-owner@postgresql.org
    On Behalf Of Nikolas
    Everett
    Sent: Thursday, October 22, 2009 4:48 AM
    To: Doug Cole
    Cc: pgsql-performance
    Subject: Re: [PERFORM] optimizing query with multiple aggregates


    So you've got a query like:

    SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as
    zeroToTen,
    SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0
    END) as tenToTwenty,
    SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0
    END) as tenToTwenty,
    ...
    FROM bigtable




    My guess is this forcing a whole bunch of if checks and your getting cpu
    bound. Could you try something like:


    SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as
    zeroToTen,
    SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE
    END) as tenToTwenty,
    SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE
    END) as tenToTwenty,
    ...
    FROM (SELECT field, count(*) FROM bigtable GROUP BY field)


    which will allow a hash aggregate? You'd do a hash aggregate on the
    whole table which should be quick and then you'd summarize your bins.


    This all supposes that you don't want to just query postgres's column
    statistics.


    On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole wrote:


    On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure
    wrote:
    >
    On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole
    wrote:
    I have a reporting query that is taking nearly all of it's
    time in aggregate
    functions and I'm trying to figure out how to optimize it.
    The query takes
    approximately 170ms when run with "select *", but when run
    with all the
    aggregate functions the query takes 18 seconds. The
    slowness comes from our
    attempt to find distribution data using selects of the form:
    > >
    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
    > >
    repeated across many different x,y values and fields to
    build out several
    histograms of the data. The main culprit appears to be the
    CASE statement,
    but I'm not sure what to use instead. I'm sure other people
    have had
    similar queries and I was wondering what methods they used
    to build out data
    like this?
    >
    have you tried: >
    count(*) where field >= x AND field < y; >
    ?? >
    merlin

    Unless I'm misunderstanding you, that would require breaking
    each bin
    into a separate sql statement and since I'm trying to calculate
    more
    than 100 bins between the different fields any improvement in
    the
    aggregate functions would be overwhelmed by the cost of the
    actual
    query, which is about 170ms.
    Thanks,
    Doug


    --
    Sent via pgsql-performance mailing list
    (pgsql-performance@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-performance
  • Robert Haas at Oct 22, 2009 at 1:27 pm

    On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole wrote:
    On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure wrote:
    On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole wrote:
    I have a reporting query that is taking nearly all of it's time in aggregate
    functions and I'm trying to figure out how to optimize it.  The query takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds.  The slowness comes from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out several
    histograms of the data.  The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead.  I'm sure other people have had
    similar queries and I was wondering what methods they used to build out data
    like this?
    have you tried:

    count(*) where field >= x AND field < y;

    ??

    merlin
    Unless I'm misunderstanding you, that would require breaking each bin
    into a separate sql statement and since I'm trying to calculate more
    than 100 bins between the different fields any improvement in the
    aggregate functions would be overwhelmed by the cost of the actual
    query, which is about 170ms.
    Well, you might be able to use subselects to fetch all the results in
    a single query, but it might still be slow.

    ...Robert
  • David Wilson at Oct 22, 2009 at 2:47 am

    On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole wrote:
    repeated across many different x,y values and fields to build out several
    histograms of the data. The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead. I'm sure other people have had
    similar queries and I was wondering what methods they used to build out data
    like this?
    Use group by with an appropriate division/rounding to create the appropriate
    buckets, if they're all the same size.

    select round(field/100) as bucket, count(*) as cnt from foo group by
    round(field/100);

    --
    - David T. Wilson
    david.t.wilson@gmail.com
  • Kenneth Marshall at Oct 22, 2009 at 1:22 pm

    On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote:
    I have a reporting query that is taking nearly all of it's time in aggregate
    functions and I'm trying to figure out how to optimize it. The query takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds. The slowness comes from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out several
    histograms of the data. The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead. I'm sure other people have had
    similar queries and I was wondering what methods they used to build out data
    like this?
    Thanks for your help,
    Doug
    Hi Doug,

    Have you tried using the width_bucket() function? Here is a nice
    article describing its use for making histograms:

    http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html

    Regards,
    Ken
  • Doug Cole at Oct 29, 2009 at 10:25 pm

    On Thu, Oct 22, 2009 at 6:22 AM, Kenneth Marshall wrote:
    On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote:
    I have a reporting query that is taking nearly all of it's time in aggregate
    functions and I'm trying to figure out how to optimize it.  The query takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds.  The slowness comes from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out several
    histograms of the data.  The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead.  I'm sure other people have had
    similar queries and I was wondering what methods they used to build out data
    like this?
    Thanks for your help,
    Doug
    Hi Doug,

    Have you tried using the width_bucket() function? Here is a nice
    article describing its use for making histograms:

    http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html

    Regards,
    Ken
    Thanks Ken,
    I ended up going with this approach - it meant I had to break it
    into a lot more queries, one for each histogram, but even with that
    added overhead I cut the time down from 18 seconds to right around 1
    second.
    Doug
  • Scott Carey at Oct 22, 2009 at 9:49 pm

    On 10/21/09 3:51 PM, "Doug Cole" wrote:

    I have a reporting query that is taking nearly all of it's time in aggregate
    functions and I'm trying to figure out how to optimize it.  The query takes
    approximately 170ms when run with "select *", but when run with all the
    aggregate functions the query takes 18 seconds.  The slowness comes from our
    attempt to find distribution data using selects of the form:

    SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

    repeated across many different x,y values and fields to build out several
    histograms of the data.  The main culprit appears to be the CASE statement,
    but I'm not sure what to use instead.  I'm sure other people have had similar
    queries and I was wondering what methods they used to build out data like
    this?
    You might be able to do this with plain aggregates. Define a function that
    generates your partitions that you can group by, then aggregate functions
    for the outputs

    In either case, rather than each result being a column in one result row,
    each result will be its own row.

    Each row would have a column that defines the type of the result (that you
    grouped on), and one with the result value. If each is just a sum, its
    easy. If there are lots of different calculation types, it would be harder.
    Potentially, you could wrap that in a subselect to pull out each into its
    own column but that is a bit messy.

    Also, in 8.4 window functions could be helpful. PARTITION BY something that
    represents your buckets perhaps?
    http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html

    This will generally force a sort, but shouldn't be that bad.

    The function used for the group by or partition by could just be a big case
    statement to generate a unique int per bucket, or a truncate/rounding
    function. It just needs to spit out a unique result for each bucket for the
    group or partition.

    Thanks for your help,
    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedOct 21, '09 at 10:51p
activeOct 29, '09 at 10:25p
posts10
users8
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase