Hi all,

I'm new to window functions. Recently I run some simple queries but
surprised to find percent_rank is so slower than rank, could anybody tell me
why?

The table schema:
test=# \d inventory1
Table "public.inventory1"
Column | Type | Modifiers
----------------------+---------+-----------
inv_date_sk | integer | not null
inv_item_sk | integer | not null
inv_warehouse_sk | integer | not null
inv_quantity_on_hand | integer |

test=# \dt+ inventory1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------+-------+----------+---------+-------------
public | inventory1 | table | workshop | 8880 kB |

The rank query result:
test=# explain analyze select inv_date_sk,inv_item_sk, rank()over(partition
by inv_date_sk order by inv_item_sk) from inventory1;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual
time=631.947..1361.158 rows=189000 loops=1)
-> Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual
time=631.924..771.990 rows=189000 loops=1)
Sort Key: inv_date_sk, inv_item_sk
Sort Method: quicksort Memory: 12218kB
-> Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000
width=8) (actual time=0.055..198.948 rows=189000 loops=1)
Total runtime: 1500.193 ms
(6 rows)

The percent_rank result:
test=# explain analyze select inv_date_sk,inv_item_sk,
percent_rank()over(partition by inv_date_sk order by inv_item_sk) from
inventory1;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual
time=766.432..32924.804 rows=189000 loops=1)
-> Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual
time=756.320..905.407 rows=189000 loops=1)
Sort Key: inv_date_sk, inv_item_sk
Sort Method: quicksort Memory: 12218kB
-> Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000
width=8) (actual time=0.102..224.607 rows=189000 loops=1)
Total runtime: 33152.188 ms
(6 rows)

One special thing is that all the values of the partition key(inv_date_sk)
are the same, that is, there is only one window partition. I find that
percent_rank needs to buffer all the tuples to get the total number of rows.
But why is it so expensive?

I use 8.4.4. And I only increase the work_mem to 100M and leave other
parameters untouched.

Thanks,
Li Jie

Search Discussions

  • Tom Lane at Dec 9, 2010 at 9:01 pm

    Jie Li writes:
    I'm new to window functions. Recently I run some simple queries but
    surprised to find percent_rank is so slower than rank, could anybody tell me
    why?
    Huh, interesting. I can reproduce this with toy data, such as

    create table inventory1 (inv_date_sk int, inv_item_sk int);
    insert into inventory1 select 1, random()* 100000 from generate_series(1,189000);
    explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;

    The example is *not* particularly slow if you leave work_mem at default.
    But if you bump up work_mem enough so that the WindowAgg's internal
    tuplestore fits into memory, it slows down like crazy. A bit of quality
    time with oprofile shows that all the time is going into this memmove()
    in tuplestore_trim():

    /*
    * Slide the array down and readjust pointers. This may look pretty
    * stupid, but we expect that there will usually not be very many
    * tuple-pointers to move, so this isn't that expensive; and it keeps a
    * lot of other logic simple.
    *
    * In fact, in the current usage for merge joins, it's demonstrable that
    * there will always be exactly one non-removed tuple; so optimize that
    * case.
    */
    if (nremove + 1 == state->memtupcount)
    state->memtuples[0] = state->memtuples[nremove];
    else
    memmove(state->memtuples, state->memtuples + nremove,
    (state->memtupcount - nremove) * sizeof(void *));

    We're throwing away one tuple at a time as we advance forward through
    the tuplestore, and moving 100000+ tuple pointers each time. Ugh.
    This code was all right when written, because (IIRC) the mergejoin
    case was actually the only caller. But it's not all right for
    WindowAgg's less-predictable usage patterns.

    I thought for a bit about changing things around so that the first-used
    tuple slot isn't necessarily state->memtuples[0], but just like the
    comment says, that complicates a lot of other logic. And there isn't
    any easy place to reclaim the wasted slots later.

    What seems like the best bet is to put in a heuristic to make
    tuplestore_trim simply not do anything until nremove reaches some
    reasonably large amount, perhaps 10% of the number of stored tuples.
    This wastes up to 10% of the alloted memory, but that seems tolerable.
    We could complicate things a bit more by remembering that so-and-so
    many slots are authorized to be removed, and forcing a trim operation
    to discard them if we find ourselves in memory trouble. I'm not sure
    that extra complication is worthwhile though. Comments?

    regards, tom lane
  • Tom Lane at Dec 9, 2010 at 10:19 pm

    I wrote:
    We're throwing away one tuple at a time as we advance forward through
    the tuplestore, and moving 100000+ tuple pointers each time. Ugh.
    This code was all right when written, because (IIRC) the mergejoin
    case was actually the only caller. But it's not all right for
    WindowAgg's less-predictable usage patterns.
    I thought for a bit about changing things around so that the first-used
    tuple slot isn't necessarily state->memtuples[0], but just like the
    comment says, that complicates a lot of other logic. And there isn't
    any easy place to reclaim the wasted slots later.
    What seems like the best bet is to put in a heuristic to make
    tuplestore_trim simply not do anything until nremove reaches some
    reasonably large amount, perhaps 10% of the number of stored tuples.
    This wastes up to 10% of the alloted memory, but that seems tolerable.
    On reflection I think just not doing anything isn't a very good idea.
    The problem with that is that a mis-coded caller could try to fetch
    tuples that it had already told the tuplestore could be trimmed away;
    and this would work, most of the time, until you got unlucky and the
    trim operation had actually deleted them. I think it's pretty important
    for bug-catching purposes that the tuplestore enforce that those tuples
    are not available anymore.

    Hence the attached patch, which combines the two ideas by recycling
    tuples immediately but not sliding the pointer array until a reasonable
    amount of movement has occurred. This fixes the complained-of
    performance problem AFAICT.

    I'm not sure whether or not to back-patch this into 9.0 and 8.4. The
    code in tuplestore.c hasn't changed at all since 8.4, so there's not
    much risk of cross-version bugs, but if I did miss anything we could
    be shipping a buggy version next week. Thoughts?

    regards, tom lane
  • Kevin Grittner at Dec 9, 2010 at 10:30 pm

    Tom Lane wrote:

    I'm not sure whether or not to back-patch this into 9.0 and 8.4.
    The code in tuplestore.c hasn't changed at all since 8.4, so
    there's not much risk of cross-version bugs, but if I did miss
    anything we could be shipping a buggy version next week.
    Thoughts?
    Is there a performance regression involved, or is it a new feature
    which hasn't performed well on this type of query until your patch?
    If the latter, I'd be inclined to give it some time on HEAD and
    release it in the *following* minor release unless you're *very*
    confident it couldn't break anything.

    It's an uphill battle to convince managers that it's safe to apply
    minor upgrades with minimal testing. It doesn't take to many slips
    for the boulder to roll all the way back to the bottom of that hill.

    -Kevin
  • Tom Lane at Dec 9, 2010 at 10:49 pm

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    Tom Lane wrote:
    I'm not sure whether or not to back-patch this into 9.0 and 8.4.
    The code in tuplestore.c hasn't changed at all since 8.4, so
    there's not much risk of cross-version bugs, but if I did miss
    anything we could be shipping a buggy version next week.
    Thoughts?
    Is there a performance regression involved, or is it a new feature
    which hasn't performed well on this type of query until your patch?
    Well, since window functions didn't exist before 8.4, it's difficult to
    argue that there was a regression. It's certainly a performance bug
    though: nobody would expect that giving a query *more* work_mem would
    cause it to run many times slower.
    If the latter, I'd be inclined to give it some time on HEAD and
    release it in the *following* minor release unless you're *very*
    confident it couldn't break anything.
    Well, I'm reasonably confident in the patch, and it does pass regression
    tests. But I've been wrong before.

    I'm not terribly thrilled with that suggestion though. Do you have
    reason to think that anybody is likely to exercise window functions in
    HEAD, beyond what the regression tests do, in the next couple of months?
    Slipping the application of the patch to back branches by a little bit
    doesn't make a lot of management sense to me. I think either we trust
    it and put it into back branches, or we don't trust it and put it only
    in HEAD, so it goes through a beta cycle before hitting production.

    regards, tom lane
  • Kevin Grittner at Dec 9, 2010 at 11:08 pm

    Tom Lane wrote:

    Do you have reason to think that anybody is likely to exercise
    window functions in HEAD, beyond what the regression tests do, in
    the next couple of months?
    Not specifically, no. From the description (not having read the
    patch) I was somewhat concerned that it might affect something
    outside that narrow use case. If that's not possible, then I'm more
    comfortable putting it in a release that soon after it hits the
    repository.

    It's a judgment call, and you're clearly in the best position to
    make it. You asked for thoughts, so I shared my concerns. :-)

    -Kevin
  • Ron Mayer at Dec 13, 2010 at 9:04 pm

    Tom Lane wrote:
    argue that there was a regression. It's certainly a performance bug
    though: nobody would expect that giving a query *more* work_mem would
    cause it to run many times slower.
    I wouldn't be that surprised - otherwise it'd just be hard-coded to
    something large.

    Especially since earlier in the thread:
    The example is *not* particularly slow if you leave work_mem at default.
    which makes me think it's arguably not quite a bug.
  • Kenneth Marshall at Dec 9, 2010 at 11:15 pm

    On Thu, Dec 09, 2010 at 05:18:57PM -0500, Tom Lane wrote:
    I wrote:
    We're throwing away one tuple at a time as we advance forward through
    the tuplestore, and moving 100000+ tuple pointers each time. Ugh.
    This code was all right when written, because (IIRC) the mergejoin
    case was actually the only caller. But it's not all right for
    WindowAgg's less-predictable usage patterns.
    I thought for a bit about changing things around so that the first-used
    tuple slot isn't necessarily state->memtuples[0], but just like the
    comment says, that complicates a lot of other logic. And there isn't
    any easy place to reclaim the wasted slots later.
    What seems like the best bet is to put in a heuristic to make
    tuplestore_trim simply not do anything until nremove reaches some
    reasonably large amount, perhaps 10% of the number of stored tuples.
    This wastes up to 10% of the alloted memory, but that seems tolerable.
    On reflection I think just not doing anything isn't a very good idea.
    The problem with that is that a mis-coded caller could try to fetch
    tuples that it had already told the tuplestore could be trimmed away;
    and this would work, most of the time, until you got unlucky and the
    trim operation had actually deleted them. I think it's pretty important
    for bug-catching purposes that the tuplestore enforce that those tuples
    are not available anymore.

    Hence the attached patch, which combines the two ideas by recycling
    tuples immediately but not sliding the pointer array until a reasonable
    amount of movement has occurred. This fixes the complained-of
    performance problem AFAICT.

    I'm not sure whether or not to back-patch this into 9.0 and 8.4. The
    code in tuplestore.c hasn't changed at all since 8.4, so there's not
    much risk of cross-version bugs, but if I did miss anything we could
    be shipping a buggy version next week. Thoughts?

    regards, tom lane
    +1 for back patching.

    Ken
  • Hitoshi Harada at Dec 10, 2010 at 5:36 pm

    2010/12/10 Tom Lane <tgl@sss.pgh.pa.us>:
    I wrote:
    We're throwing away one tuple at a time as we advance forward through
    the tuplestore, and moving 100000+ tuple pointers each time.  Ugh.
    This code was all right when written, because (IIRC) the mergejoin
    case was actually the only caller.  But it's not all right for
    WindowAgg's less-predictable usage patterns.
    I thought for a bit about changing things around so that the first-used
    tuple slot isn't necessarily state->memtuples[0], but just like the
    comment says, that complicates a lot of other logic.  And there isn't
    any easy place to reclaim the wasted slots later.
    What seems like the best bet is to put in a heuristic to make
    tuplestore_trim simply not do anything until nremove reaches some
    reasonably large amount, perhaps 10% of the number of stored tuples.
    This wastes up to 10% of the alloted memory, but that seems tolerable.
    On reflection I think just not doing anything isn't a very good idea.
    The problem with that is that a mis-coded caller could try to fetch
    tuples that it had already told the tuplestore could be trimmed away;
    and this would work, most of the time, until you got unlucky and the
    trim operation had actually deleted them.  I think it's pretty important
    for bug-catching purposes that the tuplestore enforce that those tuples
    are not available anymore.
    I see it's too late now that you've committed it, but it seems there
    was another way to avoid it by not trimming from percent_rank()
    individually. Once the whole partition is fit to the memory, you don't
    need to trim it since it never grows. The trimming logic is for
    something like moving aggregates and (simple) rank(), which grows
    tuplestore content as it advances. percent_rank() doesn't seem to
    match the optimization.

    Regards,

    --
    Hitoshi Harada
  • Tom Lane at Dec 10, 2010 at 5:46 pm

    Hitoshi Harada writes:
    I see it's too late now that you've committed it,
    Patches can always be reverted...
    but it seems there
    was another way to avoid it by not trimming from percent_rank()
    individually. Once the whole partition is fit to the memory, you don't
    need to trim it since it never grows. The trimming logic is for
    something like moving aggregates and (simple) rank(), which grows
    tuplestore content as it advances. percent_rank() doesn't seem to
    match the optimization.
    I don't think this idea leads to a robust solution. When you have a
    combination of different window functions being used in the same scan,
    you can't expect any one of them to know the global situation. Having
    percent_rank lie about its requirements in order to avoid bad behavior
    in the tuplestore infrastructure is just going to create more problems
    down the road. We need to have the individual functions tell the truth
    and then do any optimization hacking in the WindowAgg code or
    infrastructure.

    regards, tom lane
  • Hitoshi Harada at Dec 10, 2010 at 5:54 pm

    2010/12/11 Tom Lane <tgl@sss.pgh.pa.us>:
    Hitoshi Harada <umi.tanuki@gmail.com> writes:
    I see it's too late now that you've committed it,
    Patches can always be reverted...
    but it seems there
    was another way to avoid it by not trimming from percent_rank()
    individually. Once the whole partition is fit to the memory, you don't
    need to trim it since it never grows. The trimming logic is for
    something like moving aggregates and (simple) rank(), which grows
    tuplestore content as it advances. percent_rank() doesn't seem to
    match the optimization.
    I don't think this idea leads to a robust solution.  When you have a
    combination of different window functions being used in the same scan,
    you can't expect any one of them to know the global situation.  Having
    percent_rank lie about its requirements in order to avoid bad behavior
    in the tuplestore infrastructure is just going to create more problems
    down the road.  We need to have the individual functions tell the truth
    and then do any optimization hacking in the WindowAgg code or
    infrastructure.
    Hm? Once percent_rank() scans to the partition end, any other window
    functions that scans row by row don't need to care the memory
    reduction, aren't they? Or more generally, if the partition was
    scanned to the end, we don't need to trim tuplestore anymore. Am I
    misunderstanding?

    Regards,

    --
    Hitoshi Harada
  • Tom Lane at Dec 10, 2010 at 6:07 pm

    Hitoshi Harada writes:
    Hm? Once percent_rank() scans to the partition end, any other window
    functions that scans row by row don't need to care the memory
    reduction, aren't they? Or more generally, if the partition was
    scanned to the end, we don't need to trim tuplestore anymore. Am I
    misunderstanding?
    Giving back the memory as we do the scan is still a good thing IMO;
    there might be other uses for it. In any case I don't see where you're
    going to put such a heuristic without breaking potentially interesting
    uses elsewhere. The tuplestore doesn't know anything about partitions
    being read to the end; and WindowAgg doesn't (or shouldn't) know about
    whether the tuplestore is all in memory.

    Furthermore, the performance problem would exist for any situation where
    the window functions had read far beyond the frame start, whether that
    was all the way to partition end or not. Consider a frame like ROWS
    BETWEEN 10000 PRECEDING AND 10000 FOLLOWING.

    In the end this is a local problem inside tuplestore, and kluging its
    callers to work around it is the wrong approach.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedDec 9, '10 at 7:26a
activeDec 13, '10 at 9:04p
posts12
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase