Greetings,

The array_accum example aggregate in the user documentation works
reasonably on small data sets but doesn't work too hot on large ones.
http://www.postgresql.org/docs/8.1/static/xaggr.html

Normally I wouldn't care particularly much but it turns out that PL/R
uses arrays for quite a bit (eg: histograms and other statistics
functions). I've also heard other complaints about the performance of
arrays, though I'm not sure if those were due to array_accum or
something else.

Long story short, I set out to build a faster array_accum. Much to my
suprise and delight, we already *had* one. accumArrayResult() and
makeArrayResult()/construct_md_array() appear to do a fantastic job.
I've created a couple of 'glue' functions to expose these functions so
they can be used in an aggregate. I'm sure they could be improved
upon and possibly made even smaller than they already are (90 lines
total for both) but I'd like to throw out the idea of including them
in core. The aggregate created with them could also be considered for
inclusion though I'm less concerned with that. I don't expect general
PostgreSQL users would have trouble creating the aggregate- I don't
know that the average user would be able or willing to write the C
functions.

For comparison, the new functions run with:
time psql -c "select aaccum(generate_series) from generate_series(1,1000000);" > /dev/null
4.24s real 0.34s user 0.06s system

Compared to:
time psql -c "select array_accum(generate_series) from generate_series(1,1000000);" > /dev/null
...

Well, it's still running and it's been over an hour.

The main differences, as I see it, are: accumArrayResult() works in
chunks of 64 elements, and uses repalloc(). array_accum uses
array_set() which works on individual elements and uses
palloc()/memcpy(). I appriciate that this is done because for most
cases of array_set() it's not acceptable to modify the input and am
not suggesting that be changed. An alternative might be to modify
array_set() to check if it is in an aggregate and change its behavior
but adding the seperate functions seemed cleaner and much less
intrusive to me.

Please find the functions attached.

Thanks,

Stephen

Search Discussions

  • Tom Lane at Oct 6, 2006 at 9:29 pm

    Stephen Frost writes:
    Long story short, I set out to build a faster array_accum. Much to my
    suprise and delight, we already *had* one. accumArrayResult() and
    makeArrayResult()/construct_md_array() appear to do a fantastic job.
    I've created a couple of 'glue' functions to expose these functions so
    they can be used in an aggregate. I'm sure they could be improved
    upon and possibly made even smaller than they already are (90 lines
    total for both) but I'd like to throw out the idea of including them
    in core. The aggregate created with them could also be considered for
    inclusion though I'm less concerned with that.
    Since you've set up the functions to only be usable inside an aggregate,
    I don't see much of a reason why we wouldn't provide the aggregate too.
    It looks like it should work to have just one polymorphic aggregate
    definition, eg, array_accum(anyelement) returns anyarray.

    As far as coding style goes, you're supposed to use makeArrayResult()
    with accumArrayResult(), not call construct_md_array() directly. And
    copying the ArrayBuildState around like that is just plain bizarre...

    Does the thing work without memory leakage (for a pass-by-ref datatype)
    in a GROUP BY situation?

    regards, tom lane
  • Stephen Frost at Oct 6, 2006 at 10:05 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    Stephen Frost <sfrost@snowman.net> writes:
    Long story short, I set out to build a faster array_accum. Much to my
    suprise and delight, we already *had* one. accumArrayResult() and
    makeArrayResult()/construct_md_array() appear to do a fantastic job.
    I've created a couple of 'glue' functions to expose these functions so
    they can be used in an aggregate. I'm sure they could be improved
    upon and possibly made even smaller than they already are (90 lines
    total for both) but I'd like to throw out the idea of including them
    in core. The aggregate created with them could also be considered for
    inclusion though I'm less concerned with that.
    Since you've set up the functions to only be usable inside an aggregate,
    I don't see much of a reason why we wouldn't provide the aggregate too.
    Sure, I don't see a reason these functions would be useful outside of an
    aggregate in the form they need to be in for the aggregate, either..
    It looks like it should work to have just one polymorphic aggregate
    definition, eg, array_accum(anyelement) returns anyarray.
    I was hoping to do that, but since it's an aggregate the ffunc format is
    pre-defined to require accepting the 'internal state' and nothing else,
    and to return 'anyelement' or 'anyarray' one of the inputs must be an
    'anyelement' or 'anyarray', aiui. That also implied to me that the type
    passed in was expected to be the type passed out, which wouldn't
    necessairly be the case here as the internal state variable is a bytea.
    It might be possible to do away with the internal state variable
    entirely though and make it an anyelement instead, if we can find
    somewhere else to put the pointer to the ArrayBuildState.
    As far as coding style goes, you're supposed to use makeArrayResult()
    with accumArrayResult(), not call construct_md_array() directly. And
    copying the ArrayBuildState around like that is just plain bizarre...
    I had attempted to use makeArrayResult() originally and ran into some
    trouble with the MemoryContextDelete() which is done during it. The
    context used was the AggState context and therefore was deleted
    elsewhere. That might have been a misunderstanding on my part though
    since I recall fixing at least one or two bugs afterwards, so it may be
    possible to go back and change to using makeArrayResult(). That'd
    certainly make the ffunc smaller.

    As for ArrayBuildState, I'm not actually copying the structure around,
    just the pointer is being copied around inside of the state transistion
    variable (which is a bytea). I'm not sure where else to store the
    pointer to the ArrayBuildState though, since multiple could be in play
    at a given time during an aggregation, aiui.
    Does the thing work without memory leakage (for a pass-by-ref datatype)
    in a GROUP BY situation?
    I expected that using the AggState context would handle free'ing the
    memory at the end of the aggregation as I believe that's the context
    used for the state variable itself as well. Honestly, I wasn't entirely
    sure how to create my own context or if that was the correct thing to do
    in this case. I'll see about changing it around to do that if it's
    acceptable to have a context created for each instance of a group-by
    aggregate, and I can figure out how. :)

    I'm not sure about memory leakage during a Sort+GroupAgg.. I don't know
    how that's done currently either, honestly. It seems like memory could
    be free'd during that once the ffunc is called, and an extra memory
    context could do that, is that what you're referring to?

    Thanks!

    Stephen
  • Tom Lane at Oct 7, 2006 at 3:39 am

    Stephen Frost writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    It looks like it should work to have just one polymorphic aggregate
    definition, eg, array_accum(anyelement) returns anyarray.
    I was hoping to do that, but since it's an aggregate the ffunc format is
    pre-defined to require accepting the 'internal state' and nothing else,
    and to return 'anyelement' or 'anyarray' one of the inputs must be an
    'anyelement' or 'anyarray', aiui.
    Hmm ... I hadn't been thinking about what the state type would need to
    be, but certainly "bytea" is a lie given what you're really doing.
    We've run into this same problem in contrib/intagg: sometimes you'd like
    to use a state data structure that isn't any regular SQL datatype, and
    in particular isn't just a single blob of memory. That's a problem from
    nodeAgg's point of view because it expects to be responsible for copying
    the state value from one context to another. Don't have an immediate
    idea for a solution ...

    regards, tom lane
  • Florian G. Pflug at Oct 7, 2006 at 3:24 pm

    Tom Lane wrote:
    Stephen Frost <sfrost@snowman.net> writes:
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    It looks like it should work to have just one polymorphic aggregate
    definition, eg, array_accum(anyelement) returns anyarray.
    I was hoping to do that, but since it's an aggregate the ffunc format is
    pre-defined to require accepting the 'internal state' and nothing else,
    and to return 'anyelement' or 'anyarray' one of the inputs must be an
    'anyelement' or 'anyarray', aiui.
    Hmm ... I hadn't been thinking about what the state type would need to
    be, but certainly "bytea" is a lie given what you're really doing.
    We've run into this same problem in contrib/intagg: sometimes you'd like
    to use a state data structure that isn't any regular SQL datatype, and
    in particular isn't just a single blob of memory. That's a problem from
    nodeAgg's point of view because it expects to be responsible for copying
    the state value from one context to another. Don't have an immediate
    idea for a solution ...
    I used an int8 as state type for an implementation of a kind of
    array_accum_unique aggregate. I know that it's a ugly hack, but
    it has been running on a production machine for months now, without
    a problem...

    The memory is alloc'd from the aggcontext, btw.

    Note that I only use this aggregate in one particular query -
    so there might be problems with my approach that just don't
    manifest in my particular situation. For example, the aggregate
    is used only on a table that is never updated, and it is only
    used in select queries. So there might be problems if the executor
    decides that it has to restart a query...

    greetings, Florian Pflug
  • Stephen Frost at Oct 9, 2006 at 4:55 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    Stephen Frost <sfrost@snowman.net> writes:
    I was hoping to do that, but since it's an aggregate the ffunc format is
    pre-defined to require accepting the 'internal state' and nothing else,
    and to return 'anyelement' or 'anyarray' one of the inputs must be an
    'anyelement' or 'anyarray', aiui.
    Hmm ... I hadn't been thinking about what the state type would need to
    be, but certainly "bytea" is a lie given what you're really doing.
    Indeed. I've updated the functions quite a bit to clean things up,
    including: Added many more comments, removed the unnecessary 'storage*'
    pointer being used, created my own structure for tracking state
    information, created a seperate memory context (tied to the AggContext),
    correctly handle NULL values, and changed the ffunc to use
    makeArrayResult.

    I also tried just tried using polymorphic types for the functions and
    for the aggregate and it appeared to just work:

    create function aaccum_sfunc (anyarray, anyelement) returns anyarray
    language 'C' AS 'aaccum.so', 'aaccum_sfunc'
    ;
    create function aaccum_ffunc (anyarray) returns anyarray language
    'C' AS '/data/sfrost/postgres/arrays/aaccum.so', 'aaccum_ffunc'
    ;
    create aggregate aaccum (
    sfunc = aaccum_sfunc,
    basetype = anyelement,
    stype = anyarray,
    finalfunc = aaccum_ffunc
    );

    select aaccum(generate_series) from generate_series(1,5);
    aaccum
    -------------
    {1,2,3,4,5}
    (1 row)

    (test is a table with one varchar column, abc)
    select aaccum(abc) from test;
    aaccum
    ---------
    {a,b,c}
    (1 row)

    (Added a column called 'hi', set to 'a', added b,b and c,b)
    select hi,aaccum(abc) from test group by hi;
    hi | aaccum
    ----+---------
    b | {b,c}
    a | {a,b,c}
    (2 rows)

    It makes some sense that it would work as an 'anyarray' is just a
    variable-length type internally and so long as nothing else attempts to
    make sense out of our 'fake array' everything should be fine.

    The latest version also appears to be a bit faster than the prior
    version. I'm going to be running a very large query shortly using
    this aaccum and will report back how it goes. Please let me know if
    there are any other improvments or changes I should make. I'd like to
    submit this to -patches w/ the appropriate entries to have it be
    included in the core distribution. Is it acceptable to reuse the
    'array_accum' name even though it was used in the documentation as an
    example? I'm thinking yes, but wanted to check.

    Thanks!

    Stephen
  • Stephen Frost at Oct 10, 2006 at 12:14 am

    * Stephen Frost (sfrost@snowman.net) wrote:
    I'm going to be running a very large query shortly using
    this aaccum and will report back how it goes.
    It went *very* well, actually much better than I had originally
    expected. This query used to take over 12 hours to complete (about 11
    of which was after the main table involved was sorted). With this new
    aaccum in place the whole query only took about an hour, most of which
    was the sort and join required by the query. The aggregation (aaccum)
    and r_hist() (R histogram function generating PNGs) took only a few
    minutes.

    Thanks!

    Stephen
  • Merlin Moncure at Oct 10, 2006 at 5:48 am

    On 10/10/06, Stephen Frost wrote:
    * Stephen Frost (sfrost@snowman.net) wrote:
    I'm going to be running a very large query shortly using
    this aaccum and will report back how it goes.
    It went *very* well, actually much better than I had originally
    expected. This query used to take over 12 hours to complete (about 11
    of which was after the main table involved was sorted). With this new
    aaccum in place the whole query only took about an hour, most of which
    was the sort and join required by the query. The aggregation (aaccum)
    and r_hist() (R histogram function generating PNGs) took only a few
    minutes.
    very cool, and definately useful imo. i use array_accum all the time,
    and have not always been happy with its performance. if your stuff
    passes muster, would be nice to see it move into core :-).

    merlin
  • Stephen Frost at Oct 6, 2006 at 10:06 pm

    * Stephen Frost (sfrost@snowman.net) wrote:
    For comparison, the new functions run with:
    time psql -c "select aaccum(generate_series) from generate_series(1,1000000);" > /dev/null
    4.24s real 0.34s user 0.06s system

    Compared to:
    time psql -c "select array_accum(generate_series) from generate_series(1,1000000);" > /dev/null
    ...

    Well, it's still running and it's been over an hour.
    Just to follow-up on this, the result was:

    7601.36s real 0.36s user 0.02s system

    Or about 2 hours.

    Enjoy,

    Stephen
  • Tom Lane at Oct 12, 2006 at 6:02 pm

    Stephen Frost writes:
    * Neil Conway (neilc@samurai.com) wrote:
    There is no guarantee why SQL NULL and PG_RETURN_XYZ(NULL) refer to the
    same thing -- use PG_RETURN_NULL() to return a SQL NULL value, or just
    make the function strict.
    Huh, alright. I'll probably just change it to PG_RETURN_NULL().
    Unless the function actually *needs* to be non-strict, you should mark
    it strict and omit the runtime test for null input altogether. This
    is the general way that it's done in existing backend C functions.
    Doing it the other way is needlessly inconsistent (thus distracting
    readers) and clutters the code.

    (However, now that we support nulls in arrays, meseems a more consistent
    definition would be that it allows null inputs and just includes them in
    the output. So probably you do need it non-strict.)

    Personally though I'm much more concerned about the state datatype.
    As-is I think it's not only ugly but probably a security hole. If you
    are declaring the state type as something other than what it really is
    then you have to defend against two sorts of problems: someone being
    able to crash the database by calling your function and passing it
    something it didn't expect, or crashing the database by using your
    function to pass some other function an input it didn't expect. For
    example, since you've got aaccum_sfunc declared to return anyarray when
    it returns no such thing, something like array_out(aaccum_sfunc(...))
    would trivially crash the backend. It's possible that the error check
    to insist on being called with an AggState context is a sufficient
    defense against that, but I feel nervous about it, and would much rather
    have a solution that isn't playing fast and loose with the type system.
    Particularly if it's going to go into core rather than contrib.

    I'm inclined to think that this example demonstrates a deficiency in the
    aggregate-function design: there should be a way to declare what we're
    really doing. But I don't know exactly what that should look like.

    regards, tom lane
  • Stephen Frost at Oct 12, 2006 at 8:29 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    (However, now that we support nulls in arrays, meseems a more consistent
    definition would be that it allows null inputs and just includes them in
    the output. So probably you do need it non-strict.)
    This was my intention.
    I'm inclined to think that this example demonstrates a deficiency in the
    aggregate-function design: there should be a way to declare what we're
    really doing. But I don't know exactly what that should look like.
    I agree and would much rather have a clean solution which works with the
    design than one which has to work outside it. When I first was trying
    to decide on the state-type I was looking through the PG catalogs for
    essentially a "complex C type" which translated to a void*. Perhaps
    such a type could be added. Unless that's considered along the lines of
    an 'any' type it'd cause problems for the polymorphism aspect.

    Another alternative would be to provide a seperate area for each
    aggregate to put any other information it needs. This would almost
    certainly only be available to C functions but would essentially be a
    void* which is provided through the AggState structure but tracked by
    the aggregator routines and reset for each aggregate function being
    run. If that's acceptable, I don't think it'd be all that difficult to
    implement. With that, aaccum_sfunc and aaccum_ffunc would ignore the
    state variable passed to them in favor of their custom structure
    available through fcinfo->AggState (I expect they'd just keep the
    state variable NULL and be marked non-strict, or set it to some constant
    if necessary). The pointer would have to be tracked somewhere and then
    copied in/out on each call, but that doesn't seem too difficult to me.
    After all, the state variable is already being tracked somewhere, this
    would just sit next to it, in my head anyway.

    I've got some time this weekend and would be happy to take a shot at
    the second proposal if that's generally acceptable.

    Thanks,

    Stephen
  • Tom Lane at Oct 12, 2006 at 10:45 pm

    Stephen Frost writes:
    Another alternative would be to provide a seperate area for each
    aggregate to put any other information it needs.
    I'm not convinced that that's necessary --- the cases we have at hand
    suggest that the transition function is perfectly capable of doing the
    storage management it wants. The problem is how to declare to CREATE
    AGGREGATE that we're using a transition function of this kind rather
    than the "stupid" functions it expects. When the function is doing its
    own storage management, we'd really rather that nodeAgg.c stayed out of
    the way and didn't try to do any datum copying at all; having it copy a
    placeholder bytea or anyarray or whatever is really a waste of cycles,
    not to mention obscuring what is going on. If nodeAgg just provided a
    pass-by-value Datum, which the transition function could use to store a
    pointer to storage it's handling, things would be a lot cleaner.

    After a little bit of thought I'm tempted to propose that we handle this
    by inventing a new pseudotype called something like "aggregate_state",
    which'd be declared in the catalogs as pass-by-value, thereby
    suppressing useless copying activity in nodeAgg.c. You'd declare the
    aggregate as having stype = aggregate_state, and the transition function
    would have signature
    sfunc(aggregate_state, ... aggregate-input-type(s) ...)
    returns aggregate_state
    and the final function of course
    ffunc(aggregate_state) returns aggregate-result-type

    aggregate_state would have no other uses in the system, and its input
    and output functions would raise an error, so type safety is assured
    --- there would be no way to call either the sfunc or ffunc "manually",
    except by passing a NULL value, which should be safe because that's what
    they'd expect as the aggregate initial condition.

    One advantage of doing it this way is that the planner could be taught
    to recognize aggregates with stype = aggregate_state specially, and make
    allowance for the fact that they'll use more workspace than meets the
    eye. If we don't have something like this then the planner is likely to
    try to use hash aggregation in scenarios where it'd be absolutely fatal
    to do so. I'm not sure whether we'd want to completely forbid hash
    aggregation when any stype = aggregate_state is present, but for sure we
    want to assume that there's some pretty large amount of per-aggregate
    state we don't know about.

    regards, tom lane
  • Tom Lane at Oct 12, 2006 at 10:59 pm

    I wrote:
    aggregate_state would have no other uses in the system, and its input
    and output functions would raise an error, so type safety is assured
    --- there would be no way to call either the sfunc or ffunc "manually",
    except by passing a NULL value, which should be safe because that's what
    they'd expect as the aggregate initial condition.
    Um, no, I take that back, unless you want to invent a separate
    pseudotype for each such aggregate. Otherwise you can crash it with

    my_ffunc(your_sfunc(null, whatever))

    because my_ffunc will be expecting a datastructure different from what
    it gets.

    Maybe having a check for AggState call context is enough of a defense for
    that, but I'm not really satisfied. Back to the drawing board ...

    regards, tom lane
  • Martijn van Oosterhout at Oct 13, 2006 at 8:06 am

    On Thu, Oct 12, 2006 at 06:58:52PM -0400, Tom Lane wrote:
    I wrote:
    aggregate_state would have no other uses in the system, and its input
    and output functions would raise an error, so type safety is assured
    --- there would be no way to call either the sfunc or ffunc "manually",
    except by passing a NULL value, which should be safe because that's what
    they'd expect as the aggregate initial condition.
    Um, no, I take that back, unless you want to invent a separate
    pseudotype for each such aggregate. Otherwise you can crash it with
    <snip>

    What this really calls for is a type that users are forbidden to
    interact with directly. Basically, the type may only be used by C
    functions and such C functions may not appear in an SQL query.

    Seems the only way to safely deal with datums you don't know the
    representation of.

    The name "internal" would be nice, but it's taken :(

    Have a nice day,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    From each according to his ability. To each according to his ability to litigate.
  • Tom Lane at Oct 13, 2006 at 1:56 pm

    Martijn van Oosterhout writes:
    What this really calls for is a type that users are forbidden to
    interact with directly. Basically, the type may only be used by C
    functions and such C functions may not appear in an SQL query.
    That's not really the flavor of solution I'd like to have. Ideally,
    it'd actually *work* to write

    my_ffunc(my_sfunc(my_sfunc(null, 1), 2))

    and get the same result as aggregating over the values 1 and 2. The
    trick is to make sure that my_sfunc and my_ffunc can only be used
    together. Maybe we do need a type for each such aggregate ...

    In any case, "internal" isn't quite the right model, because with that,
    the type system is basically disclaiming all knowledge of the data's
    properties. With an "aggregate_state" datatype, the type system would
    be asserting that it's OK to use this type (or more accurately, these
    functions) in an aggregate.

    regards, tom lane
  • Stephen Frost at Oct 13, 2006 at 3:33 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    That's not really the flavor of solution I'd like to have. Ideally,
    it'd actually *work* to write

    my_ffunc(my_sfunc(my_sfunc(null, 1), 2))

    and get the same result as aggregating over the values 1 and 2. The
    trick is to make sure that my_sfunc and my_ffunc can only be used
    together. Maybe we do need a type for each such aggregate ...
    In general I like this idea but there are some complications, the main
    one being where would the memory be allocated? I guess we could just
    always use the QueryContext. The other issue is, in the above scenario
    is it acceptable to modify the result of my_sfunc(null, 1) in the ,2
    call? Normally it's unacceptable to modify your input, aggregates get a
    special exception when called as an aggregate, but in this case you'd
    have to copy the entire custom structure underneath, I'd think.

    As for a type for each such aggregate, that seems reasonable to me,
    honestly. I'd like for the type creation to be reasonably simple
    though, if possible. ie: could we just provide NULLs for the
    input/output functions instead of having to implement functions that
    just return an error? Or perhaps have a polymorphic function which can
    be reused that just returns an error. Additionally, we'd have to be
    able to mark the types as being polymorhpic along the same lines as
    anyelement/anyarray. Hopefully that's already easy, but if not it'd be
    nice if it could be made easy...

    Thanks,

    Stephen
  • Tom Lane at Oct 13, 2006 at 4:02 pm

    Stephen Frost writes:
    That's not really the flavor of solution I'd like to have. Ideally,
    it'd actually *work* to write
    my_ffunc(my_sfunc(my_sfunc(null, 1), 2))
    In general I like this idea but there are some complications, the main
    one being where would the memory be allocated?
    In the agg context if called with that context, else
    CurrentMemoryContext will do fine.
    The other issue is, in the above scenario
    is it acceptable to modify the result of my_sfunc(null, 1) in the ,2
    call?
    Yes, because the only place a nonnull value of the type could have come
    from is a my_sfunc call; since it's a pseudotype, we don't allow it on
    disk. (We might also need a hack to prevent the type from being used as
    a record-type component ... not sure if that comes for free with being a
    pseudotype currently.)
    As for a type for each such aggregate, that seems reasonable to me,
    honestly.
    The ugly part is that we'd still need a way for the planner to recognize
    this class of types.
    Additionally, we'd have to be
    able to mark the types as being polymorhpic along the same lines as
    anyelement/anyarray.
    What for?

    regards, tom lane
  • Stephen Frost at Oct 13, 2006 at 4:44 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    Stephen Frost <sfrost@snowman.net> writes:
    The other issue is, in the above scenario
    is it acceptable to modify the result of my_sfunc(null, 1) in the ,2
    call?
    Yes, because the only place a nonnull value of the type could have come
    from is a my_sfunc call; since it's a pseudotype, we don't allow it on
    disk. (We might also need a hack to prevent the type from being used as
    a record-type component ... not sure if that comes for free with being a
    pseudotype currently.)
    Ah, ok.
    As for a type for each such aggregate, that seems reasonable to me,
    honestly.
    The ugly part is that we'd still need a way for the planner to recognize
    this class of types.
    Hrm, true. I would agree that they would need more memory than most
    aggregates. It seems likely to me that worst offenders are going to be
    of the array_accum type- store all tuples coming in. Therefore, my
    suggestion would be that the memory usage be estimated along those lines
    and allow for hashagg when there's enough memory to keep all the tuples
    (or rather the portion of the tuple going into the aggregate) in memory
    (plus some amount of overhead, maybe 10%). That doesn't help with how
    to get the planner to recognize this class of types though. I don't
    suppose we already have a class framework which the planner uses to
    group types which have similar characteristics?
    Additionally, we'd have to be
    able to mark the types as being polymorhpic along the same lines as
    anyelement/anyarray.
    What for?
    So that the finalfunc can be polymorphic along the lines of my suggested
    aaccum_sfunc(anyarray,anyelement) returns anyarray and
    aaccum_ffunc(anyarray) returns anyarray. If the state type isn't
    polymorphic then PG won't let me create a function from it which returns
    a polymorphic, aiui.

    Thanks,

    Stephen
  • Tom Lane at Oct 13, 2006 at 5:06 pm

    Stephen Frost writes:
    Additionally, we'd have to be
    able to mark the types as being polymorhpic along the same lines as
    anyelement/anyarray.
    What for?
    So that the finalfunc can be polymorphic along the lines of my suggested
    aaccum_sfunc(anyarray,anyelement) returns anyarray and
    aaccum_ffunc(anyarray) returns anyarray.
    Hmm ... there's not any real need for this at the level of the
    aggregate, because we resolve a polymorphic aggregate's output type
    directly from its input type(s), and we've already established that the
    general-purpose agg code doesn't need to be able to infer anything about
    the transition data type. However the function code is going to
    complain if you try to declare "my_sfunc(aggregate_state) returns
    anyarray" because that looks ill-formed ... and indeed that kinda kills
    the idea of being able to call my_sfunc standalone anyway.

    Maybe we need a more radical solution in which the sfunc/ffunc don't
    exist as separately callable entities at all. That would sidestep the
    whole need to have a type-system representation for the state data,
    as well as eliminate worries about whether we've sufficiently plugged
    the security risks of being able to call one of them in an unexpected
    context. Not sure what this should look like in the catalogs though.

    regards, tom lane
  • Bruce Momjian at Feb 14, 2007 at 12:43 am
    What is the status of this feature addition?

    ---------------------------------------------------------------------------

    Stephen Frost wrote:
    -- Start of PGP signed section.
    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    (However, now that we support nulls in arrays, meseems a more consistent
    definition would be that it allows null inputs and just includes them in
    the output. So probably you do need it non-strict.)
    This was my intention.
    I'm inclined to think that this example demonstrates a deficiency in the
    aggregate-function design: there should be a way to declare what we're
    really doing. But I don't know exactly what that should look like.
    I agree and would much rather have a clean solution which works with the
    design than one which has to work outside it. When I first was trying
    to decide on the state-type I was looking through the PG catalogs for
    essentially a "complex C type" which translated to a void*. Perhaps
    such a type could be added. Unless that's considered along the lines of
    an 'any' type it'd cause problems for the polymorphism aspect.

    Another alternative would be to provide a seperate area for each
    aggregate to put any other information it needs. This would almost
    certainly only be available to C functions but would essentially be a
    void* which is provided through the AggState structure but tracked by
    the aggregator routines and reset for each aggregate function being
    run. If that's acceptable, I don't think it'd be all that difficult to
    implement. With that, aaccum_sfunc and aaccum_ffunc would ignore the
    state variable passed to them in favor of their custom structure
    available through fcinfo->AggState (I expect they'd just keep the
    state variable NULL and be marked non-strict, or set it to some constant
    if necessary). The pointer would have to be tracked somewhere and then
    copied in/out on each call, but that doesn't seem too difficult to me.
    After all, the state variable is already being tracked somewhere, this
    would just sit next to it, in my head anyway.

    I've got some time this weekend and would be happy to take a shot at
    the second proposal if that's generally acceptable.

    Thanks,

    Stephen
    -- End of PGP section, PGP failed!

    --
    Bruce Momjian <bruce@momjian.us> http://momjian.us
    EnterpriseDB http://www.enterprisedb.com

    + If your life is a hard drive, Christ can be your backup. +

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedOct 6, '06 at 8:34p
activeFeb 14, '07 at 12:43a
posts20
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2019 Grokbase