I'd like to propose adding a new GUC to limit the amount of memory a backend
can allocate for its own use. The problem this addresses is that sometimes
one needs to set work_mem fairly high to get good query plans for large joins.
However, some complex queries will then use huge amounts of memory so that
one or a few of them will consume all the memory on the host and run it deep
into swap or trigger the oom killer or worse.

I've attached a patch based on 8.4.1. It works by keeping a track of the
total memory allocated via malloc to AllocBlocks (aset.c). If this is not
shot down/up too badly I will rebase it on CVS and submit it for the next
commit fest.

I would also like to propose a similar limit on temp space use. It is quite
easy for an unintended cartesion product to use hundreds of gigabytes of
scratch space and cause other processes to fail due to lack of disk space.
If this is not objectionable, I'll work on it too.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

Search Discussions

  • Tom Lane at Oct 1, 2009 at 2:36 pm

    daveg writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.

    regards, tom lane
  • Euler Taveira de Oliveira at Oct 1, 2009 at 3:39 pm

    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    What about plataforms (Windows) that don't have ulimit?


    --
    Euler Taveira de Oliveira
    http://www.timbira.com/
  • Bill Moran at Oct 1, 2009 at 3:46 pm

    In response to Euler Taveira de Oliveira <euler@timbira.com>:

    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    What about plataforms (Windows) that don't have ulimit?
    I have a hard time believing that Windows doesn't have a ulimit equivalent.

    I don't want to degrade this thread into MS-bashing, but if that were the
    case, it would make Windows a pretty crippled OS.
  • Tom Lane at Oct 1, 2009 at 3:47 pm

    Euler Taveira de Oliveira writes:
    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    What about plataforms (Windows) that don't have ulimit?
    Get a real operating system ;-)

    Seriously, the proposed patch introduces overhead into a place that is
    already a known hot spot, in return for not much of anything. It will
    *not* bound backend memory use very accurately, because there is no way
    to track raw malloc() calls. And I think that 99% of users will not
    find it useful.

    regards, tom lane
  • Robert Haas at Oct 1, 2009 at 3:51 pm

    On Thu, Oct 1, 2009 at 11:47 AM, Tom Lane wrote:
    Euler Taveira de Oliveira <euler@timbira.com> writes:
    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    What about plataforms (Windows) that don't have ulimit?
    Get a real operating system ;-)

    Seriously, the proposed patch introduces overhead into a place that is
    already a known hot spot, in return for not much of anything.  It will
    *not* bound backend memory use very accurately, because there is no way
    to track raw malloc() calls.  And I think that 99% of users will not
    find it useful.
    What WOULD be useful is to find a way to provide a way to configure
    work_mem per backend rather than per executor node. But that's a much
    harder problem.

    ...Robert
  • Tom Lane at Oct 1, 2009 at 4:18 pm

    Robert Haas writes:
    What WOULD be useful is to find a way to provide a way to configure
    work_mem per backend rather than per executor node. But that's a much
    harder problem.
    I think it's mostly a planner problem: how do you deal with the fact
    that that would make cost estimates for different sub-problems
    interrelated? The cost of a hash, for instance, depends a lot on how
    much memory you assume it can use.

    regards, tom lane
  • Euler Taveira de Oliveira at Oct 1, 2009 at 4:46 pm

    Tom Lane escreveu:
    Robert Haas <robertmhaas@gmail.com> writes:
    What WOULD be useful is to find a way to provide a way to configure
    work_mem per backend rather than per executor node. But that's a much
    harder problem.
    I think it's mostly a planner problem: how do you deal with the fact
    that that would make cost estimates for different sub-problems
    interrelated? The cost of a hash, for instance, depends a lot on how
    much memory you assume it can use.
    It could introduce some complexity but you could track (subtract) the memory
    usage as you're walking up the tree. Also, you need to decide what to do when
    you have more than one node per level. :( How do you deal with priority in
    this case?


    --
    Euler Taveira de Oliveira
    http://www.timbira.com/
  • Euler Taveira de Oliveira at Oct 1, 2009 at 4:23 pm

    Robert Haas escreveu:
    On Thu, Oct 1, 2009 at 11:47 AM, Tom Lane wrote:
    Euler Taveira de Oliveira <euler@timbira.com> writes:
    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    What about plataforms (Windows) that don't have ulimit?
    Get a real operating system ;-)

    Seriously, the proposed patch introduces overhead into a place that is
    already a known hot spot, in return for not much of anything. It will
    *not* bound backend memory use very accurately, because there is no way
    to track raw malloc() calls. And I think that 99% of users will not
    find it useful.
    What WOULD be useful is to find a way to provide a way to configure
    work_mem per backend rather than per executor node. But that's a much
    harder problem.
    I see. Tough problem is: how do we get per backend memory usage accurately? Is
    it relying on OS specific API the only way?


    --
    Euler Taveira de Oliveira
    http://www.timbira.com/
  • Tom Lane at Oct 1, 2009 at 4:37 pm

    Euler Taveira de Oliveira writes:
    I see. Tough problem is: how do we get per backend memory usage accurately? Is
    it relying on OS specific API the only way?
    Given all the third-party libraries (perl, python, libxml2, yadda yadda)
    that can be in use and won't go through palloc, I think that this would
    have to be done at the OS level to be very meaningful.

    The other problem is the one Robert touched on: what you actually *want*
    is something entirely different, namely for the backend to actively try
    to meet an overall target for its memory usage, rather than having
    queries fail ungracefully when they hit an arbitrary limit that the
    planner didn't even know about.

    regards, tom lane
  • Robert Haas at Oct 2, 2009 at 1:18 am

    On Thu, Oct 1, 2009 at 12:15 PM, Euler Taveira de Oliveira wrote:
    Robert Haas escreveu:
    On Thu, Oct 1, 2009 at 11:47 AM, Tom Lane wrote:
    Euler Taveira de Oliveira <euler@timbira.com> writes:
    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    What about plataforms (Windows) that don't have ulimit?
    Get a real operating system ;-)

    Seriously, the proposed patch introduces overhead into a place that is
    already a known hot spot, in return for not much of anything.  It will
    *not* bound backend memory use very accurately, because there is no way
    to track raw malloc() calls.  And I think that 99% of users will not
    find it useful.
    What WOULD be useful is to find a way to provide a way to configure
    work_mem per backend rather than per executor node.  But that's a much
    harder problem.
    I see. Tough problem is: how do we get per backend memory usage accurately? Is
    it relying on OS specific API the only way?
    As I see it, this is really a planning problem, not an executor
    problem, so measuring ACTUAL memory usage is not really important: the
    problem is taking memory usage into account during planning. The
    difficulty with adjusting work_mem right now is that the correct value
    depends not only on the number of queries that are concurrently
    executing (which isn't a constant) but also on the number of sort/hash
    operations being performed per query (which is also not a constant).
    So if your queries become more complex, a value of work_mem that was
    previously OK may start to cause swapping, which encourages setting
    work_mem conservatively. But setting it conservatively can cause the
    planner to pick plans that save memory at a LARGE performance cost.

    Fixing this isn't simple. Right now, when planning a particular
    joinrel, we only keep track of the best plans for each possible set of
    path keys, regardless of how much or little memory they use. So if we
    do something naive, like just track the total amount of memory that
    each candidate path is forecast to use and avoid letting it go above
    some ceiling, query planning might fail altogether, because the
    lower-level joinrels use as much memory as they want and the higher
    level nodes, which for some reason can't be done without memory, can't
    be planned. Or we might just end up with a badly suboptimal plan,
    because we pick a slightly cheaper plan lower down in the tree that
    uses a LOT more memory over a slightly more expensive one that uses
    much less. Later we'll wish we hadn't, but by that point it's too
    late.

    Another possible angle of attack is to try to give the planner a range
    for work_mem rather than a hard limit. The planner would ordinarily
    construct paths as though the lower end of the range was the limit,
    but for a sufficiently large cost savings it would be willing to adopt
    a path that used more memory. Potentially this willingness could also
    be conditioned on the amount of memory used by the path so far,
    although that has the same problems described above in kind if not in
    degree. I'm not really sure whether something like this can be made
    to work; I'm not sure there's really enough information available when
    constructing paths for any sort of local decision-making to prove
    fruitful.

    The other idea I have is to adopt a strategy where each plan node has
    upper and lower bounds on cost, as I previously suggested here with
    respect to index-only scans.

    http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php

    The idea would basically be to estimate the lower-bound for the cost
    of a sort based on the idea that we'll have the maximum possible
    amount of memory to work with (say, the budget for the whole query)
    and the upper-bound cost based on the idea that we'll have the minimum
    possible amount of memory (zero, or whatever the minimal amount is).
    We can also estimate the most memory we think we can usefully use (for
    example, a hash join with a smaller inner rel doesn't benefit from
    more memory than the amount required to hold the entire hash table in
    memory).

    After we complete the first round of planning, we look at the
    resulting paths and decide which sorts or hashes will get funded with
    how much memory. I'm hand-waving a little bit here, because there may
    be a knapsack problem in here (which is NP-complete), since the cost
    as a function of memory probably has sharp cliffs with not much change
    in between them - certainly for hashing, and I suspect for sorting as
    well, but it might be that in practice N is small enough not to
    matter, or we might be able to find an approximation that is good
    enough that we can live with it. Even if we can get past that hurdle,
    though, there's still all the caveats from the original email,
    principally that it's unclear that the necessary computations can be
    done without blowing planning time out of the water. Plus, if we used
    this strategy for multiple purposes, like position of heap fetch nodes
    and also allocation of work memory, there could be interdependencies
    that would turn the whole thing into a giant mess.

    So to reiterate my first comment: a MUCH harder problem.

    ...Robert
  • Daveg at Oct 1, 2009 at 7:56 pm

    On Thu, Oct 01, 2009 at 11:47:43AM -0400, Tom Lane wrote:
    Euler Taveira de Oliveira <euler@timbira.com> writes:
    Tom Lane escreveu:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    Seriously, the proposed patch introduces overhead into a place that is
    already a known hot spot, in return for not much of anything. It will
    The overhead is simply an integer addition and compare with values that are
    likely already in processor caches. And this only occurs when we actually
    call malloc() to get a new block, not on every palloc. So I suspect it will
    not be noticable. However, I welcome any suggestion on how to test this
    and actually measure the overhead if any. pg_bench? Something else?
    *not* bound backend memory use very accurately, because there is no way
    to track raw malloc() calls. And I think that 99% of users will
    not find it useful.
    The use case that motivated is a client that runs many postgresql instances
    with a mostly batch/large query workload. Some of the queries are code
    generated by an application and can be very complex. A few times a month
    one of these will run through 64GB of memory and oom the host. So it
    seriously hurts production. Setting work_mem low enough to prevent this
    results in poor query performance.

    This client does not use any outside libraries that call malloc() directly.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Daveg at Oct 1, 2009 at 7:40 pm

    On Thu, Oct 01, 2009 at 10:35:55AM -0400, Tom Lane wrote:
    daveg <daveg@sonic.net> writes:
    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use.
    Use ulimit.
    That was my initial thought too. However, ulimit() is documented as superceded
    by setrlimit(). Which has the option RLIMIT_DATA to limit the size of the data
    segment. Perfect!

    Except, RLIMIT_DATA does not appear to work on linux. The call succeeds and
    the new value can even be read back with getrlimit(), but it does not seem
    to do anything to actually limit the memory allocated. I tested this on
    SuSE 11: kernel 2.6.25, and Ubuntu Intrepid: kernel 2.6.28.

    Setting RLIMIT_AS to limit the total address space for a process works as
    expected. However this seems undesireable for postgresql as it can also cause
    stack expansion to fail, which would then force a general restart. Also,
    this limit would interact with the buffercache size setting as it includes
    the shared address space as well.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Alvaro Herrera at Oct 1, 2009 at 2:49 pm

    daveg wrote:

    I'd like to propose adding a new GUC to limit the amount of memory a backend
    can allocate for its own use. The problem this addresses is that sometimes
    one needs to set work_mem fairly high to get good query plans for large joins.
    However, some complex queries will then use huge amounts of memory so that
    one or a few of them will consume all the memory on the host and run it deep
    into swap or trigger the oom killer or worse.
    Oh, BTW, did anyone get interested in adding the bits to disable the OOM
    killer for postmaster on the various Linux initscripts? It needs some
    games with /proc/<pid>/oom_adj and requires root privileges, but I think
    an initscript is in an excellent position to do it.

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    The PostgreSQL Company - Command Prompt, Inc.
  • Tom Lane at Oct 1, 2009 at 5:13 pm

    Alvaro Herrera writes:
    Oh, BTW, did anyone get interested in adding the bits to disable the OOM
    killer for postmaster on the various Linux initscripts? It needs some
    games with /proc/<pid>/oom_adj and requires root privileges, but I think
    an initscript is in an excellent position to do it.
    I was imagining that this would be something for individual distros
    to tackle. It's probably not portable enough to go into the
    contrib/start-scripts examples. On the other hand, it'd make lots
    of sense to have the Fedora or Debian or whatever scripts do this,
    since they know what kernel version they're targeting. (If anyone
    wants to send me the fixes to make Fedora's script do this ...)

    regards, tom lane
  • Stephen Frost at Oct 1, 2009 at 5:16 pm

    * Tom Lane (tgl@sss.pgh.pa.us) wrote:
    I was imagining that this would be something for individual distros
    to tackle. It's probably not portable enough to go into the
    contrib/start-scripts examples. On the other hand, it'd make lots
    of sense to have the Fedora or Debian or whatever scripts do this,
    since they know what kernel version they're targeting. (If anyone
    wants to send me the fixes to make Fedora's script do this ...)
    I'm not exactly keen on Debian init scripts hacking kernel settings.
    Should it hack up the shared memory numbers too? This is not what I
    would consider 'init script' material for specific applications.

    Thanks,

    Stephen
  • Tom Lane at Oct 1, 2009 at 5:42 pm

    Stephen Frost writes:
    I'm not exactly keen on Debian init scripts hacking kernel settings.
    Should it hack up the shared memory numbers too? This is not what I
    would consider 'init script' material for specific applications.
    What was suggested was tweaking the oom_adj setting for the postmaster
    process only, not messing with any system-wide settings. Do you really
    find that unreasonable? The default OOM killer behavior is just about
    as unreasonable as can be :-(

    regards, tom lane
  • Alvaro Herrera at Oct 1, 2009 at 5:20 pm

    Tom Lane wrote:
    Alvaro Herrera <alvherre@commandprompt.com> writes:
    Oh, BTW, did anyone get interested in adding the bits to disable the OOM
    killer for postmaster on the various Linux initscripts? It needs some
    games with /proc/<pid>/oom_adj and requires root privileges, but I think
    an initscript is in an excellent position to do it.
    I was imagining that this would be something for individual distros
    to tackle. It's probably not portable enough to go into the
    contrib/start-scripts examples.
    Hmm? I think it should be just (as root)

    if [ -f /proc/$pid_of_postmaster/oom_adj ]; then
    echo -17 > /proc/$pid_of_postmaster/oom_adj
    fi

    This is supported from 2.6.11 onwards AFAIK. If the kernel is older
    than that, the file would not exist and this would be a noop.

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    The PostgreSQL Company - Command Prompt, Inc.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedOct 1, '09 at 10:34a
activeOct 2, '09 at 1:18a
posts18
users7
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase