Please find attached a first version of a patch to allow additional
"dynamic" shared memory segments; that is, shared memory segments that
are created after server startup, live for a period of time, and are
then destroyed when no longer needed. The main purpose of this patch
is to facilitate parallel query: if we've got multiple backends
working on the same query, they're going to need a way to communicate.
  Doing that through the main shared memory segment seems infeasible
because we could, for some applications, need to share very large
amounts of data. For example, for internal sort, we basically load
the data to be sorted into memory and then rearrange an array of
pointers to the items being sorted. For parallel internal sort, we
might want to do much the same thing, but with different backend
processes manipulating different parts of the array. I'm not exactly
sure how that's going to work out yet in detail, but it seems fair to
say that the amount of data we want to share between processes there
could be quite a bit larger than anything we'd feel comfortable
nailing down in the permanent shared memory segment. Other cases,
like parallel sequential scan, might require much smaller buffers,
since there might not be much point in letting the scan get too far
ahead if nothing's consuming the tuples it produces. With this
infrastructure, we can choose at run-time exactly how much memory to
allocate for a particular purpose and return it to the operating
system as soon as we're done with it.

Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type. Since we
currently require System V shared memory to be supported on all
platforms other than Windows, I have included a System V
implementation (shmget, shmctl, shmat, shmdt). However, as we know,
on many systems, System V shared memory limits are often low out of
the box and raising them is an annoyance for users. Therefore, I've
included an implementation based on POSIX shared memory facilities
(shm_open, shm_unlink), which is the default on systems where those
facilities are supported (some of the BSDs do not, I believe). We
will also need a Windows implementation, which I have not attempted,
but one of my colleagues at EnterpriseDB will be filling in that gap.

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging. On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments. Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.

Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other. There are
ways, on most if not all platforms, to pass file descriptors through
sockets, but there's not automatically a socket connection between the
two processes either, so it gets hairy to think about making this
work. I did, however, include a "none" implementation which has the
effect of shutting the facility off altogether.

The actual implementation is split up into two layers. dsm_impl.c/h
encapsulate the implementation-dependent functionality at a very raw
level, while dsm.c/h wrap that functionality in a more palatable API.
Most of that wrapper layer is concerned with just one problem:
avoiding leaks. This turned out to require multiple levels of
safeguards, which I duly implemented. First, dynamic shared memory
segments need to be reference-counted, so that when the last mapping
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that). If a
backend is terminated uncleanly, the postmaster needs to remove all
leftover segments during the crash-and-restart process, just as it
needs to reinitialize the main shared memory segment. And if all
processes are terminated uncleanly, the next postmaster startup needs
to clean up any segments that still exist, again just as we already do
for the main shared memory segment. Neither POSIX shared memory nor
System V shared memory provide an API for enumerating all existing
shared memory segments, so we must keep track ourselves of what we
have outstanding. Second, we need to ensure, within the scope of an
individual process, that we only retain a mapping for as long as
necessary. Just as memory contexts, locks, buffer pins, and other
resources automatically go away at the end of a query or
(sub)transaction, dynamic shared memory mappings created for a purpose
such as parallel sort need to go away if we abort mid-way through. Of
course, if you have a user backend coordinating with workers, it seems
pretty likely that the workers are just going to exit if they hit an
error, so having the mapping be process-lifetime wouldn't necessarily
be a big deal; but the user backend may stick around for a long time
and execute other queries, and we can't afford to have it accumulate
mappings, not least because that's equivalent to a session-lifespan
memory leak.

To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly. If the whole operating system is rebooted, the
old control segment won't exist any more, but that's OK, because none
of the other segments will either - except under the
mmap-a-regular-file implementation, which handles cleanup by scanning
the relevant directory rather than relying on the control segment.
These precautions seem sufficient to ensure that dynamic shared memory
segments can't survive the postmaster itself short of a hard kill, and
that even after a hard kill we'll clean things up on a subsequent
postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).

There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch. Second, it doesn't make any
policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.

But, just for testing purposes, I did just that. The attached patch
includes contrib/dsm_demo, which lets you say
dsm_demo_create('something') in one string, and if you pass the return
value to dsm_demo_read() in the same or another session during the
lifetime of the first session, you'll read back the same value you
saved. This is not, by any stretch of the imagination, a
demonstration of the right way to use this facility - but as a crude
unit test, it suffices. Although I'm including it in the patch file,
I would anticipate removing it before commit. Hopefully, with a
little more functionality on top of what's included here, we'll soon
be in a position to build something that might actually be useful to
someone, but this layer itself is a bit too impoverished to build
something really cool, at least not without more work than I wanted to
put in as part of the development of this patch.

Using that crappy contrib module, I verified that the POSIX, System V,
and mmap implementations all work on my MacBook Pro (OS X 10.8.4) and
on Linux (Fedora 16). I wouldn't like to have to wager on having
gotten all of the details right to be absolutely portable everywhere,
so I wouldn't be surprised to see this break on other systems.
Hopefully that will be a matter of adjusting the configure tests a bit
rather than coping with substantive changes in available
functionality, but we'll see.

Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Andres Freund at Aug 27, 2013 at 2:07 pm
    Hi Robert,

    [just sending an email which sat in my outbox for two weeks]
    On 2013-08-13 21:09:06 -0400, Robert Haas wrote:
    ...
    Nice to see this coming. I think it will actually be interesting for
    quite some things outside parallel query, but we'll see.

    I've not yet looked at the code, so I just have some highlevel comments
    so far.
    To help solve these problems, I invented something called the "dynamic
    shared memory control segment". This is a dynamic shared memory
    segment created at startup (or reinitialization) time by the
    postmaster before any user process are created. It is used to store a
    list of the identities of all the other dynamic shared memory segments
    we have outstanding and the reference count of each. If the
    postmaster goes through a crash-and-reset cycle, it scans the control
    segment and removes all the other segments mentioned there, and then
    recreates the control segment itself. If the postmaster is killed off
    (e.g. kill -9) and restarted, it locates the old control segment and
    proceeds similarly.
    That way any corruption in that area will prevent restarts without
    reboot unless you use ipcrm, or such, right?
    Creating a shared memory segment is a somewhat operating-system
    dependent task. I decided that it would be smart to support several
    different implementations and to let the user choose which one they'd
    like to use via a new GUC, dynamic_shared_memory_type.
    I think we want that during development, but I'd rather not go there
    when releasing. After all, we don't support a manual choice between
    anonymous mmap/sysv shmem either.
    In addition, I've included an implementation based on mmap of a plain
    file. As compared with a true shared memory implementation, this
    obviously has the disadvantage that the OS may be more likely to
    decide to write back dirty pages to disk, which could hurt
    performance. However, I believe it's worthy of inclusion all the
    same, because there are a variety of situations in which it might be
    more convenient than one of the other implementations. One is
    debugging.
    Hm. Not sure what's the advantage over a corefile here.
    On MacOS X, for example, there seems to be no way to list
    POSIX shared memory segments, and no easy way to inspect the contents
    of either POSIX or System V shared memory segments.
    Shouldn't we ourselves know which segments are around?
    Another use case
    is working around an administrator-imposed or OS-imposed shared memory
    limit. If you're not allowed to allocate shared memory, but you are
    allowed to create files, then this implementation will let you use
    whatever facilities we build on top of dynamic shared memory anyway.
    I don't think we should try to work around limits like that.
    A third possible reason to use this implementation is
    compartmentalization. For example, you can put the directory that
    stores the dynamic shared memory segments on a RAM disk - which
    removes the performance concern - and then do whatever you like with
    that directory: secure it, put filesystem quotas on it, or sprinkle
    magic pixie dust on it. It doesn't even seem out of the question that
    there might be cases where there are multiple RAM disks present with
    different performance characteristics (e.g. on NUMA machines) and this
    would provide fine-grained control over where your shared memory
    segments get placed. To make a long story short, I won't be crushed
    if the consensus is against including this, but I think it's useful.
    -1 so far. Seems a bit handwavy to me.
    Other implementations are imaginable but not implemented here. For
    example, you can imagine using the mmap() of an anonymous file.
    However, since the point is that these segments are created on the fly
    by individual backends and then shared with other backends, that gets
    a little tricky. In order for the second backend to map the same
    anonymous shared memory segment that the first one mapped, you'd have
    to pass the file descriptor from one process to the other.
    It wouldn't even work. Several mappings of /dev/zero et al. do *not*
    result in the same virtual memory being mapped. Not even when using the
    same (passed around) fd.
    Believe me, I tried ;)

    There are quite a few problems that this patch does not solve. First,
    while it does give you a shared memory segment, it doesn't provide you
    with any help at all in figuring out what to put in that segment. The
    task of figuring out how to communicate usefully through shared memory
    is thus, for the moment, left entirely to the application programmer.
    While there may be cases where that's just right, I suspect there will
    be a wider range of cases where it isn't, and I plan to work on some
    additional facilities, sitting on top of this basic structure, next,
    though probably as a separate patch. Agreed.
    Second, it doesn't make any> policy decisions about what is sensible either in terms of number of
    shared memory segments or the sizes of those segments, even though
    there are serious practical limits in both cases. Actually, the total
    number of segments system-wide is limited by the size of the control
    segment, which is sized based on MaxBackends. But there's nothing to
    keep a single backend from eating up all the slots, even though that's
    pretty both unfriendly and unportable, and there's no real limit to
    the amount of memory it can gobble up per slot, either. In other
    words, it would be a bad idea to write a contrib module that exposes a
    relatively uncooked version of this layer to the user.
    At this point I am rather unconcerned with this point to be
    honest.
    --- /dev/null
    +++ b/src/include/storage/dsm.h
    @@ -0,0 +1,40 @@
    +/*-------------------------------------------------------------------------
    + *
    + * dsm.h
    + * manage dynamic shared memory segments
    + *
    + * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
    + * Portions Copyright (c) 1994, Regents of the University of California
    + *
    + * src/include/storage/dsm.h
    + *
    + *-------------------------------------------------------------------------
    + */
    +#ifndef DSM_H
    +#define DSM_H
    +
    +#include "storage/dsm_impl.h"
    +
    +typedef struct dsm_segment dsm_segment;
    +
    +/* Initialization function. */
    +extern void dsm_postmaster_startup(void);
    +
    +/* Functions that create, update, or remove mappings. */
    +extern dsm_segment *dsm_create(uint64 size, char *preferred_address);
    +extern dsm_segment *dsm_attach(dsm_handle h, char *preferred_address);
    +extern void *dsm_resize(dsm_segment *seg, uint64 size,
    + char *preferred_address);
    +extern void *dsm_remap(dsm_segment *seg, char *preferred_address);
    +extern void dsm_detach(dsm_segment *seg);
    Why do we want to expose something unreliable as preferred_address to
    the external interface? I haven't read the code yet, so I might be
    missing something here.

    Greetings,

    Andres Freund

    --
      Andres Freund http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Aug 28, 2013 at 7:21 pm

    On Tue, Aug 27, 2013 at 10:07 AM, Andres Freund wrote:
    [just sending an email which sat in my outbox for two weeks]
    Thanks for taking a look.
    Nice to see this coming. I think it will actually be interesting for
    quite some things outside parallel query, but we'll see.
    Yeah, I hope so. The applications may be somewhat limited by the fact
    that there are apparently fairly small limits to how many shared
    memory segments you can map at the same time. I believe on one system
    I looked at (some version of HP-UX?) the limit was 11. So we won't be
    able to go nuts with this: using it definitely introduces all kinds of
    failure modes that we don't have it today. But it will also let us do
    some pretty cool things that we CAN'T do today.
    To help solve these problems, I invented something called the "dynamic
    shared memory control segment". This is a dynamic shared memory
    segment created at startup (or reinitialization) time by the
    postmaster before any user process are created. It is used to store a
    list of the identities of all the other dynamic shared memory segments
    we have outstanding and the reference count of each. If the
    postmaster goes through a crash-and-reset cycle, it scans the control
    segment and removes all the other segments mentioned there, and then
    recreates the control segment itself. If the postmaster is killed off
    (e.g. kill -9) and restarted, it locates the old control segment and
    proceeds similarly.
    That way any corruption in that area will prevent restarts without
    reboot unless you use ipcrm, or such, right?
    The way I've designed it, no. If what we expect to be the control
    segment doesn't exist or doesn't conform to our expectations, we just
    assume that it's not really the control segment after all - e.g.
    someone rebooted, clearing all the segments, and then an unrelated
    process (malicious, perhaps, or just a completely different cluster)
    reused the same name. This is similar to what we do for the main
    shared memory segment.
    Creating a shared memory segment is a somewhat operating-system
    dependent task. I decided that it would be smart to support several
    different implementations and to let the user choose which one they'd
    like to use via a new GUC, dynamic_shared_memory_type.
    I think we want that during development, but I'd rather not go there
    when releasing. After all, we don't support a manual choice between
    anonymous mmap/sysv shmem either.
    That's true, but that decision has not been uncontroversial - e.g. the
    NetBSD guys don't like it, because they have a big performance
    difference between those two types of memory. We have to balance the
    possible harm of one more setting against the benefit of letting
    people do what they want without needing to recompile or modify code.
    In addition, I've included an implementation based on mmap of a plain
    file. As compared with a true shared memory implementation, this
    obviously has the disadvantage that the OS may be more likely to
    decide to write back dirty pages to disk, which could hurt
    performance. However, I believe it's worthy of inclusion all the
    same, because there are a variety of situations in which it might be
    more convenient than one of the other implementations. One is
    debugging.
    Hm. Not sure what's the advantage over a corefile here.
    You can look at it while the server's running.
    On MacOS X, for example, there seems to be no way to list
    POSIX shared memory segments, and no easy way to inspect the contents
    of either POSIX or System V shared memory segments.
    Shouldn't we ourselves know which segments are around?
    Sure, that's the point of the control segment. But listing a
    directory is a lot easier than figuring out what the current control
    segment contents are.
    Another use case
    is working around an administrator-imposed or OS-imposed shared memory
    limit. If you're not allowed to allocate shared memory, but you are
    allowed to create files, then this implementation will let you use
    whatever facilities we build on top of dynamic shared memory anyway.
    I don't think we should try to work around limits like that.
    I do. There's probably someone, somewhere in the world who thinks
    that operating system shared memory limits are a good idea, but I have
    not met any such person. There are multiple ways to create shared
    memory, and they all have different limits. Normally, System V limits
    are small, POSIX limits are large, and the inherited-anonymous-mapping
    trick we're now using for the main shared memory segment has no limits
    at all. It's very common to run into a system where you can allocate
    huge numbers of gigabytes of backend-private memory, but if you try to
    allocate 64MB of *shared* memory, you get the axe - or maybe not,
    depending on which API you use to create it.

    I would never advocate deliberately trying to circumvent a
    carefully-considered OS-level policy decision about resource
    utilization, but I don't think that's the dynamic here. I think if we
    insist on predetermining the dynamic shared memory implementation
    based on the OS, we'll just be inconveniencing people needlessly, or
    flat-out making things not work. I think this case is roughly similar
    to wal_sync_method: there really shouldn't be a performance or
    reliability difference between the ~6 ways of flushing a file to disk,
    but as it turns out, there is, so we have an option. If we're SURE
    that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
    100% of cases, and that a NetBSD user will always prefer "sysv" over
    "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
    But I'm not that sure.
    It wouldn't even work. Several mappings of /dev/zero et al. do *not*
    result in the same virtual memory being mapped. Not even when using the
    same (passed around) fd.
    Believe me, I tried ;)
    OK, well that's another reason I didn't do it that way, then. :-)
    At this point I am rather unconcerned with this point to be
    honest.
    I think that's appropriate; mostly, I wanted to emphasize that the
    wisdom of allocating any given amount of shared memory is outside the
    scope of this patch, which only aims to provide mechanism, not policy.
    Why do we want to expose something unreliable as preferred_address to
    the external interface? I haven't read the code yet, so I might be
    missing something here.
    I shared your opinion that preferred_address is never going to be
    reliable, although FWIW Noah thinks it can be made reliable with a
    large-enough hammer. But even if it isn't reliable, there doesn't
    seem to be all that much value in forbidding access to that part of
    the OS-provided API. In the world where it's not reliable, it may
    still be convenient to map things at the same address when you can, so
    that pointers can't be used. Of course you'd have to have some
    fallback strategy for when you don't get the same mapping, and maybe
    that's painful enough that there's no point after all. Or maybe it's
    worth having one code path for relativized pointers and another for
    non-relativized pointers.

    To be honest, I'm not real sure. I think it's clear enough that this
    will meet the minimal requirements for parallel query - ONE dynamic
    shared memory segment that's not guaranteed to be at the same address
    in every backend, and can't be resized after creation. And we could
    pare the API down to only support that. But I'd rather get some
    experience with this first before we start taking away options.
    Otherwise, we may never really find out the limits of what is possible
    in this area, and I think that would be a shame.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Andres Freund at Aug 30, 2013 at 3:45 pm
    Hi,
    On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
    That way any corruption in that area will prevent restarts without
    reboot unless you use ipcrm, or such, right?
    The way I've designed it, no. If what we expect to be the control
    segment doesn't exist or doesn't conform to our expectations, we just
    assume that it's not really the control segment after all - e.g.
    someone rebooted, clearing all the segments, and then an unrelated
    process (malicious, perhaps, or just a completely different cluster)
    reused the same name. This is similar to what we do for the main
    shared memory segment.
    The case I am mostly wondering about is some process crashing and
    overwriting random memory. We need to be pretty sure that we'll never
    fail partially through cleaning up old segments because they are
    corrupted or because we died halfway through our last cleanup attempt.
    I think we want that during development, but I'd rather not go there
    when releasing. After all, we don't support a manual choice between
    anonymous mmap/sysv shmem either.
    That's true, but that decision has not been uncontroversial - e.g. the
    NetBSD guys don't like it, because they have a big performance
    difference between those two types of memory. We have to balance the
    possible harm of one more setting against the benefit of letting
    people do what they want without needing to recompile or modify code.
    But then, it made them fix the issue afaik :P
    In addition, I've included an implementation based on mmap of a plain
    file. As compared with a true shared memory implementation, this
    obviously has the disadvantage that the OS may be more likely to
    decide to write back dirty pages to disk, which could hurt
    performance. However, I believe it's worthy of inclusion all the
    same, because there are a variety of situations in which it might be
    more convenient than one of the other implementations. One is
    debugging.
    Hm. Not sure what's the advantage over a corefile here.
    You can look at it while the server's running.
    That's what debuggers are for.
    On MacOS X, for example, there seems to be no way to list
    POSIX shared memory segments, and no easy way to inspect the contents
    of either POSIX or System V shared memory segments.
    Shouldn't we ourselves know which segments are around?
    Sure, that's the point of the control segment. But listing a
    directory is a lot easier than figuring out what the current control
    segment contents are.
    But without a good amount of tooling - like in a debugger... - it's not
    very interesting to look at those files either way? The mere presence of
    a segment doesn't tell you much and the contents won't be easily
    readable.
    Another use case is working around an administrator-imposed or
    OS-imposed shared memory limit. If you're not allowed to allocate
    shared memory, but you are allowed to create files, then this
    implementation will let you use whatever facilities we build on top
    of dynamic shared memory anyway.
    I don't think we should try to work around limits like that.
    I do. There's probably someone, somewhere in the world who thinks
    that operating system shared memory limits are a good idea, but I have
    not met any such person.
    "Let's drive users away from sysv shem" is the only one I heard so far ;)
    I would never advocate deliberately trying to circumvent a
    carefully-considered OS-level policy decision about resource
    utilization, but I don't think that's the dynamic here. I think if we
    insist on predetermining the dynamic shared memory implementation
    based on the OS, we'll just be inconveniencing people needlessly, or
    flat-out making things not work. [...]
    But using file-backed memory will *suck* performancewise. Why should we
    ever want to offer that to a user? That's what I was arguing about
    primarily.
    If we're SURE
    that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
    100% of cases, and that a NetBSD user will always prefer "sysv" over
    "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
    But I'm not that sure.
    I think posix shmem will be preferred to sysv shmem if present, in just
    about any relevant case. I don't know of any system with lower limits on
    posix shmem than on sysv.
    I think this case is roughly similar
    to wal_sync_method: there really shouldn't be a performance or
    reliability difference between the ~6 ways of flushing a file to disk,
    but as it turns out, there is, so we have an option.
    Well, most of them actually give different guarantees, so it makes sense
    to have differing performance...
    Why do we want to expose something unreliable as preferred_address to
    the external interface? I haven't read the code yet, so I might be
    missing something here.
    I shared your opinion that preferred_address is never going to be
    reliable, although FWIW Noah thinks it can be made reliable with a
    large-enough hammer.
    I think we need to have the arguments for that on list then. Those are
    pretty damn fundamental design decisions.
    I for one cannot see how you even remotely could make that work a) on
    windows (check the troubles we have to go through to get s_b
    consistently placed, and that's directly after startup) b) 32bit systems.
    But even if it isn't reliable, there doesn't seem to be all that much
    value in forbidding access to that part of the OS-provided API. In
    the world where it's not reliable, it may still be convenient to map
    things at the same address when you can, so that pointers can't be
    used. Of course you'd have to have some fallback strategy for when
    you don't get the same mapping, and maybe that's painful enough that
    there's no point after all. Or maybe it's worth having one code path
    for relativized pointers and another for non-relativized pointers.
    It seems likely to me that will end up with untested code in that
    case. Or even unsupported platforms.
    To be honest, I'm not real sure. I think it's clear enough that this
    will meet the minimal requirements for parallel query - ONE dynamic
    shared memory segment that's not guaranteed to be at the same address
    in every backend, and can't be resized after creation. And we could
    pare the API down to only support that. But I'd rather get some
    experience with this first before we start taking away options.
    Otherwise, we may never really find out the limits of what is possible
    in this area, and I think that would be a shame.
    On the other hand, adding capabilities annoys people far much than
    deciding that we can't support them in the end and taking them away.

    Greetings,

    Andres Freund

    --
      Andres Freund http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Amit Kapila at Aug 31, 2013 at 5:12 am

    On Fri, Aug 30, 2013 at 9:15 PM, Andres Freund wrote:
    Hi,
    On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
    That way any corruption in that area will prevent restarts without
    reboot unless you use ipcrm, or such, right?
    The way I've designed it, no. If what we expect to be the control
    segment doesn't exist or doesn't conform to our expectations, we just
    assume that it's not really the control segment after all - e.g.
    someone rebooted, clearing all the segments, and then an unrelated
    process (malicious, perhaps, or just a completely different cluster)
    reused the same name. This is similar to what we do for the main
    shared memory segment.
    The case I am mostly wondering about is some process crashing and
    overwriting random memory. We need to be pretty sure that we'll never
    fail partially through cleaning up old segments because they are
    corrupted or because we died halfway through our last cleanup attempt.
    I think we want that during development, but I'd rather not go there
    when releasing. After all, we don't support a manual choice between
    anonymous mmap/sysv shmem either.
    That's true, but that decision has not been uncontroversial - e.g. the
    NetBSD guys don't like it, because they have a big performance
    difference between those two types of memory. We have to balance the
    possible harm of one more setting against the benefit of letting
    people do what they want without needing to recompile or modify code.
    But then, it made them fix the issue afaik :P
    In addition, I've included an implementation based on mmap of a plain
    file. As compared with a true shared memory implementation, this
    obviously has the disadvantage that the OS may be more likely to
    decide to write back dirty pages to disk, which could hurt
    performance. However, I believe it's worthy of inclusion all the
    same, because there are a variety of situations in which it might be
    more convenient than one of the other implementations. One is
    debugging.
    Hm. Not sure what's the advantage over a corefile here.
    You can look at it while the server's running.
    That's what debuggers are for.
    On MacOS X, for example, there seems to be no way to list
    POSIX shared memory segments, and no easy way to inspect the contents
    of either POSIX or System V shared memory segments.
    Shouldn't we ourselves know which segments are around?
    Sure, that's the point of the control segment. But listing a
    directory is a lot easier than figuring out what the current control
    segment contents are.
    But without a good amount of tooling - like in a debugger... - it's not
    very interesting to look at those files either way? The mere presence of
    a segment doesn't tell you much and the contents won't be easily
    readable.
    Another use case is working around an administrator-imposed or
    OS-imposed shared memory limit. If you're not allowed to allocate
    shared memory, but you are allowed to create files, then this
    implementation will let you use whatever facilities we build on top
    of dynamic shared memory anyway.
    I don't think we should try to work around limits like that.
    I do. There's probably someone, somewhere in the world who thinks
    that operating system shared memory limits are a good idea, but I have
    not met any such person.
    "Let's drive users away from sysv shem" is the only one I heard so far ;)
    I would never advocate deliberately trying to circumvent a
    carefully-considered OS-level policy decision about resource
    utilization, but I don't think that's the dynamic here. I think if we
    insist on predetermining the dynamic shared memory implementation
    based on the OS, we'll just be inconveniencing people needlessly, or
    flat-out making things not work. [...]
    But using file-backed memory will *suck* performancewise. Why should we
    ever want to offer that to a user? That's what I was arguing about
    primarily.
    If we're SURE
    that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
    100% of cases, and that a NetBSD user will always prefer "sysv" over
    "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
    But I'm not that sure.
    I think posix shmem will be preferred to sysv shmem if present, in just
    about any relevant case. I don't know of any system with lower limits on
    posix shmem than on sysv.
    I think this case is roughly similar
    to wal_sync_method: there really shouldn't be a performance or
    reliability difference between the ~6 ways of flushing a file to disk,
    but as it turns out, there is, so we have an option.
    Well, most of them actually give different guarantees, so it makes sense
    to have differing performance...
    Why do we want to expose something unreliable as preferred_address to
    the external interface? I haven't read the code yet, so I might be
    missing something here.
    I shared your opinion that preferred_address is never going to be
    reliable, although FWIW Noah thinks it can be made reliable with a
    large-enough hammer.
    I think we need to have the arguments for that on list then. Those are
    pretty damn fundamental design decisions.
    I for one cannot see how you even remotely could make that work a) on
    windows (check the troubles we have to go through to get s_b
    consistently placed, and that's directly after startup) b) 32bit systems.
       For Windows, I believe we are already doing something similar
    (attaching at predefined address) in main shared
       memory. It reserves memory at particular address using
    pgwin32_ReserveSharedMemoryRegion() before actually
       starting (resuming process created in suspend mode) a process and
    then after starting backend attaches at same
       address (PGSharedMemoryReAttach).

       I think one question here is what is use of exposing
    preffered_address, to which I can think of only below:

       a. Base OS API's provide such provision, then why don't we?
       b. While browsing, I found few examples in IBM site where they also
    show usage with preferred address.
           http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.vacpp7a.doc%2Fproguide%2Fref%2Fcreate_heap.htm
       c. If user wishes to attach segments at same base address, so that
    it can access pointers in the memory mapped
           file which otherwise would not be possible.
    But even if it isn't reliable, there doesn't seem to be all that much
    value in forbidding access to that part of the OS-provided API. In
    the world where it's not reliable, it may still be convenient to map
    things at the same address when you can, so that pointers can't be
    used. Of course you'd have to have some fallback strategy for when
    you don't get the same mapping, and maybe that's painful enough that
    there's no point after all. Or maybe it's worth having one code path
    for relativized pointers and another for non-relativized pointers.
    It seems likely to me that will end up with untested code in that
    case. Or even unsupported platforms.
    To be honest, I'm not real sure. I think it's clear enough that this
    will meet the minimal requirements for parallel query - ONE dynamic
    shared memory segment that's not guaranteed to be at the same address
    in every backend, and can't be resized after creation. And we could
    pare the API down to only support that. But I'd rather get some
    experience with this first before we start taking away options.
    Otherwise, we may never really find out the limits of what is possible
    in this area, and I think that would be a shame.
    On the other hand, adding capabilities annoys people far much than
    deciding that we can't support them in the end and taking them away.

    With Regards,
    Amit Kapila.
    EnterpriseDB: http://www.enterprisedb.com
  • Robert Haas at Aug 31, 2013 at 12:27 pm

    On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund wrote:
    The way I've designed it, no. If what we expect to be the control
    segment doesn't exist or doesn't conform to our expectations, we just
    assume that it's not really the control segment after all - e.g.
    someone rebooted, clearing all the segments, and then an unrelated
    process (malicious, perhaps, or just a completely different cluster)
    reused the same name. This is similar to what we do for the main
    shared memory segment.
    The case I am mostly wondering about is some process crashing and
    overwriting random memory. We need to be pretty sure that we'll never
    fail partially through cleaning up old segments because they are
    corrupted or because we died halfway through our last cleanup attempt.
    Right. I had those considerations in mind and I believe I have nailed
    the hatch shut pretty tight. The cleanup code is designed never to
    die with an error. Of course it might, but it would have to be
    something like an out of memory failure or similar that isn't really
    what we're concerned about here. You are welcome to look for holes,
    but these issues are where most of my brainpower went during
    development.
    That's true, but that decision has not been uncontroversial - e.g. the
    NetBSD guys don't like it, because they have a big performance
    difference between those two types of memory. We have to balance the
    possible harm of one more setting against the benefit of letting
    people do what they want without needing to recompile or modify code.
    But then, it made them fix the issue afaik :P
    Pah. :-)
    You can look at it while the server's running.
    That's what debuggers are for.
    Tough crowd. I like it. YMMV.
    I would never advocate deliberately trying to circumvent a
    carefully-considered OS-level policy decision about resource
    utilization, but I don't think that's the dynamic here. I think if we
    insist on predetermining the dynamic shared memory implementation
    based on the OS, we'll just be inconveniencing people needlessly, or
    flat-out making things not work. [...]
    But using file-backed memory will *suck* performancewise. Why should we
    ever want to offer that to a user? That's what I was arguing about
    primarily.
    I see. There might be additional writeback traffic, but it might not
    be that bad in common cases. After all the data's pretty hot.
    If we're SURE
    that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
    100% of cases, and that a NetBSD user will always prefer "sysv" over
    "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
    But I'm not that sure.
    I think posix shmem will be preferred to sysv shmem if present, in just
    about any relevant case. I don't know of any system with lower limits on
    posix shmem than on sysv.
    OK, how about this.... SysV doesn't allow extending segments, but
    mmap does. The thing here is that you're saying "remove mmap and keep
    sysv" but Noah suggested to me that we remove sysv and keep mmap.
    This suggests to me that the picture is not so black and white as you
    think it is.
    I shared your opinion that preferred_address is never going to be
    reliable, although FWIW Noah thinks it can be made reliable with a
    large-enough hammer.
    I think we need to have the arguments for that on list then. Those are
    pretty damn fundamental design decisions.
    I for one cannot see how you even remotely could make that work a) on
    windows (check the troubles we have to go through to get s_b
    consistently placed, and that's directly after startup) b) 32bit systems.
    Noah?
    But even if it isn't reliable, there doesn't seem to be all that much
    value in forbidding access to that part of the OS-provided API. In
    the world where it's not reliable, it may still be convenient to map
    things at the same address when you can, so that pointers can't be
    used. Of course you'd have to have some fallback strategy for when
    you don't get the same mapping, and maybe that's painful enough that
    there's no point after all. Or maybe it's worth having one code path
    for relativized pointers and another for non-relativized pointers.
    It seems likely to me that will end up with untested code in that
    case. Or even unsupported platforms.
    Maybe. I think for the amount of code we're talking about here, it's
    not worth getting excited about.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jim Nasby at Aug 30, 2013 at 12:12 am

    On 8/13/13 8:09 PM, Robert Haas wrote:
    is removed, the segment automatically goes away (we could allow for
    server-lifespan segments as well with only trivial changes, but I'm
    not sure whether there are compelling use cases for that).
    To clarify... you're talking something that would intentionally survive postmaster restart? I don't see use for that either...
    postmaster startup. The other problem, of making sure that segments
    get unmapped at the proper time, is solved using the resource owner
    mechanism. There is an API to create a mapping which is
    session-lifespan rather than resource-owner lifespan, but the default
    is resource-owner lifespan, which I suspect will be right for common
    uses. Thus, there are four separate occasions on which we remove
    shared memory segments: (1) resource owner cleanup, (2) backend exit
    (for any session-lifespan mappings and anything else that slips
    through the cracks), (3) postmaster exit (in case a child dies without
    cleaning itself up), and (4) postmaster startup (in case the
    postmaster dies without cleaning up).
    Ignorant question... is ResourceOwner related to memory contexts? If not, would memory contexts be a better way to handle memory segment cleanup?
    There are quite a few problems that this patch does not solve. First,
    It also doesn't provide any mechanism for notifying backends of a new segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be useful to have a standard method or three of doing that; perhaps just some convenience functions wrapping the methods mentioned in comments.
    Finally, I'd like to thank Noah Misch for a lot of discussion and
    thought on that enabled me to make this patch much better than it
    otherwise would have been. Although I didn't adopt Noah's preferred
    solutions to all of the problems, and although there are probably
    still some problems buried here, there would have been more if not for
    his advice. I'd also like to thank the entire database server team at
    EnterpriseDB for allowing me to dump large piles of work on them so
    that I could work on this, and my boss, Tom Kincaid, for not allowing
    other people to dump large piles of work on me.
    Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared memory is something we've needed forever! :)

    Other comments...

    + * If the state file is empty or the contents are garbled, it probably means
    + * that the operating system rebooted before the data written by the previous
    + * postmaster made it to disk. In that case, we can just ignore it; any shared
    + * memory from before the reboot should be gone anyway.

    I'm a bit concerned about this; I know it was possible in older versions for the global shared memory context to be left behind after a crash and needing to clean it up by hand. Dynamic shared mem potentially multiplies that by 100 or more. I think it'd be worth changing dsm_write_state_file so it always writes a new file and then does an atomic mv (or something similar).

    + * If some other backend exited uncleanly, it might have corrupted the
    + * control segment while it was dying. In that case, we warn and ignore
    + * the contents of the control segment. This may end up leaving behind
    + * stray shared memory segments, but there's not much we can do about
    + * that if the metadata is gone.

    Similar concern... in this case, would it be possible to always write updates to an un-used slot and then atomically update a pointer? This would be more work than what I suggested above, so maybe just a TODO for now...

    Though... is there anything a dying backend could do that would corrupt the control segment to the point that it would screw up segments allocated by other backends and not related to the dead backend? Like marking a slot as not used when it is still in use and isn't associated with the dead backend? (I'm assuming that if a backend dies unexpectedly then all other backends using memory shared with that backend will need to handle themselves accordingly so that we don't need to worry about that in dsm.c.)


    I was able to simplify dsm_create a bit (depending on your definition of simplify...) not sure if the community is OK with using an ereport to exit a loop (that could safely go outside the loop though...). In any case, I traded 5 lines of (mostly) duplicate code with an if{} and a break:

    + nitems = dsm_control->nitems;
    + for (i = 0; i <= nitems; ++i) /* Intentionally go one slot past what's currently been allocated */
    + {
    + if (dsm_control->item[i].refcnt == 0)
    + {
    + dsm_control->item[i].handle = seg->handle;
    + /* refcnt of 1 triggers destruction, so start at 2 */
    + dsm_control->item[i].refcnt = 2;
    + seg->control_slot = i;
    + if (i = nitems) /* We hit the end of the list */
    + {
    + /* Verify that we can support an additional mapping. */
    + if (nitems >= dsm_control->maxitems)
    + ereport(ERROR,
    + (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
    + errmsg("too many dynamic shared memory segments")));
    +
    + dsm_control->nitems++;
    + }
    + break;
    + }
    + }
    +
    + LWLockRelease(DynamicSharedMemoryControlLock);
    + return seg;



    Should this (in dsm_attach)

    + * If you're hitting this error, you probably want to use attempt to

    be

    + * If you're hitting this error, you probably want to attempt to

    ?


    Should dsm_impl_op sanity check the arguments after op? I didn't notice checks in the type-specific code but I also didn't read all of it... are we just depending on the OS to sanity-check?

    Also, does the GUC code enforce that the GUC must always be something that's supported? If not then the error in dsm_impl_op should be more user-friendly.

    I basically stopped reading after dsm_impl_op... the rest of the stuff was rather over my head.
    --
    Jim C. Nasby, Data Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Robert Haas at Aug 31, 2013 at 12:17 pm

    On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby wrote:
    On 8/13/13 8:09 PM, Robert Haas wrote:
    is removed, the segment automatically goes away (we could allow for
    server-lifespan segments as well with only trivial changes, but I'm
    not sure whether there are compelling use cases for that).
    To clarify... you're talking something that would intentionally survive
    postmaster restart? I don't see use for that either...
    No, I meant something that would live as long as the postmaster and
    die when it dies.
    Ignorant question... is ResourceOwner related to memory contexts? If not,
    would memory contexts be a better way to handle memory segment cleanup?
    Nope. :-)
    There are quite a few problems that this patch does not solve. First,
    It also doesn't provide any mechanism for notifying backends of a new
    segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be
    useful to have a standard method or three of doing that; perhaps just some
    convenience functions wrapping the methods mentioned in comments.
    I don't see that as being generally useful. Backends need to know
    more than "there's a new segment", and in fact most backends won't
    care about most new segments. A background worker needs to know about
    the new segment *that it should attach*, but we have bgw_main_arg. If
    we end up using this facility for any system-wide purposes, I imagine
    we'll do that by storing the segment ID in the main shared memory
    segment someplace.
    Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared
    memory is something we've needed forever! :) Thanks.
    Other comments...

    + * If the state file is empty or the contents are garbled, it probably
    means
    + * that the operating system rebooted before the data written by the
    previous
    + * postmaster made it to disk. In that case, we can just ignore it; any
    shared
    + * memory from before the reboot should be gone anyway.

    I'm a bit concerned about this; I know it was possible in older versions for
    the global shared memory context to be left behind after a crash and needing
    to clean it up by hand. Dynamic shared mem potentially multiplies that by
    100 or more. I think it'd be worth changing dsm_write_state_file so it
    always writes a new file and then does an atomic mv (or something similar).
    I agree that the possibilities for leftover shared memory segments are
    multiplied with this new facility, and I've done my best to address
    that. However, I don't agree that writing the state file in a
    different way would improve anything.
    + * If some other backend exited uncleanly, it might have corrupted
    the
    + * control segment while it was dying. In that case, we warn and
    ignore
    + * the contents of the control segment. This may end up leaving
    behind
    + * stray shared memory segments, but there's not much we can do
    about
    + * that if the metadata is gone.

    Similar concern... in this case, would it be possible to always write
    updates to an un-used slot and then atomically update a pointer? This would
    be more work than what I suggested above, so maybe just a TODO for now...

    Though... is there anything a dying backend could do that would corrupt the
    control segment to the point that it would screw up segments allocated by
    other backends and not related to the dead backend? Like marking a slot as
    not used when it is still in use and isn't associated with the dead backend?
    Sure. A messed-up backend can clobber the control segment just as it
    can clobber anything else in shared memory. There's really no way
    around that problem. If the control segment has been overwritten by a
    memory stomp, we can't use it to clean up. There's no way around that
    problem except to not the control segment, which wouldn't be better.
    Should this (in dsm_attach)

    + * If you're hitting this error, you probably want to use attempt to

    be

    + * If you're hitting this error, you probably want to attempt to

    ?
    Good point.
    Should dsm_impl_op sanity check the arguments after op? I didn't notice
    checks in the type-specific code but I also didn't read all of it... are we
    just depending on the OS to sanity-check?
    Sanity-check for what?
    Also, does the GUC code enforce that the GUC must always be something that's
    supported? If not then the error in dsm_impl_op should be more
    user-friendly. Yes.
    I basically stopped reading after dsm_impl_op... the rest of the stuff was
    rather over my head.
    :-)

    Thanks for your interest!

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedAug 14, '13 at 1:09a
activeAug 31, '13 at 12:27p
posts8
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase