On Tue, Aug 27, 2013 at 10:07 AM, Andres Freund wrote:
[just sending an email which sat in my outbox for two weeks]
Thanks for taking a look.
Nice to see this coming. I think it will actually be interesting for
quite some things outside parallel query, but we'll see.
Yeah, I hope so. The applications may be somewhat limited by the fact
that there are apparently fairly small limits to how many shared
memory segments you can map at the same time. I believe on one system
I looked at (some version of HP-UX?) the limit was 11. So we won't be
able to go nuts with this: using it definitely introduces all kinds of
failure modes that we don't have it today. But it will also let us do
some pretty cool things that we CAN'T do today.
To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly.
That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?
The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.
Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type.
I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.
That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.
In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
Hm. Not sure what's the advantage over a corefile here.
You can look at it while the server's running.
On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments.
Shouldn't we ourselves know which segments are around?
Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.
Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
I don't think we should try to work around limits like that.
I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person. There are multiple ways to create shared
memory, and they all have different limits. Normally, System V limits
are small, POSIX limits are large, and the inherited-anonymous-mapping
trick we're now using for the main shared memory segment has no limits
at all. It's very common to run into a system where you can allocate
huge numbers of gigabytes of backend-private memory, but if you try to
allocate 64MB of *shared* memory, you get the axe - or maybe not,
depending on which API you use to create it.

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option. If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.
It wouldn't even work. Several mappings of /dev/zero et al. do *not*
result in the same virtual memory being mapped. Not even when using the
same (passed around) fd.
Believe me, I tried ;)
OK, well that's another reason I didn't do it that way, then. :-)
At this point I am rather unconcerned with this point to be
I think that's appropriate; mostly, I wanted to emphasize that the
wisdom of allocating any given amount of shared memory is outside the
scope of this patch, which only aims to provide mechanism, not policy.
Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.
I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer. But even if it isn't reliable, there doesn't
seem to be all that much value in forbidding access to that part of
the OS-provided API. In the world where it's not reliable, it may
still be convenient to map things at the same address when you can, so
that pointers can't be used. Of course you'd have to have some
fallback strategy for when you don't get the same mapping, and maybe
that's painful enough that there's no point after all. Or maybe it's
worth having one code path for relativized pointers and another for
non-relativized pointers.

To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 8 | next ›
Discussion Overview
grouppgsql-hackers @
postedAug 14, '13 at 1:09a
activeAug 31, '13 at 12:27p



site design / logo © 2017 Grokbase