We've gotten a few inquiries about whether Postgres can use "huge pages"
under Linux. In principle that should be more efficient for large shmem
regions, since fewer TLB entries are needed to support the address
space. I spent a bit of time today looking into what that would take.
My testing was done with current Fedora 13, kernel version
2.6.34.7-61.fc13.x86_64 --- it's possible some of these details vary
across other kernel versions.

You can test this with fairly minimal code changes, as illustrated in
the attached not-production-grade patch. To select huge pages we have
to include SHM_HUGETLB in the flags for shmget(), and we have to be
prepared for failure (due to permissions or lack of allocated
hugepages). I made the code just fall back to a normal shmget on
failure. A bigger problem is that the shmem request size must be a
multiple of the system's hugepage size, which is *not* a constant
even though the test patch just uses 2MB as the assumed value. For a
production-grade patch we'd have to scrounge the active value out of
someplace in the /proc filesystem (ick).

In addition to the code changes there are a couple of sysadmin
requirements to make huge pages available to Postgres:

1. You have to configure the Postgres user as a member of the group
that's permitted to allocate hugepage shared memory. I did this:
sudo sh -c "id -g postgres >/proc/sys/vm/hugetlb_shm_group"
For production use you'd need to put this in the PG initscript,
probably, to ensure it gets re-set after every reboot and before PG
is started.

2. You have to manually allocate some huge pages --- there doesn't
seem to be any setting that says "just give them out on demand".
I did this:
sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages"
which gave me a bit over 1GB of space reserved as huge pages.
Again, this'd have to be done over again at each system boot.

For testing purposes, I figured that what I wanted to stress was
postgres process swapping and shmem access. I built current git HEAD
with --enable-debug and no other options, and tested with these
non-default settings:
shared_buffers 1GB
checkpoint_segments 50
fsync off
(fsync intentionally off since I'm not trying to measure disk speed).
The test machine has two dual-core Nehalem CPUs. Test case is pgbench
at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench"
in each configuration.

And the bottom line is: if there's any performance benefit at all,
it's on the order of 1%. The best result I got was about 3200 TPS
with hugepages, and about 3160 without. The noise in these numbers
is more than 1% though.

This is discouraging; it certainly doesn't make me want to expend the
effort to develop a production patch. However, perhaps someone else
can try to show a greater benefit under some other test conditions.

regards, tom lane

Search Discussions

  • Robert Haas at Nov 28, 2010 at 3:22 am

    On Sat, Nov 27, 2010 at 2:27 PM, Tom Lane wrote:
    For testing purposes, I figured that what I wanted to stress was
    postgres process swapping and shmem access.  I built current git HEAD
    with --enable-debug and no other options, and tested with these
    non-default settings:
    shared_buffers         1GB
    checkpoint_segments    50
    fsync                  off
    (fsync intentionally off since I'm not trying to measure disk speed).
    The test machine has two dual-core Nehalem CPUs.  Test case is pgbench
    at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench"
    in each configuration.

    And the bottom line is: if there's any performance benefit at all,
    it's on the order of 1%.  The best result I got was about 3200 TPS
    with hugepages, and about 3160 without.  The noise in these numbers
    is more than 1% though.

    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch.  However, perhaps someone else
    can try to show a greater benefit under some other test conditions.
    Hmm. Presumably in order to see a large benefit, you would need to
    have shared_buffers set large enough to thrash the TLB. I have no
    idea how big TLBs on modern systems are, but it'd be interesting to
    test this on a big machine with 8GB of shared buffers.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at Nov 28, 2010 at 4:53 pm

    On Sat, 2010-11-27 at 14:27 -0500, Tom Lane wrote:

    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch.
    Perhaps.

    Why do this only for shared memory? Surely the majority of memory
    accesses are to private memory, so being able to allocate private memory
    in a single huge page would be better for avoiding TLB cache misses.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Tom Lane at Nov 28, 2010 at 5:04 pm

    Simon Riggs writes:
    On Sat, 2010-11-27 at 14:27 -0500, Tom Lane wrote:
    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch.
    Perhaps.
    Why do this only for shared memory?
    There's no exposed API for causing a process's regular memory to become
    hugepages.
    Surely the majority of memory
    accesses are to private memory, so being able to allocate private memory
    in a single huge page would be better for avoiding TLB cache misses.
    It's not really about the number of memory accesses, it's about the
    number of TLB entries needed. Private memory is generally a lot smaller
    than shared, in a tuned PG installation.

    regards, tom lane
  • Simon Riggs at Nov 28, 2010 at 7:15 pm

    On Sun, 2010-11-28 at 12:04 -0500, Tom Lane wrote:
    Simon Riggs <simon@2ndQuadrant.com> writes:
    On Sat, 2010-11-27 at 14:27 -0500, Tom Lane wrote:
    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch.
    Perhaps.
    Why do this only for shared memory?
    There's no exposed API for causing a process's regular memory to become
    hugepages.
    We could make all the palloc stuff into shared memory also ("private"
    shared memory that is). We're not likely to run out of 64-bit memory
    addresses any time soon.
    Surely the majority of memory
    accesses are to private memory, so being able to allocate private memory
    in a single huge page would be better for avoiding TLB cache misses.
    It's not really about the number of memory accesses, it's about the
    number of TLB entries needed. Private memory is generally a lot smaller
    than shared, in a tuned PG installation.
    Sure, but 4MB of memory is enough to require 1000 TLB entries, which is
    more than enough to blow the TLB even on a Nehalem. So the size of the
    memory we access is already big enough to blow the cache, even without
    shared buffers. If the majority of accesses are from private memory then
    the TLB cache will already be thrashed by the time we access shared
    buffers again.

    That is at least one possible explanation for the lack of benefit.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Tom Lane at Nov 28, 2010 at 7:32 pm

    Simon Riggs writes:
    On Sun, 2010-11-28 at 12:04 -0500, Tom Lane wrote:
    There's no exposed API for causing a process's regular memory to become
    hugepages.
    We could make all the palloc stuff into shared memory also ("private"
    shared memory that is). We're not likely to run out of 64-bit memory
    addresses any time soon.
    Mph. It's still not going to work well enough to be useful, because the
    kernel design for hugepages assumes a pretty static number of them.
    That maps well to our use of shared memory, not at all well to process
    local memory.
    Sure, but 4MB of memory is enough to require 1000 TLB entries, which is
    more than enough to blow the TLB even on a Nehalem.
    That can't possibly be right. I'm sure the chip designers have heard of
    programs using more than 4MB.

    regards, tom lane
  • Martijn van Oosterhout at Nov 28, 2010 at 7:45 pm

    On Sun, Nov 28, 2010 at 02:32:04PM -0500, Tom Lane wrote:
    Sure, but 4MB of memory is enough to require 1000 TLB entries, which is
    more than enough to blow the TLB even on a Nehalem.
    That can't possibly be right. I'm sure the chip designers have heard of
    programs using more than 4MB.
    According to
    http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=8
    on the Core 2 chip there wasn't even enough TLB to cover the entire
    onboard cache. With Nehalem there are 2304 TLB entries on the chip,
    which cover at least the whole onboard cache, but only just.

    Memory access is expensive. I think if you got good statistics on how
    much time your CPU is waiting for memory it'd be pretty depressing.

    Have a nice day,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    Patriotism is when love of your own people comes first; nationalism,
    when hate for people other than your own comes first.
    - Charles de Gaulle
  • Kenneth Marshall at Nov 28, 2010 at 10:30 pm

    On Sat, Nov 27, 2010 at 02:27:12PM -0500, Tom Lane wrote:
    We've gotten a few inquiries about whether Postgres can use "huge pages"
    under Linux. In principle that should be more efficient for large shmem
    regions, since fewer TLB entries are needed to support the address
    space. I spent a bit of time today looking into what that would take.
    My testing was done with current Fedora 13, kernel version
    2.6.34.7-61.fc13.x86_64 --- it's possible some of these details vary
    across other kernel versions.

    You can test this with fairly minimal code changes, as illustrated in
    the attached not-production-grade patch. To select huge pages we have
    to include SHM_HUGETLB in the flags for shmget(), and we have to be
    prepared for failure (due to permissions or lack of allocated
    hugepages). I made the code just fall back to a normal shmget on
    failure. A bigger problem is that the shmem request size must be a
    multiple of the system's hugepage size, which is *not* a constant
    even though the test patch just uses 2MB as the assumed value. For a
    production-grade patch we'd have to scrounge the active value out of
    someplace in the /proc filesystem (ick).
    I would expect that you can just iterate through the size possibilities
    pretty quickly and just use the first one that works -- no /proc
    groveling.
    In addition to the code changes there are a couple of sysadmin
    requirements to make huge pages available to Postgres:

    1. You have to configure the Postgres user as a member of the group
    that's permitted to allocate hugepage shared memory. I did this:
    sudo sh -c "id -g postgres >/proc/sys/vm/hugetlb_shm_group"
    For production use you'd need to put this in the PG initscript,
    probably, to ensure it gets re-set after every reboot and before PG
    is started.
    Since it would take advantage of them automatically, this would be
    just a normal DBA/admin task.
    2. You have to manually allocate some huge pages --- there doesn't
    seem to be any setting that says "just give them out on demand".
    I did this:
    sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages"
    which gave me a bit over 1GB of space reserved as huge pages.
    Again, this'd have to be done over again at each system boot. Same.
    For testing purposes, I figured that what I wanted to stress was
    postgres process swapping and shmem access. I built current git HEAD
    with --enable-debug and no other options, and tested with these
    non-default settings:
    shared_buffers 1GB
    checkpoint_segments 50
    fsync off
    (fsync intentionally off since I'm not trying to measure disk speed).
    The test machine has two dual-core Nehalem CPUs. Test case is pgbench
    at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench"
    in each configuration.

    And the bottom line is: if there's any performance benefit at all,
    it's on the order of 1%. The best result I got was about 3200 TPS
    with hugepages, and about 3160 without. The noise in these numbers
    is more than 1% though.

    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch. However, perhaps someone else
    can try to show a greater benefit under some other test conditions.

    regards, tom lane
    I would not really expect to see much benefit in the region that the
    normal TLB page size would cover with the typical number of TLB entries.
    1GB of shared buffers would not be enough to cause TLB thrashing with
    most processors. Bump it to 8-32GB or more and if the queries use up
    TLB entries with local work_mem you should see some more value in the
    patch.

    Regards,
    Ken
  • Tom Lane at Nov 29, 2010 at 12:13 am

    Kenneth Marshall writes:
    On Sat, Nov 27, 2010 at 02:27:12PM -0500, Tom Lane wrote:
    ... A bigger problem is that the shmem request size must be a
    multiple of the system's hugepage size, which is *not* a constant
    even though the test patch just uses 2MB as the assumed value. For a
    production-grade patch we'd have to scrounge the active value out of
    someplace in the /proc filesystem (ick).
    I would expect that you can just iterate through the size possibilities
    pretty quickly and just use the first one that works -- no /proc
    groveling.
    It's not really that easy, because (at least on the kernel version I
    tested) it's not the shmget that fails, it's the later shmat. Releasing
    and reacquiring the shm segment would require significant code
    restructuring, and at least on some platforms could produce weird
    failure cases --- I seem to recall having heard of kernels where the
    release isn't instantaneous, so that you could run up against SHMMAX
    for no apparent reason. Really you do want to scrape the value.
    2. You have to manually allocate some huge pages --- there doesn't
    seem to be any setting that says "just give them out on demand".
    I did this:
    sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages"
    which gave me a bit over 1GB of space reserved as huge pages.
    Again, this'd have to be done over again at each system boot.
    Same.
    The fact that hugepages have to be manually managed, and that any
    unaccounted-for represent completely wasted RAM, seems like a pretty
    large PITA to me. I don't see anybody buying into that for gains
    measured in single-digit percentages.
    1GB of shared buffers would not be enough to cause TLB thrashing with
    most processors.
    Well, bigger cases would be useful to try, although Simon was claiming
    that the TLB starts to fall over at 4MB of working set. I don't have a
    large enough machine to try the sort of test you're suggesting, so if
    anyone thinks this is worth pursuing, there's the patch ... go test it.

    regards, tom lane
  • Greg Stark at Nov 29, 2010 at 12:43 am

    On Mon, Nov 29, 2010 at 12:12 AM, Tom Lane wrote:
    I would expect that you can just iterate through the size possibilities
    pretty quickly and just use the first one that works -- no /proc
    groveling.
    It's not really that easy, because (at least on the kernel version I
    tested) it's not the shmget that fails, it's the later shmat.  Releasing
    and reacquiring the shm segment would require significant code
    restructuring, and at least on some platforms could produce weird
    failure cases --- I seem to recall having heard of kernels where the
    release isn't instantaneous, so that you could run up against SHMMAX
    for no apparent reason.  Really you do want to scrape the value.
    Couldn't we just round the shared memory allocation down to a multiple
    of 4MB? That would handle all older architectures where the size is
    2MB or 4MB.

    I see online that IA64 supports larger page sizes up to 256MB but then
    could we make it the user's problem if they change their hugepagesize
    to a larger value to pick a value of shared_buffers that will fit
    cleanly? We might need to rejigger things so that the shared memory
    segment is exactly the size of shared_buffers and any other shared
    data structures are in a separate segment though for that to work.

    --
    greg
  • Tom Lane at Nov 29, 2010 at 12:48 am

    Greg Stark writes:
    On Mon, Nov 29, 2010 at 12:12 AM, Tom Lane wrote:
    Really you do want to scrape the value.
    Couldn't we just round the shared memory allocation down to a multiple
    of 4MB? That would handle all older architectures where the size is
    2MB or 4MB.
    Rounding *down* will not work, at least not without extremely invasive
    changes to the shmem allocation code. Rounding up is okay, as long as
    you don't mind some possibly-wasted space.
    I see online that IA64 supports larger page sizes up to 256MB but then
    could we make it the user's problem if they change their hugepagesize
    to a larger value to pick a value of shared_buffers that will fit
    cleanly? We might need to rejigger things so that the shared memory
    segment is exactly the size of shared_buffers and any other shared
    data structures are in a separate segment though for that to work.
    Two shmem segments would be a pretty serious PITA too, certainly a lot
    more so than a few lines to read a magic number from /proc.

    But this is all premature pending a demonstration that there's enough
    potential gain here to be worth taking any trouble at all. The one
    set of numbers we have says otherwise.

    regards, tom lane
  • Jonathan Corbet at Nov 29, 2010 at 3:31 pm

    On Sat, 27 Nov 2010 14:27:12 -0500 Tom Lane wrote:

    And the bottom line is: if there's any performance benefit at all,
    it's on the order of 1%. The best result I got was about 3200 TPS
    with hugepages, and about 3160 without. The noise in these numbers
    is more than 1% though.

    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch. However, perhaps someone else
    can try to show a greater benefit under some other test conditions.
    Just a quick note: I can't hazard a guess as to why you're not getting
    better results than you are, but I *can* say that putting together a
    production-quality patch may not be worth your effort regardless. There
    is a nice "transparent hugepages" patch set out there which makes
    hugepages "just happen" when it seems to make sense and the system can
    support it. It eliminates the need for all administrative fiddling and
    for any support at the application level.

    This patch is invasive and has proved to be hard to merge. RHEL6 has it,
    though, and I believe it will get in eventually. I can point you at the
    developer involved if you'd like to experiment with this feature and see
    what it can do for you.

    jon

    Jonathan Corbet / LWN.net / corbet@lwn.net
  • Tom Lane at Nov 29, 2010 at 3:34 pm

    Jonathan Corbet writes:
    Just a quick note: I can't hazard a guess as to why you're not getting
    better results than you are, but I *can* say that putting together a
    production-quality patch may not be worth your effort regardless. There
    is a nice "transparent hugepages" patch set out there which makes
    hugepages "just happen" when it seems to make sense and the system can
    support it. It eliminates the need for all administrative fiddling and
    for any support at the application level.
    That would be cool, because the current kernel feature is about as
    unfriendly to use as it could possibly be ...

    regards, tom lane
  • Robert Haas at Nov 29, 2010 at 3:53 pm

    On Mon, Nov 29, 2010 at 10:30 AM, Jonathan Corbet wrote:
    On Sat, 27 Nov 2010 14:27:12 -0500
    Tom Lane wrote:
    And the bottom line is: if there's any performance benefit at all,
    it's on the order of 1%.  The best result I got was about 3200 TPS
    with hugepages, and about 3160 without.  The noise in these numbers
    is more than 1% though.

    This is discouraging; it certainly doesn't make me want to expend the
    effort to develop a production patch.  However, perhaps someone else
    can try to show a greater benefit under some other test conditions.
    Just a quick note: I can't hazard a guess as to why you're not getting
    better results than you are, but I *can* say that putting together a
    production-quality patch may not be worth your effort regardless.  There
    is a nice "transparent hugepages" patch set out there which makes
    hugepages "just happen" when it seems to make sense and the system can
    support it.  It eliminates the need for all administrative fiddling and
    for any support at the application level.
    Neat!

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedNov 27, '10 at 7:27p
activeNov 29, '10 at 3:53p
posts14
users7
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase