We've gotten a few inquiries about whether Postgres can use "huge pages"
under Linux. In principle that should be more efficient for large shmem
regions, since fewer TLB entries are needed to support the address
space. I spent a bit of time today looking into what that would take.
My testing was done with current Fedora 13, kernel version
220.127.116.11-61.fc13.x86_64 --- it's possible some of these details vary
across other kernel versions.
You can test this with fairly minimal code changes, as illustrated in
the attached not-production-grade patch. To select huge pages we have
to include SHM_HUGETLB in the flags for shmget(), and we have to be
prepared for failure (due to permissions or lack of allocated
hugepages). I made the code just fall back to a normal shmget on
failure. A bigger problem is that the shmem request size must be a
multiple of the system's hugepage size, which is *not* a constant
even though the test patch just uses 2MB as the assumed value. For a
production-grade patch we'd have to scrounge the active value out of
someplace in the /proc filesystem (ick).
In addition to the code changes there are a couple of sysadmin
requirements to make huge pages available to Postgres:
1. You have to configure the Postgres user as a member of the group
that's permitted to allocate hugepage shared memory. I did this:
sudo sh -c "id -g postgres >/proc/sys/vm/hugetlb_shm_group"
For production use you'd need to put this in the PG initscript,
probably, to ensure it gets re-set after every reboot and before PG
2. You have to manually allocate some huge pages --- there doesn't
seem to be any setting that says "just give them out on demand".
I did this:
sudo sh -c "echo 600 >/proc/sys/vm/nr_hugepages"
which gave me a bit over 1GB of space reserved as huge pages.
Again, this'd have to be done over again at each system boot.
For testing purposes, I figured that what I wanted to stress was
postgres process swapping and shmem access. I built current git HEAD
with --enable-debug and no other options, and tested with these
(fsync intentionally off since I'm not trying to measure disk speed).
The test machine has two dual-core Nehalem CPUs. Test case is pgbench
at -s 25; I ran several iterations of "pgbench -c 10 -T 60 bench"
in each configuration.
And the bottom line is: if there's any performance benefit at all,
it's on the order of 1%. The best result I got was about 3200 TPS
with hugepages, and about 3160 without. The noise in these numbers
is more than 1% though.
This is discouraging; it certainly doesn't make me want to expend the
effort to develop a production patch. However, perhaps someone else
can try to show a greater benefit under some other test conditions.
regards, tom lane