On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote:
I sent an email containing benchmarks from Robert Haas regarding the Subject.
Looking at lkml.org I can't see it right now, Will recheck when I am at home.

He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times the
lseek performance.
The workload was 64 clients hammering postgres with a simple readonly workload
(pgbench -S).
Yay! Data!
For reference see the thread in the postgres archives which also links to
performance data: http://archives.postgresql.org/message-
id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=VkJpKnRxK2kTDR+HsanWA@mail.gmail.com
So both fstat and lseek do more work than postgres wants. lseek modifies
the file pointer while fstat copies all kinds of unnecessary information
into userspace. I imagine this is the source of the slowdown seen in
the 1-client case.

There have been various proposals to make the amount of information returned
by fstat limited to the 'cheap' (for various definitions of 'cheap') fields.

I'd like to dig into the requirement for knowing the file size a little
better. According to the blog entry it's used for "the query planner".
Does the query planner need to know the exact number of bytes in the file,
or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Search Discussions

  • Andres Freund at Sep 16, 2011 at 5:27 pm
    Hi,
    On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote:
    On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote:
    I sent an email containing benchmarks from Robert Haas regarding the
    Subject. Looking at lkml.org I can't see it right now, Will recheck when
    I am at home.

    He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times
    the lseek performance.
    The workload was 64 clients hammering postgres with a simple readonly
    workload (pgbench -S).
    Yay! Data!
    For reference see the thread in the postgres archives which also links to
    performance data: http://archives.postgresql.org/message-
    id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=VkJpKnRxK2kTDR+HsanWA@mail.gmail.com
    So both fstat and lseek do more work than postgres wants. lseek modifies
    the file pointer while fstat copies all kinds of unnecessary information
    into userspace. I imagine this is the source of the slowdown seen in
    the 1-client case.
    Yes, that was my theory as well.
    I'd like to dig into the requirement for knowing the file size a little
    better. According to the blog entry it's used for "the query planner".
    Its used for multiple things - one of which is the query planner.
    The query planner needs to know how many tuples a table has to produce a
    sensible plan. For that is has stats which tell 1. how big is the table 2. how
    many tuples does the table have. Those statistics are only updated every now
    and then though.
    So it uses those old stats to check how many tuples are normally stored on a
    page and then uses that to extrapolate the number of tuples from the current
    nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a
    table).

    I am not sure how interested you are on the relevant postgres internals?
    Does the query planner need to know the exact number of bytes in the file,
    or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?
    It depends on where the information is used. For some of the uses it needs to
    be exact (the assumed size is rechecked after acquiring a lock preventing
    extension) at other places I guess it would be ok if the accuracy got lower
    with bigger files (those files won't ever get bigger than 1GB).
    But I have a hard time seeing an implementation where the approximate size
    would be faster to get than just the filesize?

    Andres
  • Alvaro Herrera at Sep 16, 2011 at 5:40 pm

    Excerpts from Andres Freund's message of vie sep 16 14:27:33 -0300 2011:
    Hi,
    On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote:

    Does the query planner need to know the exact number of bytes in the file,
    or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?
    It depends on where the information is used. For some of the uses it needs to
    be exact (the assumed size is rechecked after acquiring a lock preventing
    extension) at other places I guess it would be ok if the accuracy got lower
    with bigger files (those files won't ever get bigger than 1GB).
    One other thing we're interested in is portability. I mean, even if
    Linux were to introduce a new hypothetical syscall that was able to
    return the file size at a ridiculously low cost, we probably wouldn't
    use it because it'd be Linux-specific. So an improvement of lseek()
    seems to be the best option.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Andi Kleen at Sep 16, 2011 at 6:10 pm

    One other thing we're interested in is portability. I mean, even if
    Linux were to introduce a new hypothetical syscall that was able to
    return the file size at a ridiculously low cost, we probably wouldn't
    use it because it'd be Linux-specific. So an improvement of lseek()
    seems to be the best option.
    Fully agreed. It doesn't make any sense at all to implement special
    syscalls just to workaround a basic system call not scaling.

    -Andi
  • Benjamin LaHaise at Sep 16, 2011 at 8:27 pm

    On Fri, Sep 16, 2011 at 07:27:33PM +0200, Andres Freund wrote:
    many tuples does the table have. Those statistics are only updated every now
    and then though.
    So it uses those old stats to check how many tuples are normally stored on a
    page and then uses that to extrapolate the number of tuples from the current
    nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a
    table).

    I am not sure how interested you are on the relevant postgres internals?
    For such tables, can't Postgres track the size of the file internally? I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored size
    could be updated. Not making a syscall at all would scale far better than
    even a modified lseek() will perform.

    Granted, I've not looked at the Postgres code at all.

    -ben
  • Andres Freund at Sep 16, 2011 at 9:02 pm

    On Friday, September 16, 2011 10:08:17 PM Benjamin LaHaise wrote:
    On Fri, Sep 16, 2011 at 07:27:33PM +0200, Andres Freund wrote:
    many tuples does the table have. Those statistics are only updated every
    now and then though.
    So it uses those old stats to check how many tuples are normally stored
    on a page and then uses that to extrapolate the number of tuples from
    the current nr of pages (which is computed by lseek(SEEK_END) over the
    1GB segements of a table).

    I am not sure how interested you are on the relevant postgres internals?
    For such tables, can't Postgres track the size of the file internally? I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored
    size could be updated. Not making a syscall at all would scale far better
    than even a modified lseek() will perform.
    Yes, it tracks the fds internally. The problem is that postgres is process
    based so those tables are not reachable by all processes. It could start
    tracking those in shared memory but the synchronization overhead for that
    would likely be more expensive than the syscall overhead (Given that the
    fdsets are possibly (and realistically) disjunct between the individual
    backends you would have to reserve enough shared memory for a fully seperate
    fds between each process... Which would complicate efficient lookup).

    Also with fstat() instead of lseek() there was no bottleneck anymore, so I
    don't think the benefits would warrant that.

    Greetings,

    Andres
  • Andres Freund at Sep 16, 2011 at 9:05 pm

    On Friday, September 16, 2011 11:02:38 PM Andres Freund wrote:
    Also with fstat() instead of lseek() there was no bottleneck anymore, so I
    don't think the benefits would warrant that.
    At least thats what I observed on a 4 x 6 machine without the patch applied
    (can't reboot it). That shouldn't be concurrency relevant so...

    Andres
  • Greg Stark at Sep 16, 2011 at 10:45 pm

    On Fri, Sep 16, 2011 at 9:08 PM, Benjamin LaHaise wrote:
    For such tables, can't Postgres track the size of the file internally?  I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored size
    could be updated.  Not making a syscall at all would scale far better than
    even a modified lseek() will perform.
    There's no hardwired limit on how many tables you can have in a
    database, it's not limited by the number of file descriptors. Postgres
    would have to keep some kind of LRU for recently opened files and
    their sizes or something like that. There would probably still be a
    lot of lseeks/fstats going on.

    Generally keeping a Postgres cached value for the size would then have
    a reliability issue. It's much safer to have a single authoritative
    value -- the actual length of the file -- than have the same value
    stored in two locations and then need to worry about them getting out
    of sync. If a write fails when extending the file due to a filesystem
    running out of space then Postgres might not know how to update its
    internal cached state accurately for example.

    There's no question it could be done but it's not clear it would
    necessarily be much faster than a lock-free lseek/fstat.
    On Fri, Sep 16, 2011 at 6:27 PM, Andres Freund wrote:
    It depends on where the information is used. For some of the uses it needs to
    be exact (the assumed size is rechecked after acquiring a lock preventing
    extension)
    Fwiw this might give the wrong impression. I don't believe scans
    acquire a lock preventing extension -- that is another process can be
    concurrently extending the file at the same time as the scan is
    proceeding. The scan only locks out truncation (vacuum). Any blocks
    added by another process are ignored by the scan because they can only
    contain records invisible to that transaction. This does depend on the
    lseek/fstat being done after the transaction snapshot is taken which
    is possibly "rechecking" the value taken by the query planner but
    they're really two independent things.


    --
    greg
  • Stephen Frost at Sep 19, 2011 at 12:31 pm

    * Benjamin LaHaise (bcrl@kvack.org) wrote:
    For such tables, can't Postgres track the size of the file internally? I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored size
    could be updated. Not making a syscall at all would scale far better than
    even a modified lseek() will perform.
    We'd have to have it in shared memory and have a lock around it, it
    wouldn't be cheap at all.

    Thanks,

    Stephen
  • Matthew Wilcox at Sep 19, 2011 at 1:25 pm

    On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote:
    * Benjamin LaHaise (bcrl@kvack.org) wrote:
    For such tables, can't Postgres track the size of the file internally? I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored size
    could be updated. Not making a syscall at all would scale far better than
    even a modified lseek() will perform.
    We'd have to have it in shared memory and have a lock around it, it
    wouldn't be cheap at all.
    Yep, that makes perfect sense. After all, the kernel does basically the
    same thing to maintain this information; why should we have userspace
    duplicating the same infrastructure?

    I must admit, I'd never heard of this usage of lseek to get the current
    size of a file before; I'd assumed everybody used fstat. Given this
    legitimate reason for a high-frequency calling of lseek, I withdraw my
    earlier objection to the patch series.

    --
    Matthew Wilcox Intel Open Source Technology Centre
    "Bill, look, we understand that you're interested in selling us this
    operating system, but compare it to ours. We can't possibly take such
    a retrograde step."
  • Marco Stornelli at Sep 20, 2011 at 7:19 am

    2011/9/19 Matthew Wilcox <matthew@wil.cx>:
    On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote:
    * Benjamin LaHaise (bcrl@kvack.org) wrote:
    For such tables, can't Postgres track the size of the file internally?  I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored size
    could be updated.  Not making a syscall at all would scale far better than
    even a modified lseek() will perform.
    We'd have to have it in shared memory and have a lock around it, it
    wouldn't be cheap at all.
    Yep, that makes perfect sense.  After all, the kernel does basically the
    same thing to maintain this information; why should we have userspace
    duplicating the same infrastructure?

    I must admit, I'd never heard of this usage of lseek to get the current
    size of a file before; I'd assumed everybody used fstat.  Given this
    legitimate reason for a high-frequency calling of lseek, I withdraw my
    earlier objection to the patch series.

    --
    Matthew Wilcox                          Intel Open Source Technology Centre
    "Bill, look, we understand that you're interested in selling us this
    operating system, but compare it to ours.  We can't possibly take such
    a retrograde step."
    --
    To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    I really don't understand the approach here. An improvement is an
    improvement, do we need a use case to add an improvement to the
    kernel? We are not talking about to add a new syscall or to do an ABI
    change in this case. So my absolute ack to these patches.

    Marco
  • Robert Haas at Sep 19, 2011 at 1:30 pm

    On Mon, Sep 19, 2011 at 8:31 AM, Stephen Frost wrote:
    * Benjamin LaHaise (bcrl@kvack.org) wrote:
    For such tables, can't Postgres track the size of the file internally?  I'm
    assuming it's keeping file descriptors open on the tables it manages, in
    which case when it writes to a file to extend it, the internally stored size
    could be updated.  Not making a syscall at all would scale far better than
    even a modified lseek() will perform.
    We'd have to have it in shared memory and have a lock around it, it
    wouldn't be cheap at all.
    In theory, we could implement a lock-free cache. But I still think it
    would be better to see this fixed on the kernel side. If we had some
    evidence that all of those lseek() calls were a performance problem
    even when the i_mutex is not seriously contended, then that would be a
    good argument for doing this in user-space, but I haven't seen any
    such evidence. On the other hand, the numbers I posted show that when
    i_mutex IS contended, it can cause a throughput regression of up to
    90%. That seems worth fixing. If it turns out that lseek() is too
    expensive even in the uncontended case or with the i_mutex contention
    removed (or if the Linux community is unwilling to accept the proposed
    fix), then we can (and should) look at further optimizing it within
    PostgreSQL. My guess, though, is that an unlocked lseek will be fast
    enough that we won't need to worry about installing our own caching
    infrastructure (or at least, there will be plenty of more significant
    performance problems to hunt down first).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 16, '11 at 3:46p
activeSep 20, '11 at 7:19a
posts12
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase