FAQ
I've dug into the problem reported by Igor Neyman:
http://archives.postgresql.org/pgsql-admin/2010-06/msg00148.php
Unlike previous complainants, Igor was kind enough to supply a pg_dump
archive file that triggers the problem. What I find is that his dump
file contains no data offsets, ie, dataState == K_OFFSET_POS_NOT_SET
for every TABLE DATA item. This causes _PrintTocData to take the same
path taken for a non-seekable input file, ie, search forward looking for
the desired item. In a parallel restore, all threads will start from
the same file location, right after the last serially-restored item.
Therefore, of course every one of them fails, except for the one told
to process the very first parallel-restore item.

The reason the dump file contains no offsets is that pg_dump can't write
them unless it thinks the dump file is seekable *at dump time* ---
otherwise it can't rewind to modify the dump's table of contents.
And guess what: pre-8.4 pg_dump on Windows will NEVER believe that the
output file is seekable, because we didn't bother to define HAVE_FSEEKO
in the Windows port until 8.4.

In short, parallel pg_restore is guaranteed to fail on any input file
made with a pre-8.4 pg_dump on Windows. It may be that there's some
other mechanism involved in the reports we've gotten of parallel restore
failing only some of the time, but I'm thinking that the heretofore
unrecognized dependency on pg_dump-time seekability could well explain
those too.

I see several action items here:

1. The error message emitted by _PrintTocData is incredibly misleading.
It needs to be fixed to tell people if the problem is lack of data
offsets rather than lack of seek capability.

2. The reason that _PrintTocData thinks it's an error to hit a
restorable data item other than the one it wants is that, lacking seek
capability, there'd be no way to rewind to get at that data item later.
However, this is only an issue in serial restore. In a parallel restore
worker thread, we're not going to need to seek back on that file pointer
anyway, so we should just allow the code to continue forward. There
seem to be two plausible ways of implementing that:

* Just skip the error test altogether if in a worker child.

* Modify the error test so that the only data item considered
"wanted" is the specific one the current worker wants.

The existing parallel restore logic in pg_backup_archiver.c doesn't
appear to export enough state to allow either of these strategies to be
implemented. In the Unix implementation I'd be inclined to export the
state by creating a suitable static variable, but that's not going to
work in the thread-based Windows code. It looks like we'd need some
thread-local storage which the current code hasn't got any of.

Another possibility is to just remove the inside-the-loop error test
altogether: make it just skip till it finds the desired item, and only
throw an error if it hits EOF without finding it. In the case that
the error test is trying to catch, this would mean significantly more
work done before reporting the error, but do we really care? I'm
leaning to this solution because it would not require exporting state
from the parallel restore control logic.

3. Perhaps pg_dump ought to emit a warning when it can't seek, instead
of just silently not writing the data offsets. That behavior was okay
before when lack of data offsets didn't really matter that much, but
lack of data offsets is a serious performance handicap for parallel
restore even after we fix the outright failure condition (because each
worker is going to read through a lot of data to find what it needs).

4. Is there any value in back-porting the Windows FSEEKO support into
8.3 and 8.2? Arguably, not writing the data offsets is a performance
bug. However a back-port won't do anything for people who are dumping
with less than the latest minor release of pg_dump, so doing this might
be largely wasted effort.

Comments?

regards, tom lane

Search Discussions

  • Greg Stark at Jun 22, 2010 at 10:52 pm

    On Tue, Jun 22, 2010 at 9:07 PM, Tom Lane wrote:
    3. Perhaps pg_dump ought to emit a warning when it can't seek, instead
    of just silently not writing the data offsets.  That behavior was okay
    before when lack of data offsets didn't really matter that much, but
    lack of data offsets is a serious performance handicap for parallel
    restore even after we fix the outright failure condition (because each
    worker is going to read through a lot of data to find what it needs).
    I'm not terribly familiar with the pg_dump format, but... the usual
    strategy for storing a TOC on a non-seekable output stream is to store
    it at the end of the file. So you just accumulate all the offsets in
    memory as you generate the file and then write the TOC at the end. Of
    course you need a seekable input stream when you load it then but it
    would narrow the slow case to when you have a non-seekable output
    stream when dumping *and* a non-seekable input stream on restore.

    On the other hand if we didn't notice this dependency when there was
    only one variable making it depend on two variables would make it that
    much more obscure when the slow case hits and users wonder why the
    restore is taking so long.

    --
    greg
  • Andrew Dunstan at Jun 23, 2010 at 1:02 am

    Tom Lane wrote:
    In short, parallel pg_restore is guaranteed to fail on any input file
    made with a pre-8.4 pg_dump on Windows. It may be that there's some
    other mechanism involved in the reports we've gotten of parallel restore
    failing only some of the time, but I'm thinking that the heretofore
    unrecognized dependency on pg_dump-time seekability could well explain
    those too.

    IIRC, you can reproduce this on Unix too by sending the output of
    pg_dump into a pipe. So it's not uniquely a Windows problem.

    As Greg suggests, the solution would be to have a second TOC at the end
    of the file with the offsets. But I think that's way beyond what we
    should do on the back branches, and really beyond what we should do for
    9.0. We should document the limitation.
    I see several action items here:

    1. The error message emitted by _PrintTocData is incredibly misleading.
    It needs to be fixed to tell people if the problem is lack of data
    offsets rather than lack of seek capability. Agreed.
    Another possibility is to just remove the inside-the-loop error test
    altogether: make it just skip till it finds the desired item, and only
    throw an error if it hits EOF without finding it. In the case that
    the error test is trying to catch, this would mean significantly more
    work done before reporting the error, but do we really care? I'm
    leaning to this solution because it would not require exporting state
    from the parallel restore control logic.
    Would exporting a bit of state be so bad? It seems like it would be a
    bit cleaner, and I'll be surprised if it's terribly difficult. It can be
    set at the top of parallel_restore().
    3. Perhaps pg_dump ought to emit a warning when it can't seek, instead
    of just silently not writing the data offsets. That behavior was okay
    before when lack of data offsets didn't really matter that much, but
    lack of data offsets is a serious performance handicap for parallel
    restore even after we fix the outright failure condition (because each
    worker is going to read through a lot of data to find what it needs).
    For now, yes. But in 9.1 we should write out a second TOC and teach
    pg_restore to look for it.
    4. Is there any value in back-porting the Windows FSEEKO support into
    8.3 and 8.2? Arguably, not writing the data offsets is a performance
    bug. However a back-port won't do anything for people who are dumping
    with less than the latest minor release of pg_dump, so doing this might
    be largely wasted effort.

    I doubt it's worth it, but I could be persuaded otherwise.

    cheers

    andrew
  • Tom Lane at Jun 23, 2010 at 1:28 am

    Andrew Dunstan writes:
    Tom Lane wrote:
    In short, parallel pg_restore is guaranteed to fail on any input file
    made with a pre-8.4 pg_dump on Windows.
    IIRC, you can reproduce this on Unix too by sending the output of
    pg_dump into a pipe. So it's not uniquely a Windows problem.
    Right. We need to be able to cope, albeit with degraded performance.
    As Greg suggests, the solution would be to have a second TOC at the end
    of the file with the offsets.
    Uh, that doesn't fix anything: if you can't seek, a TOC at the end of
    the file is useless. And the cases where the writer can't seek are
    likely to be identically the ones where the reader can't seek, viz
    pg_dump piped to pg_restore (perhaps with some other programs between).
    Another possibility is to just remove the inside-the-loop error test
    altogether: make it just skip till it finds the desired item, and only
    throw an error if it hits EOF without finding it. In the case that
    the error test is trying to catch, this would mean significantly more
    work done before reporting the error, but do we really care? I'm
    leaning to this solution because it would not require exporting state
    from the parallel restore control logic.
    Would exporting a bit of state be so bad?
    The threaded case seems a bit messy, and frankly I don't believe that
    we'd be buying anything. The error case never actually occurs in the real
    world, except perhaps on corrupted archive files, so why should we care
    about performance for it?
    For now, yes. But in 9.1 we should write out a second TOC and teach
    pg_restore to look for it.
    I don't think this is useful.
    4. Is there any value in back-porting the Windows FSEEKO support into
    8.3 and 8.2? Arguably, not writing the data offsets is a performance
    bug. However a back-port won't do anything for people who are dumping
    with less than the latest minor release of pg_dump, so doing this might
    be largely wasted effort.
    I doubt it's worth it, but I could be persuaded otherwise.
    I'm leaning in that direction too. Anybody who's doing a version
    upgrade really ought to be using the newer pg_dump version anyway ...

    regards, tom lane
  • Andrew Dunstan at Jun 23, 2010 at 1:47 am

    Tom Lane wrote:
    Another possibility is to just remove the inside-the-loop error test
    altogether: make it just skip till it finds the desired item, and only
    throw an error if it hits EOF without finding it. In the case that
    the error test is trying to catch, this would mean significantly more
    work done before reporting the error, but do we really care? I'm
    leaning to this solution because it would not require exporting state
    from the parallel restore control logic.
    Would exporting a bit of state be so bad?
    The threaded case seems a bit messy, and frankly I don't believe that
    we'd be buying anything. The error case never actually occurs in the real
    world, except perhaps on corrupted archive files, so why should we care
    about performance for it?
    OK, I can buy that.

    cheers

    andrew
  • Magnus Hagander at Jun 23, 2010 at 9:37 am

    On Wed, Jun 23, 2010 at 03:26, Tom Lane wrote:
    Andrew Dunstan <andrew@dunslane.net> writes:
    4. Is there any value in back-porting the Windows FSEEKO support into
    8.3 and 8.2?  Arguably, not writing the data offsets is a performance
    bug.  However a back-port won't do anything for people who are dumping
    with less than the latest minor release of pg_dump, so doing this might
    be largely wasted effort.
    I doubt it's worth it, but I could be persuaded otherwise.
    I'm leaning in that direction too.  Anybody who's doing a version
    upgrade really ought to be using the newer pg_dump version anyway ...
    +1 on not backpatching that stuff - it's build system related, so it's
    kind of fragile on the windows side :-)
  • Greg Stark at Jun 23, 2010 at 10:20 am

    On Wed, Jun 23, 2010 at 2:26 AM, Tom Lane wrote:
    Uh, that doesn't fix anything: if you can't seek, a TOC at the end of
    the file is useless.  And the cases where the writer can't seek are
    likely to be identically the ones where the reader can't seek, viz
    pg_dump piped to pg_restore (perhaps with some other programs between).
    That seems like a tenuous leap. A typical reason for the pipe is to
    transfer it to a different machine and that only has to be done once.

    But I'm not convinced it's such a great idea either for the reason I
    described -- It makes the case where pg_restore has to read through
    the whole archive that much harder to explain to users. So I'm not
    really going to argue for it too strongly. It's also a fair amount of
    extra complexity for not much gain. We would still need the fallback
    code anyways.


    --
    greg

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJun 22, '10 at 8:08p
activeJun 23, '10 at 10:20a
posts7
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase