People have been talking about a parallel version of pg_dump a few
times already. I have been working on some proof-of-concept code for
this feature every now and then and I am planning to contribute this
for 9.1.

There are two main issues with a parallel version of pg_dump:

The first one is that it requires a consistent snapshot among multiple
pg_dump clients and the second is that currently the output goes to a
single file and it is unclear what to do about multiple processes
writing into a single file.

- There are ideas on how to solve the issue with the consistent
snapshot but in the end you can always solve it by stopping your
application(s). I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.
Even if we had consistent snapshots in a future version, would we
forbid people to run parallel dumps against old server versions? What
I suggest is to just display a big warning if run against a server
without consistent snapshot support (which currently is every
version).

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

In fact, technically this is rather a parallel pg_restore than a
pg_dump as you need to respect the dependencies between objects. The
good news is that with the parallel pg_restore of the custom archive
format we have everything in place already for this dependency
checking. The addition is a new archive type that dumps (just-in-time)
whatever the dependency-algorithm decides to restore next.

This is probably the fastest way that we can copy or upgrade a
database when pg_migrator cannot be used (for example when you migrate
to a different hardware architecture).

As said, I have some working code for the features described (unix
only), if anybody would like to give it a try already now, just let me
know, I'd be happy to get some early test reports and you could check
for the speedup to expect. But before I continue, I'd like to have a
discussion about what is what people actually want and what is the
best way to go forward here.

I am currently not planning to make parallel dumps work with the
custom format even though this would be possible if we changed the
format to a certain degree.

Comments?


Joachim

Search Discussions

  • David Fetter at Mar 29, 2010 at 3:08 pm

    On Mon, Mar 29, 2010 at 04:46:48PM +0200, Joachim Wieland wrote:
    People have been talking about a parallel version of pg_dump a few
    times already. I have been working on some proof-of-concept code for
    this feature every now and then and I am planning to contribute this
    for 9.1.

    There are two main issues with a parallel version of pg_dump:

    The first one is that it requires a consistent snapshot among
    multiple pg_dump clients
    Cloning snapshots seems like the way to fix this. I don't know how
    far this project has drifted from the PostgreSQL code, but you might
    want to look here:

    http://pgfoundry.org/projects/snapclone/
    and the second is that currently the output goes to a single file
    and it is unclear what to do about multiple processes writing into a
    single file.
    I don't think that's a good idea. Coming up with a directory
    structure for the new parallel pg_dump seems like a much better idea.

    Andrew, do you have some notes on this?

    Cheers,
    David.
    --
    David Fetter <david@fetter.org> http://fetter.org/
    Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
    Skype: davidfetter XMPP: david.fetter@gmail.com
    iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

    Remember to vote!
    Consider donating to Postgres: http://www.postgresql.org/about/donate
  • Robert Haas at Mar 29, 2010 at 3:45 pm

    On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland wrote:
    - There are ideas on how to solve the issue with the consistent
    snapshot but in the end you can always solve it by stopping your
    application(s). I actually assume that whenever people are interested
    in a very fast dump, it is because they are doing some maintenance
    task (like migrating to a different server) that involves pg_dump. In
    these cases, they would stop their system anyway.
    Even if we had consistent snapshots in a future version, would we
    forbid people to run parallel dumps against old server versions? What
    I suggest is to just display a big warning if run against a server
    without consistent snapshot support (which currently is every
    version).
    Seems reasonable.
    - Regarding the output of pg_dump I am proposing two solutions. The
    first one is to introduce a new archive type "directory" where each
    table and each blob is a file in a directory, similar to the
    experimental "files" archive type. Also the idea has come up that you
    should be able to specify multiple directories in order to make use of
    several physical disk drives. Thinking this further, in order to
    manage all the mess that you can create with this, every file of the
    same backup needs to have a unique identifier and pg_restore should
    have a check parameter that tells you if your backup directory is in a
    sane and complete state (think about moving a file from one backup
    directory to another one or trying to restore from two directories
    which are from different backup sets...).
    I think that specifying several directories is a piece of complexity
    that would be best left alone for a first version of this. But a
    single directory with multiple files sounds pretty reasonable. Of
    course we'll also need to support that format in non-parallel mode,
    and in pg_restore.
    The second solution to the single-file-problem is to generate no
    output at all, i.e. whatever you export from your source database you
    import directly into your target database, which in the end turns out
    to be a parallel form of "pg_dump | psql".
    This is a very interesting idea but you might want to get the other
    thing merged first, as it's going to present a different set of
    issues.
    I am currently not planning to make parallel dumps work with the
    custom format even though this would be possible if we changed the
    format to a certain degree.
    I'm thinking we probably don't want to change the existing formats.

    ...Robert
  • Stefan Kaltenbrunner at Mar 29, 2010 at 5:16 pm

    Robert Haas wrote:
    On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland wrote:
    [...]
    - Regarding the output of pg_dump I am proposing two solutions. The
    first one is to introduce a new archive type "directory" where each
    table and each blob is a file in a directory, similar to the
    experimental "files" archive type. Also the idea has come up that you
    should be able to specify multiple directories in order to make use of
    several physical disk drives. Thinking this further, in order to
    manage all the mess that you can create with this, every file of the
    same backup needs to have a unique identifier and pg_restore should
    have a check parameter that tells you if your backup directory is in a
    sane and complete state (think about moving a file from one backup
    directory to another one or trying to restore from two directories
    which are from different backup sets...).
    I think that specifying several directories is a piece of complexity
    that would be best left alone for a first version of this. But a
    single directory with multiple files sounds pretty reasonable. Of
    course we'll also need to support that format in non-parallel mode,
    and in pg_restore.
    The second solution to the single-file-problem is to generate no
    output at all, i.e. whatever you export from your source database you
    import directly into your target database, which in the end turns out
    to be a parallel form of "pg_dump | psql".
    This is a very interesting idea but you might want to get the other
    thing merged first, as it's going to present a different set of
    issues.
    I had some prior discussion with joachim (and I suspect I had some
    influence in him trying to implement that) on that.
    The reason why this is really needed is that the current pg_restore -j
    is actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that
    are basically "duplicate this database to that location" (or any
    migration really).
    The example at had is a 240GB production database with around 850
    tables, it takes ~145min to dump that database single
    threaded(completely CPU bound), simply loading the SQL using psql can
    restore it in ~150min(again CPU bound both for COPY and index creation),
    -j8 brings that down to ~55min.
    So if you do the math(and a bit of handwaving):

    * using pg_dump | psql you get greatest(140,150) -> 150min.
    * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
    * using a theoretical parallel pg_dump and the existing parallel restore
    you would get: 50(just a guess for how fast it might be) + 55 -> 105min
    * a parallel dump & restore that can pipline would end up at
    greatest(50,55)->55min


    So a parallel dump alone would only give you a 50% speedup in total time
    for doing a migration/upgrade/dump-to-devbox despite the fact that it
    uses 8x the resources. A piplined solution would result in a ~3x speedup
    in total time and you don't even have to even think about stuff that
    might be a problem like having available diskspace on the
    source/destination to hold a full temporary dump(if you don't you might
    even have to add some transfer time as well).



    Stefan
  • Robert Haas at Mar 29, 2010 at 5:37 pm

    On Mon, Mar 29, 2010 at 1:16 PM, Stefan Kaltenbrunner wrote:
    Robert Haas wrote:
    On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland wrote:
    [...]
    - Regarding the output of pg_dump I am proposing two solutions. The
    first one is to introduce a new archive type "directory" where each
    table and each blob is a file in a directory, similar to the
    experimental "files" archive type. Also the idea has come up that you
    should be able to specify multiple directories in order to make use of
    several physical disk drives. Thinking this further, in order to
    manage all the mess that you can create with this, every file of the
    same backup needs to have a unique identifier and pg_restore should
    have a check parameter that tells you if your backup directory is in a
    sane and complete state (think about moving a file from one backup
    directory to another one or trying to restore from two directories
    which are from different backup sets...).
    I think that specifying several directories is a piece of complexity
    that would be best left alone for a first version of this.  But a
    single directory with multiple files sounds pretty reasonable.  Of
    course we'll also need to support that format in non-parallel mode,
    and in pg_restore.
    The second solution to the single-file-problem is to generate no
    output at all, i.e. whatever you export from your source database you
    import directly into your target database, which in the end turns out
    to be a parallel form of "pg_dump | psql".
    This is a very interesting idea but you might want to get the other
    thing merged first, as it's going to present a different set of
    issues.
    I had some prior discussion with joachim (and I suspect I had some influence
    in him trying to implement that) on that.
    The reason why this is really needed is that the current pg_restore -j is
    actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are
    basically "duplicate this database to that location" (or any migration
    really).
    The example at had is a 240GB production database with around 850 tables, it
    takes ~145min to dump that database single threaded(completely CPU bound),
    simply loading the SQL using psql can restore it in ~150min(again CPU bound
    both for COPY and index creation), -j8 brings that down to ~55min.
    So if  you do the math(and a bit of handwaving):

    * using pg_dump | psql you get greatest(140,150) -> 150min.
    * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
    * using a theoretical parallel pg_dump and the existing parallel restore you
    would get: 50(just a guess for how fast it might be) + 55 -> 105min
    * a parallel dump & restore that can pipline would end up at
    greatest(50,55)->55min


    So a parallel dump alone would only give you a 50% speedup in total time for
    doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x
    the resources. A piplined solution would result in a ~3x speedup in total
    time and you don't even have to even think about stuff that might be a
    problem like having available diskspace on the source/destination to hold a
    full temporary dump(if you don't you might even have to add some transfer
    time as well).
    It's a great idea - but there are two features here. I've seen many
    patches implementing two features during my relatively short time with
    the project and if the rejection rate hasn't been 100% it's certainly
    been close. If Joachim thinks he's got it all working, by all means
    submit both patches. One can apply over the other if they are
    interdependent. But I STRONGLY suggest separating this into two
    pieces - it is MUCH easier to get things applied that way, for good
    and valid reasons.

    ...Robert
  • Josh Berkus at Mar 29, 2010 at 7:46 pm

    On 3/29/10 7:46 AM, Joachim Wieland wrote:
    I actually assume that whenever people are interested
    in a very fast dump, it is because they are doing some maintenance
    task (like migrating to a different server) that involves pg_dump. In
    these cases, they would stop their system anyway.
    Actually, I'd say that there's a broad set of cases of people who want
    to do a parallel pg_dump while their system is active. Parallel pg_dump
    on a stopped system will help some people (for migration, particularly)
    but parallel pg_dump with snapshot cloning will help a lot more people.

    For example, imagine a user who has a 16-core machine on a 14-drive RAID
    10, and a 100-table 1TB database. At 2am, this person might reasonaly
    want to allocate a large portion of the machine resources to the dump by
    giving it 4 threads, without cutting access to the application.

    So: if parallel dump in single-user mode is what you can get done, then
    do it. We can always improve it later, and we have to start somewhere.
    But we will eventually need parallel pg_dump on active systems, and
    that should remain on the TODO list.


    --
    -- Josh Berkus
    PostgreSQL Experts Inc.
    http://www.pgexperts.com
  • Tom Lane at Mar 29, 2010 at 8:11 pm

    Josh Berkus writes:
    On 3/29/10 7:46 AM, Joachim Wieland wrote:
    I actually assume that whenever people are interested
    in a very fast dump, it is because they are doing some maintenance
    task (like migrating to a different server) that involves pg_dump. In
    these cases, they would stop their system anyway.
    Actually, I'd say that there's a broad set of cases of people who want
    to do a parallel pg_dump while their system is active. Parallel pg_dump
    on a stopped system will help some people (for migration, particularly)
    but parallel pg_dump with snapshot cloning will help a lot more people.
    I doubt that. My thought about it is that parallel dump will suck
    enough resources from the source server, both disk and CPU, that you
    would never want to use it on a live production machine. Not even at
    2am. And your proposed use case is hardly a "broad set" in any case.
    Thus, Joachim's approach seems perfectly sane from here. I certainly
    don't see that there's an argument for spending 10x more development
    effort to pick up such use cases.

    Another question that's worth asking is exactly what the use case would
    be for parallel pg_dump against a live server, whether the snapshots are
    synchronized or not. You will not be able to use that dump as a basis
    for PITR, so there is no practical way of incorporating any changes that
    occur after the dump begins. So what are you making it for? If it's a
    routine backup for disaster recovery, fine, but it's not apparent why
    you want max speed and to heck with live performance for that purpose.
    I think migration to a new server version (that's too incompatible for
    PITR or pg_migrate migration) is really the only likely use case.

    regards, tom lane
  • Robert Haas at Mar 29, 2010 at 8:16 pm

    On Mon, Mar 29, 2010 at 4:11 PM, Tom Lane wrote:
    Josh Berkus <josh@agliodbs.com> writes:
    On 3/29/10 7:46 AM, Joachim Wieland wrote:
    I actually assume that whenever people are interested
    in a very fast dump, it is because they are doing some maintenance
    task (like migrating to a different server) that involves pg_dump. In
    these cases, they would stop their system anyway.
    Actually, I'd say that there's a broad set of cases of people who want
    to do a parallel pg_dump while their system is active.  Parallel pg_dump
    on a stopped system will help some people (for migration, particularly)
    but parallel pg_dump with snapshot cloning will help a lot more people.
    I doubt that.  My thought about it is that parallel dump will suck
    enough resources from the source server, both disk and CPU, that you
    would never want to use it on a live production machine.  Not even at
    2am.  And your proposed use case is hardly a "broad set" in any case.
    Thus, Joachim's approach seems perfectly sane from here.  I certainly
    don't see that there's an argument for spending 10x more development
    effort to pick up such use cases.

    Another question that's worth asking is exactly what the use case would
    be for parallel pg_dump against a live server, whether the snapshots are
    synchronized or not.  You will not be able to use that dump as a basis
    for PITR, so there is no practical way of incorporating any changes that
    occur after the dump begins.  So what are you making it for?  If it's a
    routine backup for disaster recovery, fine, but it's not apparent why
    you want max speed and to heck with live performance for that purpose.
    I think migration to a new server version (that's too incompatible for
    PITR or pg_migrate migration) is really the only likely use case.
    It's completely possible that you could want to clone a server for dev
    and have more CPU and I/O bandwidth available than can be efficiently
    used by a non-parallel pg_dump. But certainly what Joachim is talking
    about will be a good start. I think there is merit to the
    synchronized snapshot stuff for pg_dump and perhaps other applications
    as well, but I think Joachim's (well-taken) point is that we don't
    have to treat it as a hard prerequisite.

    ...Robert
  • Andrew Dunstan at Mar 29, 2010 at 8:43 pm

    Robert Haas wrote:
    It's completely possible that you could want to clone a server for dev
    and have more CPU and I/O bandwidth available than can be efficiently
    used by a non-parallel pg_dump. But certainly what Joachim is talking
    about will be a good start. I think there is merit to the
    synchronized snapshot stuff for pg_dump and perhaps other applications
    as well, but I think Joachim's (well-taken) point is that we don't
    have to treat it as a hard prerequisite.

    Possibly. I think the most useful thing that could be done right now is
    probably the least controversial, namely creating a directory type of
    archive, with support for pg_restore, including parallel pg_restore.

    Personally I think that's worth doing in its own right anyway.

    cheers

    andrew
  • Stefan Kaltenbrunner at Mar 30, 2010 at 6:39 am

    Tom Lane wrote:
    Josh Berkus <josh@agliodbs.com> writes:
    On 3/29/10 7:46 AM, Joachim Wieland wrote:
    I actually assume that whenever people are interested
    in a very fast dump, it is because they are doing some maintenance
    task (like migrating to a different server) that involves pg_dump. In
    these cases, they would stop their system anyway.
    Actually, I'd say that there's a broad set of cases of people who want
    to do a parallel pg_dump while their system is active. Parallel pg_dump
    on a stopped system will help some people (for migration, particularly)
    but parallel pg_dump with snapshot cloning will help a lot more people.
    I doubt that. My thought about it is that parallel dump will suck
    enough resources from the source server, both disk and CPU, that you
    would never want to use it on a live production machine. Not even at
    2am. And your proposed use case is hardly a "broad set" in any case.
    Thus, Joachim's approach seems perfectly sane from here. I certainly
    don't see that there's an argument for spending 10x more development
    effort to pick up such use cases.

    Another question that's worth asking is exactly what the use case would
    be for parallel pg_dump against a live server, whether the snapshots are
    synchronized or not. You will not be able to use that dump as a basis
    for PITR, so there is no practical way of incorporating any changes that
    occur after the dump begins. So what are you making it for? If it's a
    routine backup for disaster recovery, fine, but it's not apparent why
    you want max speed and to heck with live performance for that purpose.
    I think migration to a new server version (that's too incompatible for
    PITR or pg_migrate migration) is really the only likely use case.
    I really doubt that - on fast systems pg_dump is completely CPU
    bottlenecked and typical 1-2U typical hardware you get these days has
    8-16 cores so simply dedicating a few cores to dumping the database
    during quieter times is very realistic.
    Databases are growing larger and larger and the single threaded nature
    of pg_dump makes it very hard to even stay withing reasonable time
    limits for doing the backup.


    Stefan
  • Peter Eisentraut at Mar 30, 2010 at 11:02 am

    On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
    on fast systems pg_dump is completely CPU bottlenecked
    Might be useful to profile why that is. I don't think pg_dump has
    historically been developed with CPU efficiency in mind.
  • Pierre C at Mar 30, 2010 at 12:06 pm

    On Tue, 30 Mar 2010 13:01:54 +0200, Peter Eisentraut wrote:
    On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
    on fast systems pg_dump is completely CPU bottlenecked
    Might be useful to profile why that is. I don't think pg_dump has
    historically been developed with CPU efficiency in mind.
    Already done that (I had posted some WIP patches to speed up COPY,
    hopefully I'll have time to finish those one day ;)
    Most of the time spent in the postmaster process during "COPY TO" is in
    the datum -> string functions.
  • Stefan Kaltenbrunner at Mar 30, 2010 at 12:15 pm

    Peter Eisentraut wrote:
    On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
    on fast systems pg_dump is completely CPU bottlenecked
    Might be useful to profile why that is. I don't think pg_dump has
    historically been developed with CPU efficiency in mind.
    It's not pg_dump that is the problem - it is COPY that is the limit. In
    my specific case als the fact that a lot of the columns are bytea adds
    to the horrible CPU overhead (fixed in 9.0). Still our bulk load &
    unload performance is still way slower on a per core comparision than a
    lot of other databases :(


    Stefan
  • Jeff at Apr 1, 2010 at 1:35 pm

    On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote:

    Peter Eisentraut wrote:
    On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
    on fast systems pg_dump is completely CPU bottlenecked
    Might be useful to profile why that is. I don't think pg_dump has
    historically been developed with CPU efficiency in mind.
    It's not pg_dump that is the problem - it is COPY that is the limit.
    In my specific case als the fact that a lot of the columns are bytea
    adds to the horrible CPU overhead (fixed in 9.0). Still our bulk
    load & unload performance is still way slower on a per core
    comparision than a lot of other databases :(
    Don't forget the zlib compression used in -Fc (unless you use -Z0)
    takes a fair amount of cpu too.
    I did some tests and it turned out that -Z0 actually took longer than -
    Z1 simply because there was a lot more data to write out, thus I
    became IO bound not CPU bound.

    There's a thing called pigz around that is a parallel gzip
    implementation - wonder how much of that could be adapted to pg_dumps
    use as compression does use a considerable amount of time (even at -
    Z1). The biggest problem I can immediately see is that it uses threads.
  • Stefan Kaltenbrunner at Apr 1, 2010 at 1:51 pm

    Jeff wrote:
    On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote:

    Peter Eisentraut wrote:
    On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
    on fast systems pg_dump is completely CPU bottlenecked
    Might be useful to profile why that is. I don't think pg_dump has
    historically been developed with CPU efficiency in mind.
    It's not pg_dump that is the problem - it is COPY that is the limit.
    In my specific case als the fact that a lot of the columns are bytea
    adds to the horrible CPU overhead (fixed in 9.0). Still our bulk load
    & unload performance is still way slower on a per core comparision
    than a lot of other databases :(
    Don't forget the zlib compression used in -Fc (unless you use -Z0) takes
    a fair amount of cpu too.
    I did some tests and it turned out that -Z0 actually took longer than
    -Z1 simply because there was a lot more data to write out, thus I became
    IO bound not CPU bound.

    There's a thing called pigz around that is a parallel gzip
    implementation - wonder how much of that could be adapted to pg_dumps
    use as compression does use a considerable amount of time (even at
    -Z1). The biggest problem I can immediately see is that it uses threads.
    all my numbers are with -Z0 and it is the backend (COPY and/or index
    creation) that is the limit. If you start using compression you are
    shifting the load to pg_dump.


    Stefan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMar 29, '10 at 2:47p
activeApr 1, '10 at 1:51p
posts15
users10
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase