Heikki Linnakangas writes:
Log Message:
-----------
Use a latch to make startup process wake up and replay immediately when
new WAL arrives via streaming replication. This reduces the latency, and
also allows us to use a longer polling interval, which is good for energy
efficiency.
We still need to poll to check for the appearance of a trigger file, but
the interval is now 5 seconds (instead of 100ms), like when waiting for
a new WAL segment to appear in WAL archive.
This is just speculation at this point, because I haven't taken time
to think through the details, but couldn't we improve on that still
further?

There are always going to be some conditions that we have to poll for,
in particular death of the postmaster (since Unix unaccountably fails
to provide a SIGPARNT signal condition :-(). However postmaster death
isn't really something that needs an instant response IMO. I would like
to get the wakeup-and-poll interval for our background processes down to
a minute or so; so far as postmaster death goes that doesn't seem like
an unacceptable response time.

So I'm wondering if we couldn't eliminate the five-second sleep
requirement here too. It's problematic anyhow, since somebody looking
for energy efficiency will still feel it's too short, while somebody
concerned about fast failover will feel it's too long. Could the
standby triggering protocol be modified so that it involves sending a
signal, not just creating a file? (One issue is that it's not clear
what that'd translate to on Windows.)

regards, tom lane

Search Discussions

  • Heikki Linnakangas at Sep 15, 2010 at 2:16 pm

    On 15/09/10 16:55, Tom Lane wrote:
    So I'm wondering if we couldn't eliminate the five-second sleep
    requirement here too. It's problematic anyhow, since somebody looking
    for energy efficiency will still feel it's too short, while somebody
    concerned about fast failover will feel it's too long. Yep.
    Could the
    standby triggering protocol be modified so that it involves sending a
    signal, not just creating a file?
    Seems reasonable, at least if we still provide an option for more
    frequent polling and no need to send signal.
    (One issue is that it's not clear what that'd translate to on Windows.)
    pg_ctl failover ? At the moment, the location of the trigger file is
    configurable, but if we accept a constant location like
    "$PGDATA/failover" pg_ctl could do the whole thing, create the file and
    send signal. pg_ctl on Window already knows how to send the "signal" via
    the named pipe signal emulation.

    Fujii-san suggested that we might have a user-defined function for
    triggering failover as well. That's also handy, but it's not a
    replacement because it only works in hot standby mode.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Fujii Masao at Sep 16, 2010 at 4:23 am

    On Wed, Sep 15, 2010 at 11:14 PM, Heikki Linnakangas wrote:
    (One issue is that it's not clear what that'd translate to on Windows.)
    pg_ctl failover ? At the moment, the location of the trigger file is
    configurable, but if we accept a constant location like "$PGDATA/failover"
    pg_ctl could do the whole thing, create the file and send signal. pg_ctl on
    Window already knows how to send the "signal" via the named pipe signal
    emulation. Right.
    Fujii-san suggested that we might have a user-defined function for
    triggering failover as well.
    The attached patch introduces such a user-defined function. This is
    useful especially when clusterware like pgpool-II is located in remote
    server since it can trigger failover without using something like ssh.
    That's also handy, but it's not a replacement
    because it only works in hot standby mode.
    Yep.

    And we should increase the sleep time in walsender's poll loop (i.e.,
    increase the default value of wal_sender_delay) too? Currently it's
    very small, 200ms.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Fujii Masao at Sep 16, 2010 at 5:05 am

    On Thu, Sep 16, 2010 at 1:23 PM, Fujii Masao wrote:
    Fujii-san suggested that we might have a user-defined function for
    triggering failover as well.
    The attached patch introduces such a user-defined function. This is
    useful especially when clusterware like pgpool-II is located in remote
    server since it can trigger failover without using something like ssh.
    I forgot to check if the caller of that function has superuser permission.
    Here is the updated version.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Simon Riggs at Sep 15, 2010 at 2:36 pm

    On Wed, 2010-09-15 at 20:14 +0900, Fujii Masao wrote:
    On Wed, Sep 15, 2010 at 7:35 PM, Heikki Linnakangas
    wrote:
    Log Message:
    -----------
    Use a latch to make startup process wake up and replay immediately when
    new WAL arrives via streaming replication. This reduces the latency, and
    also allows us to use a longer polling interval, which is good for energy
    efficiency.

    We still need to poll to check for the appearance of a trigger file, but
    the interval is now 5 seconds (instead of 100ms), like when waiting for
    a new WAL segment to appear in WAL archive.
    Good work!
    No, not good work.

    You both know very well that I'm working on this area also and these
    commits are not agreed... yet. They might not be contended but they are
    very likely to break my patch, again.

    Please desist while we resolve which are the good ideas and which are
    not. We won't know that if you keep breaking other people's patches in a
    stream of commits that prevent anybody completing other options.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • David Fetter at Sep 15, 2010 at 3:32 pm

    On Wed, Sep 15, 2010 at 03:35:30PM +0100, Simon Riggs wrote:
    On Wed, 2010-09-15 at 20:14 +0900, Fujii Masao wrote:
    On Wed, Sep 15, 2010 at 7:35 PM, Heikki Linnakangas
    wrote:
    Log Message:
    -----------
    Use a latch to make startup process wake up and replay immediately when
    new WAL arrives via streaming replication. This reduces the latency, and
    also allows us to use a longer polling interval, which is good for energy
    efficiency.

    We still need to poll to check for the appearance of a trigger file, but
    the interval is now 5 seconds (instead of 100ms), like when waiting for
    a new WAL segment to appear in WAL archive.
    Good work!
    No, not good work.

    You both know very well that I'm working on this area also and these
    commits are not agreed... yet. They might not be contended but they are
    very likely to break my patch, again.

    Please desist while we resolve which are the good ideas and which are
    not. We won't know that if you keep breaking other people's patches in a
    stream of commits that prevent anybody completing other options.
    Simon,

    No matter how many times you try, you are not going to get a license
    to stop all work on anything you might chance to think about. It is
    quite simply never going to happen, so you need to back off.

    Cheers,
    David.
    --
    David Fetter <david@fetter.org> http://fetter.org/
    Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
    Skype: davidfetter XMPP: david.fetter@gmail.com
    iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

    Remember to vote!
    Consider donating to Postgres: http://www.postgresql.org/about/donate
  • Simon Riggs at Sep 15, 2010 at 3:40 pm

    On Wed, 2010-09-15 at 07:59 -0700, David Fetter wrote:
    On Wed, Sep 15, 2010 at 03:35:30PM +0100, Simon Riggs wrote:

    Please desist while we resolve which are the good ideas and which are
    not. We won't know that if you keep breaking other people's patches in a
    stream of commits that prevent anybody completing other options.
    No matter how many times you try, you are not going to get a license
    to stop all work on anything you might chance to think about. It is
    quite simply never going to happen, so you need to back off.
    I agree that asking people to stop work is not OK. However, I haven't
    asked for development work to stop, only that commits into that area
    stop until proper debate has taken place. Those might be minor commits,
    but they might not. Had I made those commits, they would have been
    called premature by others also.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Sep 15, 2010 at 4:45 pm

    On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs wrote:
    I agree that asking people to stop work is not OK. However, I haven't
    asked for development work to stop, only that commits into that area
    stop until proper debate has taken place. Those might be minor commits,
    but they might not. Had I made those commits, they would have been
    called premature by others also.
    I do not believe that Heikki has done anything inappropriate. We've
    spent weeks discussing the latch facility and its various
    applications.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Simon Riggs at Sep 15, 2010 at 5:30 pm

    On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote:
    On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs wrote:
    I agree that asking people to stop work is not OK. However, I haven't
    asked for development work to stop, only that commits into that area
    stop until proper debate has taken place. Those might be minor commits,
    but they might not. Had I made those commits, they would have been
    called premature by others also.
    I do not believe that Heikki has done anything inappropriate. We've
    spent weeks discussing the latch facility and its various
    applications.
    Sounds reasonable, but my comments were about this commit, not the one
    that happened on Saturday. This patch was posted about 32 hours ago, and
    the commit need not have taken place yet. If I had posted such a patch
    and committed it knowing other work is happening in that area we both
    know that you would have objected.

    It's not actually a major issue, but at some point I have to ask for no
    more commits, so Fujii and I can finish our patches, compare and
    contrast, so the best ideas can get into Postgres.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Sep 15, 2010 at 5:58 pm

    On Wed, Sep 15, 2010 at 1:30 PM, Simon Riggs wrote:
    On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote:
    On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs wrote:
    I agree that asking people to stop work is not OK. However, I haven't
    asked for development work to stop, only that commits into that area
    stop until proper debate has taken place. Those might be minor commits,
    but they might not. Had I made those commits, they would have been
    called premature by others also.
    I do not believe that Heikki has done anything inappropriate.  We've
    spent weeks discussing the latch facility and its various
    applications.
    Sounds reasonable, but my comments were about this commit, not the one
    that happened on Saturday. This patch was posted about 32 hours ago, and
    the commit need not have taken place yet. If I had posted such a patch
    and committed it knowing other work is happening in that area we both
    know that you would have objected.
    I've often felt that we ought to have a bit more delay between when
    committers post patches and when they commit them. I was told 24
    hours and I've seen cases where people haven't even waited that long.
    On the other hand, if we get to strict about it, it can easily get to
    the point where it just gets in the way of progress, and certainly
    some patches are far more controversial than others. So I don't know
    what the best thing to do is. Still, I have to admit that I feel
    fairly positive about the direction we're going with this particular
    patch. Clearing away these peripheral issues should make it easier
    for us to have a rational discussion about the core issues around how
    this is going to be configured and actually work at the protocol
    level.
    It's not actually a major issue, but at some point I have to ask for no
    more commits, so Fujii and I can finish our patches, compare and
    contrast, so the best ideas can get into Postgres.
    I don't think anyone is prepared to agree to that. I think that
    everyone is prepared to accept a limited amount of further delay in
    pressing forward with the main part of sync rep, but I expect that no
    one will be willing to freeze out incremental improvements in the
    meantime, even if it does induce a certain amount of rebasing. It's
    also worth noting that Fujii Masao's patch has been around for months,
    and yours isn't finished yet. That's not to say that we don't want to
    consider your ideas, because we do: and you've had more than your
    share of good ones. At the same time, it would be unfair and
    unreasonable to expect work on a patch that is done, and has been done
    for some time, to wait on one that isn't.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Heikki Linnakangas at Sep 15, 2010 at 6:34 pm

    On 15/09/10 20:58, Robert Haas wrote:
    On Wed, Sep 15, 2010 at 1:30 PM, Simon Riggswrote:
    On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote:
    On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggswrote:
    I agree that asking people to stop work is not OK. However, I haven't
    asked for development work to stop, only that commits into that area
    stop until proper debate has taken place. Those might be minor commits,
    but they might not. Had I made those commits, they would have been
    called premature by others also.
    I do not believe that Heikki has done anything inappropriate. We've
    spent weeks discussing the latch facility and its various
    applications.
    Sounds reasonable, but my comments were about this commit, not the one
    that happened on Saturday. This patch was posted about 32 hours ago, and
    the commit need not have taken place yet. If I had posted such a patch
    and committed it knowing other work is happening in that area we both
    know that you would have objected.
    I've often felt that we ought to have a bit more delay between when
    committers post patches and when they commit them. I was told 24
    hours and I've seen cases where people haven't even waited that long.
    On the other hand, if we get to strict about it, it can easily get to
    the point where it just gets in the way of progress, and certainly
    some patches are far more controversial than others. So I don't know
    what the best thing to do is.
    With anything non-trivial, I try to "sleep on it" before committing.
    More with complicated patches, but it's really up to your own comfort
    level with the patch, and whether you think anyone might have different
    opinions on it. I don't mind quick commits if it's something that has
    been discussed in the past and the committer thinks it's
    non-controversial. There's always the option of complaining afterwards.
    If it comes to that, though, it wasn't really ripe for committing yet.
    (That doesn't apply to gripes about typos or something like that,
    because that happens to me way too often ;-) )
    Still, I have to admit that I feel
    fairly positive about the direction we're going with this particular
    patch. Clearing away these peripheral issues should make it easier
    for us to have a rational discussion about the core issues around how
    this is going to be configured and actually work at the protocol
    level.
    Yeah, I don't think anyone has any qualms about the substance of these
    patches.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Sep 15, 2010 at 7:18 pm

    On Wed, 2010-09-15 at 13:58 -0400, Robert Haas wrote:
    It's not actually a major issue, but at some point I have to ask for no
    more commits, so Fujii and I can finish our patches, compare and
    contrast, so the best ideas can get into Postgres.
    I don't think anyone is prepared to agree to that. I think that
    everyone is prepared to accept a limited amount of further delay in
    pressing forward with the main part of sync rep, but I expect that no
    one will be willing to freeze out incremental improvements in the
    meantime, even if it does induce a certain amount of rebasing.
    It's
    also worth noting that Fujii Masao's patch has been around for months,
    and yours isn't finished yet. That's not to say that we don't want to
    consider your ideas, because we do: and you've had more than your
    share of good ones. At the same time, it would be unfair and
    unreasonable to expect work on a patch that is done, and has been done
    for some time, to wait on one that isn't.
    I understand your viewpoint there. I'm sure we all agree sync rep is a
    very important feature that must get into the next release.

    The only reason my patch exists is because debate around my ideas was
    ruled out on various grounds. One of those was it would take so long to
    develop we shouldn't risk not getting sync rep in this release. I am
    amenable to such arguments (and I make the same one on MERGE, btw, where
    I am getting seriously worried) but the reality is that there is
    actually very little code here and we can definitely do this, whatever
    ideas we pick. I've shown this by providing an almost working version in
    about 4 days work. Will finishing it help?

    We definitely have the time, so the question is, what are the best
    ideas? We must discuss the ideas properly, not just plough forwards
    claiming time pressure when it isn't actually an issue at all. We *need*
    to put the tools down and talk in detail about the best way forwards.

    Before, I had no patch. Now mine "isn't finished". At what point will my
    ideas be reviewed without instant dismissal? If we accept your seniority
    argument, then "never" because even if I finish it you'll say "Fujii was
    there first".

    If who mentioned it first was important, then I'd say I've been
    discussing this for literally years (late 2006) and have regularly
    explained the benefits of the master-side approach I've outlined on list
    every time this has come up (every few months). I have also explained
    the implementation details many times as well an I'm happy to say that
    latches are pretty much exactly what I described earlier. (I called them
    LSN queues, similar to lwlocks, IIRC). But thats not the whole deal.

    If we simply wanted a patch that was "done" we would have gone with
    Zoltan's wouldn't we, based on the seniority argument you use above?
    Zoltan's patch didn't perform well at all. Fujii's performs much better.
    However, my proposed approach offers even better performance, so
    whatever argument you use to include Fujii's also applies to mine
    doesn't it? But that's silly and divisive, its not about who's patch
    "wins" is it?

    Do we have to benchmark multiple patches to prove which is best? If
    that's the criteria I'll finish my patch and demonstrate that.

    But it doesn't make sense to start committing pieces of Fujii's patch,
    so that I can't ever keep up and as a result "Simon never finished his
    patch, but it sounded good".

    Next steps should be: tools down, discuss what to do. Then go forwards.

    We have time, so lets discuss all of the ideas on the table not just
    some of them.

    For me this is not about the number or names of parameters, its about
    master-side control of sync rep and having very good performance.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Sep 15, 2010 at 8:01 pm

    On Wed, Sep 15, 2010 at 3:18 PM, Simon Riggs wrote:
    Will finishing it help?
    Yes, I expect that to help a lot.
    Before, I had no patch. Now mine "isn't finished". At what point will my
    ideas be reviewed without instant dismissal? If we accept your seniority
    argument, then "never" because even if I finish it you'll say "Fujii was
    there first".
    I said very clearly in my previous email that "I think that everyone
    is prepared to accept a limited amount of further delay in pressing
    forward with the main part of sync rep". In other words, I think
    everyone is willing to consider your ideas provided that they are
    submitted in a form which everyone can understand and think through
    sometime soon. I am not, nor do I think anyone is, saying that we
    don't wish to consider your ideas. I'm actually really pleased that
    you are only a day or two from having a working patch. It can be much
    easier to conceptualize a patch than to find the time to finish it
    (unfortunately, this problem has overtaken me rather badly in the last
    few weeks, which is why I have no new patches in this CommitFest) and
    if you can finish it up and get it out in front of everyone I expect
    that to be a good thing for this feature and our community.
    Do we have to benchmark multiple patches to prove which is best? If
    that's the criteria I'll finish my patch and demonstrate that.
    I was thinking about that earlier today. I think it's definitely
    possible that we'll need to do some benchmarking, although I expect
    that people will want to read the code first.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Fujii Masao at Sep 17, 2010 at 5:33 am

    On Thu, Sep 16, 2010 at 4:18 AM, Simon Riggs wrote:
    We definitely have the time, so the question is, what are the best
    ideas?
    Before advancing the review of each patch, we must determine what
    should be committed in 9.1, and what's in this CF.

    "Synchronization level on per-transaction" feature is included in Simon's
    patch, but not in mine. This is most important difference, which would
    have wide-reaching impact on the implementation, e.g., protocol between
    walsender and walreceiver. So, at first we should determine whether we'll
    commit the feature in 9.1. Then we need to determine how far we should
    implement in this CF. Thought?

    Each patch provides "synchronization level on per-standby" feature. In
    Simon's patch, that level is specified in the standbys's recovery.conf.
    In mine, it's in the master's standbys.conf. I think that the former is simpler.
    But if we support the capability to register the standbys, the latter would
    be required. Which is the best?

    Simon's patch seems to include simple quorum commit feature (correct
    me if I'm wrong). That is, when there are multiple synchronous standbys,
    the master waits until ACK has arrived from at least one standby. OTOH,
    in my patch, the master waits until ACK has arrived from all the synchronous
    standbys. Which should we choose? I think that we should commit my
    straightforward approach first, and enable the quorum commit on that.
    Thought?

    Simon proposes to invoke walwriter in the standby. This is not included
    in my patch, but looks good idea. ISTM that this is not essential feature
    for synchronous replication, so how about detachmenting of the walwriter
    part from the patch and reviewing it independently?

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Simon Riggs at Sep 17, 2010 at 7:06 am

    On Fri, 2010-09-17 at 14:33 +0900, Fujii Masao wrote:
    On Thu, Sep 16, 2010 at 4:18 AM, Simon Riggs wrote:
    We definitely have the time, so the question is, what are the best
    ideas?
    Before advancing the review of each patch, we must determine what
    should be committed in 9.1, and what's in this CF.
    Thank you for starting the discussion.
    "Synchronization level on per-transaction" feature is included in Simon's
    patch, but not in mine. This is most important difference
    Agreed. It's also a very important option for users.
    which would
    have wide-reaching impact on the implementation, e.g., protocol between
    walsender and walreceiver. So, at first we should determine whether we'll
    commit the feature in 9.1. Then we need to determine how far we should
    implement in this CF. Thought?
    Yes, sync rep specified per-transaction changes many things at a low
    level. Basically, we have a choice of two mostly incompatible
    implementations, plus some other options common to both.

    There is no danger that we won't commit in 9.1. We have time for
    discussion and thought. We also have time for performance testing and
    since many of my design proposals are performance related that seems
    essential to properly reviewing the patches.

    I don't think we can determine how far to implement without considering
    both approaches in detail. With regard to your points below, I don't
    think any of those points could be committed first.
    Each patch provides "synchronization level on per-standby" feature. In
    Simon's patch, that level is specified in the standbys's recovery.conf.
    In mine, it's in the master's standbys.conf. I think that the former is simpler.
    But if we support the capability to register the standbys, the latter would
    be required. Which is the best?
    Either approach is OK for me. Providing both options is also possible.
    My approach was just less code and less change to existing mechanisms,
    so I did it that way.

    There are some small optimisations possible on standby if the standby
    knows what role it's being asked to play. It doesn't matter to me
    whether we let standby tell master or master tell standby and the code
    is about the same either way.
    Simon's patch seems to include simple quorum commit feature (correct
    me if I'm wrong). That is, when there are multiple synchronous standbys,
    the master waits until ACK has arrived from at least one standby. OTOH,
    in my patch, the master waits until ACK has arrived from all the synchronous
    standbys. Which should we choose? I think that we should commit my
    straightforward approach first, and enable the quorum commit on that.
    Thought?
    Yes, my approach is simple. For those with Oracle knowledge, my approach
    (first-reply-releases-waiter) is equivalent to Oracle's Maximum
    Protection mode (= 'fsync' in my design). Providing even higher levels
    of protection would not be the most common case.

    Your approach of waiting for all replies is much slower and requires
    more complex code, since we need to track intermediate states. It also
    has additional complexities of behaviour, such as how long do we wait
    for second acknowledgement when we already have one, and what happens
    when a second ack is not received? More failure modes == less stable.
    ISTM that it would require more effort to do this also, since every ack
    needs to check all WAL sender data to see if it is the last ack. None of
    that seems straightforward.

    I don't agree we should commit your approach to that aspect.

    In my proposal, such additional features would be possible as a plugin.
    The majority of users would not this facility and the plugin leaves the
    way open for high-end users that need this.
    Simon proposes to invoke walwriter in the standby. This is not included
    in my patch, but looks good idea. ISTM that this is not essential feature
    for synchronous replication, so how about detachmenting of the walwriter
    part from the patch and reviewing it independently?
    I regard it as an essential feature for implementing 'recv' mode of sync
    rep, which is the fastest mode. At present WALreceiver does all of
    these: receive, write and fsync. Of those the fsync is the slowest and
    increases response time significantly.

    Of course 'recv' option doesn't need to be part of first commit, but
    splitting commits doesn't seem likely to make this go quicker or easier
    in the early stages. In particular, splitting some features out could
    make it much harder to put back in again later. That point is why my
    patch even exists.


    I would like to express my regret that the main feature proposal from me
    necessitates low level changes that cause our two patches to be in
    conflict. Nobody should take this as a sign that there is a personal or
    professional problem between Fujii-san and myself.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Heikki Linnakangas at Sep 17, 2010 at 8:10 am
    (changed subject again.)
    On 17/09/10 10:06, Simon Riggs wrote:
    I don't think we can determine how far to implement without considering
    both approaches in detail. With regard to your points below, I don't
    think any of those points could be committed first.
    Yeah, I think we need to decide on the desired feature set first, before
    we dig deeper into the the patches. The design and implementation will
    fall out of that.

    That said, there's a few small things that can be progressed regardless
    of the details of synchronous replication. There's the changes to
    trigger failover with a signal, and it seems that we'll need some libpq
    changes to allow acknowledgments to be sent back to the master
    regardless of the rest of the design. We can discuss those in separate
    threads in parallel.

    So the big question is what the user interface looks like. How does one
    configure synchronous replication, and what options are available.
    Here's a list of features that have been discussed. We don't necessarily
    need all of them in the first phase, but let's avoid painting ourselves
    in the corner.

    * Support multiple standbys with various synchronization levels.

    * What happens if a synchronous standby isn't connected at the moment?
    Return immediately vs. wait forever.

    * Per-transaction control. Some transactions are important, others are not.

    * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all
    servers can be seen as important special cases of this.

    * async, recv, fsync and replay levels of synchronization.

    So what should the user interface be like? Given the 1st and 2nd
    requirement, we need standby registration. If some standbys are
    important and others are not, the master needs to distinguish between
    them to be able to determine that a transaction is safely delivered to
    the important standbys.

    For per-transaction control, ISTM it would be enough to have a simple
    user-settable GUC like synchronous_commit. Let's call it
    "synchronous_replication_commit" for now. For non-critical transactions,
    you can turn it off. That's very simple for developers to understand and
    use. I don't think we need more fine-grained control than that at
    transaction level, in all the use cases I can think of you have a stream
    of important transactions, mixed with non-important ones like log
    messages that you want to finish fast in a best-effort fashion. I'm
    actually tempted to tie that to the existing synchronous_commit GUC, the
    use case seems exactly the same.

    OTOH, if we do want fine-grained per-transaction control, a simple
    boolean or even an enum GUC doesn't really cut it. For truly
    fine-grained control you want to be able to specify exceptions like
    "wait until this is replayed in slave named 'reporting'" or 'don't wait
    for acknowledgment from slave named 'uk-server'". With standby
    registration, we can invent a syntax for specifying overriding rules in
    the transaction. Something like SET replication_exceptions =
    'reporting=replay, uk-server=async'.

    For the control between async/recv/fsync/replay, I like to think in
    terms of
    a) asynchronous vs synchronous
    b) if it's synchronous, how synchronous is it? recv, fsync or replay?

    I think it makes most sense to set sync vs. async in the master, and the
    level of synchronicity in the slave. Although I have sympathy for the
    argument that it's simpler if you configure it all from the master side
    as well.

    Putting all of that together. I think Fujii-san's standby.conf is pretty
    close. What it needs is the additional GUC for transaction-level control.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Sep 17, 2010 at 8:15 am

    On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote:
    That said, there's a few small things that can be progressed
    regardless of the details of synchronous replication. There's the
    changes to trigger failover with a signal, and it seems that we'll
    need some libpq changes to allow acknowledgments to be sent back to
    the master regardless of the rest of the design. We can discuss those
    in separate threads in parallel.
    Agree to both of those points.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Simon Riggs at Sep 17, 2010 at 9:20 am

    On Fri, 2010-09-17 at 09:15 +0100, Simon Riggs wrote:
    On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote:
    That said, there's a few small things that can be progressed
    regardless of the details of synchronous replication. There's the
    changes to trigger failover with a signal, and it seems that we'll
    need some libpq changes to allow acknowledgments to be sent back to
    the master regardless of the rest of the design. We can discuss those
    in separate threads in parallel.
    Agree to both of those points.
    But I don't agree that those things should be committed just yet.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Dimitri Fontaine at Sep 17, 2010 at 9:10 am

    Heikki Linnakangas writes:
    * Support multiple standbys with various synchronization levels.

    * What happens if a synchronous standby isn't connected at the moment?
    Return immediately vs. wait forever.

    * Per-transaction control. Some transactions are important, others are not.

    * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
    can be seen as important special cases of this.

    * async, recv, fsync and replay levels of synchronization.

    So what should the user interface be like? Given the 1st and 2nd
    requirement, we need standby registration. If some standbys are important
    and others are not, the master needs to distinguish between them to be able
    to determine that a transaction is safely delivered to the important
    standbys.
    Well the 1st point can be handled in a distributed fashion, where the
    sync level is setup at the slave. Ditto for second point, you can get
    the exact same behavior control attached to the quorum facility.

    What I think you're description is missing is the implicit feature that
    you want to be able to setup the "ignore-or-wait" failure behavior per
    standby. I'm not sure we need that, or more precisely that we need to
    have that level of detail in the master's setup.

    Maybe what we need instead is a more detailed quorum facility, but as
    you're talking about something similar later in the mail, let's follow
    you.
    For per-transaction control, ISTM it would be enough to have a simple
    user-settable GUC like synchronous_commit. Let's call it
    "synchronous_replication_commit" for now. For non-critical transactions, you
    can turn it off. That's very simple for developers to understand and use. I
    don't think we need more fine-grained control than that at transaction
    level, in all the use cases I can think of you have a stream of important
    transactions, mixed with non-important ones like log messages that you want
    to finish fast in a best-effort fashion. I'm actually tempted to tie that to
    the existing synchronous_commit GUC, the use case seems exactly the
    same.
    Well, that would be an over simplification. In my applications I set the
    "sessions" transaction with synchronous_commit = off, but the business
    transactions to synchronous_commit = on. Now, among those last, I have
    backoffice editing and money transactions. I'm not willing to be forced
    to endure the same performance penalty for both when I know the
    distributed durability needs aren't the same.
    OTOH, if we do want fine-grained per-transaction control, a simple boolean
    or even an enum GUC doesn't really cut it. For truly fine-grained control
    you want to be able to specify exceptions like "wait until this is replayed
    in slave named 'reporting'" or 'don't wait for acknowledgment from slave
    named 'uk-server'". With standby registration, we can invent a syntax for
    specifying overriding rules in the transaction. Something like SET
    replication_exceptions = 'reporting=replay, uk-server=async'.
    Then you want to be able to have more than one reporting server and need
    only one of them at the "replay" level, but you don't need to know which
    it is. Or on the contrary you have a failover server and you want to be
    sure this one is at the replay level whatever happens.

    Then you want topology flexibility: you need to be able to replace a
    reporting server with another, ditto for the failover one.

    Did I tell you my current thinking on how to tackle that yet? :) Using a
    distributed setup, where each slave has a weight (several votes per
    transaction) and a level offering would allow that I think.

    Now something similar to your idea that I can see a need for is being
    able to have a multi-part quorum target: when you currently say that you
    want 2 votes for sync, you would be able to say you want 2 votes for
    recv, 2 for fsync and 1 for replay. Remember that any slave is setup to
    offer only one level of synchronicity but can offer multiple votes.

    How this would look like in the setup? Best would be to register the
    different service levels your application need. Time to bikeshed a
    little?

    sync_rep_services = {critical: recv=2, fsync=2, replay=1;
    important: fsync=3;
    reporting: recv=2, apply=1}

    Well you get the idea, it could maybe get stored on a catalog somewhere
    with nice SQL commands etc. The goal is then to be able to handle a much
    simpler GUC in the application, sync_rep_service = important for
    example. Reserved label would be off, the default value.
    For the control between async/recv/fsync/replay, I like to think in terms of
    a) asynchronous vs synchronous
    b) if it's synchronous, how synchronous is it? recv, fsync or replay?
    Same here.
    I think it makes most sense to set sync vs. async in the master, and the
    level of synchronicity in the slave.
    Yeah, exactly.

    If you add a weight to each slave then a quorum commit, you don't change
    the implementation complexity and you offer lot of setup flexibility. If
    the slave sync-level and weight are SIGHUP, then it even become rather
    easy to switch roles online or to add new servers or to organise a
    maintenance window — the quorum to reach is a per-transaction GUC on the
    master, too, right?

    Regards,
    --
    dim
  • Heikki Linnakangas at Sep 17, 2010 at 9:30 am

    On 17/09/10 12:10, Dimitri Fontaine wrote:
    Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
    * Support multiple standbys with various synchronization levels.

    * What happens if a synchronous standby isn't connected at the moment?
    Return immediately vs. wait forever.

    * Per-transaction control. Some transactions are important, others are not.

    * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
    can be seen as important special cases of this.

    * async, recv, fsync and replay levels of synchronization.

    So what should the user interface be like? Given the 1st and 2nd
    requirement, we need standby registration. If some standbys are important
    and others are not, the master needs to distinguish between them to be able
    to determine that a transaction is safely delivered to the important
    standbys.
    Well the 1st point can be handled in a distributed fashion, where the
    sync level is setup at the slave.
    If the synchronicity is configured in the standby, how does the master
    know that there's a synchronous slave out there that it should wait for,
    if that slave isn't connected at the moment?
    OTOH, if we do want fine-grained per-transaction control, a simple boolean
    or even an enum GUC doesn't really cut it. For truly fine-grained control
    you want to be able to specify exceptions like "wait until this is replayed
    in slave named 'reporting'" or 'don't wait for acknowledgment from slave
    named 'uk-server'". With standby registration, we can invent a syntax for
    specifying overriding rules in the transaction. Something like SET
    replication_exceptions = 'reporting=replay, uk-server=async'.
    Then you want to be able to have more than one reporting server and need
    only one of them at the "replay" level, but you don't need to know which
    it is. Or on the contrary you have a failover server and you want to be
    sure this one is at the replay level whatever happens.

    Then you want topology flexibility: you need to be able to replace a
    reporting server with another, ditto for the failover one.

    Did I tell you my current thinking on how to tackle that yet? :) Using a
    distributed setup, where each slave has a weight (several votes per
    transaction) and a level offering would allow that I think.
    Yeah, the quorum stuff. That's all good, but doesn't change the way you
    would do per-transaction control. By specifying overrides on a
    per-transaction basis, you can have as fine-grained control as you
    possibly can. Anything you can specify in a configuration file can then
    also be specified per-transaction with overrides. The syntax just needs
    to be flexible enough.

    If we buy into the concept of per-transaction exceptions, we can put
    that issue aside for the moment, and just consider how to configure
    things in a config file. Anything you can express in the config file can
    also be expressed per-transaction with the exceptions GUC.
    Now something similar to your idea that I can see a need for is being
    able to have a multi-part quorum target: when you currently say that you
    want 2 votes for sync, you would be able to say you want 2 votes for
    recv, 2 for fsync and 1 for replay. Remember that any slave is setup to
    offer only one level of synchronicity but can offer multiple votes.

    How this would look like in the setup? Best would be to register the
    different service levels your application need. Time to bikeshed a
    little?

    sync_rep_services = {critical: recv=2, fsync=2, replay=1;
    important: fsync=3;
    reporting: recv=2, apply=1}

    Well you get the idea, it could maybe get stored on a catalog somewhere
    with nice SQL commands etc. The goal is then to be able to handle a much
    simpler GUC in the application, sync_rep_service = important for
    example. Reserved label would be off, the default value
    So ignoring the quorum stuff for a moment, the general idea is that you
    have predefined sets of configurations (or exceptions to the general
    config) specified in a config file, and in the application you just
    choose among those with "sync_rep_service=XXX". Yeah, I like that, it
    allows you to isolate the details of the topology from the application.
    If you add a weight to each slave then a quorum commit, you don't change
    the implementation complexity and you offer lot of setup flexibility. If
    the slave sync-level and weight are SIGHUP, then it even become rather
    easy to switch roles online or to add new servers or to organise a
    maintenance window — the quorum to reach is a per-transaction GUC on the
    master, too, right?
    I haven't bought into the quorum idea yet, but yeah, if we have quorum
    support, then it would be configurable on a per-transaction basis too
    with the above mechanism.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Dimitri Fontaine at Sep 17, 2010 at 10:00 am

    Heikki Linnakangas writes:
    If the synchronicity is configured in the standby, how does the master know
    that there's a synchronous slave out there that it should wait for, if that
    slave isn't connected at the moment?
    That's what quorum is trying to solve. The master knows how many votes
    per sync level the transaction needs. If no slave is acknowledging any
    vote, that's all you need to know to ROLLBACK (after the timeout),
    right? — if setup says so, on the master.
    Yeah, the quorum stuff. That's all good, but doesn't change the way you
    would do per-transaction control.
    That's when I bought in on the feature. It's all dynamic and
    distributed, and it offers per-transaction control.

    Regards,
    --
    Dimitri Fontaine
    PostgreSQL DBA, Architecte
  • Simon Riggs at Sep 17, 2010 at 10:03 am

    On Fri, 2010-09-17 at 12:30 +0300, Heikki Linnakangas wrote:

    If the synchronicity is configured in the standby, how does the master
    know that there's a synchronous slave out there that it should wait for,
    if that slave isn't connected at the moment?
    That isn't a question you need standby registration to answer.

    In my proposal, the user requests a certain level of confirmation and
    will wait until timeout to see if it is received. The standby can crash
    and restart, come back and provide the answer, and it will still work.

    So it is the user request that informs the master that there would
    normally be a synchronous slave out there it should wait for.

    So far, I have added the point that if a user requests a level of
    confirmation that is currently unavailable, then it will use the highest
    level of confirmation available now. That stops us from waiting for
    timeout for every transaction we run if standby goes down hard, which
    just freezes the application for long periods to no real benefit. It
    also prevents applications from requesting durability levels the cluster
    cannot satisfy, in the opinion of the sysadmin, since the sysadmin
    specifies the max level on each standby.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Dimitri Fontaine at Sep 17, 2010 at 11:20 am

    Simon Riggs writes:
    So far, I have added the point that if a user requests a level of
    confirmation that is currently unavailable, then it will use the highest
    level of confirmation available now. That stops us from waiting for
    timeout for every transaction we run if standby goes down hard, which
    just freezes the application for long periods to no real benefit. It
    also prevents applications from requesting durability levels the cluster
    cannot satisfy, in the opinion of the sysadmin, since the sysadmin
    specifies the max level on each standby.
    That sounds like the commit-or-rollback when slave are gone question. I
    think this behavior should be user-setable, again per-transaction. I
    agree with you that the general case looks like your proposed default,
    but we already know that some will need "don't ack if not replied before
    the timeout", and they even will go as far as asking for it to be
    reported as a serialisation error of some sort, I guess…

    Regards,
    --
    Dimitri Fontaine
    PostgreSQL DBA, Architecte
  • Simon Riggs at Sep 17, 2010 at 9:49 am

    On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote:
    (changed subject again.)
    On 17/09/10 10:06, Simon Riggs wrote:
    I don't think we can determine how far to implement without considering
    both approaches in detail. With regard to your points below, I don't
    think any of those points could be committed first.
    Yeah, I think we need to decide on the desired feature set first, before
    we dig deeper into the the patches. The design and implementation will
    fall out of that.
    Well, we've discussed these things many times and talking hasn't got us
    very far on its own. We need measurements and neutral assessments.

    The patches are simple and we have time.

    This isn't just about UI, there are significant and important
    differences between the proposals in terms of the capability and control
    they offer.

    I propose we develop both patches further and performance test them.
    Many of the features I have proposed are performance related and people
    need to be able to see what is important, and what is not. But not
    through mere discussion, we need numbers to show which things matter and
    which things don't. And those need to be derived objectively.
    * Support multiple standbys with various synchronization levels.

    * What happens if a synchronous standby isn't connected at the moment?
    Return immediately vs. wait forever.

    * Per-transaction control. Some transactions are important, others are not.

    * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all
    servers can be seen as important special cases of this.

    * async, recv, fsync and replay levels of synchronization.
    That's a reasonable starting list of points, there may be others.

    So what should the user interface be like? Given the 1st and 2nd
    requirement, we need standby registration. If some standbys are
    important and others are not, the master needs to distinguish between
    them to be able to determine that a transaction is safely delivered to
    the important standbys.
    My patch provides those two requirements without standby registration,
    so we very clearly don't "need" standby registration.

    The question is do we want standby registration on master and if so,
    why?

    For per-transaction control, ISTM it would be enough to have a simple
    user-settable GUC like synchronous_commit. Let's call it
    "synchronous_replication_commit" for now.
    If you wish to change the name of the GUC away from the one I have
    proposed, fine. Please note that aspect isn't important to me and I will
    happily concede all such points to the majority view.
    For non-critical transactions,
    you can turn it off. That's very simple for developers to understand and
    use. I don't think we need more fine-grained control than that at
    transaction level, in all the use cases I can think of you have a stream
    of important transactions, mixed with non-important ones like log
    messages that you want to finish fast in a best-effort fashion.
    Sounds like we're getting somewhere. See below.
    I'm
    actually tempted to tie that to the existing synchronous_commit GUC, the
    use case seems exactly the same.
    http://archives.postgresql.org/pgsql-hackers/2008-07/msg01001.php
    Check the date!

    I think that particular point is going to confuse us. It will draw much
    bike shedding and won't help us decide between patches. It's a nicety
    that can be left to a time after we have the core feature committed.
    OTOH, if we do want fine-grained per-transaction control, a simple
    boolean or even an enum GUC doesn't really cut it. For truly
    fine-grained control you want to be able to specify exceptions like
    "wait until this is replayed in slave named 'reporting'" or 'don't wait
    for acknowledgment from slave named 'uk-server'". With standby
    registration, we can invent a syntax for specifying overriding rules in
    the transaction. Something like SET replication_exceptions =
    'reporting=replay, uk-server=async'.

    For the control between async/recv/fsync/replay, I like to think in
    terms of
    a) asynchronous vs synchronous
    b) if it's synchronous, how synchronous is it? recv, fsync or replay?

    I think it makes most sense to set sync vs. async in the master, and the
    level of synchronicity in the slave. Although I have sympathy for the
    argument that it's simpler if you configure it all from the master side
    as well.
    I have catered for such requests by suggesting a plugin that allows you
    to implement that complexity without overburdening the core code.

    This strikes me as an "ad absurdum" argument. Since the above
    over-complexity would doubtless be seen as insane by Tom et al, it
    attempts to persuade that we don't need recv, fsync and apply either.

    Fujii has long talked about 4 levels of service also. Why change? I had
    thought that part was pretty much agreed between all of us.

    Without performance tests to demonstrate "why", these do sound hard to
    understand. But we should note that DRBD offers recv ("B") and fsync
    ("C") as separate options. And Oracle implements all 3 of recv, fsync
    and apply. Neither of them describe those options so simply and easily
    as the way we are proposing with a 4 valued enum (with async as the
    fourth option).

    If we have only one option for sync_rep = 'on' which of recv | fsync |
    apply would it implement? You don't mention that. Which do you choose?
    For what reason do you make that restriction? The code doesn't get any
    simpler, in my patch at least, from my perspective it would be a
    restriction without benefit.

    I no longer seek to persuade by words alone. The existence of my patch
    means that I think that only measurements and tests will show why I have
    been saying these things. We need performance tests. I'm not ready for
    them today, but will be very soon. I suspect you aren't either since
    from earlier discussions you didn't appear to have much about overall
    throughput, only about response times for single transactions. I'm happy
    to be proved wrong there.
    Putting all of that together. I think Fujii-san's standby.conf is pretty
    close.
    What it needs is the additional GUC for transaction-level control.
    The difference between the patches is not a simple matter of a GUC.

    My proposal allows a single standby to provide efficient replies to
    multiple requested durability levels all at the same time. With
    efficient use of network resources. ISTM that because the other patch
    cannot provide that you'd like to persuade us that we don't need that,
    ever. You won't sell me on that point, cos I can see lots of uses for
    it.

    Another use case for you:

    * customer orders are important, but we want lots of them, so we use
    recv mode for those.

    * pricing data hardly ever changes, but when it does we need it to be
    applied across the cluster so we don't get read mismatches, so those
    rare transactions use apply mode.

    If you don't want multiple modes at once, you don't need to use that
    feature. But there is no reason to prevent people having the choice,
    when a design exists that can provide it.

    (A separate and later point, is that I would one day like to annotate
    specific tables and functions with different modes, so a sysadmin can
    point out which data is important at table level - which is what MySQL
    provides by allowing choice of storage engine for particular tables.
    Nobody cares about the specific engine, they care about the durability
    implications of those choices. This isn't part of the current proposal,
    just a later statement of direction.)

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Heikki Linnakangas at Sep 17, 2010 at 10:41 am

    On 17/09/10 12:49, Simon Riggs wrote:
    This isn't just about UI, there are significant and important
    differences between the proposals in terms of the capability and control
    they offer.
    Sure. The point of focusing on the UI is that the UI demonstrates what
    capability and control a proposal offers.
    So what should the user interface be like? Given the 1st and 2nd
    requirement, we need standby registration. If some standbys are
    important and others are not, the master needs to distinguish between
    them to be able to determine that a transaction is safely delivered to
    the important standbys.
    My patch provides those two requirements without standby registration,
    so we very clearly don't "need" standby registration.
    It's still not clear to me how you would configure things like "wait for
    ack from reporting slave, but not other slaves" or "wait until replayed
    in the server on the west coast" in your proposal. Maybe it's possible,
    but doesn't seem very intuitive, requiring careful configuration in both
    the master and the slaves.

    In your proposal, you also need to be careful not to connect e.g a test
    slave with "synchronous_replication_service = apply" to the master, or
    it will possible shadow a real production slave, acknowledging
    transactions that are not yet received by the real slave. It's certainly
    possible to screw up with standby registration too, but you have more
    direct control of the master behavior in the master, instead of
    distributing it across all slaves.
    The question is do we want standby registration on master and if so,
    why?
    Well, aside from how to configure synchronous replication, standby
    registration would help with retaining the right amount of WAL in the
    master. wal_keep_segments doesn't guarantee that enough is retained, and
    OTOH when all standbys are connected you retain much more than might be
    required.

    Giving names to slaves also allows you to view their status in the
    master in a more intuitive format. Something like:

    postgres=# SELECT * FROM pg_slave_status ;
    name | connected | received | fsyncd | applied
    ------------+-----------+------------+------------+------------
    reporting | t | 0/26000020 | 0/26000020 | 0/25550020
    ha-standby | t | 0/26000020 | 0/26000020 | 0/26000020
    testserver | f | | 0/15000020 |
    (3 rows)
    For the control between async/recv/fsync/replay, I like to think in
    terms of
    a) asynchronous vs synchronous
    b) if it's synchronous, how synchronous is it? recv, fsync or replay?

    I think it makes most sense to set sync vs. async in the master, and the
    level of synchronicity in the slave. Although I have sympathy for the
    argument that it's simpler if you configure it all from the master side
    as well.
    I have catered for such requests by suggesting a plugin that allows you
    to implement that complexity without overburdening the core code.
    Well, plugins are certainly one possibility, but then we need to design
    the plugin API. I've been thinking along the lines of a proxy, which can
    implement whatever logic you want to decide when to send the
    acknowledgment. With a proxy as well, if we push any features people
    that want to a proxy or plugin, we need to make sure that the
    proxy/plugin has all the necessary information available.
    This strikes me as an "ad absurdum" argument. Since the above
    over-complexity would doubtless be seen as insane by Tom et al, it
    attempts to persuade that we don't need recv, fsync and apply either.

    Fujii has long talked about 4 levels of service also. Why change? I had
    thought that part was pretty much agreed between all of us.
    Now you lost me. I agree that we need 4 levels of service (at least
    ultimately, not necessarily in the first phase).
    Without performance tests to demonstrate "why", these do sound hard to
    understand. But we should note that DRBD offers recv ("B") and fsync
    ("C") as separate options. And Oracle implements all 3 of recv, fsync
    and apply. Neither of them describe those options so simply and easily
    as the way we are proposing with a 4 valued enum (with async as the
    fourth option).

    If we have only one option for sync_rep = 'on' which of recv | fsync |
    apply would it implement? You don't mention that. Which do you choose?
    You would choose between recv, fsync and apply in the slave, with a GUC.
    I no longer seek to persuade by words alone. The existence of my patch
    means that I think that only measurements and tests will show why I have
    been saying these things. We need performance tests.
    I don't expect any meaningful differences in terms of performance
    between any of the discussed options. The big question right now is what
    features we provide and how they're configured. Performance will depend
    primarily on the mode you use, and secondarily on the implementation of
    the mode. It would be completely premature to do performance testing yet
    IMHO.
    Putting all of that together. I think Fujii-san's standby.conf is pretty
    close.
    What it needs is the additional GUC for transaction-level control.
    The difference between the patches is not a simple matter of a GUC.

    My proposal allows a single standby to provide efficient replies to
    multiple requested durability levels all at the same time. With
    efficient use of network resources. ISTM that because the other patch
    cannot provide that you'd like to persuade us that we don't need that,
    ever. You won't sell me on that point, cos I can see lots of uses for
    it.
    Simon, how the replies are sent is an implementation detail I haven't
    given much thought yet. The reason we delved into that discussion
    earlier was that you seemed to contradict yourself with the claims that
    you don't need to send more than one reply per transaction, and that the
    standby doesn't need to know the synchronization level. Other than that
    the curiosity about that contradiction, it doesn't seem like a very
    interesting detail to me right now. It's not a question that drives the
    rest of the design, but the other way round.

    But FWIW, something like your proposal of sending 3 XLogRecPtrs in each
    reply seems like a good approach. I'm not sure about using walwriter. I
    can see that it helps with getting the 'recv' and 'replay'
    acknowledgments out faster, but I still have the scars from starting
    bgwriter during recovery.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Sep 17, 2010 at 11:31 am

    On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote:
    On 17/09/10 12:49, Simon Riggs wrote:
    This isn't just about UI, there are significant and important
    differences between the proposals in terms of the capability and control
    they offer.
    Sure. The point of focusing on the UI is that the UI demonstrates what
    capability and control a proposal offers.
    My patch does not include server registration. It could be added later
    on top of my patch without any issues.

    The core parts of my patch are the fine grained transaction-level
    control and the ability to mix them dynamically with good performance.

    To me server registration is not a core issue. I'm not actively against
    it, I just don't see the need for it at all. Certainly not committed
    first, especially since its not actually needed by either of our
    patches.

    Standby registration doesn't provide *any* parameter that can't be
    supplied from standby recovery.conf.

    The only thing standby registration allows you to do is know whether
    there was supposed to be a standby there, but yet it isn't there now. I
    don't see that point as being important because it seems strange to me
    to want to wait for a standby that ought to be there, but isn't anymore.
    What happens if it never comes back? Manual intervention required.

    (We agree on how to handle a standby that *is* "connected", yet never
    returns a reply or takes too long to do so).
    So what should the user interface be like? Given the 1st and 2nd
    requirement, we need standby registration. If some standbys are
    important and others are not, the master needs to distinguish between
    them to be able to determine that a transaction is safely delivered to
    the important standbys.
    My patch provides those two requirements without standby registration,
    so we very clearly don't "need" standby registration.
    It's still not clear to me how you would configure things like "wait for
    ack from reporting slave, but not other slaves" or "wait until replayed
    in the server on the west coast" in your proposal. Maybe it's possible,
    but doesn't seem very intuitive, requiring careful configuration in both
    the master and the slaves.
    In the use cases we discussed we had simple 2 or 3 server configs.

    master
    standby1 - preferred sync target - set to recv, fsync or apply
    standby2 - non-preferred sync target, maybe test server - set to async

    So in the two cases you mention we might set

    "wait for ack from reporting slave"
    master: sync_replication = 'recv' #as default, can be changed
    reporting-slave: sync_replication_service = 'recv' #gives max level

    "wait until replayed in the server on the west coast"
    master: sync_replication = 'recv' #as default, can be changed
    west-coast: sync_replication_service = 'apply' #gives max level


    The absence of registration in my patch makes some things easier and
    some things harder. For example, you can add a new standby without
    editing the config on the master.

    If you had 2 standbys, both offering the same level of protection, my
    proposal would *not* allow you to specify that you preferred one master
    over another. But we could add a priority parameter as well if that's an
    issue.
    In your proposal, you also need to be careful not to connect e.g a test
    slave with "synchronous_replication_service = apply" to the master, or
    it will possible shadow a real production slave, acknowledging
    transactions that are not yet received by the real slave. It's certainly
    possible to screw up with standby registration too, but you have more
    direct control of the master behavior in the master, instead of
    distributing it across all slaves.
    The question is do we want standby registration on master and if so,
    why?
    Well, aside from how to configure synchronous replication, standby
    registration would help with retaining the right amount of WAL in the
    master. wal_keep_segments doesn't guarantee that enough is retained, and
    OTOH when all standbys are connected you retain much more than might be
    required.

    Giving names to slaves also allows you to view their status in the
    master in a more intuitive format. Something like:
    We can give servers a name without registration. It actually makes more
    sense to set the name in the standby and it can be passed through from
    standby when we connect.

    I very much like the idea of server names and think this next SRF looks
    really cool.
    postgres=# SELECT * FROM pg_slave_status ;
    name | connected | received | fsyncd | applied
    ------------+-----------+------------+------------+------------
    reporting | t | 0/26000020 | 0/26000020 | 0/25550020
    ha-standby | t | 0/26000020 | 0/26000020 | 0/26000020
    testserver | f | | 0/15000020 |
    (3 rows)
    That could be added on top of my patch also.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Sep 17, 2010 at 11:44 am

    On Fri, Sep 17, 2010 at 7:31 AM, Simon Riggs wrote:
    The only thing standby registration allows you to do is know whether
    there was supposed to be a standby there, but yet it isn't there now. I
    don't see that point as being important because it seems strange to me
    to want to wait for a standby that ought to be there, but isn't anymore.
    What happens if it never comes back? Manual intervention required.

    (We agree on how to handle a standby that *is* "connected", yet never
    returns a reply or takes too long to do so).
    Doesn't Oracle provide a mode where it shuts down if this occurs?
    The absence of registration in my patch makes some things easier and
    some things harder. For example, you can add a new standby without
    editing the config on the master.
    That's actually one of the reasons why I like the idea of
    registration. It seems rather scary to add a new standby without
    editing the config on the master. Actually, adding a new fully-async
    slave without touching the master seems reasonable, but adding a new
    sync slave without touching the master gives me the willies. The
    behavior of the system could change quite sharply when you do this,
    and it might not be obvious what has happened. (Imagine DBA #1 makes
    the change and DBA #2 is then trying to figure out what's happened -
    he checks the configs of all the machines he knows about and finds
    them all unchanged... head-scratching ensues.)

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Aidan Van Dyk at Sep 17, 2010 at 1:36 pm

    * Robert Haas [100917 07:44]:
    On Fri, Sep 17, 2010 at 7:31 AM, Simon Riggs wrote:
    The only thing standby registration allows you to do is know whether
    there was supposed to be a standby there, but yet it isn't there now. I
    don't see that point as being important because it seems strange to me
    to want to wait for a standby that ought to be there, but isn't anymore.
    What happens if it never comes back? Manual intervention required.
    The absence of registration in my patch makes some things easier and
    some things harder. For example, you can add a new standby without
    editing the config on the master.
    That's actually one of the reasons why I like the idea of
    registration. It seems rather scary to add a new standby without
    editing the config on the master. Actually, adding a new fully-async
    slave without touching the master seems reasonable, but adding a new
    sync slave without touching the master gives me the willies. The
    behavior of the system could change quite sharply when you do this,
    and it might not be obvious what has happened. (Imagine DBA #1 makes
    the change and DBA #2 is then trying to figure out what's happened -
    he checks the configs of all the machines he knows about and finds
    them all unchanged... head-scratching ensues.)
    So, those both give me the willies too...

    I've had a rack loose all power. Now, let's say I've got two servers
    (plus trays of disks for each) in the same rack. Ya, I know, I should
    move them to separate racks, preferably in separate buildings on the
    same campus, but realistically...

    I want to have them configured in a fsync WAL/style sync rep, I want to
    make sure that if the master comes up first after I get power back, it's
    not going to be claiming transactions are committed while the slave
    (which happens to have 4x the disks because it keeps PITR backups for a
    period too) it still chugging away on SCSI probes yet, not gotten to
    having PostgreSQL up yet...

    And I want to make sure the dev box that was testing another slave setup
    on, which is running in some test area by some other DBA, but not in the
    same rack, *can't* through some mis-configuration make my master think
    that it's production slave has properly fsync'ed the replicated WAL.

    </hopes & dreams>

    --
    Aidan Van Dyk Create like a god,
    aidan@highrise.ca command like a king,
    http://www.highrise.ca/ work like a slave.
  • Simon Riggs at Sep 17, 2010 at 3:22 pm

    On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:

    I want to have them configured in a fsync WAL/style sync rep, I want to
    make sure that if the master comes up first after I get power back, it's
    not going to be claiming transactions are committed while the slave
    (which happens to have 4x the disks because it keeps PITR backups for a
    period too) it still chugging away on SCSI probes yet, not gotten to
    having PostgreSQL up yet...
    Nobody has mentioned the ability to persist the not-committed state
    across a crash before, and I think it's an important discussion point.

    We already have it: its called "two phase commit". (2PC)

    If you run 2PC on 3 servers and one goes down, you can just commit the
    in-flight transactions and continue. But it doesn't work on hot standby.

    It could: If we want that we could prepare the transaction on the master
    and don't allow commit until we get positive confirmation from standby.
    All of the machinery is there.

    I'm not sure if that's a 5th sync rep mode, or that idea is actually
    good enough to replace all the ideas we've had up until now. I would say
    probably not, but we should think about this.

    A slightly modified idea would be avoid writing the transaction prepare
    file as a separate file, just write the WAL for the prepare. We then
    remember the LSN of the prepare so we can re-access the WAL copy of it
    by re-reading the WAL files on master. Make sure we don't get rid of WAL
    that refers to waiting transactions. That would then give us the option
    to commit or abort depending upon whether we receive a reply within
    timeout.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Sep 17, 2010 at 3:24 pm

    On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs wrote:
    On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:

    I want to have them configured in a fsync WAL/style sync rep, I want to
    make sure that if the master comes up first after I get power back, it's
    not going to be claiming transactions are committed while the slave
    (which happens to have 4x the disks because it keeps PITR backups for a
    period too) it still chugging away on SCSI probes yet, not gotten to
    having PostgreSQL up yet...
    Nobody has mentioned the ability to persist the not-committed state
    across a crash before, and I think it's an important discussion point.
    Eh? I think all Aidan is asking for is the ability to have a mode
    where sync rep is really always sync, or nothing commits. Rather than
    timing out and continuing merrily on its way...

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Aidan Van Dyk at Sep 17, 2010 at 3:30 pm

    * Robert Haas [100917 11:24]:
    On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs wrote:
    On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:

    I want to have them configured in a fsync WAL/style sync rep, I want to
    make sure that if the master comes up first after I get power back, it's
    not going to be claiming transactions are committed while the slave
    (which happens to have 4x the disks because it keeps PITR backups for a
    period too) it still chugging away on SCSI probes yet, not gotten to
    having PostgreSQL up yet...
    Nobody has mentioned the ability to persist the not-committed state
    across a crash before, and I think it's an important discussion point.
    Eh? I think all Aidan is asking for is the ability to have a mode
    where sync rep is really always sync, or nothing commits. Rather than
    timing out and continuing merrily on its way...
    Right, I'm not asking for a "new" mode. I'm just hope that there will
    be a way to guarantee my "sync rep" is actually replicating. Having it
    "not replicate" simply because no slave has (yet) connected means I have
    to dance jigs around pg_hba.conf so that it won't allow non-replication
    connections until I've manual verified that the replication slave
    is connected...

    a.
    --
    Aidan Van Dyk Create like a god,
    aidan@highrise.ca command like a king,
    http://www.highrise.ca/ work like a slave.
  • Simon Riggs at Sep 17, 2010 at 3:50 pm

    On Fri, 2010-09-17 at 11:30 -0400, Aidan Van Dyk wrote:
    * Robert Haas [100917 11:24]:
    On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs wrote:
    On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:

    I want to have them configured in a fsync WAL/style sync rep, I want to
    make sure that if the master comes up first after I get power back, it's
    not going to be claiming transactions are committed while the slave
    (which happens to have 4x the disks because it keeps PITR backups for a
    period too) it still chugging away on SCSI probes yet, not gotten to
    having PostgreSQL up yet...
    Nobody has mentioned the ability to persist the not-committed state
    across a crash before, and I think it's an important discussion point.
    Eh? I think all Aidan is asking for is the ability to have a mode
    where sync rep is really always sync, or nothing commits. Rather than
    timing out and continuing merrily on its way...
    Right, I'm not asking for a "new" mode. I'm just hope that there will
    be a way to guarantee my "sync rep" is actually replicating. Having it
    "not replicate" simply because no slave has (yet) connected means I have
    to dance jigs around pg_hba.conf so that it won't allow non-replication
    connections until I've manual verified that the replication slave
    is connected...
    I agree that aspect is a problem.

    One solution, to me, would be to have a directive included in the
    pg_hba.conf that says entries below it are only allowed if it passes the
    test. So your hba file looks like this

    local postgres postgres
    host replication ...
    need replication
    host any any

    So the "need" test is an extra option in the first column. We might want
    additional "need" tests before we allow other rules also. Text following
    the "need" verb will be additional info for that test, sufficient to
    allow some kind of execution on the backend.

    I definitely don't like the idea that anyone that commits will just sit
    there waiting until the standby comes up. That just sounds an insane way
    of doing it.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Fujii Masao at Sep 17, 2010 at 12:20 pm

    On Fri, Sep 17, 2010 at 8:31 PM, Simon Riggs wrote:
    The only thing standby registration allows you to do is know whether
    there was supposed to be a standby there, but yet it isn't there now. I
    don't see that point as being important because it seems strange to me
    to want to wait for a standby that ought to be there, but isn't anymore.
    According to what I heard, some people want to guarantee that all the
    transactions are *always* written in *all* the synchronous standbys.
    IOW, they want to keep the transaction waiting until it has been written
    in all the synchronous standbys. Standby registration is required to
    support such a use case. Without the registration, the master cannot
    determine whether the transaction has been written in all the synchronous
    standbys.
    What happens if it never comes back? Manual intervention required. Yep.
    In the use cases we discussed we had simple 2 or 3 server configs.

    master
    standby1 - preferred sync target - set to recv, fsync or apply
    standby2 - non-preferred sync target, maybe test server - set to async

    So in the two cases you mention we might set

    "wait for ack from reporting slave"
    master: sync_replication = 'recv'   #as default, can be changed
    reporting-slave: sync_replication_service = 'recv' #gives max level

    "wait until replayed in the server on the west coast"
    master: sync_replication = 'recv'   #as default, can be changed
    west-coast: sync_replication_service = 'apply' #gives max level
    What synchronization level does each combination of sync_replication
    and sync_replication_service lead to? I'd like to see something like
    the following table.

    sync_replication | sync_replication_service | result
    ------------------+--------------------------+--------
    async | async | ???
    async | recv | ???
    async | fsync | ???
    async | apply | ???
    recv | async | ???
    ...

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Simon Riggs at Sep 17, 2010 at 12:41 pm

    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:

    What synchronization level does each combination of sync_replication
    and sync_replication_service lead to? I'd like to see something like
    the following table.

    sync_replication | sync_replication_service | result
    ------------------+--------------------------+--------
    async | async | ???
    async | recv | ???
    async | fsync | ???
    async | apply | ???
    recv | async | ???
    ...
    Good question.

    There are only 4 possible outcomes. There is no combination, so we don't
    need a table like that above.

    The "service" specifies the highest request type available from that
    specific standby. If someone requests a higher service than is currently
    offered by this standby, they will either
    a) get that service from another standby that does offer that level
    b) automatically downgrade the sync rep mode to the highest available.

    For example, if you request recv but there is only one standby and it
    only offers async, then you get downgraded to async.

    In all cases, if you request async then we act same as 9.0.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Dimitri Fontaine at Sep 17, 2010 at 7:36 pm

    Simon Riggs writes:
    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
    What synchronization level does each combination of sync_replication
    and sync_replication_service lead to?
    There are only 4 possible outcomes. There is no combination, so we don't
    need a table like that above.

    The "service" specifies the highest request type available from that
    specific standby. If someone requests a higher service than is currently
    offered by this standby, they will either
    a) get that service from another standby that does offer that level
    b) automatically downgrade the sync rep mode to the highest available.
    I like the a) part, I can't say the same about the b) part. There's no
    reason to accept to COMMIT a transaction when the requested durability
    is known not to have been reached, unless the user said so.
    For example, if you request recv but there is only one standby and it
    only offers async, then you get downgraded to async.
    If so you choose, but with a net slowdown as you're now reaching the
    timeout for each transaction, with what I have in mind, and I don't see
    how you can avoid that. Even if you setup the replication from the
    master, you still can mess it up the same way, right?

    Regards,
    --
    dim
  • Fujii Masao at Sep 21, 2010 at 7:58 am

    On Sat, Sep 18, 2010 at 4:36 AM, Dimitri Fontaine wrote:
    Simon Riggs <simon@2ndQuadrant.com> writes:
    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
    What synchronization level does each combination of sync_replication
    and sync_replication_service lead to?
    There are only 4 possible outcomes. There is no combination, so we don't
    need a table like that above.

    The "service" specifies the highest request type available from that
    specific standby. If someone requests a higher service than is currently
    offered by this standby, they will either
    a) get that service from another standby that does offer that level
    b) automatically downgrade the sync rep mode to the highest available.
    I like the a) part, I can't say the same about the b) part. There's no
    reason to accept to COMMIT a transaction when the requested durability
    is known not to have been reached, unless the user said so.
    Yep, I can imagine that some people want to ensure that *all* the
    transactions are synchronously replicated to the synchronous standby,
    without regard to sync_replication. So I'm not sure if automatic
    downgrade/upgrade of the mode makes sense. We should introduce new
    parameter specifying whether to allow automatic degrade/upgrade or not?
    It seems complicated though.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Simon Riggs at Sep 21, 2010 at 6:05 pm

    On Tue, 2010-09-21 at 16:58 +0900, Fujii Masao wrote:
    On Sat, Sep 18, 2010 at 4:36 AM, Dimitri Fontaine
    wrote:
    Simon Riggs <simon@2ndQuadrant.com> writes:
    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
    What synchronization level does each combination of sync_replication
    and sync_replication_service lead to?
    There are only 4 possible outcomes. There is no combination, so we don't
    need a table like that above.

    The "service" specifies the highest request type available from that
    specific standby. If someone requests a higher service than is currently
    offered by this standby, they will either
    a) get that service from another standby that does offer that level
    b) automatically downgrade the sync rep mode to the highest available.
    I like the a) part, I can't say the same about the b) part. There's no
    reason to accept to COMMIT a transaction when the requested durability
    is known not to have been reached, unless the user said so.
    Hmm, no reason? The reason is that the alternative is that the session
    would hang until a standby arrived that offered that level of service.
    Why would you want that behaviour? Would you really request that option?
    Yep, I can imagine that some people want to ensure that *all* the
    transactions are synchronously replicated to the synchronous standby,
    without regard to sync_replication. So I'm not sure if automatic
    downgrade/upgrade of the mode makes sense. We should introduce new
    parameter specifying whether to allow automatic degrade/upgrade or not?
    It seems complicated though.
    I agree, but I'm not against any additional parameter if people say they
    really want them *after* the consequences of those choices have been
    highlighted.

    IMHO we should focus on the parameters that deliver key use cases.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Markus Wanner at Sep 22, 2010 at 8:22 am
    Hi,
    On 09/21/2010 08:05 PM, Simon Riggs wrote:
    Hmm, no reason? The reason is that the alternative is that the session
    would hang until a standby arrived that offered that level of service.
    Why would you want that behaviour? Would you really request that option?
    I think I now agree with Simon on that point. It's only an issue in
    multi-master replication, where continued operation would lead to a
    split-brain situation.

    With master-slave, you only need to make sure your master stays the
    master even if the standby crash(es) are followed by a master crash. If
    your cluster-ware is too clever and tries a fail-over on a slave that's
    quicker to come up, you get the same split-brain situation.

    Put another way: if you let your master continue, don't ever try a
    fail-over after a full-cluster crash.

    Regards

    Markus Wanner
  • Simon Riggs at Sep 17, 2010 at 12:56 pm

    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
    On Fri, Sep 17, 2010 at 8:31 PM, Simon Riggs wrote:
    The only thing standby registration allows you to do is know whether
    there was supposed to be a standby there, but yet it isn't there now. I
    don't see that point as being important because it seems strange to me
    to want to wait for a standby that ought to be there, but isn't anymore.
    According to what I heard, some people want to guarantee that all the
    transactions are *always* written in *all* the synchronous standbys.
    IOW, they want to keep the transaction waiting until it has been written
    in all the synchronous standbys. Standby registration is required to
    support such a use case. Without the registration, the master cannot
    determine whether the transaction has been written in all the synchronous
    standbys.
    You don't need standby registration at all. You can do that with a
    single parameter, already proposed:

    quorum_commit = N.

    But most people said they didn't want it. If they do we can put it back
    later.

    I don't think we're getting anywhere here. I just don't see any *need*
    to have it. Some people might *want* to set things up that way, and if
    that's true, that's enough for me to agree with them. The trouble is, I
    know some people have said they *want* to set it in the standby and we
    definitely *need* to set it somewhere. After this discussion, I think
    "both" is easily done and quite cool.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Dimitri Fontaine at Sep 17, 2010 at 7:32 pm

    Simon Riggs writes:
    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
    According to what I heard, some people want to guarantee that all the
    transactions are *always* written in *all* the synchronous standbys.
    You don't need standby registration at all. You can do that with a
    single parameter, already proposed:

    quorum_commit = N.
    I think you also need another parameter to control the behavior upon
    timeout. You received less than N votes, now what? You're current idea
    seems to be COMMIT, Aidan says ROLLBACK, and I say that's to be a GUC
    set at the transaction level.

    As far as registration goes, I see no harm to have the master maintain a
    list of known standby systems, of course, it's just maintaining that
    list from the master that I don't understand the use case for.

    Regards,
    --
    dim
  • Simon Riggs at Sep 18, 2010 at 8:51 am

    On Fri, 2010-09-17 at 21:32 +0200, Dimitri Fontaine wrote:
    Simon Riggs <simon@2ndQuadrant.com> writes:
    On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
    According to what I heard, some people want to guarantee that all the
    transactions are *always* written in *all* the synchronous standbys.
    You don't need standby registration at all. You can do that with a
    single parameter, already proposed:

    quorum_commit = N.
    I think you also need another parameter to control the behavior upon
    timeout. You received less than N votes, now what? You're current idea
    seems to be COMMIT, Aidan says ROLLBACK, and I say that's to be a GUC
    set at the transaction level.
    I've said COMMIT with no option because I believe that we have only two
    choices: commit or wait (perhaps forever), and IMHO waiting is not good.

    We can't ABORT, because we sent a commit to the standby. If we abort,
    then we're saying the standby can't ever come back because it will have
    received and potentially replayed a different transaction history. I had
    some further thoughts around that but you end up with the byzantine
    generals problem always.

    Waiting might sound attractive. In practice, waiting will make all of
    your connections lock up and it will look to users as if their master
    has stopped working as well. (It has!). I can't imagine why anyone would
    ever want an option to select that; its the opposite of high
    availability. Just sounds like a serious footgun.

    Having said that Oracle offers Maximum Protection mode, which literally
    shuts down the master when you lose a standby. I can't say anything
    apart from "LOL".
    As far as registration goes, I see no harm to have the master maintain a
    list of known standby systems, of course, it's just maintaining that
    list from the master that I don't understand the use case for.
    Yes, the master needs to know about all currently connected standbys.
    The only debate is what happens about ones that "ought" to be there.

    Given my comments above, I don't see the need.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Dimitri Fontaine at Sep 18, 2010 at 11:57 am

    Simon Riggs writes:
    I've said COMMIT with no option because I believe that we have only two
    choices: commit or wait (perhaps forever), and IMHO waiting is not good.

    We can't ABORT, because we sent a commit to the standby.
    Ah yes, I keep forgetting Sync Rep is not about 2PC. Sorry about that.
    Waiting might sound attractive. In practice, waiting will make all of
    your connections lock up and it will look to users as if their master
    has stopped working as well. (It has!). I can't imagine why anyone would
    ever want an option to select that; its the opposite of high
    availability. Just sounds like a serious footgun.
    I guess that if there's a timeout GUC it can still be set to infinite
    somehow. Unclear as the use case might be.

    Regards,
    --
    dim
  • Robert Haas at Sep 18, 2010 at 7:59 pm

    On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs wrote:
    Waiting might sound attractive. In practice, waiting will make all of
    your connections lock up and it will look to users as if their master
    has stopped working as well. (It has!). I can't imagine why anyone would
    ever want an option to select that; its the opposite of high
    availability. Just sounds like a serious footgun.
    Nevertheless, it seems that some people do want exactly that behavior,
    no matter how crazy it may seem to you. I'm not exactly sure what
    we're in disagreement about, TBH. You've previously said that you
    don't think standby registration is necessary, but that you don't
    object to it if others want it. So it seems like this might be mostly
    academic.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Heikki Linnakangas at Sep 20, 2010 at 6:27 am

    On 18/09/10 22:59, Robert Haas wrote:
    On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggswrote:
    Waiting might sound attractive. In practice, waiting will make all of
    your connections lock up and it will look to users as if their master
    has stopped working as well. (It has!). I can't imagine why anyone would
    ever want an option to select that; its the opposite of high
    availability. Just sounds like a serious footgun.
    Nevertheless, it seems that some people do want exactly that behavior,
    no matter how crazy it may seem to you.
    Yeah, I agree with both of you. I have a hard time imaging a situation
    where you would actually want that. It's not high availability, it's
    high durability. When a transaction is acknowledged as committed, you
    know it's never ever going to disappear even if a meteor strikes the
    current master server within the next 10 milliseconds. In practice,
    people want high availability instead.

    That said, the timeout option also feels a bit wishy-washy to me. With a
    timeout, acknowledgment of a commit means "your transaction is safely
    committed in the master and slave. Or not, if there was some glitch with
    the slave". That doesn't seem like a very useful guarantee; if you're
    happy with that why not just use async replication?

    However, the "wait forever" behavior becomes useful if you have a
    monitoring application outside the DB that decides when enough is enough
    and tells the DB that the slave can be considered dead. So "wait
    forever" actually means "wait until I tell you that you can give up".
    The monitoring application can STONITH to ensure that the slave stays
    down, before letting the master proceed with the commit.

    With that in mind, we have to make sure that a transaction that's
    waiting for acknowledgment of the commit from a slave is woken up if the
    configuration changes.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Sep 20, 2010 at 9:17 am

    On Mon, 2010-09-20 at 09:27 +0300, Heikki Linnakangas wrote:
    On 18/09/10 22:59, Robert Haas wrote:
    On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggswrote:
    Waiting might sound attractive. In practice, waiting will make all of
    your connections lock up and it will look to users as if their master
    has stopped working as well. (It has!). I can't imagine why anyone would
    ever want an option to select that; its the opposite of high
    availability. Just sounds like a serious footgun.
    Nevertheless, it seems that some people do want exactly that behavior,
    no matter how crazy it may seem to you.
    Yeah, I agree with both of you. I have a hard time imaging a situation
    where you would actually want that. It's not high availability, it's
    high durability. When a transaction is acknowledged as committed, you
    know it's never ever going to disappear even if a meteor strikes the
    current master server within the next 10 milliseconds. In practice,
    people want high availability instead.

    That said, the timeout option also feels a bit wishy-washy to me. With a
    timeout, acknowledgment of a commit means "your transaction is safely
    committed in the master and slave. Or not, if there was some glitch with
    the slave". That doesn't seem like a very useful guarantee; if you're
    happy with that why not just use async replication?

    However, the "wait forever" behavior becomes useful if you have a
    monitoring application outside the DB that decides when enough is enough
    and tells the DB that the slave can be considered dead. So "wait
    forever" actually means "wait until I tell you that you can give up".
    The monitoring application can STONITH to ensure that the slave stays
    down, before letting the master proceed with the commit.
    err... what is the difference between a timeout and stonith? None. We
    still proceed without the slave in both cases after the decision point.

    In all cases, we would clearly have a user accessible function to stop
    particular sessions, or all sessions, from waiting for standby to
    return.

    You would have 3 choices:
    * set automatic timeout
    * set wait forever and then wait for manual resolution
    * set wait forever and then trust to external clusterware

    Many people have asked for timeouts and I agree it's probably the
    easiest thing to do if you just have 1 standby.
    With that in mind, we have to make sure that a transaction that's
    waiting for acknowledgment of the commit from a slave is woken up if the
    configuration changes.
    There's a misunderstanding here of what I've said and its a subtle one.

    My patch supports a timeout of 0, i.e. wait forever. Which means I agree
    that functionality is desired and should be included. This operates by
    saying that if a currently-connected-standby goes down we will wait
    until the timeout. So I agree all 3 choices should be available to
    users.

    Discussion has been about what happens to ought-to-have-been-connected
    standbys. Heikki had argued we need standby registration because if a
    server *ought* to have been there, yet isn't currently there when we
    wait for sync rep, we would still wait forever for it to return. To do
    this you require standby registration.

    But there is a hidden issue there: If you care about high availability
    AND sync rep you have two standbys. If one goes down, the other is still
    there. In general, if you want high availability on N servers then you
    have N+1 standbys. If one goes down, the other standbys provide the
    required level of durability and we do not wait.

    So the only case where standby registration is required is where you
    deliberately choose to *not* have N+1 redundancy and then yet still
    require all N standbys to acknowledge. That is a suicidal config and
    nobody would sanely choose that. It's not a large or useful use case for
    standby reg. (But it does raise the question again of whether we need
    quorum commit).

    My take is that if the above use case occurs it is because one standby
    has just gone down and the standby is, for a hopefully short period, in
    a degraded state and that the service responds to that. So in my
    proposal, if a standby is not there *now* we don't wait for it.

    Which cuts out a huge bag of code, specification and such like that
    isn't required to support sane use cases. More stuff to get wrong and
    regret in later releases. The KISS principle, just like we apply in all
    other cases.

    If we did have standby registration, then I would implement it in a
    table, not in an external config file. That way when we performed a
    failover the data would be accessible on the new master. But I don't
    suggest we have CREATE/ALTER STANDBY syntax. We already have
    CREATE/ALTER SERVER if we wanted to do it in SQL. If we did that, ISTM
    we should choose functions.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Heikki Linnakangas at Sep 20, 2010 at 12:16 pm

    On 20/09/10 12:17, Simon Riggs wrote:
    err... what is the difference between a timeout and stonith?
    STONITH ("Shoot The Other Node In The Head") means that the other node
    is somehow disabled so that it won't unexpectedly come back alive. A
    timeout means that the slave hasn't been seen for a while, but it might
    reconnect just after the timeout has expired.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Sep 20, 2010 at 12:51 pm

    On Mon, 2010-09-20 at 15:16 +0300, Heikki Linnakangas wrote:
    On 20/09/10 12:17, Simon Riggs wrote:
    err... what is the difference between a timeout and stonith?
    STONITH ("Shoot The Other Node In The Head") means that the other node
    is somehow disabled so that it won't unexpectedly come back alive. A
    timeout means that the slave hasn't been seen for a while, but it might
    reconnect just after the timeout has expired.
    You've edited my reply to change the meaning of what was a rhetorical
    question, as well as completely ignoring the main point of my reply.

    Please respond to the main point: Following some thought and analysis,
    AFAICS there is no sensible use case that requires standby registration.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Training and Services
  • Heikki Linnakangas at Sep 20, 2010 at 1:26 pm

    On 20/09/10 15:50, Simon Riggs wrote:
    On Mon, 2010-09-20 at 15:16 +0300, Heikki Linnakangas wrote:
    On 20/09/10 12:17, Simon Riggs wrote:
    err... what is the difference between a timeout and stonith?
    STONITH ("Shoot The Other Node In The Head") means that the other node
    is somehow disabled so that it won't unexpectedly come back alive. A
    timeout means that the slave hasn't been seen for a while, but it might
    reconnect just after the timeout has expired.
    You've edited my reply to change the meaning of what was a rhetorical
    question, as well as completely ignoring the main point of my reply.

    Please respond to the main point: Following some thought and analysis,
    AFAICS there is no sensible use case that requires standby registration.
    Ok, I had completely missed your point then.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Robert Haas at Sep 20, 2010 at 1:28 pm

    On Mon, Sep 20, 2010 at 8:50 AM, Simon Riggs wrote:
    Please respond to the main point: Following some thought and analysis,
    AFAICS there is no sensible use case that requires standby registration.
    I disagree. You keep analyzing away the cases that require standby
    registration, but I don't believe that they're not real. Aidan Van
    Dyk's case upthread of wanting to make sure that the standby is up and
    replicating synchronously before the master starts processing
    transactions seems perfectly legitimate to me. Sure, it's paranoid,
    but so what? We're all about paranoia, at least as far as data loss
    is concerned. So the "wait forever" case is, in my opinion,
    sufficient to demonstrate that we need it, but it's not even my
    primary reason for wanting to have it.

    The most important reason why I think we should have standby
    registration is for simplicity of configuration. Yes, it adds another
    configuration file, but that configuration file contains ALL of the
    information about which standbys are synchronous. Without standby
    registration, this information will inevitably be split between the
    master config and the various slave configs and you'll have to look at
    all the configurations to be certain you understand how it's going to
    end up working. As a particular manifestation of this, and as
    previously argued and +1'd upthread, the ability to change the set of
    standbys to which the master is replicating synchronously without
    changing the configuration on the master or any of the existing slaves
    seems seems dangerous.

    Another reason why I think we should have standby registration is to
    allow eventually allow the "streaming WAL backwards" configuration
    which has previously been discussed. IOW, you could stream the WAL to
    the slave in advance of fsync-ing it on the master. After a power
    failure, the machines in the cluster can talk to each other and figure
    out which one has the furthest-advanced WAL pointer and stream from
    that machine to all the others. This is an appealing configuration
    for people using sync rep because it would allow the fsyncs to be done
    in parallel rather than sequentially as is currently necessary - but
    if you're using it, you're certainly not going to want the master to
    enter normal running without waiting to hear from the slave.

    Just to be clear, that is a list of three independent reasons any one
    of which I think is sufficient for wanting standby registration.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Dimitri Fontaine at Sep 20, 2010 at 8:10 pm
    Hi,

    I'm somewhat sorry to have to play this game, as I sure don't feel
    smarter by composing this email. Quite the contrary.

    Robert Haas <robertmhaas@gmail.com> writes:
    So the "wait forever" case is, in my opinion,
    sufficient to demonstrate that we need it, but it's not even my
    primary reason for wanting to have it.
    You're talking about standby registration on the master. You can solve
    this case without it, because when a slave is not connected it's not
    giving any feedback (vote, weight, ack) to the master. All you have to
    do is have the quorum setup in a way that disconnecting your slave means
    you can't reach the quorum any more. Have it SIGHUP and you can even
    choose to fix the setup, rather than fix the standby.

    So no need for registration here, it's just another way to solve the
    problem. Not saying it's better or worse, just another.

    Now we could have a summary function on the master showing all the known
    slaves, their last time of activity, their known current setup, etc, all
    from the master, but read-only. Would that be useful enough?
    The most important reason why I think we should have standby
    registration is for simplicity of configuration. Yes, it adds another
    configuration file, but that configuration file contains ALL of the
    information about which standbys are synchronous. Without standby
    registration, this information will inevitably be split between the
    master config and the various slave configs and you'll have to look at
    all the configurations to be certain you understand how it's going to
    end up working.
    So, here, we have two quite different things to be concerned
    about. First is the configuration, and I say that managing a distributed
    setup will be easier for the DBA.

    Then there's how to obtain a nice view about the distributed system,
    which again we can achieve from the master without manually registering
    the standbys. After all, the information you want needs to be there.
    As a particular manifestation of this, and as
    previously argued and +1'd upthread, the ability to change the set of
    standbys to which the master is replicating synchronously without
    changing the configuration on the master or any of the existing slaves
    seems seems dangerous.
    Well, you still need to open the HBA for the new standby to be able to
    connect, and to somehow take a base backup, right? We're not exactly
    transparent there, yet, are we?
    Another reason why I think we should have standby registration is to
    allow eventually allow the "streaming WAL backwards" configuration
    which has previously been discussed. IOW, you could stream the WAL to
    the slave in advance of fsync-ing it on the master. After a power
    failure, the machines in the cluster can talk to each other and figure
    out which one has the furthest-advanced WAL pointer and stream from
    that machine to all the others. This is an appealing configuration
    for people using sync rep because it would allow the fsyncs to be done
    in parallel rather than sequentially as is currently necessary - but
    if you're using it, you're certainly not going to want the master to
    enter normal running without waiting to hear from the slave.
    I love the idea.

    Now it seems to me that all you need here is the master sending one more
    information with each WAL "segment", the currently fsync'ed position,
    which pre-9.1 is implied as being the current LSN from the stream,
    right?

    Here I'm not sure to follow you in details, but it seems to me
    registering the standbys is just another way of achieving the same. To
    be honest, I don't understand a bit how it helps implement your idea.

    Regards,
    --
    Dimitri Fontaine
    PostgreSQL DBA, Architecte
  • Robert Haas at Sep 20, 2010 at 9:16 pm

    On Mon, Sep 20, 2010 at 4:10 PM, Dimitri Fontaine wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    So the "wait forever" case is, in my opinion,
    sufficient to demonstrate that we need it, but it's not even my
    primary reason for wanting to have it.
    You're talking about standby registration on the master. You can solve
    this case without it, because when a slave is not connected it's not
    giving any feedback (vote, weight, ack) to the master. All you have to
    do is have the quorum setup in a way that disconnecting your slave means
    you can't reach the quorum any more. Have it SIGHUP and you can even
    choose to fix the setup, rather than fix the standby.
    I suppose that could work.
    The most important reason why I think we should have standby
    registration is for simplicity of configuration.  Yes, it adds another
    configuration file, but that configuration file contains ALL of the
    information about which standbys are synchronous.  Without standby
    registration, this information will inevitably be split between the
    master config and the various slave configs and you'll have to look at
    all the configurations to be certain you understand how it's going to
    end up working.
    So, here, we have two quite different things to be concerned
    about. First is the configuration, and I say that managing a distributed
    setup will be easier for the DBA.
    Yeah, I disagree with that, but I suppose it's a question of opinion.
    Then there's how to obtain a nice view about the distributed system,
    which again we can achieve from the master without manually registering
    the standbys. After all, the information you want needs to be there.
    I think that without standby registration it will be tricky to display
    information like "the last time that standby foo was connected".
    Yeah, you could set a standby name on the standby server and just have
    the master remember details for every standby name it's ever seen, but
    then how do you prune the list?

    Heikki mentioned another application for having a list of the current
    standbys only (rather than "every standby that has ever existed")
    upthread: you can compute the exact amount of WAL you need to keep
    around.
    As a particular manifestation of this, and as
    previously argued and +1'd upthread, the ability to change the set of
    standbys to which the master is replicating synchronously without
    changing the configuration on the master or any of the existing slaves
    seems seems dangerous.
    Well, you still need to open the HBA for the new standby to be able to
    connect, and to somehow take a base backup, right? We're not exactly
    transparent there, yet, are we?
    Sure, but you might have that set relatively open on a trusted network.
    Another reason why I think we should have standby registration is to
    allow eventually allow the "streaming WAL backwards" configuration
    which has previously been discussed.  IOW, you could stream the WAL to
    the slave in advance of fsync-ing it on the master.  After a power
    failure, the machines in the cluster can talk to each other and figure
    out which one has the furthest-advanced WAL pointer and stream from
    that machine to all the others.  This is an appealing configuration
    for people using sync rep because it would allow the fsyncs to be done
    in parallel rather than sequentially as is currently necessary - but
    if you're using it, you're certainly not going to want the master to
    enter normal running without waiting to hear from the slave.
    I love the idea.

    Now it seems to me that all you need here is the master sending one more
    information with each WAL "segment", the currently fsync'ed position,
    which pre-9.1 is implied as being the current LSN from the stream,
    right?
    I don't see how that would help you.
    Here I'm not sure to follow you in details, but it seems to me
    registering the standbys is just another way of achieving the same. To
    be honest, I don't understand a bit how it helps implement your idea.
    Well, if you need to talk to "all the other standbys" and see who has
    the furtherst-advanced xlog pointer, it seems like you have to have a
    list somewhere of who they all are. Maybe there's some way to get
    this to work without standby registration, but I don't really
    understand the resistance to the idea, and I fear it's going to do
    nothing good for our reputation for ease of use (or lack thereof).
    The idea of making this all work without standby registration strikes
    me as akin to the notion of having someone decide whether they're
    running a three-legged race by checking whether their leg is currently
    tied to someone else's leg. You can probably make that work by
    patching around the various failure cases, but why isn't simpler to
    just tell the poor guy "Hi, Joe. You're running a three-legged race
    with Jane today. Hans and Juanita will be following you across the
    field, too, but don't worry about whether they're keeping up."?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company

Related Discussions

People

Translate

site design / logo © 2022 Grokbase