On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
On 03/06/2011 05:51 PM, Simon Riggs wrote:
Efficient transaction-controlled synchronous replication.
I'm glad this is in, but I thought we agreed NOT to call it "synchronous
replication".
The discussion on the thread was that its not sync rep unless we have
the strictest guarantees. We have the strictest guarantees, so it
qualifies as sync rep.

Relaxations are possible and, to some people, desirable.

Perhaps there is a more marketable term, and if so, we can rebrand. It
wouldn't be the first time things got renamed in beta.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Search Discussions

  • Heikki Linnakangas at Mar 7, 2011 at 7:30 am

    On 07.03.2011 01:28, Simon Riggs wrote:
    On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
    On 03/06/2011 05:51 PM, Simon Riggs wrote:
    Efficient transaction-controlled synchronous replication.
    I'm glad this is in, but I thought we agreed NOT to call it "synchronous
    replication".
    The discussion on the thread was that its not sync rep unless we have
    the strictest guarantees. We have the strictest guarantees, so it
    qualifies as sync rep.
    What do you mean by "strictes guarantees"?

    I don't see allow_synchronous_standby setting in the committed patch. I
    presume you didn't make allow_synchronous_standby=off the default
    behavior. Also, the documentation that describes this as two-safe
    replication and claims that "the only possibility that data can be lost
    is if both the primary and the standby suffer crashes at the same time"
    needs big fat caveats to clarify that this doesn't actually achieve
    those guarantees.

    Please change the name.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Mar 7, 2011 at 7:48 am

    On Mon, 2011-03-07 at 09:29 +0200, Heikki Linnakangas wrote:

    I presume you didn't make allow_synchronous_standby=off the default
    behavior.
    You presume incorrectly.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Heikki Linnakangas at Mar 7, 2011 at 7:54 am

    On 07.03.2011 09:48, Simon Riggs wrote:
    On Mon, 2011-03-07 at 09:29 +0200, Heikki Linnakangas wrote:

    I presume you didn't make allow_synchronous_standby=off the default
    behavior.
    Sorry, s/allow_synchronous_standby/allow_standalone_master
    You presume incorrectly.
    Ok, ok then. Thank you! Looks like I need to git pull and get myself
    up-to-speed with these latest developments :-).

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Andrew Dunstan at Mar 7, 2011 at 1:30 pm

    On 03/07/2011 02:29 AM, Heikki Linnakangas wrote:
    On 07.03.2011 01:28, Simon Riggs wrote:
    On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
    On 03/06/2011 05:51 PM, Simon Riggs wrote:
    Efficient transaction-controlled synchronous replication.
    I'm glad this is in, but I thought we agreed NOT to call it
    "synchronous
    replication".
    The discussion on the thread was that its not sync rep unless we have
    the strictest guarantees. We have the strictest guarantees, so it
    qualifies as sync rep.
    What do you mean by "strictes guarantees"?

    I don't see allow_synchronous_standby setting in the committed patch.
    I presume you didn't make allow_synchronous_standby=off the default
    behavior. Also, the documentation that describes this as two-safe
    replication and claims that "the only possibility that data can be
    lost is if both the primary and the standby suffer crashes at the same
    time" needs big fat caveats to clarify that this doesn't actually
    achieve those guarantees.

    Please change the name.
    Previously, Simon said:
    Truly "synchronous" requires two-phase commit, which this never was.
    So I too am confused about how it's now become "truly synchronous". Are
    we saying this give the same or better guarantees than a 2PC setup?

    cheers

    andrew
  • Heikki Linnakangas at Mar 7, 2011 at 2:03 pm

    On 07.03.2011 15:30, Andrew Dunstan wrote:
    Previously, Simon said:
    Truly "synchronous" requires two-phase commit, which this never was.
    So I too am confused about how it's now become "truly synchronous". Are
    we saying this give the same or better guarantees than a 2PC setup?
    The guarantee we have now with synchronous_replication=on is that when
    the server acknowledges a commit to the client (ie. when COMMIT command
    returns), the transaction is safely flushed to disk on the master and at
    least one synchronous standby server.

    What you don't get is a guarantee on what happens to transactions that
    were not acknowledged to the client. For example, if you pull the power
    plug, the transaction that was just being committed might be committed
    on the master, but not yet on the standby.

    For me, that's enough to call it "synchronous replication". It provides
    a useful guarantee to the client. But you could argue for an even
    stricter definition, requiring atomicity so that if a transaction is not
    successfully replicated for any reason, including crash, it is rolled
    back in the master too. That would require 2PC.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Andrew Dunstan at Mar 7, 2011 at 2:21 pm

    On 03/07/2011 09:02 AM, Heikki Linnakangas wrote:
    On 07.03.2011 15:30, Andrew Dunstan wrote:
    Previously, Simon said:
    Truly "synchronous" requires two-phase commit, which this never was.
    So I too am confused about how it's now become "truly synchronous". Are
    we saying this give the same or better guarantees than a 2PC setup?
    The guarantee we have now with synchronous_replication=on is that when
    the server acknowledges a commit to the client (ie. when COMMIT
    command returns), the transaction is safely flushed to disk on the
    master and at least one synchronous standby server.

    What you don't get is a guarantee on what happens to transactions that
    were not acknowledged to the client. For example, if you pull the
    power plug, the transaction that was just being committed might be
    committed on the master, but not yet on the standby.

    For me, that's enough to call it "synchronous replication". It
    provides a useful guarantee to the client. But you could argue for an
    even stricter definition, requiring atomicity so that if a transaction
    is not successfully replicated for any reason, including crash, it is
    rolled back in the master too. That would require 2PC.
    My worry is that the stricter definition is what many people will
    expect, without reading the fine print.

    cheers

    andrew
  • Aidan Van Dyk at Mar 7, 2011 at 2:34 pm

    On Mon, Mar 7, 2011 at 2:21 PM, Andrew Dunstan wrote:

    For me, that's enough to call it "synchronous replication". It provides a
    useful guarantee to the client. But you could argue for an even stricter
    definition, requiring atomicity so that if a transaction is not successfully
    replicated for any reason, including crash, it is rolled back in the master
    too. That would require 2PC.
    My worry is that the stricter definition is what many people will expect,
    without reading the fine print.
    They they are either already hosed or already using 2PC.

    a.
    --
    Aidan Van Dyk                                             Create like a god,
    aidan@highrise.ca                                       command like a king,
    http://www.highrise.ca/                                   work like a slave.
  • Andrew Dunstan at Mar 7, 2011 at 3:03 pm

    On 03/07/2011 09:29 AM, Aidan Van Dyk wrote:
    On Mon, Mar 7, 2011 at 2:21 PM, Andrew Dunstanwrote:
    For me, that's enough to call it "synchronous replication". It provides a
    useful guarantee to the client. But you could argue for an even stricter
    definition, requiring atomicity so that if a transaction is not successfully
    replicated for any reason, including crash, it is rolled back in the master
    too. That would require 2PC.
    My worry is that the stricter definition is what many people will expect,
    without reading the fine print.
    They they are either already hosed or already using 2PC.

    This is about expectations. The thing that worries me is that the use of
    this term might cause some people NOT to use 2PC because they think they
    are getting an equivalent guarantee, when in fact they are not. And
    that's hardly unreasonable. Here for example is what wikipedia says
    <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:

    Synchronous replication - guarantees "zero data loss" by the means
    of atomic write operation, i.e. write either completes on both sides
    or not at all. Write is not considered complete until
    acknowledgement by both local and remote storage.


    cheers

    andrew
  • Kevin Grittner at Mar 7, 2011 at 3:14 pm

    Andrew Dunstan wrote:

    Synchronous replication - guarantees "zero data loss" by the
    means of atomic write operation, i.e. write either completes on
    both sides or not at all.
    So far, so good.
    Write is not considered complete until acknowledgement by both
    local and remote storage.
    OK, *if* we want to live up to this definition, we don't seem to
    have that part covered. Of course, since the connection is broken
    during the hypothetical crash, it seems hard to acknowledge it on
    recovery, and short of 2PC I don't see how we roll it back. About
    the best we could do is somehow have explicit logging of the
    disposition of unacknowledged commit requests upon recovery, and
    consider logging of success to be "acknowledgement". Is this
    logging provided by other databases with "synchronous replication"
    features?

    -Kevin
  • Heikki Linnakangas at Mar 7, 2011 at 3:47 pm

    On 07.03.2011 17:03, Andrew Dunstan wrote:
    This is about expectations. The thing that worries me is that the use of
    this term might cause some people NOT to use 2PC because they think they
    are getting an equivalent guarantee, when in fact they are not. And
    that's hardly unreasonable. Here for example is what wikipedia says
    <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:

    Synchronous replication - guarantees "zero data loss" by the means
    of atomic write operation, i.e. write either completes on both sides
    or not at all. Write is not considered complete until
    acknowledgement by both local and remote storage.
    Hmm, I've read that wikipedia definition before, but the "atomic" part
    never caught my eye. You do get zero data loss with what we have; if a
    meteor strikes the master, no acknowledged transaction is lost. I find
    that definition a bit confusing.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Andrew Dunstan at Mar 7, 2011 at 3:51 pm

    On 03/07/2011 10:46 AM, Heikki Linnakangas wrote:
    On 07.03.2011 17:03, Andrew Dunstan wrote:
    This is about expectations. The thing that worries me is that the use of
    this term might cause some people NOT to use 2PC because they think they
    are getting an equivalent guarantee, when in fact they are not. And
    that's hardly unreasonable. Here for example is what wikipedia says
    <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:

    Synchronous replication - guarantees "zero data loss" by the means
    of atomic write operation, i.e. write either completes on both sides
    or not at all. Write is not considered complete until
    acknowledgement by both local and remote storage.
    Hmm, I've read that wikipedia definition before, but the "atomic" part
    never caught my eye. You do get zero data loss with what we have; if a
    meteor strikes the master, no acknowledged transaction is lost. I find
    that definition a bit confusing.
    Maybe it is - I agree the difference might be small. I'm just trying to
    make sure we don't use a term that could mislead reasonable people about
    what we're providing. If we're satisfied that we aren't, then keep it.

    cheers

    andrew
  • Alvaro Herrera at Mar 7, 2011 at 4:10 pm

    Excerpts from Andrew Dunstan's message of lun mar 07 12:51:49 -0300 2011:
    On 03/07/2011 10:46 AM, Heikki Linnakangas wrote:

    Hmm, I've read that wikipedia definition before, but the "atomic" part
    never caught my eye. You do get zero data loss with what we have; if a
    meteor strikes the master, no acknowledged transaction is lost. I find
    that definition a bit confusing.
    Maybe it is - I agree the difference might be small. I'm just trying to
    make sure we don't use a term that could mislead reasonable people about
    what we're providing. If we're satisfied that we aren't, then keep it.
    I think these terms are used inconsistenly enough across the industry
    that what would make the most sense would be to use the common term and
    document accurately what we mean by it, rather than relying on some
    external entity's definition, which could change (like wikipedia's).

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Markus Wanner at Mar 18, 2011 at 9:27 am
    Hi,

    sorry for being late to join that bike-shedding discussion.
    On 03/07/2011 05:09 PM, Alvaro Herrera wrote:
    I think these terms are used inconsistenly enough across the industry
    that what would make the most sense would be to use the common term and
    document accurately what we mean by it, rather than relying on some
    external entity's definition, which could change (like wikipedia's).
    I absolutely agree to Alvaro here.

    The Wikipedia definition seems to only speak about one local and one
    remote node. Requiring an ack from "at least one" remote node seems to
    cover that.

    Not even Wikipedia goes further in their definition and tries to explain
    what 'synchronous replication' could mean in case we have more than two
    nodes. A somewhat common expectation is, that all nodes would have to
    ack. However, with such a requirement a single node failure brings your
    cluster to a full stop. So this isn't a practical option.

    Google invented the term "semi-syncronous" for something that's
    essentially the same that we have, now, I think. However, I full
    heartedly hate that term (based on the reasoning that there's no
    semi-pregnant, either).

    Others (like me) use "synchronous" or (lately rather) "eager" to mean
    that only a majority of nodes need to send an ACK. I have to explain
    what I mean every time.

    In the end, I don't have a strong opinion either way, anymore. I'm
    happy to think of the replication between the master and the one standby
    that's sending an ACK first as "synchronous". (Even if those may well
    be different standbies for different transactions).

    Hope to have brought some light into this discussion.

    Regards

    Markus Wanner
  • MARK CALLAGHAN at Mar 18, 2011 at 1:18 pm

    On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner wrote:
    Google invented the term "semi-syncronous" for something that's
    essentially the same that we have, now, I think.  However, I full
    heartedly hate that term (based on the reasoning that there's no
    semi-pregnant, either).
    We didn't invent the term, we just implemented something that Heikki
    Tuuri briefly described, for example:
    http://bugs.mysql.com/bug.php?id=7440

    In the Google patch and official MySQL version, the sequence is:
    1) commit on master
    2) wait for slave to ack
    3) return to user

    After step 1 another user on the master can observe the commit and the
    following is possible:
    1) commit on master
    2) other user observes that commit on master
    3) master blows up and a user observed a commit that never made it to a slave

    I do not think this sequence should be possible in a sync replication
    system. But it is possible in what has been implemented for MySQL.
    Thus it was named semi-sync rather than sync.

    --
    Mark Callaghan
    mdcallag@gmail.com
  • Robert Haas at Mar 18, 2011 at 1:31 pm

    On Fri, Mar 18, 2011 at 9:16 AM, MARK CALLAGHAN wrote:
    On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner wrote:
    Google invented the term "semi-syncronous" for something that's
    essentially the same that we have, now, I think.  However, I full
    heartedly hate that term (based on the reasoning that there's no
    semi-pregnant, either).
    We didn't invent the term, we just implemented something that Heikki
    Tuuri briefly described, for example:
    http://bugs.mysql.com/bug.php?id=7440

    In the Google patch and official MySQL version, the sequence is:
    1) commit on master
    2) wait for slave to ack
    3) return to user

    After step 1 another user on the master can observe the commit and the
    following is possible:
    1) commit on master
    2) other user observes that commit on master
    3) master blows up and a user observed a commit that never made it to a slave

    I do not think this sequence should be possible in a sync replication
    system. But it is possible in what has been implemented for MySQL.
    Thus it was named semi-sync rather than sync.
    Thanks for the insight. That can't happen with our implementation, I believe.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Mar 18, 2011 at 1:40 pm

    MARK CALLAGHAN wrote:
    Markus Wanner wrote:
    Google invented the term "semi-syncronous" for something that's
    essentially the same that we have, now, I think. However, I full
    heartedly hate that term (based on the reasoning that there's no
    semi-pregnant, either).
    To be fair, what we're considering calling semi-synchronous is
    something which tries to stay in synchronous mode but switches out
    of it when necessary to meet availability targets. Your analogy
    doesn't match up at all well -- at least without getting really
    ugly.
    We didn't invent the term, we just implemented something that
    Heikki Tuuri briefly described, for example:
    http://bugs.mysql.com/bug.php?id=7440

    In the Google patch and official MySQL version, the sequence is:
    1) commit on master
    2) wait for slave to ack
    3) return to user

    After step 1 another user on the master can observe the commit and
    the following is possible:
    1) commit on master
    2) other user observes that commit on master
    3) master blows up and a user observed a commit that never made it
    to a slave

    I do not think this sequence should be possible in a sync
    replication system.
    Then the only thing you would consider sync replication, as far as I
    can see, is two phase commit, which we already have. So your use
    case seems to be covered already, and we're trying to address other
    people's needs. The guarantee that some people are looking for is
    that a successful commit means that the data has been persisted on
    two separate servers. Others want to try for that, but are willing
    to compromise it for HA; in general I think they want to know when
    the guarantee is not there so they can take action to get back to a
    safer condition.

    -Kevin
  • Markus Wanner at Mar 18, 2011 at 2:37 pm
    Hi,
    On 03/18/2011 02:40 PM, Kevin Grittner wrote:
    Then the only thing you would consider sync replication, as far as I
    can see, is two phase commit
    I think waiting for the ACK before actually making the changes from the
    transaction visible (COMMIT) would suffice for disallowing such an
    inconsistency to manifest. But obviously, MySQL decided it's not worth
    doing that, as it's such a rare event and a short period of time that
    may show inconsistencies...
    people's needs. The guarantee that some people are looking for is
    that a successful commit means that the data has been persisted on
    two separate servers.
    Well, MySQL's semi-sync also seems to guarantee that WRT the client
    confirmation. And transactions always appear committed *before* the
    client receives the COMMIT acknowledgement, due to the time it takes for
    the ACK to arrive at the client.

    It's just the commit *before* receiving the slave's ACK, which might
    make a transaction visible that's not durable, yet. But I guess that
    simplified implementation for them...

    Regards

    Markus Wanner
  • MARK CALLAGHAN at Mar 18, 2011 at 4:03 pm

    On Fri, Mar 18, 2011 at 2:37 PM, Markus Wanner wrote:
    Hi,
    On 03/18/2011 02:40 PM, Kevin Grittner wrote:
    Then the only thing you would consider sync replication, as far as I
    can see, is two phase commit
    I think waiting for the ACK before actually making the changes from the
    transaction visible (COMMIT) would suffice for disallowing such an
    inconsistency to manifest.  But obviously, MySQL decided it's not worth
    doing that, as it's such a rare event and a short period of time that
    may show inconsistencies...
    There are fewer options for implementing this in MySQL because
    replication requires a binlog on the master and that requires the
    internal use of XA to keep the binlog and InnoDB in sync as they are
    separate resource managers. In theory, this can be changed so that
    commit is only forced for the binlog and then on a crash missing
    transactions could be copied from the binlog to InnoDB but I don't
    think this will ever change.

    By "fewer options" I mean that commit in MySQL with InnoDB and the
    binlog requires:
    1) prepare to InnoDB (force transaction log to disk for changes from
    this transaction)
    2) write binlog events from this transaction to the binlog
    3) write XID event to the binlog (at this point transaction commit is
    official, will survive a crash)
    4) force binlog to disk
    5) release row locks held by transaction in innodb
    6) write commit record to innodb transaction log
    7) force write of commit record to disk

    Group commit is done for the fsyncs from steps 1 and 7. It is not done
    for the fsync done in step 4.

    Regardless, the processing above is complicated even without
    semi-sync. AFAIK, semi-sync code occurs after step 7 but I have not
    looked at the official version of semi-sync code in MySQL and my
    memory of the work we did at Google is vague.

    It is great if Postgres doesn't have this issue. It wasn't clear to me
    from lurking on this list. I hope your docs highlight the behavior as
    not having the issue is a big deal.

    --
    Mark Callaghan
    mdcallag@gmail.com
  • Simon Riggs at Mar 18, 2011 at 2:19 pm

    On Fri, 2011-03-18 at 13:16 +0000, MARK CALLAGHAN wrote:
    On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner wrote:
    Google invented the term "semi-syncronous" for something that's
    essentially the same that we have, now, I think. However, I full
    heartedly hate that term (based on the reasoning that there's no
    semi-pregnant, either).
    We didn't invent the term, we just implemented something that Heikki
    Tuuri briefly described, for example:
    http://bugs.mysql.com/bug.php?id=7440

    In the Google patch and official MySQL version, the sequence is:
    1) commit on master
    2) wait for slave to ack
    3) return to user

    After step 1 another user on the master can observe the commit and the
    following is possible:
    1) commit on master
    2) other user observes that commit on master
    3) master blows up and a user observed a commit that never made it to a slave

    I do not think this sequence should be possible in a sync replication
    system. But it is possible in what has been implemented for MySQL.
    Thus it was named semi-sync rather than sync.
    Thanks for clearing it up Mark.

    We should definitely not be calling what we have "semi-sync". The
    semantics are very different.

    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Kevin Grittner at Mar 18, 2011 at 2:52 pm

    Simon Riggs wrote:

    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    Really? I hadn't picked up on that. That makes for a lot of
    complication on crash-and-recovery of a master, but if we can pull
    it off, that's really cool. If we do that and MySQL doesn't, we
    definitely don't want to use the same terminology they do, which
    would imply the same behavior.

    Apologies for not picking up on that aspect of the implementation.

    -Kevin
  • Markus Wanner at Mar 18, 2011 at 3:44 pm

    On 03/18/2011 03:52 PM, Kevin Grittner wrote:
    Really? I hadn't picked up on that. That makes for a lot of
    complication on crash-and-recovery of a master
    What complication do you have in mind here?

    I think of it the opposite way (at least for Postgres, that is):
    committing a transaction that's not acknowledged means having to revert
    a (locally only) committed transaction if you want to use the current
    data to recover to some cluster-agreed state. (Of course, you can
    always simply transfer the whole

    If you don't commit the transaction before the ACK in the first place,
    you don't have anything special to do upon recovery.

    Regards

    Markus Wanner
  • Heikki Linnakangas at Mar 18, 2011 at 3:48 pm

    On 18.03.2011 16:52, Kevin Grittner wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    Really? I hadn't picked up on that. That makes for a lot of
    complication on crash-and-recovery of a master, but if we can pull
    it off, that's really cool. If we do that and MySQL doesn't, we
    definitely don't want to use the same terminology they do, which
    would imply the same behavior.
    To be clear: other users cannot observe the commit until standby
    acknowledges it - unless the master crashes while waiting for the
    acknowledgment. If that happens, the commit will be visible to everyone
    after recovery.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Mar 18, 2011 at 4:19 pm

    On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
    On 18.03.2011 16:52, Kevin Grittner wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    Really? I hadn't picked up on that. That makes for a lot of
    complication on crash-and-recovery of a master, but if we can pull
    it off, that's really cool. If we do that and MySQL doesn't, we
    definitely don't want to use the same terminology they do, which
    would imply the same behavior.
    To be clear: other users cannot observe the commit until standby
    acknowledges it - unless the master crashes while waiting for the
    acknowledgment. If that happens, the commit will be visible to everyone
    after recovery.
    No, only in the case where you choose not to failover to the standby
    when you crash, which would be a fairly strange choice after the effort
    to set up the standby. In a correctly configured and operated cluster
    what I say above is fully correct and needs no addendum.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Mar 18, 2011 at 4:33 pm

    On Fri, Mar 18, 2011 at 12:19 PM, Simon Riggs wrote:
    On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
    On 18.03.2011 16:52, Kevin Grittner wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    Really?  I hadn't picked up on that.  That makes for a lot of
    complication on crash-and-recovery of a master, but if we can pull
    it off, that's really cool.  If we do that and MySQL doesn't, we
    definitely don't want to use the same terminology they do, which
    would imply the same behavior.
    To be clear: other users cannot observe the commit until standby
    acknowledges it - unless the master crashes while waiting for the
    acknowledgment. If that happens, the commit will be visible to everyone
    after recovery.
    No, only in the case where you choose not to failover to the standby
    when you crash, which would be a fairly strange choice after the effort
    to set up the standby. In a correctly configured and operated cluster
    what I say above is fully correct and needs no addendum.
    Except it doesn't work that way. If, say, a backend on the master
    core dumps, the system will perform a crash and restart cycle, and the
    transaction will become visible whether it's yet been replicated or
    not. Since we now have a GUC to suppress restart after a backend
    crash, it's theoretically possible to set up the system so that this
    doesn't occur, but it'd take quite a bit of work to make it robust and
    automatic, and it's certainly not the default out of the box.

    The fundamental problem here is that once you update CLOG and flush
    the corresponding WAL record, there is no going backward. You can
    hold the system in some intermediate state where the transaction still
    holds locks and is excluded from MVCC snapshots, but there's no way to
    back up. So there are bound to be corner cases where the where the
    wait doesn't last as long as you want, and stuff leaks out around the
    edges. It's fundamentally impossible to guarantee that you'll remain
    in that intermediate state forever - what do you do if a meteor hits
    the synchronous standby and at the same time you lose power to the
    master? No amount of configuration will save you from coming back on
    line with a visible-but-unreplicated transaction. I'm not knocking
    the system; I think what we have is impressively good. But pretending
    that corner cases can't happen gets us nowhere.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Mar 18, 2011 at 4:49 pm

    Robert Haas wrote:
    Simon Riggs wrote:
    No, only in the case where you choose not to failover to the
    standby when you crash, which would be a fairly strange choice
    after the effort to set up the standby. In a correctly configured
    and operated cluster what I say above is fully correct and needs
    no addendum.
    what do you do if a meteor hits the synchronous standby and at the
    same time you lose power to the master? No amount of
    configuration will save you from coming back on line with a
    visible-but-unreplicated transaction.
    You don't even need to postulate an extreme condition like that; we
    prefer to have a DBA pull the trigger on a failover, rather than
    trust the STONITH call to software. This is particularly true when
    the master is local to its primary users and the replica is remote
    to them.

    -Kevin
  • Greg Stark at Mar 18, 2011 at 5:36 pm

    On Fri, Mar 18, 2011 at 4:33 PM, Robert Haas wrote:
    The fundamental problem here is that once you update CLOG and flush
    the corresponding WAL record, there is no going backward.  You can
    hold the system in some intermediate state where the transaction still
    holds locks and is excluded from MVCC snapshots, but there's no way to
    back up.  So there are bound to be corner cases where the where the
    wait doesn't last as long as you want, and stuff leaks out around the
    edges.

    I'm finding this whole idea of hiding the committed transaction until
    the slave acks it kind of strange. It means there are times when the
    slave is actually *ahead* of the master which would actually be kind
    of hard to code against if you're trying to use the slave as a
    possibly-not-up-to-date mirror.

    I think promising that the COMMIT doesn't return until the transaction
    and all previous transactions are replicated is enough. We don't have
    to promise that nobody else will see it either. Those same
    transactions eventually have to commit as well and if they want that
    level of protection they can block waiting until they're replicated as
    well which will imply that anything they depended on will be
    replicated.

    This is akin to the synchronous_commit=off case where other
    transactions can see your data as soon as you commit even before the
    xlog is fsynced. If you have synchronous_commit mode enabled then
    you'll block until your xlog is fsynced and that will implicitly mean
    the other transactions you saw were also fsynced.

    --
    greg
  • Markus Wanner at Mar 18, 2011 at 7:18 pm

    On 03/18/2011 06:35 PM, Greg Stark wrote:
    I think promising that the COMMIT doesn't return until the transaction
    and all previous transactions are replicated is enough. We don't have
    to promise that nobody else will see it either. Those same
    transactions eventually have to commit as well
    No, they don't have to. They can ROLLBACK, get aborted, lose connection
    to the master, etc.. The issue here is that, given the MySQL scheme,
    these transactions see a snapshot that's not durable, because at that
    point in time, no standby guarantees to have stored the transaction to
    be committed, yet. So in case of a failover, you'd suddenly see a
    different snapshot (and lose changes of that transaction).
    This is akin to the synchronous_commit=off case where other
    transactions can see your data as soon as you commit even before the
    xlog is fsynced. If you have synchronous_commit mode enabled then
    you'll block until your xlog is fsynced and that will implicitly mean
    the other transactions you saw were also fsynced.
    Somewhat, yes. And for exactly that reason, most users run with
    synchronous_commit enabled. They don't want to lose committed transactions.

    Regards

    Markus Wanner
  • Markus Wanner at Mar 18, 2011 at 7:19 pm
    Simon,
    On 03/18/2011 05:19 PM, Simon Riggs wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    On other nodes as well? To me that means the standby needs to hold back
    COMMIT of an ACKed transaction, until receives a re-ACK from the master,
    that it committed the transaction there. How else could the slave know
    when to commit its ACKed transactions?
    No, only in the case where you choose not to failover to the standby
    when you crash, which would be a fairly strange choice after the effort
    to set up the standby. In a correctly configured and operated cluster
    what I say above is fully correct and needs no addendum.
    If you don't failover, how can the standby be ahead of the master, given
    it takes measures not to be during normal operation?

    Eager to understand... ;-)

    Regards

    Markus
  • Simon Riggs at Mar 18, 2011 at 7:30 pm

    On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
    Simon,
    On 03/18/2011 05:19 PM, Simon Riggs wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    On other nodes as well? To me that means the standby needs to hold back
    COMMIT of an ACKed transaction, until receives a re-ACK from the master,
    that it committed the transaction there. How else could the slave know
    when to commit its ACKed transactions?
    We could do that easily enough, actually, if we wished.

    Do we wish?
    No, only in the case where you choose not to failover to the standby
    when you crash, which would be a fairly strange choice after the effort
    to set up the standby. In a correctly configured and operated cluster
    what I say above is fully correct and needs no addendum.
    If you don't failover, how can the standby be ahead of the master, given
    it takes measures not to be during normal operation?

    Eager to understand... ;-)

    Regards

    Markus
    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Kevin Grittner at Mar 18, 2011 at 7:34 pm

    Simon Riggs wrote:
    On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:

    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    On other nodes as well? To me that means the standby needs to
    hold back COMMIT of an ACKed transaction, until receives a re-ACK
    from the master, that it committed the transaction there. How
    else could the slave know when to commit its ACKed transactions?
    We could do that easily enough, actually, if we wished.

    Do we wish?
    +1

    If we're going out of our way to suppress it on the master until the
    COMMIT returns, it shouldn't be showing on the replicas before that.

    -Kevin
  • Markus Wanner at Mar 18, 2011 at 7:41 pm

    On 03/18/2011 08:29 PM, Simon Riggs wrote:
    We could do that easily enough, actually, if we wished.

    Do we wish?
    I personally don't see any problem letting a standby show a snapshot
    before the master. I'd consider it unneeded network traffic. But then
    again, I'm completely biased.

    Regards

    Markus Wanner
  • Aidan Van Dyk at Mar 22, 2011 at 8:07 pm

    On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner wrote:
    On 03/18/2011 08:29 PM, Simon Riggs wrote:
    We could do that easily enough, actually, if we wished.

    Do we wish?
    I personally don't see any problem letting a standby show a snapshot
    before the master.  I'd consider it unneeded network traffic.  But then
    again, I'm completely biased.
    In fact, we *need* to have standbys show a snapshot before the master.

    By the time the master acks the commit to the client, the snapshot
    must be visible to all client connected to both the master and the
    syncronous slave.

    Even with just a single server postgresql cluster, other
    clients(backends) can see the commit before the commiting client
    receives the ACK. Just that on a single server, the time period for
    that is small.

    Sync rep increases that time period by the length of time from when
    the slave reaches the commit point in the WAL stream to when it's ack
    of that point get's back to the wal sender. Ideally, that ACK time is
    small.

    Adding another round trip in there just for a "go almost to $COMIT,
    ok, now go to $COMMIT" type of WAL/ack is going to be pessimal for
    performance, and still not improve the *guarentees* it can make.

    It can only slightly reduce, but not eliminated that window where them
    master has WAL that the slave doesn't, and without a complete
    elimination (where you just switch the problem to be the slave has the
    data that the master doesn't), you haven't changed any of the
    guarantees sync rep can make (or not).

    a.

    --
    Aidan Van Dyk                                             Create like a god,
    aidan@highrise.ca                                       command like a king,
    http://www.highrise.ca/                                   work like a slave.
  • Simon Riggs at Mar 18, 2011 at 9:26 pm

    On Fri, 2011-03-18 at 17:08 -0400, Aidan Van Dyk wrote:
    On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner wrote:
    On 03/18/2011 08:29 PM, Simon Riggs wrote:
    We could do that easily enough, actually, if we wished.

    Do we wish?
    I personally don't see any problem letting a standby show a snapshot
    before the master. I'd consider it unneeded network traffic. But then
    again, I'm completely biased.
    In fact, we *need* to have standbys show a snapshot before the master.

    By the time the master acks the commit to the client, the snapshot
    must be visible to all client connected to both the master and the
    syncronous slave.

    Even with just a single server postgresql cluster, other
    clients(backends) can see the commit before the commiting client
    receives the ACK. Just that on a single server, the time period for
    that is small.

    Sync rep increases that time period by the length of time from when
    the slave reaches the commit point in the WAL stream to when it's ack
    of that point get's back to the wal sender. Ideally, that ACK time is
    small.

    Adding another round trip in there just for a "go almost to $COMIT,
    ok, now go to $COMMIT" type of WAL/ack is going to be pessimal for
    performance, and still not improve the *guarentees* it can make.

    It can only slightly reduce, but not eliminated that window where them
    master has WAL that the slave doesn't, and without a complete
    elimination (where you just switch the problem to be the slave has the
    data that the master doesn't), you haven't changed any of the
    guarantees sync rep can make (or not).
    Well explained observation. Agreed.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Mar 22, 2011 at 8:38 pm

    On Fri, Mar 18, 2011 at 5:08 PM, Aidan Van Dyk wrote:
    On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner wrote:
    On 03/18/2011 08:29 PM, Simon Riggs wrote:
    We could do that easily enough, actually, if we wished.

    Do we wish?
    I personally don't see any problem letting a standby show a snapshot
    before the master.  I'd consider it unneeded network traffic.  But then
    again, I'm completely biased.
    In fact, we *need* to have standbys show a snapshot before the master.

    By the time the master acks the commit to the client, the snapshot
    must be visible to all client connected to both the master and the
    syncronous slave.
    We might have a version of synchronous replication that works this way
    some day, but it's not the version were shipping with 9.1. The slave
    acknowledges the WAL records when they hit the disk (i.e. fsync) not
    when they are applied; WAL apply can lag arbitrarily. The point is to
    guarantee clients that the WAL is on disk somewhere and that it will
    be replayed in the event of a failover. Despite the fact that this
    doesn't work as you're describing, it's a useful feature in its own
    right.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Mar 23, 2011 at 7:27 am

    On 03/22/2011 09:33 PM, Robert Haas wrote:
    We might have a version of synchronous replication that works this way
    some day, but it's not the version were shipping with 9.1. The slave
    acknowledges the WAL records when they hit the disk (i.e. fsync) not
    when they are applied; WAL apply can lag arbitrarily. The point is to
    guarantee clients that the WAL is on disk somewhere and that it will
    be replayed in the event of a failover. Despite the fact that this
    doesn't work as you're describing, it's a useful feature in its own
    right.
    In that sense, our approach may be more synchronous than most others,
    because after the ACK is sent from the slave, the slave still needs to
    apply the transaction data from WAL before it gets visible, while the
    master needs to wait for the ACK to arrive at its side, before making it
    visible there.

    Ideally, these two latencies (disk seek and network induced) are just
    about equal. But of course, there's no such guarantee. So whenever one
    of the two is off by an order of magnitude or two (by use case or due to
    a temporary overload), either the master or the slave may lag behind the
    other machine.

    What pleases me is that the guarantee from the slave is somewhat similar
    to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
    transaction *is* applied locally, it just guarantees that it *will* be
    able to do so sometime in the future. Kind of a mind twister, though...

    Regards

    Markus
  • Robert Haas at Mar 23, 2011 at 11:52 am

    On Wed, Mar 23, 2011 at 3:27 AM, Markus Wanner wrote:
    On 03/22/2011 09:33 PM, Robert Haas wrote:
    We might have a version of synchronous replication that works this way
    some day, but it's not the version were shipping with 9.1.  The slave
    acknowledges the WAL records when they hit the disk (i.e. fsync) not
    when they are applied; WAL apply can lag arbitrarily.  The point is to
    guarantee clients that the WAL is on disk somewhere and that it will
    be replayed in the event of a failover.  Despite the fact that this
    doesn't work as you're describing, it's a useful feature in its own
    right.
    In that sense, our approach may be more synchronous than most others,
    because after the ACK is sent from the slave, the slave still needs to
    apply the transaction data from WAL before it gets visible, while the
    master needs to wait for the ACK to arrive at its side, before making it
    visible there.

    Ideally, these two latencies (disk seek and network induced) are just
    about equal.  But of course, there's no such guarantee.  So whenever one
    of the two is off by an order of magnitude or two (by use case or due to
    a temporary overload), either the master or the slave may lag behind the
    other machine.

    What pleases me is that the guarantee from the slave is somewhat similar
    to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
    transaction *is* applied locally, it just guarantees that it *will* be
    able to do so sometime in the future.  Kind of a mind twister, though...
    Yes. What this won't do is let you build a big load-balancing network
    (at least not without great caution about what you assume). What it
    will do is make it really, really hard to lose committed transactions.
    Both good things, but different.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Mar 23, 2011 at 12:16 pm

    On 03/23/2011 12:52 PM, Robert Haas wrote:
    Yes. What this won't do is let you build a big load-balancing network
    (at least not without great caution about what you assume).
    This sounds too strong to me. Session-aware load balancing is pretty
    common these days. It's the default mode of PgBouncer, for example.
    Not much caution required there, IMO. Or what pitfalls did you have in
    mind?
    What it
    will do is make it really, really hard to lose committed transactions.
    Both good things, but different.
    ..you can still get both at the same time. At least as long as you are
    happy with session-aware load balancing. And who really needs finer
    grained balancing?

    (Note that no matter how fine-grained you balance, you are still bound
    to a (single core of a) single node. That changes with distributed
    querying, and things really start to get interesting there... but we are
    far from that, yet).

    Regards

    Markus
  • Robert Haas at Mar 23, 2011 at 3:24 pm

    On Wed, Mar 23, 2011 at 8:16 AM, Markus Wanner wrote:
    On 03/23/2011 12:52 PM, Robert Haas wrote:
    Yes.  What this won't do is let you build a big load-balancing network
    (at least not without great caution about what you assume).
    This sounds too strong to me.  Session-aware load balancing is pretty
    common these days.  It's the default mode of PgBouncer, for example.
    Not much caution required there, IMO.  Or what pitfalls did you have in
    mind?
    Well, just the one we were talking about: a COMMIT on one node doesn't
    guarantee that the transactions is visible on the other node, just
    that it will become visible there eventually, even if a crash happens.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Mar 18, 2011 at 9:18 pm

    On Fri, Mar 18, 2011 at 3:29 PM, Simon Riggs wrote:
    On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
    Simon,
    On 03/18/2011 05:19 PM, Simon Riggs wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    On other nodes as well?  To me that means the standby needs to hold back
    COMMIT of an ACKed transaction, until receives a re-ACK from the master,
    that it committed the transaction there.  How else could the slave know
    when to commit its ACKed transactions?
    We could do that easily enough, actually, if we wished.

    Do we wish?
    Seems like it would be nice, but isn't it dreadfully expensive?
    Wouldn't you need to prevent the slave from applying the WAL until the
    master has released the sync rep waiters? You'd need a whole new
    series of messages back and forth.

    Since the current solution is intended to support data-loss-free
    failover, but NOT to guarantee a consistent view of the world from a
    SQL level, I doubt it's worth paying any price for this. Certainly in
    the hot_standby=off case it's a nonissue. We might need to think
    harder about it when and if someone impements an 'apply' level though,
    because this would seem more of a concern in that case (though I
    haven't thought through all the details).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Mar 18, 2011 at 9:24 pm

    Robert Haas wrote:

    Since the current solution is intended to support data-loss-free
    failover, but NOT to guarantee a consistent view of the world from
    a SQL level, I doubt it's worth paying any price for this.
    Well, that brings us back to the question of why we would want to
    suppress the view of the data on the master until the replica
    acknowledges the commit. It *is* committed on the master, we're
    just holding off on telling the committer about it until we can
    honor the guarantee of replication. If it can be seen on the
    replica before the committer get such acknowledgment, why not on the
    master?

    -Kevin
  • Simon Riggs at Mar 18, 2011 at 9:30 pm

    On Fri, 2011-03-18 at 16:24 -0500, Kevin Grittner wrote:
    Robert Haas wrote:
    Since the current solution is intended to support data-loss-free
    failover, but NOT to guarantee a consistent view of the world from
    a SQL level, I doubt it's worth paying any price for this.
    Well, that brings us back to the question of why we would want to
    suppress the view of the data on the master until the replica
    acknowledges the commit. It *is* committed on the master, we're
    just holding off on telling the committer about it until we can
    honor the guarantee of replication. If it can be seen on the
    replica before the committer get such acknowledgment, why not on the
    master?
    I think the issue is explicit acknowledgement, not visibility.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services
  • Robert Haas at Mar 18, 2011 at 9:43 pm

    On Fri, Mar 18, 2011 at 5:24 PM, Kevin Grittner wrote:
    Robert Haas wrote:
    Since the current solution is intended to support data-loss-free
    failover, but NOT to guarantee a consistent view of the world from
    a SQL level, I doubt it's worth paying any price for this.
    Well, that brings us back to the question of why we would want to
    suppress the view of the data on the master until the replica
    acknowledges the commit.  It *is* committed on the master, we're
    just holding off on telling the committer about it until we can
    honor the guarantee of replication.  If it can be seen on the
    replica before the committer get such acknowledgment, why not on the
    master?
    Well, the idea is that we don't want to let people depend on the value
    until it's guaranteed to be durably committed.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Mar 18, 2011 at 9:48 pm

    Robert Haas wrote:

    Well, the idea is that we don't want to let people depend on the
    value until it's guaranteed to be durably committed.
    OK, so if you see it on the replica, you know it is in at least two
    places. I guess that makes sense. It kinda "feels" wrong to see a
    view of the replica which is ahead of the master, but I guess it's
    the least of the evils. I guess we should document it, though, so
    nobody has a false expectation that seeing something on the replica
    means that a connection looking at the master will see something
    that current.

    -Kevin
  • Robert Haas at Mar 18, 2011 at 10:48 pm

    On Fri, Mar 18, 2011 at 5:48 PM, Kevin Grittner wrote:
    Robert Haas wrote:
    Well, the idea is that we don't want to let people depend on the
    value until it's guaranteed to be durably committed.
    OK, so if you see it on the replica, you know it is in at least two
    places.  I guess that makes sense.  It kinda "feels" wrong to see a
    view of the replica which is ahead of the master, but I guess it's
    the least of the evils.  I guess we should document it, though, so
    nobody has a false expectation that seeing something on the replica
    means that a connection looking at the master will see something
    that current.
    Yeah, it can go both ways: a snapshot taken on the standby can be
    either earlier or later in the commit ordering than the master.
    That's counterintuitive, but I see no reason to stress about it. It's
    perfectly reasonable to set up a server with synchronous replication
    for enhanced durability and also enable hot standby just for
    convenience, but without actually relying on it all that heavily, or
    only for non-critical reporting purposes. Synchronous replication,
    like asynchronous replication, is basically a high-availability tool.
    As long as it does that well, I'm not going to get worked up about the
    fact that it doesn't address every other use case someone might want.
    We can always add more frammishes in future releases.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Mar 19, 2011 at 7:27 pm

    On 03/18/2011 10:48 PM, Kevin Grittner wrote:
    the least of the evils. I guess we should document it, though, so
    nobody has a false expectation that seeing something on the replica
    means that a connection looking at the master will see something
    that current.
    Agreed. Note, however, that even if there's no such guarantee, it's
    highly unlikely for a user (or application) to ever notice this during
    normal operation.

    Regards

    Markus Wanner
  • Fujii Masao at Mar 25, 2011 at 12:12 pm

    On Sat, Mar 19, 2011 at 4:29 AM, Simon Riggs wrote:
    On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
    Simon,
    On 03/18/2011 05:19 PM, Simon Riggs wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    On other nodes as well?  To me that means the standby needs to hold back
    COMMIT of an ACKed transaction, until receives a re-ACK from the master,
    that it committed the transaction there.  How else could the slave know
    when to commit its ACKed transactions?
    We could do that easily enough, actually, if we wished.

    Do we wish?
    No.

    I'm not sure what's the problem with seeing from the standby the data which is
    not visible on the master yet? And, I'm really not sure whether that problem can
    be solved by making the data visible on the master before the standby. If we
    really want to see the consistent data from each node, we should implement
    and use a cluster-wide snapshot as well as Postgres-XC does.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Kevin Grittner at Mar 18, 2011 at 4:28 pm

    On 18.03.2011 16:52, Kevin Grittner wrote:
    Simon Riggswrote:
    In PostgreSQL other users cannot observe the commit until an
    acknowledgement has been received.
    Really? I hadn't picked up on that. That makes for a lot of
    complication on crash-and-recovery of a master, but if we can
    pull it off, that's really cool.
    Markus Wanner wrote:
    What complication do you have in mind here?
    Basically, what Heikki addresses. It has to be committed after
    crash and recovery, and deal with replicas which may or may not have
    been notified and may or may not have applied the transaction.

    Heikki Linnakangas wrote:
    To be clear: other users cannot observe the commit until standby
    acknowledges it - unless the master crashes while waiting for the
    acknowledgment. If that happens, the commit will be visible to
    everyone after recovery.
    Right. If other transactions cannot see the transaction before the
    COMMIT returns, I was kinda assuming that this was the behavior,
    because otherwise one or more replicas could be ahead of the master
    after recovery, which would be horribly broken. I agree that the
    behavior which you describe is much better than allowing other
    transactions to see the work of the pending COMMIT.

    In fact, on further reflection, allowing other transactions to see
    work before the committing transaction returns could lead to broken
    behavior if that viewing transaction took some action based on the
    that, the master crashed, recovery was done using a standby, and
    that standby hadn't persisted the transaction. So this behavior is
    necessary for good behavior. Even though that "perfect storm" of
    events might be fairly rare, the difference in the level of
    confidence in correctness is significant, and certainly something to
    brag about.

    -Kevin
  • Markus Wanner at Mar 18, 2011 at 7:22 pm

    On 03/18/2011 05:27 PM, Kevin Grittner wrote:
    Basically, what Heikki addresses. It has to be committed after
    crash and recovery, and deal with replicas which may or may not have
    been notified and may or may not have applied the transaction.
    Huh? I'm not quite following here. Committing additional transactions
    isn't a problem, reverting committed transactions is.

    And yes, given that we only wait for ACK from a single standby, you'd
    have to failover to exactly *that* standby to guarantee consistency.
    In fact, on further reflection, allowing other transactions to see
    work before the committing transaction returns could lead to broken
    behavior if that viewing transaction took some action based on the
    that, the master crashed, recovery was done using a standby, and
    that standby hadn't persisted the transaction. So this behavior is
    necessary for good behavior.
    I fully agree to that.

    Regards

    Markus
  • Markus Wanner at Mar 18, 2011 at 2:20 pm
    Mark,
    On 03/18/2011 02:16 PM, MARK CALLAGHAN wrote:
    We didn't invent the term, we just implemented something that Heikki
    Tuuri briefly described, for example:
    http://bugs.mysql.com/bug.php?id=7440
    Oh, okay, good to know who to blame ;-) However, I didn't mean to
    offend anybody.
    I do not think this sequence should be possible in a sync replication
    system. But it is possible in what has been implemented for MySQL.
    Thus it was named semi-sync rather than sync.
    Sure?

    Their documentation [1] isn't entirely clear on that first: "the master
    blocks after the commit is done and waits until at least one
    semisynchronous slave acknowledges that it has received all events for
    the transaction" and the "slave acknowledges receipt of a transaction's
    events only after the events have been written to its relay log and
    flushed to disk".

    But then continues to say that "[the master is] waiting for
    acknowledgment from a slave after having performed a commit", so this
    indeed sounds like the transaction is visible to other sessions before
    the slave ACKs.

    So, semi-sync may show temporary inconsistencies in case of a master
    failure. Wow!

    Regards

    Markus Wanner


    [1] MySQL 5.5 reference manual, 17.3.8. Semisynchronous Replication:
    http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html
  • MARK CALLAGHAN at Mar 18, 2011 at 3:52 pm

    On Fri, Mar 18, 2011 at 2:19 PM, Markus Wanner wrote:

    Their documentation [1] isn't entirely clear on that first: "the master
    blocks after the commit is done and waits until at least one
    semisynchronous slave acknowledges that it has received all events for
    the transaction" and the "slave acknowledges receipt of a transaction's
    events only after the events have been written to its relay log and
    flushed to disk".

    But then continues to say that "[the master is] waiting for
    acknowledgment from a slave after having performed a commit", so this
    indeed sounds like the transaction is visible to other sessions before
    the slave ACKs.
    Yes, their docs are not clear on this.

    --
    Mark Callaghan
    mdcallag@gmail.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMar 6, '11 at 11:28p
activeMar 25, '11 at 12:12p
posts76
users12
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase