On Wednesday, September 12, 2012 10:15 PM Fujii Masao On Wed, Sep 12, 2012 at 8:54 PM, wrote:
The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:
1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,
Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.
What about setting keepalives_xxx libpq parameters?
http://www.postgresql.org/docs/devel/static/libpq-connect.html#LIBPQ-PARAMKE
YWORDS
Keepalives are not a perfect solution for the termination of connection, but
it would help to a certain extent.
We have tried by enabling keepalive, but it didn't worked maybe because
walreceiver is trying to send reveiver status.
It fails in sending that after many attempts of same.
If you need something like walreceiver-version of replication_timeout,
such feature has not been implemented yet.
Please feel free to implement that!
I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

The only point in having different timeout parameters for walsender and
walreceiver is for the case of standby which
has both walsender and walreceiver to send logs to cascaded standby, in
such case somebody might want to have different timeout parameters for
walsender and walreceiver.
OTOH it will create confusion to have too many parameters. My opinion is to
have one timeout parameter for both walsender and walrecevier.

Let me know your suggestion/opinion about same.

Note- I am marking cc to pgsql-hackers, as it will be a feature request.

With Regards,
Amit Kapila.

Search Discussions

  • Fujii Masao at Sep 13, 2012 at 5:27 pm

    On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:

    Bug reference: 7534
    Logged by: Amit Kapila
    Email address: amit.kapila@huawei.com
    PostgreSQL version: 9.2.0
    Operating system: Suse 10
    Description:
    1. Both master and standby machine are connected normally,
    2. then you use the command: ifconfig ip down; make the network card of
    master and standby down,
    Observation
    master can detect connect abnormal, but the standby can't detect connect
    abnormal and show a connected channel long time.
    What about setting keepalives_xxx libpq parameters?
    http://www.postgresql.org/docs/devel/static/libpq-connect.html#LIBPQ-PARAMKE
    YWORDS
    Keepalives are not a perfect solution for the termination of connection, but
    it would help to a certain extent.
    We have tried by enabling keepalive, but it didn't worked maybe because
    walreceiver is trying to send reveiver status.
    It fails in sending that after many attempts of same.
    If you need something like walreceiver-version of replication_timeout,
    such feature has not been implemented yet.
    Please feel free to implement that!
    I would like to implement such feature for walreceiver, but there is one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are placed in
    the same room, but cascaded standby is placed in other continent.

    Regards,

    --
    Fujii Masao
  • Amit kapila at Sep 14, 2012 at 1:01 pm

    On Thursday, September 13, 2012 10:57 PM Fujii Masao On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:
    Bug reference: 7534
    Logged by: Amit Kapila
    Email address: amit.kapila@huawei.com
    PostgreSQL version: 9.2.0
    Operating system: Suse 10
    Description:
    1. Both master and standby machine are connected normally,
    2. then you use the command: ifconfig ip down; make the network card of
    master and standby down,
    Observation
    master can detect connect abnormal, but the standby can't detect connect
    abnormal and show a connected channel long time.
    I would like to implement such feature for walreceiver, but there is one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are placed in
    the same room, but cascaded standby is placed in other continent.
    Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
    The main changes are:
    1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
    2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
    This is same as walsender functionality.

    As this is a feature, So I am uploading the attached patch in coming CommitFest.

    Suggestions/Comments?

    With Regards,
    Amit Kapila.
  • Fujii Masao at Sep 15, 2012 at 5:57 am

    On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila wrote:
    On Thursday, September 13, 2012 10:57 PM Fujii Masao
    On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:
    Bug reference: 7534
    Logged by: Amit Kapila
    Email address: amit.kapila@huawei.com
    PostgreSQL version: 9.2.0
    Operating system: Suse 10
    Description:
    1. Both master and standby machine are connected normally,
    2. then you use the command: ifconfig ip down; make the network card of
    master and standby down,
    Observation
    master can detect connect abnormal, but the standby can't detect connect
    abnormal and show a connected channel long time.
    I would like to implement such feature for walreceiver, but there is one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are placed in
    the same room, but cascaded standby is placed in other continent.
    Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
    The main changes are:
    1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
    2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
    This is same as walsender functionality.

    As this is a feature, So I am uploading the attached patch in coming CommitFest.

    Suggestions/Comments?
    You also need to change walsender so that it periodically sends the heartbeat
    message, like walreceiver does each wal_receiver_status_interval. Otherwise,
    walreceiver will detect the timeout wrongly whenever there is no traffic in the
    master.

    Regards,

    --
    Fujii Masao
  • Amit kapila at Sep 15, 2012 at 7:27 am

    On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote: On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila wrote:

    On Thursday, September 13, 2012 10:57 PM Fujii Masao
    On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:
    I would like to implement such feature for walreceiver, but there is one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are placed in
    the same room, but cascaded standby is placed in other continent.
    Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
    The main changes are:
    1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
    2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
    This is same as walsender functionality.
    As this is a feature, So I am uploading the attached patch in coming CommitFest.
    Suggestions/Comments?
    You also need to change walsender so that it periodically sends the heartbeat
    message, like walreceiver does each wal_receiver_status_interval. Otherwise,
    walreceiver will detect the timeout wrongly whenever there is no traffic in the
    master.
    Doesn't current keepalive message from walsender will suffice that need?

    With Regards,
    Amit Kapila.
  • Fujii Masao at Sep 15, 2012 at 6:44 pm

    On Sat, Sep 15, 2012 at 4:26 PM, Amit kapila wrote:
    On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
    On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila wrote:

    On Thursday, September 13, 2012 10:57 PM Fujii Masao
    On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:
    I would like to implement such feature for walreceiver, but there is one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are placed in
    the same room, but cascaded standby is placed in other continent.
    Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
    The main changes are:
    1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
    2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
    This is same as walsender functionality.
    As this is a feature, So I am uploading the attached patch in coming CommitFest.
    Suggestions/Comments?
    You also need to change walsender so that it periodically sends the heartbeat
    message, like walreceiver does each wal_receiver_status_interval. Otherwise,
    walreceiver will detect the timeout wrongly whenever there is no traffic in the
    master.
    Doesn't current keepalive message from walsender will suffice that need?
    No. Though the keepalive interval should be smaller than the timeout,
    IIRC there is
    no way to specify the keepalive interval now.

    Regards,

    --
    Fujii Masao
  • Amit kapila at Sep 16, 2012 at 6:11 am

    On Sunday, September 16, 2012 12:14 AM Fujii Masao wrote: On Sat, Sep 15, 2012 at 4:26 PM, Amit kapila wrote:
    On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
    On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila wrote:

    On Thursday, September 13, 2012 10:57 PM Fujii Masao
    On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:
    I would like to implement such feature for walreceiver, but there is one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are placed in
    the same room, but cascaded standby is placed in other continent.
    Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
    The main changes are:
    1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
    2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
    This is same as walsender functionality.
    As this is a feature, So I am uploading the attached patch in coming CommitFest.
    Suggestions/Comments?
    You also need to change walsender so that it periodically sends the heartbeat
    message, like walreceiver does each wal_receiver_status_interval. Otherwise,
    walreceiver will detect the timeout wrongly whenever there is no traffic in the
    master.
    Doesn't current keepalive message from walsender will suffice that need?
    No. Though the keepalive interval should be smaller than the timeout,
    IIRC there is
    no way to specify the keepalive interval now.
    Currently AFAICS in the code on idle system, it should send keepalive after 10s which is hardcoded value as sleeptime.
    You are right that if its not configurable, and somebody configures replication_timeout as value lower than 10s then the logic will fail.

    So is it okay if a new config parameter similar to wal_receiver_status_interval be added and map it directly to sleeptime in the current code.
    There will be no need for any new heartbeat message, existing keepalive will sufice that purpose.

    With Regards,
    Amit Kapila.
  • Amit Kapila at Sep 17, 2012 at 7:04 am

    On Sunday, September 16, 2012 12:14 AM Fujii Masao wrote: On Sat, Sep 15, 2012 at 4:26 PM, Amit kapila wrote:
    On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
    On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila wrote:

    On Thursday, September 13, 2012 10:57 PM Fujii Masao
    On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila wrote:
    On Wednesday, September 12, 2012 10:15 PM Fujii Masao
    On Wed, Sep 12, 2012 at 8:54 PM, wrote:
    The following bug has been logged on the website:
    I would like to implement such feature for walreceiver, but there is
    one
    confusion that whether to use
    same configuration parameter(replication_timeout) for walrecevier as
    for
    master or introduce a new
    configuration parameter (receiver_replication_timeout).
    I like the latter. I believe some users want to set the different
    timeout values,
    for example, in the case where the master and standby servers are
    placed in
    the same room, but cascaded standby is placed in other continent.
    Thank you for your suggestion. I have implemented as per your
    suggestion to have separate timeout parameter for walreceiver.
    The main changes are:
    1. Introduce a new configuration parameter
    wal_receiver_replication_timeout for walreceiver.
    2. In function WalReceiverMain(), check if there is no communication
    till wal_receiver_replication_timeout, exit the walreceiver.
    This is same as walsender functionality.
    As this is a feature, So I am uploading the attached patch in coming
    CommitFest.
    Suggestions/Comments?
    You also need to change walsender so that it periodically sends the
    heartbeat
    message, like walreceiver does each wal_receiver_status_interval.
    Otherwise,
    walreceiver will detect the timeout wrongly whenever there is no traffic
    in the
    master.
    Doesn't current keepalive message from walsender will suffice that need?
    No. Though the keepalive interval should be smaller than the timeout,
    IIRC there is
    no way to specify the keepalive interval now.
    To define the behavior correctly, according to me there are 2 options now:

    Approach-1 :
    Document that both(sender and receiver) the timeout parameters should be
    greater than wal_receiver_status_interval.
    If both are greater, then I think it might never timeout due to Idle.

    Approach-2 :
    Provide a variable wal_send_status_interval, such that if this is 0, then
    the current behavior would prevail and if its non-zero then KeepAlive
    message would be send maximum after that time.
    The modified code of WALSendLoop will be as follows:

    TimestampTz timeout = 0;
    long sleeptime = 10000; /* 10 s */
    int wakeEvents;

    /* sleeptime should be equal to wal send interval if
    it is not zero otherwise default as 10 sec*/
    if (wal_send_status_interval > 0)
    {
    sleeptime = wal_send_status_interval;
    }

    wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
    WL_SOCKET_READABLE | WL_TIMEOUT;

    if (pq_is_send_pending())
    wakeEvents |= WL_SOCKET_WRITEABLE;
    else if (wal_send_status_interval > 0)
    {
    WalSndKeepalive(output_message);
    /* Try to flush pending output to the client
    */
    if (pq_flush_if_writable() != 0)
    break;
    }

    /* Determine time until replication timeout */
    if (replication_timeout > 0)
    {
    timeout =
    TimestampTzPlusMilliseconds(last_reply_timestamp,

    replication_timeout);

    if (wal_send_status_interval <= 0)
    {
    sleeptime = 1 + (replication_timeout
    / 10);
    }
    }



    /* Sleep until something happens or replication
    timeout */
    WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
    MyProcPort->sock,
    sleeptime);

    /*
    * Check for replication timeout. Note we ignore
    the corner case
    * possibility that the client replied just as we
    reached the
    * timeout ... he's supposed to reply *before* that.

    */
    if (replication_timeout > 0 &&
    GetCurrentTimestamp() >= timeout)
    {
    /*
    * Since typically expiration of replication
    timeout means
    * communication problem, we don't send the
    error message to
    * the standby.
    */
    ereport(COMMERROR,
    (errmsg("terminating
    walsender process due to replication timeout")));
    break;
    }
    }

    Which way you think is better or you have any other idea to handle.

    With Regards,
    Amit Kapila.
  • Fujii Masao at Sep 18, 2012 at 12:32 pm

    On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila wrote:
    To define the behavior correctly, according to me there are 2 options now:

    Approach-1 :
    Document that both(sender and receiver) the timeout parameters should be
    greater than wal_receiver_status_interval.
    If both are greater, then I think it might never timeout due to Idle.
    In this approach, keepalive messages are sent each wal_receiver_status_interval?
    Approach-2 :
    Provide a variable wal_send_status_interval, such that if this is 0, then
    the current behavior would prevail and if its non-zero then KeepAlive
    message would be send maximum after that time.
    The modified code of WALSendLoop will be as follows: <snip>
    Which way you think is better or you have any other idea to handle.
    I think #2 is better because it's more intuitive to a user.

    Regards,

    --
    Fujii Masao
  • Amit Kapila at Sep 18, 2012 at 12:51 pm

    On Tuesday, September 18, 2012 6:03 PM Fujii Masao wrote: On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila wrote:
    To define the behavior correctly, according to me there are 2 options
    now:
    Approach-1 :
    Document that both(sender and receiver) the timeout parameters should be
    greater than wal_receiver_status_interval.
    If both are greater, then I think it might never timeout due to Idle.
    In this approach, keepalive messages are sent each
    wal_receiver_status_interval?
    wal_receiver_status_interval or sleeptime whichever is smaller.
    Approach-2 :
    Provide a variable wal_send_status_interval, such that if this is 0, then
    the current behavior would prevail and if its non-zero then KeepAlive
    message would be send maximum after that time.
    The modified code of WALSendLoop will be as follows:
    <snip>
    Which way you think is better or you have any other idea to handle.
    I think #2 is better because it's more intuitive to a user.
    I shall update the Patch as per Approach-2 and upload the same.

    With Regards,
    Amit Kapila.
  • Amit kapila at Sep 21, 2012 at 11:18 am

    On Tuesday, September 18, 2012 6:02 PM Fujii Masao wrote: On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila wrote:
    Approach-2 :
    Provide a variable wal_send_status_interval, such that if this is 0, then
    the current behavior would prevail and if its non-zero then KeepAlive
    message would be send maximum after that time.
    The modified code of WALSendLoop will be as follows:
    <snip>
    Which way you think is better or you have any other idea to handle.
    I think #2 is better because it's more intuitive to a user.
    Please find a patch attached for implementation of Approach-2.


    With Regards,
    Amit Kapila.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 13, '12 at 4:23a
activeSep 21, '12 at 11:18a
posts11
users2
websitepostgresql.org...
irc#postgresql

2 users in discussion

Amit kapila: 7 posts Fujii Masao: 4 posts

People

Translate

site design / logo © 2021 Grokbase