Why is SQL AG sync commit mode classified as no data-loss HA solution even when the session timeout can make the mode async?



  • Assuming the SQL AG is in sync commit mode. Suppose the secondary becomes unresponsive or has secondary has network issues, then upon the session timeout expiry the AG will automatically get converted into async mode.

    So, why is SQL AG sync commit mode classified as no data-loss HA solution even when the session timeout can make the mode async?



  • upon the session timeout expiry the AG will automatically get converted into async mode

    It's been a while since I've worked with AlwaysOn Availability Groups, so I'm not sure that statement is entirely accurate. The secondary replica in your example will however leave the Healthy state and enter a Disconnected state, whereby allowing the primary replica to continue to transact without waiting to synchronize to it (similar to how an asynchronous replica works).

    I think it's still fair to call synchronous AlwaysOn Availability Groups a valid HA/DR solution with minimal chance of data loss (no solution in the world offers 100% guarantee of 0 data loss) despite its session timeout functionality. The situation where a replica is unresponsive is an error situation you would want to be aware of, and therefore it's fair for it to be marked as Disconnected.

    It would be worse if everything carried on as normal, and a days worth of transactions just continued to build up on the primary replica waiting to be committed but couldn't because the secondary is unresponsive. There's a lot of risk for data loss in such a situation. And we'd be none the wiser without any indication that the secondary is in a bad state. Even scarier of a situation would be if the primary became unresponsive. The secondaries should have a way to acknowledge that there's some sort of a failure occuring on the primary, and the session timeout is one way of doing so.

    The session timeout is configured for 10 seconds by default, such that one can be alerted and respond to an unresponsive replica situation rather quickly. But it can be adjusted to the needs of your use case, and increased if a longer timeout time is necessary (due to normal anomalies, etc).




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2