Failover and DNS Propagation Delays – What to Expect

dnsfailoverreplication

When a DBMS fails over after a failure, if it fails over to a server in a different and unrelated data center, then the IP address of the subdomain name of that database will need to be changed in that name's DNS server, and due to DNS propagation delays, it could be days before all the DNS servers of the world have the IP address for the new master, during which time some clients would still be trying to access to old master. What can be done to deal with this, other than changing the client software to fetch the IP address of the master from somewhere, or is that the only option? Thanks.

Best Answer

In SQL Server, the default behavior with an Availability Group listener is to register ALL IPs in all subnets with DNS. While DNS has all the IPs attached to the listener's A Record, only the IP in the correct subnet for the current primary will be online. Other IPs will be in DNS for the network name, but will not actually be online/resolvable for any actual host.

With a default configuration, multi-subnet AGs require that the clients connecting to them include MultiSubnetFailover=true as a connection string attribute. This attribute tells the driver to expect DNS to provide multiple IP addresses for the Listener name, and to try all of them to find the correct IP to connect to for that network name. Clients that do not specify this attribute will get multiple IPs and not know how to handle them properly–most drivers will pick up one of the returned IPs at random (or maybe just seemingly random), and try to connect to that. This can result in random (or seemingly random) connection failures when it picks the wrong IP.

When failover happens, the DNS records do not change. Instead, the IP that is online & responding to requests will change. The former primary IP will go offline & stop responding, and a different IP will come online in a new subnet, and start handling traffic. This is why the client-side MultiSubnetFailover=true connection string attribute is necessary. The client's driver needs to know how to handle this.

You can turn off this behavior, so that DNS only registers a single IP address at a time, or even create two listeners so that one uses the default behavior and the other registers a single IP at a time. When using the single-IP listener, you do need to wait for the TTL to expire, and for the DNS updates to propagate before clients notice the failover & are able to reconnect. In this situation, you would want to set the TTL sufficiently low that the downtime is acceptable. Adjusting the TTL is ultimately the lever you can pull to prevent the DNS propagation from taking days.

Best Answer

Related Solutions

PostgreSQL failover and replication

MySQL Replication via DNS Failure

Related Question