What to check before attempting a read-scale replica log backup

availability-groupsbackup

We have a read-scale replica that we take our backups from. The network connection is not the best and sometimes we lose sync. It comes back in a few minutes, and for our purposes we can live with it. However, during that time-frame, sometimes the log backups fail. I read this which finally explained why that might be happening:

http://www.centinosystems.com/blog/sql/designing-for-offloaded-backups-in-alwayson-availability-groups/

Basically it sounds as if LSNs get too far apart and then backups will fail. So I think it would be a good idea to check those LSNs with sys.dm_hadr_database_replica_states and take appropriate action. My question is what LSNs should I compare to delay a backup until it will not fail? There are 9 of them, and I have trouble going from the blog post to the DMV.

My goal is to delay the backup until it will not fail, so if this is not the best approach I am open to something better.

Thanks.

Best Answer

Log backups are important. If they are scheduled to be run every 15 minutes, for example, that means they need to run every 15 minutes in order to meet you RPO. Although it's nice to have the secondary running log backups, it is not nice if the log backups fail, and then you have a catastrophe at the primary site and you don't have recent log backups--you will not meet your RTO.

I would recommend that you run the log backups on the primary if possible, and have them copied to another location at the primary site, and also to the secondary's site. This way you have the log backups available at the primary site, and this covers you in case you have a problem with the primary server, but the site survives. If the entire site goes down, you have the latest log backups possible, given the network instability, at the secondary site.

I would inform management that due to the network instability, you cannot guarantee the RPO at the secondary site, and have them approve whatever plan you implement.