Sql-server – Recovery time effects of Availability Groups

availability-groupshigh-availabilitysql serversql-server-2012

We did different kinds of basic Tests and AlwaysOn passed in many tests. We finally did a heavy write test for AlwaysOn and it gave surprising results.

The actual Test details are here, the goal is to see if AlwaysOn availability group can accommodate a high write load.

  1. I have two VMs each running on 8 cores and 17GB of RAM allocated to SQL Server.

  2. We wrote a script to generate reasonably good write I/O (in 20 threads).

  3. Each thread basically inserts 24 MB of data into a table and deletes in an endless loop.

Within 15 minutes of the test run, the estimate for recovery time on automatic failover reached 12 minutes which is pretty bad. We tried a failover to confirm if it really takes 12 minutes, it took around 5 minutes which is still too high. Also if we continue the test for three hours recovery ETA is almost 3 hours and it is taking hours to recover on failover (clearly this should not have been the case if it was a cluster failover, because all transactions are committed transactions).

So couple of things..

It is very clear that the synchronous secondary replica is not able to keep-up with the load primary is generating (even though both machines are of same configuration). And the side effect of this is the log on primary will keep on growing (even we take log backups it can't truncate the log).

We know that the secondary uses one thread per every 4 CPU-cores for doing redo, which looks like a clear limitation. If the primary is running 100 threads to generate load, the secondary can't use that many threads anyway.

Additionally, the primary does all its transactions in-memory and leaves the actual data file writes to checkpoints. However, it seems that secondary has to read all transactions from the physical log drive and redo. The log pool on secondary which is supposed to make this process faster? But it is not doing a good job in this scenario.

Finally questions to AlwaysOn experts:

  1. Does anyone know how the redo process exactly happens?
    • is it cached ?
    • is the buffer pool involved at all ?
  2. Does the secondary use log pool to cache the log entries for redo?

  3. What is the size of log pool? Can it grow up to the max memory available?

  4. When redo happens, the redo thread reads the pages to buffer pool and maintains them as if it is a normal transaction?

  5. If secondary is not able to keep-up how come AlwaysOn articles say recovery time is a few seconds?

This makes the high availability part of Availability Groups questionable, since these recovery times are unsustainable.


[Edit by asker] Clarifications, since people seem to think this is answered, the transactions at primary are indeed acknowledged (i.e. the log is hardened), since the state of the secondary is always "synchronized". So it is not a problem with the hardening of the log. So it is the redo process that is taking forever upon failover. This means that AlwaysOn will always take longer to recover than without it for any load that generates log > redo threads capacity.

Best Answer

It is very clear that the synchronous secondary replica is not able to keep-up with the load primary is generating (even though both machines are of same configuration). And the side effect of this is the log on primary will keep on growing (even we take log backups it can't truncate the log)

In synchronous mirroring/alwayson the secondary must acknowledge that it hardened (written to disk) the log before the primary's commit is allowed to continue. The primary is then free to truncate/reuse it's own log as it needs. If you cannot truncate the primary it means the secondary is not synchronized. This would point toward a problem with the ability to ship the log to the secondary and write it to disk. The two obvious bottlenecks would be the network speed and the secondary's log file storage. Both are easy to measure and diagnose, as they're straight forward USE (utilization, saturation, errors) OS level metrics.

Note that I never mentioned recovery (secondary's redo). If the problem is indeed that the secondary is not able to synchronize then redo is not playing any real role here.