SQL Server AlwaysOn – Issues Seeding Large Availability Database

I set up a 2-node SQL Server 2016 AlwaysOn Availability Group cluster (1 primary, 1 secondary). The underlying hardware is fairly powerful; 4 CPUs, 16 GB of RAM and local-attached SSD disk. Then I restored 10 databases on the primary. One of these databases was large, about 38 GB in size, while the others were relatively compact being in the 0.5 – 2 GB range in size. The restore happened quickly – then I added these 10 databases, all at the same time, as Availability Databases and chose Seeding as the method to initialize the replicas.

This is when the strange behavior happened. The first 6 databases were copied across as expected and appear as Synchronized in the list of databases on the secondary replica – including the 38 GB database. The CPU utilization on the secondary replica went from 0% to about 20-50%. Then the remaining 4 databases never appeared. So I ran this query:

SELECT
 r.session_id, r.status, r.command, r.wait_type
 , r.percent_complete, r.estimated_completion_time
FROM sys.dm_exec_requests r JOIN sys.dm_exec_sessions s
 ON r.session_id = s.session_id
WHERE r.session_id <> @@SPID
AND s.is_user_process = 0
AND r.command like 'VDI%'

And here is the output of the above query:

So it seems that the other databases are not getting seeded because they are sitting in the queue with status 'background'. So I decided to wait 16 hours. 16 hours later, the CPU usage on the secondary replica is still in the 20-50% range. There are no queries being issued on the cluster. The Primary Replica is in Synchronized state for all of the databases and is at 0% CPU utilization (and is running on the same type of hardware). However, the 4 databases are showing as Not Synchronized on the secondary replica. The cluster is using a Cloud Witness for Quorum mode.

So my questions are:

Why are the 4 small databases still not seeded onto the secondary replica?
Why does the secondary replica seem to be stuck doing some kind of large CPU or I/O bound task?
What should I do now? Should I keep waiting (it's already been 12 hours) to see if it finishes?

Any assistance here would be greatly appreciated.

Best Answer

I faced similar issue about month ago. I decided to connect new node to AG using direct seeding - just for a try (about 15 databases in the group and about 30 GB data in total) and it resulted in the situation that some of them were seeded successfully but five of them not (not synchronized state on the new replica).

I checked the sys.dm_hadr_automatic_seeding and sys.dm_hadr_physical_seeding_stats DMVs and none of them indicated that the seeding is still running. However, I noticed that there is large network traffic (30Mbit/s) between primary and new replica mirroring endpoints which should not be there. I dropped the databases on the secondary and joined them again using the well-proven "full backup approach". It helped and AG was fine now ...

However, the 30Mbit traffic was still there. When I stopped the sql service on the primary (without failover) it stopped and started again after start, then the same with the secondary. I kept it running whole night and it was still there. I had to restart the whole primary replica server to get rid off this traffic.

I seems that the VDI get stuck somehow.

Best Answer

Related Solutions

SQL Server – How AlwaysOn Availability Group Secondary Replica Catches Up After Downtime

SQL Server AlwaysOn – Fixing Not Synchronizing/In Recovery Mode After Upgrade

Related Question