SQL Server Performance – Stuck REDO Thread in Availability Group

availability-groupsperformancesql serversql-server-2017

We are facing issues with REDO queue fluctuating as we have readable secondary configuration.

Based on my understanding in newer versions of SQL Server, redo threads have been made parallel. Current version is SQL Server 2017 running on VM with 64 cores.

I am seeing problems when select queries on the secondary run on the same cpu/scheduler where AG threads are running. I am assuming AG thread yields its scheduler and takes a back seat to let other queries run. And I believe this is the problem because queries coming in seem to be hitting same cpu_id where AG threads are doing redo work.

Example below: spid 91 is select query on cpu_id 6 and at same time spid 123 and 144 running threads for AG's as found from sys.dm_exec_requests dmv. REDO queue starts building up when 91 came up.

I am not aware if there is any process which makes your queries go to different cores rather than one where AGs thread might be running.

Can we control this to make queries hit other schedulers and not one
where AG working?
I see lots of waits on PARALLEL_REDO_TRAN_TURN when queries running on secondary while AG threads are suspended. Can trace flag help as mentioned here
There might be some many other schedulers available. I am not able to understand why certain process is stuck in slot of Numa. I guess newer versions have all soft numa enabled so that makes bucket of 8 soft numas. 8*8 schedulers for 64 available. Is disabling soft numa a good idea here?

Database isolation is RCSI but I don't think that matters here because internally it's changed to snapshot based on design of readable secondaries.

Best Answer

While this might not be the answer you are looking for as it requires a migration in the end, you could try running your workload on SQL Server 2019 to be able to use "SQL Server 2019 Intelligent Performance -Worker Migration" or also commonly called Worker Stealing.

Worker migration (AKA “worker stealing”) allows an idle SOS scheduler to migrate a worker from the runnable queue of another scheduler on the same NUMA node and immediately resume the task of the migrated worker. This enhancement provides more balanced CPU usage and reduces the amount of time long-running tasks spend in the runnable queue.

A long-running task that is enabled for worker migration is no longer bound to a fixed scheduler. Instead, it will frequently move across schedulers within the same NUMA node which naturally results in less loaded schedulers. Together with the existing load factor mechanism, worker migration provides SQL Server with an enriched solution for balanced CPU usage.

Source

Parallel Redo is eligible for worker migration as noted in the same source:

In SQL Server 2019, workers associated with availability group parallel redo tasks are enabled for worker migration to address a commonly observed scheduler contention issue among redo tasks on secondary replicas.

When migrating to SQL Server 2019 this feature is enabled by default.

Addition

I see lots of waits on PARALLEL_REDO_TRAN_TURN when queries running on secondary while AG threads are suspended. Can trace flag 3459 help?

That trace flag can provide benefit if the queries running on the secondary cannot be adapted ( for example long running queries on the secondary reduced in time). But there could be other reasons such as frequent page splits.

One way to see if running the redo thread serial is beneficial for your set up is looking into the PARALLEL_REDO_TRAN_TURN wait stats but you should always compare these to other information such as redo queue size increase/decrease on the secondary .

Be mindful about enabling this traceflag as redo performance can suffer from going serial instead of parallel. Reverting back to parallel redo also requires a restart with the traceflag disabled, test this beforehand.

Related Solutions

Sql-server – SQL Server – Anyone use SUMA, trace flag 8048, or trace flag 8015

This is an awesome post.

To answer your final question, I'd speculate that your answer is "yes".

That said, I probably would have pursued soft numa before resorting to the trace flags. I think you are right about the numa node allocation and that's could be at the root of your problem. Via soft numa, you could scale out the requests, depending on your count of numa nodes (4?) - to 4, if that's the correct number, and then assign, via ip address, each host to a specific numa node, in addition to that, I'd disable hyper threading. Combined, the issue would likely decrease, however, it would do so at the cost of fewer schedulers.

On a seperate thought, I'd look at forced parameterization - the fact that your load is driving your CPU so high is very interesting and it may be worth looking into that.

Lastly, on multi-numa node systems, I typically have the output of the following queries dumping to a table every N seconds. Makes for some interesting analysis when workload changes or trace flags are implemented:

SELECT getdate() as poll_time, node_id, node_state_desc, memory_node_id, online_scheduler_count, active_worker_count, avg_load_balance, idle_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK) 
WHERE node_state_desc <> N'ONLINE DAC'

and

SELECT top 10 getdate() as sample_poll, wait_type, count (*)
FROM sys.dm_os_waiting_tasks
WHERE [wait_type] NOT IN
('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH','WAITFOR', 'BROKER_TASK_STOP',
'BROKER_RECEIVE_WAITFOR', 'OLEDB','CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT' ) 
GROUP BY wait_type
ORDER BY COUNT (*) DESC

Sql-server – Recovery time effects of Availability Groups

It is very clear that the synchronous secondary replica is not able to keep-up with the load primary is generating (even though both machines are of same configuration). And the side effect of this is the log on primary will keep on growing (even we take log backups it can't truncate the log)

In synchronous mirroring/alwayson the secondary must acknowledge that it hardened (written to disk) the log before the primary's commit is allowed to continue. The primary is then free to truncate/reuse it's own log as it needs. If you cannot truncate the primary it means the secondary is not synchronized. This would point toward a problem with the ability to ship the log to the secondary and write it to disk. The two obvious bottlenecks would be the network speed and the secondary's log file storage. Both are easy to measure and diagnose, as they're straight forward USE (utilization, saturation, errors) OS level metrics.

Note that I never mentioned recovery (secondary's redo). If the problem is indeed that the secondary is not able to synchronize then redo is not playing any real role here.

Best Answer

Related Solutions

Sql-server – SQL Server – Anyone use SUMA, trace flag 8048, or trace flag 8015

Sql-server – Recovery time effects of Availability Groups

Related Question