SQL Server Performance – Stuck REDO Thread in Availability Group

availability-groupsperformancesql serversql-server-2017

We are facing issues with REDO queue fluctuating as we have readable secondary configuration.

Based on my understanding in newer versions of SQL Server, redo threads have been made parallel. Current version is SQL Server 2017 running on VM with 64 cores.

I am seeing problems when select queries on the secondary run on the same cpu/scheduler where AG threads are running. I am assuming AG thread yields its scheduler and takes a back seat to let other queries run. And I believe this is the problem because queries coming in seem to be hitting same cpu_id where AG threads are doing redo work.

Example below: spid 91 is select query on cpu_id 6 and at same time spid 123 and 144 running threads for AG's as found from sys.dm_exec_requests dmv. REDO queue starts building up when 91 came up.

enter image description here

I am not aware if there is any process which makes your queries go to different cores rather than one where AGs thread might be running.

  1. Can we control this to make queries hit other schedulers and not one
    where AG working?

  2. I see lots of waits on PARALLEL_REDO_TRAN_TURN when queries running on secondary while AG threads are suspended. Can trace flag help as mentioned here

  3. There might be some many other schedulers available. I am not able to understand why certain process is stuck in slot of Numa. I guess newer versions have all soft numa enabled so that makes bucket of 8 soft numas. 8*8 schedulers for 64 available. Is disabling soft numa a good idea here?

Database isolation is RCSI but I don't think that matters here because internally it's changed to snapshot based on design of readable secondaries.

Best Answer

While this might not be the answer you are looking for as it requires a migration in the end, you could try running your workload on SQL Server 2019 to be able to use "SQL Server 2019 Intelligent Performance -Worker Migration" or also commonly called Worker Stealing.

Worker migration (AKA “worker stealing”) allows an idle SOS scheduler to migrate a worker from the runnable queue of another scheduler on the same NUMA node and immediately resume the task of the migrated worker. This enhancement provides more balanced CPU usage and reduces the amount of time long-running tasks spend in the runnable queue.

A long-running task that is enabled for worker migration is no longer bound to a fixed scheduler. Instead, it will frequently move across schedulers within the same NUMA node which naturally results in less loaded schedulers. Together with the existing load factor mechanism, worker migration provides SQL Server with an enriched solution for balanced CPU usage.

Source

Parallel Redo is eligible for worker migration as noted in the same source:

In SQL Server 2019, workers associated with availability group parallel redo tasks are enabled for worker migration to address a commonly observed scheduler contention issue among redo tasks on secondary replicas.

When migrating to SQL Server 2019 this feature is enabled by default.

Addition

I see lots of waits on PARALLEL_REDO_TRAN_TURN when queries running on secondary while AG threads are suspended. Can trace flag 3459 help?

That trace flag can provide benefit if the queries running on the secondary cannot be adapted ( for example long running queries on the secondary reduced in time). But there could be other reasons such as frequent page splits.

One way to see if running the redo thread serial is beneficial for your set up is looking into the PARALLEL_REDO_TRAN_TURN wait stats but you should always compare these to other information such as redo queue size increase/decrease on the secondary .

Be mindful about enabling this traceflag as redo performance can suffer from going serial instead of parallel. Reverting back to parallel redo also requires a restart with the traceflag disabled, test this beforehand.