SQL Server – Always On DDL Operations Explained

availability-groupsddlsql serversql-server-2016

Always On availability group with two nodes, synchronous commit.

Redo thread contention on the secondary replica regularly creates a very large redo queue. I have confirmed the wait types are similar to the following:

https://blogs.msdn.microsoft.com/alwaysonpro/2015/01/06/troubleshooting-redo-queue-build-up-data-latency-issues-on-alwayson-readable-secondary-replicas-using-the-wait_info-extended-event/

In my case, this is one extended events session(captured when the redo queue was large) output grouped and aggregated like in the above link:

enter image description here

My question:

How can I find out the exact source of the DDL operation which is causing the LCK_M_SCH_M wait?

Best Answer

As Brent Ozar mentioned in the comment section that this is not a simple task to find wait type (and what is causing the wait) between primary and secondary with correlation to time. I am answering your question about finding the source. I modified extended event trace definition given in the blog post you mentioned. Removed the where clause so you can capture all the sessions that is causing wait.

Added few more actions to capture more information. For example:

  • sqlserver.client_hostname
  • sqlserver.plan_handle
  • sqlserver.session_nt_username
  • sqlserver.sql_text

Here is the full definition.

CREATE event session [redo_wait_info] ON server ADD event sqlos.wait_info( action(package0.event_sequence,sqlos.scheduler_id,sqlserver.client_hostname,sqlserver.database_id,sqlserver.plan_handle,sqlserver.session_id,sqlserver.session_nt_username,sqlserver.sql_text) ) ADD target package0.event_file(SET filename=N'C:\Redo_Wait_Info.xel',
  max_file_size=(50), 
  max_rollover_files=(100)) WITH (max_memory=4096 kb, 
event_retention_mode=allow_multiple_event_loss, 
max_dispatch_latency=120 seconds, 
max_event_size=0 kb, 
memory_partition_mode=none, 
track_causality=OFF, 
startup_state=ON)
GO