Sql-server – SQL Server – Log file not shrinking in AlwaysOn group

availability-groupsreplicationshrinksql servertransaction-log

We have three-node AO setup in SQL Server. Primary node, one secondary node in Sync mode and another secondary node in Async mode.
Main Database on Primary node, log file-size is 434GB, Used 99.91%. Log backups is happening every hour only on Primary node.

Ran following query on Primary and Secondary servers.

SELECT log_reuse_wait_desc
FROM sys.databases
WHERE name = 'Main DB Name'

Result:

node        log_reuse_wait_desc
------------------------------------------------------------
primary     AVAILABILITY_REPLICA
Secondary-1 LOG_BACKUP
Secondary-2 AVAILABILITY_REPLICA

While other DBs (much smaller than main DB) in AO groups log usage is very minimum (1 to 10%) and can be shrunk, main DB log usage continuously showing more than 99% and not shrinking.
I am guessing since log_reuse_wait_desc is showing "AVAILABILITY_REPLICA" on Primary Node, means not all the log is transferred to Secondary replica, and so its still full.

I checked Redo Queue Size in Secondary-1 Node showing 27GB, Secondary-2 is showing nothing. So most likely Secondary 1needs to get all logs from Primary. The problem is Redo Queue Size in Secondary-1 Node is increasing and not decreasing as what I through it should. Also log-file on Primary node is slowly increasing which is going to fill-up whole disk-space.

Now Secondary-1 Node showing LOG-BACKUP. How do I solve this issue? Should I take Log backup on Secondary-1 Node and check?

Best Answer

The issue is resolved. Apparently one of programs connecting to same large DB had been running for more than 24-hours, seemed it was in "hung" state, had several open connections to secondary replica-DB caused blocking. I guess that was causing not completing log replication from primary to secondary, because when I ended those programs, immediately blocking were gone and Redo Queue Size started decreasing instead of previously increasing. After about an hour or so, log file in primary node showed more than 80% available free space.

dbcc opentran and sp_who2 didn't give that much details to track this issue. What helped was

select * from sys.sysprocesses where blocked<>0
dbcc inputbuffer(spid)

The first statement was to find out what was causing blocks, and second one to kill. I am just sharing if it helps others in similar situation.

TLog backup is being taken for all user-Databases in every hour only on primary node as mentioned in my post. So far TLog physical file-size was small, wouldn't increase and most of the time had more than 90% available free space - which is what its supposed to be. So after the issue is resolved, only difference is file-size is very big which I understand because it kept on growing. Now I need to do a TLog shrink (we did before) and it should go back to initial size.

TLog back doesn't need to be taken on all replicas. Technically you can, but if you are taking on all replicas, in case of Database restoration you have to gather all TLog backups from all Nodes and restore them which may be complicated.