Sql-server – Database with replication and mirroring stuck in LOG_BACKUP after log file shrink

mirroringreplicationsql serversql-server-2008-r2transaction-log

Need an advice from SQL Server Experts. I'm just a developer, not a DBA, so pardon my ignorance.

My environment consists of 2 equal instances (Node1 and Node2) of SQL Server 2008 R2 x64 (10.50.4339.0). I have 2 databases: DB1 and DB2 with Full recovery model. Both databases participate in replication between each other within single SQL instance and also both of them get mirrored from Node1 to Node2. These are not production databases and I did take full DB backups before doing any changes.

However, I have ended up in trouble: the log file is not growing for both DB1 and DB2 at Principal and the execution of:

select log_reuse_wait_desc, name, * from sys.databases where name='DB1' or name='DB2'

shows that both DB1 and DB2 have `log_reuse_wait_desc='LOG_BACKUP' even though I did run log backup manually.

Here is the sequence of events which caused the current state:

  1. I needed to update DB1 and DB2 with year-worth set of changes which involved running many structure and data change scripts. Some of the data scripts were quite heavy, truncating tables and containing 70K+ INSERT calls. So, I've started applying changes at Principal (Node1) one by one in little chunks.
  2. Node1 already had limited disk space available at partition with DB1 and DB2 data and log files.
  3. When applying changes several times I've received log file full... message. So I've stopped and shrunk the log file of both DB1 and DB2 to 0Mb to free up some disk space to be able to finish the data/structure update task.
  4. The log file of both DBs did get shrunk to minimal allowed however after I have continued with data changes I've got the following message:

Msg 9002, Level 17, State 2, Line 1 The transaction log for database
'DB2' is full. To find out why space in the log cannot be reused, see
the log_reuse_wait_desc column in sys.databases

Sometime during running of data/structure change scripts the free disk space at relevant partition dropped to a few KBs and SQL Server sent out warning emails.

I did try to research how to get log_reuse_wait_desc column to show NOTHING. However, the suggested solutions prompted putting DB1 and DB2 into simple recovery mode or detach/re-attach them, something I cannot do because of the replication and mirroring.

I have also followed the advice from Paul S. Randal and have run log backup manually:

 BACKUP LOG DB2 TO DISK = 'z:\DB2.TRN'

which completed successfully, however the log_reuse_wait_desc still shows LOG_BACKUP and the log file is not growing.

DBCC LOGINFO Output:

DB1: 421 records like:

dbcc loginfo for DB1

DB2: 381 records like:

dbcc loginfo for DB2

Initial loginfo records for DB2: https://i.stack.imgur.com/TTsVu.png

Note, the Field column for DB1 contains values 2 and 4, while for DB2 Field column contains only value 2. The file size and offset fields looks scary: 54GB+? I may not have so much free space. Currently there is 26GB of free space on the drive and it's not decreasing. So seems SQL Server has hung.

Is there any way to fix the current DB1 and DB2 without disabling replication/mirroring followed by a restore from backup?

Any help is greatly appreciated.

Best Answer

I wonder if your operations generated some gaps in the transaction log which may be preventing the virtual log file from looping back around to the front of the physical log file. Conceptually this is what should be happening within your database when the end of the physical file is reached:

enter image description here

If DBCC LOGINFO returns records indicating the VLFs at the beginning of the physical log file have a status of 2, these VLFs need to be cleared, which in Full Recovery mode, means a TLog backup must be taken. I believe this situation can also cause the log_reuse_wait_desc column to show LOG BACKUP as well. To fix this, you must force the logical header back toward the beginning of the physical file to clear out these status 2 VLFs (and hopefully get your log_reuse_wait_desc column back to showing NOTHING). To do this, perform the following steps (which may need to done on both DB1 and DB2 databases):

  1. Perform a Log Backup (e.g. BACKUP LOG [DBNAME]...) against the database
  2. Perform a DBCC SHRINKFILE operation on the transaction log file specifying 1 MB as the target size
  3. Perform another Log Backup against the database
  4. Perform another DBCC SHRINKFILE operation on the transaction log file specifying 1 MB as the target size
  5. Manually increase the size of the log file, incrementally per Kimberly Tripp's article, to grow the TLog back to an appropriate size.

So what's happening here?

  1. The first Log Backup simply generates a backup of the completed transactions that have been flushed to disk and deletes inactive virtual log files from the logical transaction log, freeing space for reuse up to the logical log header.
  2. The first DBCC Shrinkfile operation shrinks the physical TLOG down to where the current logical log header is located, freeing as much space as possible.
  3. The second Log Backup operation is now able to loop back around within the logical log and backup/clear any residual inactive VLFs located before the logical log header. These were probably never flushed in the past for whatever reason, but should be cleared at this point. This step is critical to fixing your issue and sometimes this backup can take a bit of time depending on how much data needs to be written down to disk.
  4. The second DBCC SHRINKFILE operation moves the logical log header to the front of the physical log file, eliminates excessive VLFs that are likely there now, and reduces the TLOG file back down to a very small size, in this case 1MB. It is imperative to manually grow the file in step 5 so you don't run into VLF issues again down the road
  5. Manually increasing the size of the tlog will allow you to control the size and number of VLFs within your physical log file. This is an opportune time to configure this properly, so take advantage.

One final note on the process. If your tlog is highly active, you may need to perform this a few times or during a window of reduced activity. High activity may result in another errant auto-growth event occurring before you execute the second log backup. This basically turns what you thought was step 3 back into step 1. Generally Steps 1 and 2 go by quickly, and step 3 takes the longest to complete. Make sure to follow-through with steps 4 and 5.