Sql-server – At which point in a log backup does SQL Server truncate the log file

sql server

I am investigating an issue where on a somewhat predictable schedule where our third party backup software begins a full backup of some of our databases during the workday. It starts a log backup, detects a broken chain in the LSN's and then converts it to a full backup.

I checked and there are no other log backups being run against the SQL servers we are experiencing this with.

What happens is that a log backup begins on schedule, but on some days the connection with the backup server is lost during the backup. The job sits in limbo and when it restarts with the same job id there is a broken chain detected and it begins a full backup of affected databases.

Scouring the logs it I can see the backup failures and the backup restarting with the same LSN which is no longer there.

So my question is, when there is a log backup with the truncate option does SQL Server truncate the log as it is backed up, or at the successful completion of the job? In my case it seems like a range of LSN's is being backed up, marked as backed up and when the job fails it looks for those same LSN's again when the job restarts.

Best Answer

Technically, the virtual log files (VLFs) are only attempted to be marked as inactive (can be reused) when they have been successfully backed up and are no longer needed by internal (and sometimes external) processes - such as replication or availability groups.

What happens is that a log backup begins on schedule, but on some days the connection with the backup server is lost during the backup.

This shouldn't cause any issues with SQL Server. SQL Server will see that the connection was broken and kill the session plus whatever was executing. This means that backup wasn't successful so the next log backup should start at the same place because it hasn't yet finished successfully.

The job sits in limbo and when it restarts with the same job id there is a broken chain detected and it begins a full backup of affected databases.

Sounds like the application thinks it ran a backup successfully even though it didn't and "detects" there is a problem... but there isn't. It really sounds like the application logic in the backup program is either flawed or running into a config issue... or it might be their standard logic (by design).

Now if it takes a full backup and walks away... that's not going to affect the log reuse.

But on a certain day of the week, at a certain time of day half of the jobs hang because the backup server goes down for a few minutes and when they restart again they begin a full backup on some databases. And from what I can tell it looks for the LSN, runs a backup, job fails, restarts looking for the same LSN.

Sounds like you have other infrastructure issues that may be contributing. Assuming they are not, though, it sounds like the backup application is confused. If the job fails to backup the log successfully it should start at the same place because it hasn't yet been successfully backed up. If the backup application chooses to see that as an issue and take a full backup to reset itself (the backup application) internally, that's on the application vendor to fix/decide/working as intended/whatever after you tell them about it.

However, since this only happens when the backup application server goes down... you may want to also have a stern talking to your infrastructure team and get that issue situated - or at the very least, if it's downtime because of patching or something, that all jobs be held until the patching (or whatever) is completed.