Sql-server – why are the Database backups occasionally failing

backupsql server

Part of our back up routine, on our main production server, I nightly take a full backup of all of the databases. These databases get restored on several dev severs and data is obfuscated. These are for Developers to work on, test on, and our staging/QA environment. These are mostly sql server 2016 (a couple of the dev environments are 2019 as we prepare to make that switch.) The code which does these backups and restores has not changed in months (close to a year). These instances are spread over three different servers.

About two weeks ago I started getting errors that the lower environments were getting errors trying to restore some of the backup files. I have scoured the logs and found nothing. The restore is done in an agent job in a try- catch block. The try is “restore” , the catch is “send email to dba”. That email is the only thing I could find. However when I looked at those dbs, they indeed did not restore.
this is the error message I get when trying to restore them manually :
Msg 3013, Level 16, State 1, Line 21
RESTORE DATABASE is terminating abnormally.

When a backup for a database failed, it failed for all of the instances to where it was going to get restored. Obviously a bad backup. I scoured the log files on the production server and found nothing.

I could take a new backup of the offending databases and restore them in the lower environments with no issue.

There are only about 10 databases (A B C D E F G H I J) that are affected. Each day between 1 and 4 of these databases fails to back up. It is random how many and which databases.
Day one could be A G J, day 2 E day 3 B C G H day 4 B E C I etc. There are about 50 databases on the server. The ten that take turns failing are ten of the smallest of the 50 databases. two of the ten are relatively new (maybe a year old) while the others are ancient.

Now as part of the backup process I have changed it so that after each database backs up I run a restore with verify only with checksum on the database. If the verify fails, it sends me an email and redoes the backup. This has taken care of the problem of the restores and everything is working smoothly. However I am still getting the emails saying that the verify of the backup has failed. I only get the email once. I would get the email for each time the verify fails. So the second attempt is always working.

Whereas checking the backups has solved the restoring problem, I have not solved the problem The problem reamains that bad backups are being taken. I am worried that something is wrong with our prod server in General or those databases in particular. I am afraid that my work around will only hide the problem and that we have a train wreck coming down the line.

I run DBCC CHECKDB on the entire server daily, and I have run it on those specific databases and results say that all is normal. There is plenty of space on all of the lower level instances so space contention is not an issue.

Best Answer

I have not solved the problem The problem reamains that bad backups are being taken. I am worried that something is wrong with our prod server in General or those databases in particular. I am afraid that my work around will only hide the problem and that we have a train wreck coming down the line.

Yes. You should be afraid, and proceed on the assumption that your storage solution is not reliable. So you should immediately ensure that you are landing good backups on a reliable storage device, and proceed with troubleshooting/support of your production database's main storage and/or backup storage solution.

Related Solutions

Sql-server – FULL recovery and differential backups

There are multiple ways to get to the intended target recovery point with your setup.

A few things:

You cannot eliminate log backups for the most important reason of transaction log re-use.
It's not possible to know when an issue will occur and thus having multiple ways to get to the RPO is useful and sometimes necessary.
Differentials aren't a substitute for log backups, they will lower the RTO as they are much faster to apply than log backups but ultimately aren't worth much on their own (spanning broken lsn points, etc, is useful)
How are you going to know which ones you need and don't need ahead of time?

I would not stop taking log backups but you could put into place an aging mechanism in your backup software or process to eliminate files that are no longer needed.

Using your example, you have many paths but the basis is as such:

Always take a tail of the log backup first

Restore the full backup, latest differential, all logs.
Restore the full backup, all transaction logs.

This is more flexible as a corrupt backup file (say a log file) could be spanned by a differential or a corrupt full could be spanned by differentials, etc.

I would not change how your strategy is currently, but I would make sure it meets your RPO and RTO requirements.

Sql-server – Restore Verify Only w/ Transaction Log Backups

But this brings up a question about how to proceed when RESTORE VERIFY ONLY fails.Let's say a trx log backup completes successfully, but RESTORE VERIFY ONLY fails

Restore verifyonly(as per BOL)Verifies the backup but does not restore it, and checks to see that the backup set is complete and the entire backup is readable. However, RESTORE VERIFYONLY does not attempt to verify the structure of the data contained in the backup volumes. So even if backup set comes as verified clean by Verifyonly its not 100% guaranteed that backup set is consistent only a successful restore of a backup can guarantee that backup set is valid.

I have not see an scenario where verifyonly fails but restores succeeds(unless you succeed with using continue_after_error) it can only happen when verifyonly failed saying there is not enough space to check backup consistency because verifyonly also checks that enough space is there to restore the database

Checks performed by RESTORE VERIFYONLY include:

•That the backup set is complete and all volumes are readable.

•Some header fields of database pages, such as the page ID (as if it were about to write the data).

•Checksum (if present on the media).

•Checking for sufficient space on destination devices.

Please show me the message which you got when verifyonly failed

After Verifyonly fails

I would run a checkdb on my database to check for any inconsistency but point here to note is checkdb does not do consistency check for log file only when recovery on snapshot is run during checkdb log file is used so we 'can' say that it checks for log file as well but not complete like it does for data file. If checkdb comes out clean I would say my database is consistent.

The backup spans LSNs X to Y. I can't simply take another trx log backup to "fix" the situation: the subsequent trx log backup would span a different range of LSNs (Y+1 to Z). At this point, the log chain is effectively broken, right?

Yes you are correct taking multiple backup does not fixes corruption in log backup but taking multiple log backups does not break log chain. Log files are inked internally with LSN number and taking multiple log backup does not breaks chain if log backups are restored in sequential manner

My first thought is to immediately take a DIFFERENTIAL backup. What would you do?

It wont solve corruption issue in any way. Differential backups are means to lower the RTO and perform speedy recovery of database thats it.

EDIT AFTER USER UPDATED HIS QUESTION:

the best I could do in a DR situation is to restore the FULL backup and six subsequent trx log backups

Yes you are correct.

If I had taken a DIFFERENTIAL backup soon after #7 failed, in a DR situation I could:

Restore the FULL backup
Restore the DIFFERENTIAL backup
Restore transaction log backups 8, 9, and 10

Yes you can restore differential backup, after you cleared question with details I have updated the answer. You can restore differential backup and then restore log backups 8,9,10 as they are taken after diff backup

Best Answer

Related Solutions

Sql-server – FULL recovery and differential backups

Sql-server – Restore Verify Only w/ Transaction Log Backups

Related Question