Sql-server – why are the Database backups occasionally failing

backupsql server

Part of our back up routine, on our main production server, I nightly take a full backup of all of the databases. These databases get restored on several dev severs and data is obfuscated. These are for Developers to work on, test on, and our staging/QA environment. These are mostly sql server 2016 (a couple of the dev environments are 2019 as we prepare to make that switch.) The code which does these backups and restores has not changed in months (close to a year). These instances are spread over three different servers.

About two weeks ago I started getting errors that the lower environments were getting errors trying to restore some of the backup files. I have scoured the logs and found nothing. The restore is done in an agent job in a try- catch block. The try is “restore” , the catch is “send email to dba”. That email is the only thing I could find. However when I looked at those dbs, they indeed did not restore.
this is the error message I get when trying to restore them manually :
Msg 3013, Level 16, State 1, Line 21
RESTORE DATABASE is terminating abnormally.

When a backup for a database failed, it failed for all of the instances to where it was going to get restored. Obviously a bad backup. I scoured the log files on the production server and found nothing.

I could take a new backup of the offending databases and restore them in the lower environments with no issue.

There are only about 10 databases (A B C D E F G H I J) that are affected. Each day between 1 and 4 of these databases fails to back up. It is random how many and which databases.
Day one could be A G J, day 2 E day 3 B C G H day 4 B E C I etc. There are about 50 databases on the server. The ten that take turns failing are ten of the smallest of the 50 databases. two of the ten are relatively new (maybe a year old) while the others are ancient.

Now as part of the backup process I have changed it so that after each database backs up I run a restore with verify only with checksum on the database. If the verify fails, it sends me an email and redoes the backup. This has taken care of the problem of the restores and everything is working smoothly. However I am still getting the emails saying that the verify of the backup has failed. I only get the email once. I would get the email for each time the verify fails. So the second attempt is always working.

Whereas checking the backups has solved the restoring problem, I have not solved the problem The problem reamains that bad backups are being taken. I am worried that something is wrong with our prod server in General or those databases in particular. I am afraid that my work around will only hide the problem and that we have a train wreck coming down the line.

I run DBCC CHECKDB on the entire server daily, and I have run it on those specific databases and results say that all is normal. There is plenty of space on all of the lower level instances so space contention is not an issue.

Best Answer

I have not solved the problem The problem reamains that bad backups are being taken. I am worried that something is wrong with our prod server in General or those databases in particular. I am afraid that my work around will only hide the problem and that we have a train wreck coming down the line.

Yes. You should be afraid, and proceed on the assumption that your storage solution is not reliable. So you should immediately ensure that you are landing good backups on a reliable storage device, and proceed with troubleshooting/support of your production database's main storage and/or backup storage solution.