Sql-server – CheckDB Operation failing and returning error

We have a database that in the past couple of months has started failing when we run CheckDB on it during the maintenance window. When we re-run it after the failure it will complete fine each time.

A little information about the environment before I detail what is happening and what we have done so far.

We are running on SQL Server 2008 R2 SP3 Enterprise / Windows Server 2008 R2 Enterprise. This Instance contains our legacy DW (Relational Design with views making up the Dimensional Design). Each time CheckDB fails it fails on the same database and on the same filegroup. The database size is 7.5 TB and consists of 218 Filegroups. There is one filegroup that is the issue each time this fails and it is 641GB. The Filegroup sits on the I: Drive which is currently a 2.7 TB Drive.

Here are some of the errors that we are getting –

The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x0000002ece0000 in file 'I:\MSSQL\SQLDATA\FG_A.ndf:MSSQL_DBCC13'.
Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB).
This error can be caused by many factors; for more information, see SQL Server Books Online.

I:\MSSQL\SQLDATA\FG_A.ndf:MSSQL_DBCC13: Operating system error 665(The requested operation could not be completed due to a file system limitation) encountered.
Command: DBCC CHECKDB ([DatabaseA]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
Msg 5269, Level 16, State 1, Server ServerA, Line 1
Check terminated. The transient database snapshot for database 'DatabaseA' (database ID 6) has been marked suspect due to an IO operation failure. Refer to the SQL Server error log for details.

Sqlcmd: Error: Microsoft SQL Native Client : Unspecified error.
Internal database snapshot has split point LSN = 003f2ecc:00003498:0001 and first LSN = 003eb426:00002f58:0001.

First thing that I want to note is that anytime we have re-ran this manually after the failure (and there have been many) this never fails. It obviously takes a really long time, but it will complete with no errors.

Now the error to me indicates that this failed due to a file system limitation. When CheckDB runs it requires a consistent view of the database to check. To do this, it creates a hidden database snapshot. The message:

Internal database snapshot has split point LSN = 003f2ecc:00003498:0001 and first LSN = 003eb426:00002f58:0001"

..indicates the points in which the snapshot was created.

This error occurred at the exact same time as our failure so looks like we failed due to there not being enough space on the drive to create the temp snapshot. Looking at our monitoring we did not see an increase of space used during the failure, however, if it first checks to see how much space is available we may not ever see the space increase. Although when we run CheckDB manually, we are not seeing huge spikes in space utilization then either.

The filegroup sits on the I: Drive and we had been running a little low on space on the drive so we had it expanded to create a comfortable gap. Currently, there is 515GB of free space on a 2.7 TB drive. After we did that, we still had issues.

We looked at creating smaller drives and spreading the filegroups out more then what they are, but we have a TON of SQL Transactional Replication that goes to this database. So if we detach, move files and reattach I believe we would break replication so have not done that. However, not sure if this would solve the issue.

I also ran DBCC CheckDB WITH ESTIMATEONLY and it returned 1.5GB as the Estimated TempDB space needed for CHECKALLOC and 1 KB as the estimated TempDB space needed for CHECKTABLES. I believe that in older versions of SQL this is not very accurate though, but went ahead and checked.

We also use SAN Storage for this server. We started having the issue on our older storage device and we moved it to a new one that we are currently moving all SQL Servers and the issue went away for a couple of weeks and then started again. So we have had this on 2 completely separate storage devices and the issues exist on both. Additionally, we have several things running on these storage devices and not seeing this issue anywhere else.

After doing some research we did a hotfix for the product that seemed like it may help with the issue:

A heavily fragmented file in an NTFS volume may not grow beyond a certain size

We went through the process of getting this applied, however, it did not resolve the issue.

The errors seem to indicate that the failure is due to a file system limitation, however, we are unable to determine what that limitation is. OR if like some alerts, we are just getting a generic error that is not really indicative of the actual issue.

We checked with the Sysadmins and Storage Admins to make sure there was not a heavy operation occurring in the background that could be affecting this and so far nothing has been determined.

We have just adjusted the schedule on the weekend as much as we can to see if that makes a difference, but not sure it will. Wanted to check with you guys to see if there is somewhere else we should be looking or see if there is something that we may be missing. Thanks

I checked the history of the errors and found that the CheckDB fails at around the same time on the same day of the week with the same error. Going to check with the Admins again to see if anything is going on at this time.

Best Answer

I didn't want to duplicate @Nic's information, so this will be supplementary to his suggestions.

The operating system returned error 665 ...

Common issue with sparse files on NTFS. Just wait, this answer gets much better.

After doing some research we did a hotfix for the product that seemed like it may help with the issue... We went through the process of getting this applied, however, it did not resolve the issue.

That's what I would expect. This is because the hotfix adds support for the /L option (large attribute descriptor) for the NTFS filesystem. Here's the awesome part, you can only enable it by formatting the volume with this option. So, of course it didn't help as the volume isn't formatted with this option.

You could create a new volume, format with /L, copy the files over or you could attempt to use 3rd party applications to try and compact the files (which should compact the metadata as well) and see if that works. Additionally, a larger NTFS allocation unit size will result (for SQL Server) much less NTFS metadata fragmentation.

Best Answer

Related Solutions

Sql-server – The operating system returned error 665(The requested operation could not be completed due to a file system limitation) on SQL Server 2014

SQL Server – Understanding DBCC CHECKDB, Snapshots, and Standard Edition

Related Question