Sql-server – 823 errors on tempdb

sql serversql-server-2005tempdb

Have a couple Fiber channel SAN-attached cluster servers which have been intermittently reporting errors like:

2012-05-09 16:31:54.13 spid74      Error: 823, Severity: 24, State: 2.
2012-05-09 16:31:54.13 spid74      The operating system returned error 38(Reached the end of the file.) to SQL Server during a read at offset 0x000000e11b0000 in file 'T:\tempdb.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2012-05-09 16:31:54.13 spid74      Error while allocating extent for a worktable.  Extent (1:461024) in TEMPDB may have been lost.

SQL Server re-start, which re-creates tempdb, has so far always fixed the issue.

Per a Microsoft tech recommendation, I've enabled trace flag 818 ( http://support.microsoft.com/kb/826433 ), but this flag has not yet yielded any fruit [addition ERRORLOG details]; I'm supposing what's happening must not be meeting the criteria for this flag's feature to log extra data.

Have a Microsoft case open, and sysadmins and storage vendor looking at things. Also working through http://support.microsoft.com/kb/2015755 , including running sqliosim at one site for ~2.5 hrs. Though there must be something going wrong in the I/O path, nothing has bubbled to the top to pinpoint the cause.

  • What have you experienced?
  • What are your thoughts about where else to look?

Best Answer

First, if you're getting corruption issues in TempDB, I'd start by making sure I've got DBCC CHECKDB reporting clean results across my user databases. I'd also restore those backups on another server and run DBCC CHECKDB there just to make sure the problems are only confined to TempDB.

Then I'd remove any filter drivers on the servers - antivirus, disk defragmenting, storage mirroring, etc - and try to run multiple load-intensive queries against TempDB to see if I can reproduce the issue. Do simple select-intos, dumping Cartesian joins of your user databases into TempDB.

Finally, if I can get an outage window, I'd try SQLIO - not SQLIOSIM - hammering the TempDB drives. SQLIOSIM is interesting if you want to reproduce a vague similarity of a benchmark, but SQLIO is better if you just want to repeatedly punch your storage in the junk. Plus, it's really good about bubbling failures up quickly.

You also asked:

how much more likely is drive corruption than, say, I/O path driver issues (storport, HBA driver or firmware, filter drivers).

In my experience, the most common corruption issues are connectivity problems, followed by storage processor/controller firmware bugs, followed by filter drivers (I see less only because they were so unreliable that most folks have uninstalled them), finally followed by drive corruption. After all, in modern storage, the RAID controllers are pretty good about catching problems with a single failing drive.