Sql-server – SQL Server 2008 R2 Lockup for 10 minutes

sql serversql-server-2008-r2

Last Saturday night, our site went down for about 10 minutes. Analysis of the logs showed the following errors on the principal SQL Server (in a mirrored pair), during the outage and not outside the outage:

  • 06/06/2015 23:14:41,spid[various],Unknown,Timeout occurred while waiting for latch: class 'ACCESS_METHODS_DATASET_PARENT' id [various] type 4 Task [various] : [various] waittime 300 flags 0x1a owning task [various]. Continuing to wait.
  • 06/06/2015 23:06:54,spid19s,Unknown,Time-out occurred while waiting for buffer latch type 3 for page (5:159076157) database ID 6.
  • 06/06/2015 23:06:54,spid19s,Unknown,Error: 845 Severity: 17 State: 1.
  • 06/06/2015 23:06:54,spid19s,Unknown,A time-out occurred while waiting for buffer latch — type 3 bp 0000000518FA1200 page 5:159076157 stat 0xc0000b database id: 6 allocation unit Id: 72057793340899328 task 0x0000000008EDA748 : 0 waittime 300 flags 0x100000001a owning task 0x0000000004472BC8. Not continuing to wait.

There were over a hundred of the first error mostly before but also after the others, which only occurred once. These errors occurred about 2 hours after doing a failover, applying OS updates, and failing back to the original server. We've been running on these servers for about 2 years now and have never seen this issue. The software calling into the servers was most recently updated on Thursday afternoon (about 55 hours before the outage).

I'm finding very little information here or through google about this timeout. The closest thing I've found is the second answer to this question: https://stackoverflow.com/questions/3149310/time-out-occurred-while-waiting-for-buffer-latch-type-2-error-in-sql-server, which talks about type 4 latch error being tempdb-related and caused by a bug in 2008, but that bug was resolved in 2009, before 2008 R2 was released. The exact version reports as:

Microsoft SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64)    
Standard Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1) 

The servers both use mirrored Intel Enterprise SSDs for everything (dual RAID 10 arrays for the data), and none of the drives are reporting issues. The tempdb volume has 200GB free. There are 20 tempdb files (it's a BIG server). SQL traffic is down significantly from our peak over a year ago due to optimizations in the schema, stored procedures, and the software calling SQL, so it's probably not a load-related issue.

Is type 4 definitely related to the tempdb as many posts seem to indicate (database 6 referenced in the error message is our main database, not the tempdb)? What can I do to prevent this issue from happening again?

Best Answer

We had another mysterious SQL crash yesterday afternoon (no logs for that one either--just a SQL error message). We failed over to the mirror and ran some burn-in tests on this system and it turns out this server had a memory module that must have recently developed issues. I believe this is likely the cause of these issues. Thanks to all who looked into this!