SQL Server Error Log – Time-out Waiting for Buffer Latch on Non-existent Database

dbcc-checkdblatchsnapshotsql serversql-server-2008-r2

Background:
I've recently started at a new shop with lots of databases in various states. I arrived this morning (having just been added to the SQL DBA mailing group) to find an email about the CHECKDB job failing last night on one of the servers (one I've been warned has been problematic).

I won't dwell on the details of this here, it appears to be down to a number of time-outs waiting for buffer latches – by itself it's hard to tell why this is, as there should've been (and appears to have been) very little system usage last night when this was run.

There are ongoing problems with I/O requests longer than 15 seconds in the logs, and the vmware guys say that there have been communication problems with storage (on-going network issues), so this goes someway to explain the problem I think.

Important Note: I reran CHECKDB this morning and it completely quickly, finding 0 errors and as far as I'm aware nothing has happened since the failure last night, so (correct me if I'm wrong), I'm pretty confident there is no database problems, and last night was an anomaly, possibly triggered by network or storage problems (I'll know more about this when the network guys get back to me… if they do).

The actual point of this:
In the logs, right at the point where last night's CHECKDB job failed, there is a time-out waiting for a buffer latch – the strange thing to me is that the database id of the page it's supposedly trying to latch is 11, and there is no database with that id on the instance – ids only go up to 10.

This is followed up by the CHECKDB line with 1 errors found, 0 repaired, pointing to an Internal Database snapshot having a split point (see below for error log)

Specific Questions:

  1. Why would there be a reference to a database ID that doesn't exist (do Internal database snapshots get assigned a database ID at some point and that's what it was referring to)?
  2. Is there anyway of finding out what the database snapshot in question is (is it something to do with the way CHECKDB works, or is it something else)?
  3. Is there a way of finding what happened to the snapshot in question (as it obviously wasn't a problem when I ran CHECKDB again this morning).

Error log:

01/04/2017 23:50 spid101
Unknown DBCC CHECKDB (zenworks_UAL) WITH no_infomsgs executed by ARTSLOCAL\svc_zcmsql found 1 errors and repaired 0 errors. Elapsed time: 0 hours 28 minutes 51 seconds. Internal database snapshot has split point LSN = 00adf51f:00019bf4:0001 and first LSN = 00adf51f:00019bf3:0001.

01/04/2017 23:50 spid101
Unknown Unable to read and latch page (1:2067184) with latch type SH. Latch failed.

01/04/2017 23:50 spid101
Unknown Error: 8966 Severity: 16 State: 2.

01/04/2017 23:50 spid101
Unknown A time-out occurred while waiting for buffer latch — type 2 bp 00000004FFFF4A80 page 1:2067184 stat 0xc2040d database id: 11 allocation unit Id: 0/281474976710656 task 0x000000015BA6A088 : 0 waittime 300 flags 0x100000001a owning task 0x000000015BA6A088. Not continuing to wait.

enter image description here

Note:
As far as a know, no one else was working the database or running anything significant between the failure last night and me running CHECKDB this morning. The system was in use, but not significantly. Nothing is showing in the logs.

To confirm, there is no DB of ID 11, and I don't think there ever was (and as certain as I can be not between last night and this morning

Query:

SELECT db_name(11)

Results:

NULL

Query:

SELECT * FROM sys.databases

Results:

╔════════════════════╦═════════════╗
║        name        ║ database_id ║
╠════════════════════╬═════════════╣
║ master             ║           1 ║
║ tempdb             ║           2 ║
║ model              ║           3 ║
║ msdb               ║           4 ║
║ ReportServer       ║           5 ║
║ ReportServerTempDB ║           6 ║
║ z_xxxxxx_L         ║           7 ║
║ z_xxxxxx_p         ║           8 ║
║ S_xxxxxx_t         ║           9 ║
║ z_xxxxxx_t         ║          10 ║
╚════════════════════╩═════════════╝

Best Answer

This issue is a characteristic symptom of heavy I/O on the host. Perhaps there are multiple backups, DBCCs, and other "maintenance" activities happening on the SAN at that time.

Work with your VMware guy to ensure the SQL Server VMs are using the PVSCSI adapter for the LUNs present in the VM. VMware's SQL Server Best Practices documentation says this about SCSI adapters:

Utilize the VMware Paravirtualized SCSI (PVSCSI) Controller as the virtual SCSI Controller for data and log VMDKs.

The PVSCSI Controller is the optimal SCSI controller for an I/O-intensive application on vSphere. This controller has a queue depth of 64 (Per Device) and 254 (Per Controller) by default (double the size of an LSI Logic SAS controller). The PVSCSI controller’s per-device and per-controller queue depths can also be increased to 254 and 1024 respectively, providing even more increased I/O bandwidth for the virtualized workload. See Configuring disks to use VMware Paravirtual SCSI (PVSCSI) adapters (1010298).

Also, see how to increase queue depths in a VMware vSphere environment.

Note:
While increasing the default queue depth of a virtual SCSI controller can be beneficial to an SQL Server-based VM, the configuration can also introduce unintended adverse effects in overall performance if not done properly.

VMware highly recommends that customers consult and work with the appropriate storage vendor’s support personnel to evaluate the impact of such changes and obtain recommendations or other adjustments that may be required to support the increase in queue depth of a virtual SCSI controller.

Use multiple vSCSI adapters. Placing OS, data, and transaction logs onto a separate vSCSI adapter optimizes I/O by distributing load across multiple target devices and allowing for more queues on the operating system level. Spread the I/O load across all PVSCSI adapters to help optimize the I/O from the guest. In cases where there is a requirement for many data and transaction log disks, it will be beneficial to set the OS boot disk to use PVSCSI as well.

I would check the SQL Server default trace to see if DatabaseID 11 shows up.

This reads the default trace into a temporary table:

DECLARE @trcfilename VARCHAR(1000);

SELECT @trcfilename = path 
FROM sys.traces 
WHERE is_default = 1;

IF OBJECT_ID(N'tempdb..#trctemp', N'U') IS NOT NULL
BEGIN
    DROP TABLE #trctemp;
END

SELECT *
INTO #trctemp
FROM sys.fn_trace_gettable(@trcfilename, default) tt;

This will show any trace events where DatabaseID = 11.

SELECT tt.DatabaseID
    , tt.DatabaseName
    , tt.StartTime
    , tt.HostName
    , tt.LoginName
    , tt.ApplicationName
    , te.name
FROM #trctemp tt
    LEFT JOIN sys.trace_events te ON tt.EventClass = te.trace_event_id
WHERE tt.DatabaseID = 11 --missing database ID
ORDER BY tt.StartTime DESC;

DBCC CHECKDB (db_name) will create an internal snapshot of the database. You can see the snapshot files as an NTFS alternate data stream attached to the actual database files. So, if you run dir C:\some\path /r where C:\some\path is the location of your database files, you'll see the internal database snapshot files listed like this:

2018-03-29  03:06 PM       402,653,184 Test_DB.mdf
                           402,653,184 Test_DB.mdf:MSSQL_DBCC25:$DATA

In the sample above, you can see I have a database with a file named Test_DB.mdf. There is also an alternate data stream named Test_DB.mdf:MSSQL_DBCC25:$DATA. By correlating the existing database IDs in sys.databases, I am able to deduce that 25 in the DBCC25 part of the name is the database id for the snapshot.