Background:
I've recently started at a new shop with lots of databases in various states. I arrived this morning (having just been added to the SQL DBA mailing group) to find an email about the CHECKDB job failing last night on one of the servers (one I've been warned has been problematic).
I won't dwell on the details of this here, it appears to be down to a number of time-outs waiting for buffer latches – by itself it's hard to tell why this is, as there should've been (and appears to have been) very little system usage last night when this was run.
There are ongoing problems with I/O requests longer than 15 seconds in the logs, and the vmware guys say that there have been communication problems with storage (on-going network issues), so this goes someway to explain the problem I think.
Important Note: I reran CHECKDB this morning and it completely quickly, finding 0 errors and as far as I'm aware nothing has happened since the failure last night, so (correct me if I'm wrong), I'm pretty confident there is no database problems, and last night was an anomaly, possibly triggered by network or storage problems (I'll know more about this when the network guys get back to me… if they do).
The actual point of this:
In the logs, right at the point where last night's CHECKDB job failed, there is a time-out waiting for a buffer latch – the strange thing to me is that the database id of the page it's supposedly trying to latch is 11, and there is no database with that id on the instance – ids only go up to 10.
This is followed up by the CHECKDB line with 1 errors found, 0 repaired, pointing to an Internal Database snapshot having a split point (see below for error log)
Specific Questions:
- Why would there be a reference to a database ID that doesn't exist (do Internal database snapshots get assigned a database ID at some point and that's what it was referring to)?
- Is there anyway of finding out what the database snapshot in question is (is it something to do with the way CHECKDB works, or is it something else)?
- Is there a way of finding what happened to the snapshot in question (as it obviously wasn't a problem when I ran CHECKDB again this morning).
Error log:
01/04/2017 23:50 spid101
Unknown DBCC CHECKDB (zenworks_UAL) WITH no_infomsgs executed by ARTSLOCAL\svc_zcmsql found 1 errors and repaired 0 errors. Elapsed time: 0 hours 28 minutes 51 seconds. Internal database snapshot has split point LSN = 00adf51f:00019bf4:0001 and first LSN = 00adf51f:00019bf3:0001.
01/04/2017 23:50 spid101
Unknown Unable to read and latch page (1:2067184) with latch type SH. Latch failed.
01/04/2017 23:50 spid101
Unknown Error: 8966 Severity: 16 State: 2.
01/04/2017 23:50 spid101
Unknown A time-out occurred while waiting for buffer latch — type 2 bp 00000004FFFF4A80 page 1:2067184 stat 0xc2040d database id: 11 allocation unit Id: 0/281474976710656 task 0x000000015BA6A088 : 0 waittime 300 flags 0x100000001a owning task 0x000000015BA6A088. Not continuing to wait.
Note:
As far as a know, no one else was working the database or running anything significant between the failure last night and me running CHECKDB this morning. The system was in use, but not significantly. Nothing is showing in the logs.
To confirm, there is no DB of ID 11, and I don't think there ever was (and as certain as I can be not between last night and this morning
Query:
SELECT db_name(11)
Results:
NULL
Query:
SELECT * FROM sys.databases
Results:
╔════════════════════╦═════════════╗
║ name ║ database_id ║
╠════════════════════╬═════════════╣
║ master ║ 1 ║
║ tempdb ║ 2 ║
║ model ║ 3 ║
║ msdb ║ 4 ║
║ ReportServer ║ 5 ║
║ ReportServerTempDB ║ 6 ║
║ z_xxxxxx_L ║ 7 ║
║ z_xxxxxx_p ║ 8 ║
║ S_xxxxxx_t ║ 9 ║
║ z_xxxxxx_t ║ 10 ║
╚════════════════════╩═════════════╝
Best Answer
This issue is a characteristic symptom of heavy I/O on the host. Perhaps there are multiple backups, DBCCs, and other "maintenance" activities happening on the SAN at that time.
Work with your VMware guy to ensure the SQL Server VMs are using the PVSCSI adapter for the LUNs present in the VM. VMware's SQL Server Best Practices documentation says this about SCSI adapters:
Use multiple vSCSI adapters. Placing OS, data, and transaction logs onto a separate vSCSI adapter optimizes I/O by distributing load across multiple target devices and allowing for more queues on the operating system level. Spread the I/O load across all PVSCSI adapters to help optimize the I/O from the guest. In cases where there is a requirement for many data and transaction log disks, it will be beneficial to set the OS boot disk to use PVSCSI as well.
I would check the SQL Server default trace to see if DatabaseID 11 shows up.
This reads the default trace into a temporary table:
This will show any trace events where
DatabaseID = 11
.DBCC CHECKDB (db_name)
will create an internal snapshot of the database. You can see the snapshot files as an NTFS alternate data stream attached to the actual database files. So, if you rundir C:\some\path /r
whereC:\some\path
is the location of your database files, you'll see the internal database snapshot files listed like this:In the sample above, you can see I have a database with a file named
Test_DB.mdf
. There is also an alternate data stream namedTest_DB.mdf:MSSQL_DBCC25:$DATA
. By correlating the existing database IDs insys.databases
, I am able to deduce that25
in theDBCC25
part of the name is the database id for the snapshot.