We recently separated out our tempdb files to a new SSD and have started seeing:
5348 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [T:\tempdb\tempdb4.ndf] .
We have multiple occurrences of this error. We did not see the errors when tempdb was back on its original RAID 5 home. I followed a tutorial on SQLIO and I think the SSD should be much faster, when doing 8kb random read/writes, than the previous RAID 5 disks. So why are we seeing these errors?
Also, by way of more proof that not all is well, the batch file we run overnight (which is when these errors occur) is taking 7 hours. It took 6.25 hours on the old disks.
The disks sit in a directly attached array. The RAID5 for data, RAID 10 for logs and a spare slot that we used for the SSD. The RAID 5 and SSD are formatted for 64kb block size. The log is incorrectly set to 4KB block size (I know – will fix when I get a chance).
These are the results of SQLIO:
T drive (ssd)
Ios= 8KB random writing,IOs/sec=31847.48,MBs/sec=248.8
Ios= 8KB random reading,IOs/sec=76391.66,MBs/sec=596.8
S drive (RAID 5)
Ios= 8KB random writing,IOs/sec=2601.3,MBs/sec=20.32
Ios= 8KB random reading,IOs/sec=3138.45,MBs/sec=24.51
For 64K sequential read/writes they were around the same.
Tempdb is split into 4 1.5Gb files (this is the same before and after the move).
SQL server 2012 is patched to SP3.
Do you have any idea what could be causing all these I/O errors getting reported by SQL Server?
Is it possibly an Array or HBA driver issue? Does a single disk added into a spare slot on an directly attached array needs careful configuration in terms of cache?
Best Answer
I would highly recommend that you test your new T:\ drive using Crystal Disk Mark. Check out the guide from Brent Ozar here:
How to Test Your Storage with CrystalDiskMark
Compare the results from the T:\ drive with
If the SSD is slower than those other two devices, and nothing else has changed* in your setup, it's likely there is a problem with the disk itself, or with the driver being used, or the controller for the array this disk sits in, etc.
*things that might have changed since you moved tempdb:
Next Steps
Since it seems like the disk is reasonably fast (per the benchmarks you shared), I think it would be a good idea to log the contents of
sys.dm_io_virtual_file_stats
before and after the nightly batch job you mentioned. This will tell you how much I/O is happening on tempdb during that process. This is important, because maybe there really is more I/O than the disk can handle. So here's what you do:Run this query right before your nightly batch job is scheduled to run:
Save the results somewhere (like Excel or something - probably not in tempdb :P)
We can then take the difference of the two snapshots and determine how many bytes were read / written during the job. You can also use those numbers to calculate overall latency during that period.
Note: a more granular approach would be to log the results of that query to a table every 5 minutes (or less if you want)