I found these articles helpful:
Further reading suggests that the benefits of using RAID1 for log files (assuming isolated from data files) is lost when more than 1 log file exists on that RAID1 array. This is due to the sequential nature of the transaction log writes to disk. The sequential write benefit is lost when multiple log files are accesses on the same RAID1 array due to the random nature of access on a spinning disk. This would suggest that RAID10 is the better choice in a multiple DB environment unless you have the disks to isolate each log file.
These stats below sold me on proposal 3, isolating tempDB on RAID1 by stealing 2 disks from LOG array moving log array from RAID10 to RAID1. Basing much of this on RAID1s ability to maintain good WRITE speed.
TEMPDB is clearly under more stress than I have realised.
These table rankings ring true for snapshot values during normal operation (not just the accumulated totals) as we do have intensive out of hour routines.
TOTAL IO:
db.tempdb.mdf = 144,747,290,352
db.2.mdf = 100,482,243,080
db.2.ldf = 2,571,065,773
db.s.mdf = 1,702,508,040
db.s.ldf = 223,032,162
TOTAL READS:
DB.2.mdf = 84,851,614,280.00
db.tempdb.mdf = 72,271,813,552.00
db.s.mdf = 1,691,504,864.00
db.2.LDF= 93,822,304.00
TOTAL WRITES:
db.tempdb.mdf = 72,475,476,800
db.2.mdf = 15,630,628,800
db.2.ldf = 2,477,243,469
db.tempdb.ldf = 222,946,079
One possible concern maybe the additional of the tempdb ldf and mdf on the same raid1 array but if this is a problem tempdb.ldf can be moved to the log array.
These are helpful links that explain TEMPDB usage, what it does and how it may effect my apps:
Your question can be basically rephrased as 'How does the query memory grant work?'. A good read on the subject is Understanding SQL server memory grant. Before a query is launched into execution it may require a memory grant for sorts and hashes and other memory hungry operations. This memory grant is an estimate. Based on current system state (number of requests running and pending, memory available etc) the system grants the query a memory grant up to the required amount. Once the memory is granted, the query starts execution (it may have to wait in the dreaded 'resource semaphore' queue before it gets the grant). At execution it memory grant is guaranteed by the system. This amount of memory can be shared with data pages (since they can always flush to disk) but never with other memory usage (ie. it cannot be subject 'steal'). So when the query starts asking for committed memory from its grant, the engine will deploy what you call 'strategy #1': data pages may be evicted (flushed if dirty) in order to give the query the memory it was promised. Now if the estimate was correct and the grant was 100% of the requested memory, the query should not 'spill'. But if the estimate was incorrect (boils down to cardinality estimates, therefore is subject to stale stats) or if the query did not got the entire grant it had asked for, the query will 'spill'. This is when tempdb comes into picture and performance usually tanks.
The only knob you have at your disposal that controls something in this process is the Resource Governor. Since the RG can be used to specify a MIN setting for a pool, it can be used to reserve memory for a certain workload so that it actually gets the memory grant it requests. Of course, after you did the proper investigation that shows that reduced memory grants are the culprit, and of course after the impact on other workloads was evaluated. And tested, of course.
Now lets go back to your original question. If your investigation is correct (a very big if) I would like to point out two problems:
- you run in production queries that require memory grants for a web site. This is a big no-no. Memory grants are indicative of analytical queries which have no place in serving HTTP requests.
- your queries are probably not event getting the memory grant they request. Again, even more of a no-no for a latency critical workload as web sites are.
So what that tells me is that you have a fundamental design and architectural problem. Web sites are latency driven and should create an OLTP like workload, with no memory grants and with no memory pressure on queries. Not to mention no spills. Analytical queries should be run in offline jobs and store the pre-processed results for quick availability when HTTP requests desire them.
Best Answer
Before you commit to a particular course of action, you first need to understand what is causing the poor performance.
Look at wait stats while the calculation process is underway. Check this article by Paul Randal for an excellent starting point on how to gather and analyze them.
Use the system dmv,
sys.dm_io_virtual_file_stats
to understand the I/O requirements of your process. I use this code, which shows the activity over a 10 minute period:You'll want to get an excellent understanding of the performance profile of the process that you can use as a baseline. Once you do make changes, you can compare the performance profile after the changes to the baseline performance to understand what impact your change had.
Without understanding the performance profile you will be shooting in the dark by throwing hardware at the problem. However, this can be a good strategy if you know you have underperforming hardware, such as a the data files on old 7200 rpm spinning rust, or old Xeon processors. As a total off-the-cuff recommendation, you could try putting all the databases (tempdb, and the data database) onto a nice fast SSD, preferably a PCIe SSD, and making sure you have enough processor speed and memory bandwidth.