Sql-server – SQL Server tempdb on SSD or virtual RAM disc to increase temp-table performance

performanceperformance-tuningsql serversql server 2014tempdb

On a monthly bases we have a lot of temp calculations that run 4 days in total. We can't really change so many scripts and therefore i am searching for a config / hardware solution.

Background

Its not a ETL process because we are creating bits of information out of at least 10 applications (therefore i would call it calculations or lots of update-statements). This combination of information is only 1 stage of a bigger process that saves the calculated data finally with a normal ETL process (SSIS) in a DWH. If this calculation stage fails (server crash), we can easily restart it.

Problem

The MSSQL server where this calculation is running has only 2 HDD-raid1. One raid1 for data, indices and tempdb and one for the log. This calculation produce so much I/O that the whole process takes days. The server has 180GB of RAM and theoretically the whole data and result of these calculations could fit in 5GB memory. This indicates we are using the database/resources in the wrong way.

Possible Solution

I think we have to reduce the I/O in the first place and also handle the I/O that is needed anyway in a second step. BTW the server is SQL Server 2014 Standard Edition.

Move the tempdb on a virtual ram disc (ca. 40GB). That means we move the main I/O to memory, correct?
Use more temp tables to reduce server-logging and recovery time and use the new ram disc (is there really less logging?)
Put the logs, indices and important tables on a new SSD raid5
Put all other data on the old HDD-drives

Is this a good solution or would you suggest the standard solution tempdb on the SSDs as well?

Best Answer

Before you commit to a particular course of action, you first need to understand what is causing the poor performance.

Look at wait stats while the calculation process is underway. Check this article by Paul Randal for an excellent starting point on how to gather and analyze them.

Use the system dmv, sys.dm_io_virtual_file_stats to understand the I/O requirements of your process. I use this code, which shows the activity over a 10 minute period:

IF (COALESCE(OBJECT_ID('tempdb..#vfs_stats'), 0) = 0)
CREATE TABLE #vfs_stats
(
    run_num INT NOT NULL
    , database_id INT NOT NULL
    , file_id INT NOT NULL
    , sample_ms BIGINT NOT NULL
    , num_of_reads BIGINT NOT NULL
    , num_of_bytes_read BIGINT NOT NULL
    , io_stall_read_ms BIGINT NOT NULL
    , num_of_writes BIGINT NOT NULL
    , num_of_bytes_written BIGINT NOT NULL
    , io_stall_write_ms BIGINT NOT NULL
    , io_stall BIGINT NOT NULL
    , size_on_disk_bytes BIGINT NOT NULL
);

TRUNCATE TABLE #vfs_stats;

INSERT INTO #vfs_stats (run_num
    , database_id
    , file_id
    , sample_ms
    , num_of_reads
    , num_of_bytes_read
    , io_stall_read_ms
    , num_of_writes
    , num_of_bytes_written
    , io_stall_write_ms
    , io_stall
    , size_on_disk_bytes
)
SELECT 1
    , vfs.database_id
    , vfs.file_id
    , vfs.sample_ms
    , vfs.num_of_reads
    , vfs.num_of_bytes_read
    , vfs.io_stall_read_ms
    , vfs.num_of_writes
    , vfs.num_of_bytes_written
    , vfs.io_stall_write_ms
    , vfs.io_stall
    , vfs.size_on_disk_bytes
FROM sys.dm_io_virtual_file_stats(NULL, NULL) vfs;

WAITFOR DELAY '00:10:00';

INSERT INTO #vfs_stats (run_num
    , database_id
    , file_id
    , sample_ms
    , num_of_reads
    , num_of_bytes_read
    , io_stall_read_ms
    , num_of_writes
    , num_of_bytes_written
    , io_stall_write_ms
    , io_stall
    , size_on_disk_bytes
)
SELECT 2
    , vfs.database_id
    , vfs.file_id
    , vfs.sample_ms
    , vfs.num_of_reads
    , vfs.num_of_bytes_read
    , vfs.io_stall_read_ms
    , vfs.num_of_writes
    , vfs.num_of_bytes_written
    , vfs.io_stall_write_ms
    , vfs.io_stall
    , vfs.size_on_disk_bytes
FROM sys.dm_io_virtual_file_stats(NULL, NULL) vfs;

SELECT 
    DatbaseName = d.name
    , FileName = mf.name
    , SampleTime_ms =   s2.sample_ms            - s1.sample_ms
    , Reads =           s2.num_of_reads         - s1.num_of_reads
    , BytesRead =       s2.num_of_bytes_read    - s1.num_of_bytes_read
    , IOStallRead =     s2.io_stall_read_ms     - s1.io_stall_read_ms
    , Writes =          s2.num_of_writes        - s1.num_of_writes
    , BytesWritten =    s2.num_of_bytes_written - s1.num_of_bytes_written
    , IOStallWRite =    s2.io_stall_write_ms    - s1.io_stall_write_ms
    , IOStall =         s2.io_stall             - s1.io_stall
    , GrowthOnDisk =    s2.size_on_disk_bytes   - s1.size_on_disk_bytes
FROM #vfs_stats s1
    INNER JOIN #vfs_stats s2 ON s1.run_num = (s2.run_num - 1)
        AND s1.database_id = s2.database_id
        AND s1.file_id = s2.file_id
    INNER JOIN sys.databases d ON s1.database_id = d.database_id
    INNER JOIN sys.master_files mf ON s1.database_id = mf.database_id 
        AND s1.file_id = mf.file_id
ORDER BY d.name
    , mf.name;

You'll want to get an excellent understanding of the performance profile of the process that you can use as a baseline. Once you do make changes, you can compare the performance profile after the changes to the baseline performance to understand what impact your change had.

Without understanding the performance profile you will be shooting in the dark by throwing hardware at the problem. However, this can be a good strategy if you know you have underperforming hardware, such as a the data files on old 7200 rpm spinning rust, or old Xeon processors. As a total off-the-cuff recommendation, you could try putting all the databases (tempdb, and the data database) onto a nice fast SSD, preferably a PCIe SSD, and making sure you have enough processor speed and memory bandwidth.

Related Solutions

Sql-server – tempdb on raid1 isolated or raid10 shared with high write db

I found these articles helpful:

Further reading suggests that the benefits of using RAID1 for log files (assuming isolated from data files) is lost when more than 1 log file exists on that RAID1 array. This is due to the sequential nature of the transaction log writes to disk. The sequential write benefit is lost when multiple log files are accesses on the same RAID1 array due to the random nature of access on a spinning disk. This would suggest that RAID10 is the better choice in a multiple DB environment unless you have the disks to isolate each log file.

These stats below sold me on proposal 3, isolating tempDB on RAID1 by stealing 2 disks from LOG array moving log array from RAID10 to RAID1. Basing much of this on RAID1s ability to maintain good WRITE speed.

TEMPDB is clearly under more stress than I have realised.

These table rankings ring true for snapshot values during normal operation (not just the accumulated totals) as we do have intensive out of hour routines.

TOTAL IO:

db.tempdb.mdf = 144,747,290,352

db.2.mdf =         100,482,243,080

db.2.ldf =           2,571,065,773

db.s.mdf =         1,702,508,040

db.s.ldf =           223,032,162

TOTAL READS:

DB.2.mdf =         84,851,614,280.00

db.tempdb.mdf = 72,271,813,552.00

db.s.mdf =         1,691,504,864.00

db.2.LDF=          93,822,304.00

TOTAL WRITES:

db.tempdb.mdf = 72,475,476,800

db.2.mdf =         15,630,628,800

db.2.ldf =           2,477,243,469

db.tempdb.ldf =   222,946,079

One possible concern maybe the additional of the tempdb ldf and mdf on the same raid1 array but if this is a problem tempdb.ldf can be moved to the log array.

These are helpful links that explain TEMPDB usage, what it does and how it may effect my apps:

Sql-server – How to boost performance of virgin queries in MS SQL Server

Your question can be basically rephrased as 'How does the query memory grant work?'. A good read on the subject is Understanding SQL server memory grant. Before a query is launched into execution it may require a memory grant for sorts and hashes and other memory hungry operations. This memory grant is an estimate. Based on current system state (number of requests running and pending, memory available etc) the system grants the query a memory grant up to the required amount. Once the memory is granted, the query starts execution (it may have to wait in the dreaded 'resource semaphore' queue before it gets the grant). At execution it memory grant is guaranteed by the system. This amount of memory can be shared with data pages (since they can always flush to disk) but never with other memory usage (ie. it cannot be subject 'steal'). So when the query starts asking for committed memory from its grant, the engine will deploy what you call 'strategy #1': data pages may be evicted (flushed if dirty) in order to give the query the memory it was promised. Now if the estimate was correct and the grant was 100% of the requested memory, the query should not 'spill'. But if the estimate was incorrect (boils down to cardinality estimates, therefore is subject to stale stats) or if the query did not got the entire grant it had asked for, the query will 'spill'. This is when tempdb comes into picture and performance usually tanks.

The only knob you have at your disposal that controls something in this process is the Resource Governor. Since the RG can be used to specify a MIN setting for a pool, it can be used to reserve memory for a certain workload so that it actually gets the memory grant it requests. Of course, after you did the proper investigation that shows that reduced memory grants are the culprit, and of course after the impact on other workloads was evaluated. And tested, of course.

Now lets go back to your original question. If your investigation is correct (a very big if) I would like to point out two problems:

you run in production queries that require memory grants for a web site. This is a big no-no. Memory grants are indicative of analytical queries which have no place in serving HTTP requests.
your queries are probably not event getting the memory grant they request. Again, even more of a no-no for a latency critical workload as web sites are.

So what that tells me is that you have a fundamental design and architectural problem. Web sites are latency driven and should create an OLTP like workload, with no memory grants and with no memory pressure on queries. Not to mention no spills. Analytical queries should be run in offline jobs and store the pre-processed results for quick availability when HTTP requests desire them.

Best Answer

Related Solutions

Sql-server – tempdb on raid1 isolated or raid10 shared with high write db

Sql-server – How to boost performance of virgin queries in MS SQL Server

Related Question