Sql-server – Switching to RCSI

isolation-levelmetricssql server

The company I work for currently uses SQL Server databases (latest Enterprise version usually) for a product we develop.

I would describe it as an OLTP database that is somewhat equally write and read intensive with a lot of time critical apps. In addition to this, a lot of reporting and graphical data is displayed from information in this same OLTP database (separate problem) against a lot of the same tables that are being read from and written to at a frequent rate.

We commonly experience problems where blocking occurs and it usually ends up slowing down time critical apps or even causing issues because of deadlocks in these apps. The common solution to this problem often seems to be to throw nolock hints at the problematic queries. I honestly hate this solution and I've felt for a long time that this is the wrong way to try and tackle this problem and from everything I've read I come to the same conclusion.

I've been trying to convince my team for a while that RCSI is something that we could definitely benefit from especially given our type of database. They seem to think this is a big risk and often put it off because of the risk factor but we continue to run into performance problems where we just throw nolock hints at it.

How can I help prove that our database can greatly benefit from using RCSI?
Are there performance tests I can run based on an actual production database that we convert to RCSI in a test environment?

I'm looking for a good way to show concrete metrics to our team to finally convince them we should potentially move to this methodology.

Best Answer

As I'm sure you know, just because a lot of people use NOLOCK does not make it a good idea - after all, if you go back far enough, a lot of people thought slavery, pesticides, asbestos, leaded paint and gasoline, etc. were all great ideas, too.

NOLOCK has a performance benefit but not because it doesn't take locks - it's simply because it allows the reader to ignore locks taken by other readers or writers. A query with NOLOCK can still be blocked depending on who is doing what to the underlying tables - at the very least, Sch-S locks are still needed.

But the downsides are numerous, and usually ignored until they happen - people are blissfully happy that their queries are "fast" - even if they produce incorrect or inconsistent data, especially if they don't notice. With NOLOCK / READ UNCOMMITTED, you can:

read the same row twice (row you've read moves ahead of the allocation order scan)
skip a row altogether (row you haven't read yet moves behind the scan)
read an uncommitted row that might never exist
get errors due to too much movement during scan
read different columns in a row in different state
- even something as granular a single value in a half-committed state (excellent example from Paul White.)

What to do

You can avoid these risks. The default isolation level (READ COMMITTED) certainly isn't without data consistency risks either, it's just that NOLOCK adds a lot of them.

I see "snapshot" getting thrown about quite a bit - while they may sound similar, read committed snapshot isolation is quite different from snapshot isolation. Kendra Little has a thorough post here that is worth a read.

Someone said "no one ever got fired for using Snapshot Isolation." I can actually imagine scenarios where that very thing is quite possible. Snapshot Isolation has significantly different semantics and requires code changes to be properly implemented. More change = more risk.

Read Committed Snapshot Isolation can be implemented with little to no code change, and is usually the one that people mean when they suggest you switch from NOLOCK / READ UNCOMMITTED. However it is also possible to introduce behavior changes due to simply enabling RCSI at the database level (see item 3 in Kendra's post for an example).

While I personally think that RCSI is way better than NOLOCK, you need to keep in mind that performance is not guaranteed to be any better. RCSI works by creating versions of rows in tempdb, so that any given session has its own copy of the row if it needs it. If tempdb is a bottleneck on your system, this might not go so well. In addition, enabling read committed snapshot at the database level means that all data changes going forward will add 14 bytes per row.

This means that, in order to show that the switch is worth it, you should adequately equip tempdb to support additional load if it is currently sub-optimal, and you need to be prepared for your existing tables to require more space on disk and, ultimately, in memory. If tempdb is already saturated, or you are near your disk capacity already, or memory is already exhausted, RCSI might not help.

(There are other ways to improve performance in general, of course, if NOLOCK isn't worth the risk, and you don't have the overhead for RCSI. Compression, for example, can be a good choice to reduce I/O when you have a lot of CPU overhead. Columnstore indexes can be useful for particular workloads. And obviously granular index and query tuning can be beneficial.)

For the question at hand:

I would suggest that you focus on ensuring that your queries return correct results, without a noticeable performance hit - compared to the default isolation level. Comparing to NOLOCK is not really fair, unless you are absolutely comfortable with all the risks mentioned above. Kendra's post goes through some details on measuring the impact, but basically you want to do before-and-after measurements of the same workload with the same external pressure and concurrency:

query times for these queries that have NOLOCK today (sys.dm_exec_query_stats)
overall performance metrics - waits, I/O latency, tempdb usage, memory usage, even page life expectancy

And you would do this using whatever methodology you currently use to measure performance - if you don't have a methodology, I might suggest that a monitoring tool might be worthwhile, even if only a trial. (Disclaimer: I used to work for one.)

There isn't a magic "tell me if my workload is better with RCSI than without" - this will depend on what aspects of performance are important to you, which are already close to being bottlenecks, and which will demonstrate - in your specific case - an overall gain or loss (again, assuming everything else remains the same).

And it might be the case that you can convince them with qualitative arguments rather than (or in addition to) just "metrics."

Further reading (some links repeated from above):

Best Answer

Related Solutions

Sql-server – Is using CRUD stored procedures against a view with NOLOCK bad

SQL Server – Traditional Failover Clustering vs. Availability Groups

Related Question