SQL Server – Force Database to Persist on Commit

sql server

I was wondering whether databases such as SQL Server was writing changes to persistent medium (disk) whenever a operation is completed, but it seems not? Not even when committing a transaction it seem to do this but only commit to RAM. Correct me if I am wrong. The disk-write seem to happen on CHECKPOINT which is server-initiated and perhaps doesn't have to happen straight efter a COMMIT? In any case, my questions are:

If an update to the db is not persisted before returning OK to the client, what then will it do if the update fails in the end? How will the client know this? Can a client somehow instruct the db-server to really persist before saying "all is fine"?
Is there a good reason for not just persisting to disk directly instead of RAM in a modern server with really fast SSD-disk? Only if the performance is really critical? More so than persistence?

Best Answer

SQL Server does indeed write at commit. What it writes is to the transaction log, however. The data (dirty pages) can be written at a later time (for instance at checkpoint). Thanks to "force log write at commit" SQL server can re-construct a consistent view of the database at startup, where what should be committed is persisted.

Related Solutions

SQL Server – Cache Flush and Disk I/O Performance

Others have already pointed out the culprit: SQL Server accumulates updates in memory (in the buffer pool) and only flushes them out periodically (at checkpoints). The two options suggested (-k and checkpoint interval) are complementary:

-k will make the cause the checkpoint to produce less aggressive IO requests and last longer
lowering recovery interval will cause the checkpoint to start more often

But I did not respond only to regurgitate the fine comments you received do far :)

What you're seeing is, unfortunately, a very typical behavior of queued processing. Whether you use Service Broker queues or opt for using tables as queues approach, the system is very prone to this kind of behavior. This is because queuing based processing is write heavy, even more write heavy than OLTP processing. Both enqueue and dequeue primitives are write operations and there are almost no read operations. Simply put, queue processing will generated the most writes (= most dirty pages, and most log) compared to any other workload, even OLTP (ie. TPC-C like workload).

Very importantly, the writes of a queue workload follow an the insert/delete pattern: every row inserted is very quickly deleted. This is important to distinguish from an append-only pattern of a insert heavy (ETL) workload. You are basically feeding the ghost cleanup task a full meal, and you can easily outrun it. Think about what that means:

enqueue is an insert, it will create a dirty page
dequeue is a delete, it will dirty the same page again (it may be lucky and catch the page before checkpoint, so it will avoid double-flush, but only if is lucky)
ghost cleanup will cleanup the page, making it dirty again

Yes, it really means that you may end up writing a page three times to disk, in three different IO requests, for each message you process (worst case). And it also means that the random IO of checkpoints will be really random as the write point of the page will be visited by those moving heads again between two checkpoints (compare with many OLTP workloads tend to group the writes on some 'hot spots', not queues...).

So you have these three write points, racing to mark the same page dirty again and again. And that is before we consider any page splits, which queue processing may be prone too because of the insert key order. By comparison 'typical' OLTP workloads have a much more balanced read/write ratio and the OLTP writes distribute across inserts/updates/deletes, often with updates ('status' changes) and inserts taking the lion's share. Queue processing writes are exclusively insert/delete with, by definition, 50/50 split.

Some consequences follow:

Checkpoint becomes a very hot issue (no longer a surprise for you)
You'll see heavy fragmentation (the fragmentation per-se won't matter much as you are not going to do range scans, but your IO efficiency suffers and ghost cleanup has more to work, slowing it down even more)
Your MDF storage random IO throughput is going to be your bottleneck

My recommendation comes in 3 letters: S, S and D. Move your MDF to a storage that can handle fast random IO. SSD. Fusion-IO if you have the moneys. Unfortunately this is one of those symptoms that cannot be resolved with more cheap RAM...

Edit:

As Mark points out you have two logical disks backed by one physical disk. Perhaps you tried to follow best practices and split log on D: and data on C: but alas is to no avail, C and D are the same disk. Between checkpoints you achieve sequential throughput but as soon as checkpoint starts the disk heads start to move and your log throughput collapses, taking down the entire app throughput. Make sure you separate the DB log so that is not affected by data IO (separate disk).

Sql-server – Failover scenario with principal “hard down”

I found the problem.

SQL Native Client and the .NET SQL Data Provider should seamlessly connect to a failover partner even if the primary is down; however, the default configuration of both providers is to try both TCP/IP and Named Pipes.

This configuration is fine, and works appropriately, when the servers are still both available but the primary has failed over to the backup, because the network-layer connection will succeed, but the application-level connection will be promptly refused causing the client to try the mirror. If the primary is unresponsive, however, the client will waste the entire default 20-second timeout period just waiting for the primary server to respond, first over TCP/IP, then over Named Pipes (which by itself has a default 20- to 30-second timeout period at the network level), and so in this situation it will return an error without ever having tried the backup mirror.

The solution, according to this TechNet article, is to force the client, one way or another, to connect using only TCP/IP, which will fail fast enough that the client has ample time to try both partners. You can manually configure the data providers to do this, or you can override their default settings case-by-case, by specifying the proper protocol to use in the connection string. The parameter to include that forces TCP/IP is Network=dbmssocn;. On top of that, the SQL client has a "three strikes" approach before it gives up and tries the failover mirror, so the Connection Timeout should be set long enough that it will get through this process and try the failover partner at least once. A timeout of 30 seconds should be adequate.

We tried this fix when our primary DB server was powered down again, and it does work. Now we're in the process of updating the config files of all our in-house apps (we have at least a half-dozen desktop apps and intranet sites that target this failover pair; it's a well-loved stack).

Best Answer

Related Solutions

SQL Server – Cache Flush and Disk I/O Performance

Sql-server – Failover scenario with principal “hard down”

Related Question