Sql-server – When updated records get written to the real physical pages in mdf file

sql servertransactiontransaction-log

I'm a beginner in SQL Server, just some questions on transaction. My textbook says:

A checkpoint is a periodic operation
that forces all dirty pages for the database currently in use to be written to disk

I'm confused, does a checkpoint make all dirty pages written to log file .ldf?

Let's say we write some code:

BEGIN TRAN T1;  
UPDATE table1 
SET postcode = '9000'
WHERE CustomerID = 1;

...
...
COMMIT TRAN T1; 

so by the time the 'commit' is executed, does the updated record in table1 still in cache and waiting for a checkpoint to make the updated record written in the log file?
After the updated record is written to the disk (log file), then how the updated record get written to the real physical pages in mdf file?

Best Answer

We can think of a data store as being a list of all the events that have ever happened - insert row 1; insert row 2; update row 2; delete row 1. To find a current value we simply replay all events over each other and where we end up is the answer. In current parlance this is known as event sourcing.

There are some problems with this, however. Performance is the obvious one as every time a value is required the whole log must be reprocessed.

Since history does not change once recorded an obvious optimization would be to take a snapshot at a point in time and summarize all events up to that point. Now queries can be answered from this summarization plus the much smaller set of events that have happened since the snapshot was calculated. We could even persist these summaries to avoid having to read the whole event stream on start-up.

Further, every event as it arrives can be written into this summary so no query need ever read the event stream. This, effectively, is how a database server works.

On a controlled shut-down these summaries can be persisted cleanly, ready to be read on start-up, and all values are consistent. But what of uncontrolled shutdowns, such as system crashes or loss of power? How can we handle transactions that were not yet commited when the outage occured? Some hardware is too small to hold all summaries in memory so some parts must be shuffled on and off disk. How to keep all parts consistent irrespective of their current location and transactional status?

There are several solutions to these problems. One is the Aries protocol which SQL Server follows. This separates the events from the summaries. The events go to a log file (ldf in SQL Server) and the summaries (i.e. tables) into a data file (mdf). Log records are written synchronously as SQL statements are processed. Tables are written occassionally and asynchronously as the system decides it needs to. Persisting tables to disk is called checkpointing. There is no direct relation between when a SQL statement is processed or a transaction committed and when a checkpoint is processed. Checkpoints are written by a background system task. Often we decide checkpoint frequency to meet the system's recovery time objective (RTO).

Importantly a table can be checkpointed even if it contains values written by a transaction which has not committed. Aries stores enough in the log to correctly identify and recover from this situation.

Under Aries the logical and physical states of the data are separate. Locking a row is a logical operation; the page holding that row may or may not be in memory. Committing a transaction is a logical operation; the pages touched by that transaction need not be persisted immediately, or may have been persisted already.

When data is changed the old and new values plus some metadata are written to the log file and flushed to disk. This write is synchronous. Once this is acknowledged there is a permanent record of the change which can be examined after a crash and used to bring the system back to a consistent state. Only after the log is acknowledged will the change be applied to the data page. This is why it is known as write-ahead logging.

The start and end of each transaction is also recorded in the log. Transactions can start because the client explicitly requested one (BEGIN TRANSACTION). If no explicit transaction exists the system will start its own to ensure the atomicity (the "A" in ACID) of each SQL statement. When a transaction commits only the log record recording this fact is guaranteed to be persisted. The data page holding the newly written value may either be in memory only or may have been checkpointed. It doesn't matter to Aries which is the case.

On recovery the system can read the log, find the last checkpoint, remove changes made by un-committed transactions and apply changes made by committed transactions.

For your specific example of updating a customer's postcode let's suppose there are two matching rows. The log file will end up looking a little like this (in my made-up pseudo-code).

lsn:12345; begin transaction 1;
lsn:24680; transaction 1; object: table1; row:654: column: postcode; old: '3000'; new: '9000'
lsn:35791; transaction 1; object: table1; row:357: column: postcode; old: '3002'; new: '9000'
lsn:45678; commit transaction 1;

Of course if you run a hex editor over an ldf it will look nothing like this, but it illustrates the point. "LSN" is the Log Sequence Number and is used by Aries to relate data pages to log events. Other users' log records will be interleaved with yours. I've indicated this by leaving large gaps between LSNs.

Table checkpoint log records can appear before, between or after what I've shown, or not at all. It doesn't matter to the protocol.

Aries protocol is simple in concept but has to accommodate many corner cases which makes it complex to implement. There are many on-line resources available explaining it. I like this one (YouTube) by a professor at Carnegie Mellon University and this write-up.

Recently SQL Server has introduced a new protocol called Accelerated Database Recovery. It can reduce recovery times from minutes to milliseconds.

Kudos to @DanGuzman for his work on this.