No your theory is wrong.
Dirty pages can be written to disc even if the transaction is not yet committed. However it is ensured that they cannot be written before the last transaction log entry that modified the page has been written to disc.
The transaction log records do contain sufficient information both for redo and undo (except for in tempdb
where only undo is necessary). If you decide to rollback the transaction then nothing is deleted from the log. Instead compensation log records are written to the log indicating this.
For specific part of Question you asked to Sean.
'thus, when the page is "read" by another transaction, the updated
values are there.': How does SQL Server know that data on the disk is
out of date and therefore should read (upon a select query) from the
memory?
Suppose there is other request which comes and asks to read same page. SQL Server Buffer Manager will check the data cache in the buffer pool to see if it already has the page cached in memory. If the page is already cached, then the results are passed back to the Access Methods(Access Methods contains set of code which handles I/O requests for rows, indexes, pages, allocations and row versions) to continue with operation as requested. If the page isn’t already in cache, then the Buffer Manager will get the page from the database on disk, put it in the data cache, and pass the results to the Access Methods.
Is there an information on the page header that the data in the page
on the disk is stale? Isn't there a performance penalty for this as it
reads from the disk first and goes to memory next.
Yes there is. The header of each page in cache stores details about the last two times it was accessed, and a periodic scan through the cache examines these values. A counter is maintained that is decremented if the page hasn’t been accessed for a while; and when SQL Server needs to free up some cache, the pages with the lowest counter are flushed first. The process of “aging out” pages from cache and maintaining an available amount of free cache pages for subsequent use can be done by any worker thread after scheduling its own I/O or by the lazywriter process.
Having said that I would like you to read How It Works: Bob Dorr's SQL Server I/O Presentation
Take the following as an example update tblxxx set col1 = GetDate()
where the table is 1 billion rows.
Simplified the process to discuss.
Begin transaction--Transaction starts here
Fetch Page--Page is fetched from Disk to memory
Acquire Lock
Acquire Latch
Generate log record--Change to the page generates Log records
Update page LSN to match that of the log record--When change to a page generates Log records They have LSN numbers identifying them and the same LSN number is stamped on header of the page change to which generated these log records.
Change Data (dirty the page)--Since page changes as it was on disk. It is now dirty
Release Latch
Commit Transaction Starting--Commit issued for transaction
FlushToLSN of Commit--LSN corresponding to commit is flushed to log disk and recorded. This is done before dirty data page is flushed to disk
Release locks
Commit Transaction Complete
The entire table won’t fit into memory so lazy writer will write dirty pages to disk. In doing so lazy writer will first call the routine FlushToLSN to flush all log records up to and including the LSN stored in the page header of the dirty page. Then lazy writer, and only then, will it issue the write of the data page. If a power outage occurs at this point the log records can be used to reconstruct the page (rollback the changes).
Notice that the latch only protects the physical access to the in-memory data page only for the small about of time the physical change is being made. The lock protects the associated data until the commit or rollback takes place.
This allows lazy writer to remove a dirty page while the locking structures continue to maintain the transactional integrity.
The question which can also arise is if rollback is issued instead of a commit would SQL Server fetch back in pages to undo the update? The answer is yes. The page could have been removed from buffer pool/page cache by lazy writer after the FlushToLSN took place. When the rollback occurs the pages changed by the transaction may need to be fetched into buffer pool to undo the change.
PS: Feel free to ask if any further clarification is required.
Best Answer
We can think of a data store as being a list of all the events that have ever happened -
insert row 1; insert row 2; update row 2; delete row 1
. To find a current value we simply replay all events over each other and where we end up is the answer. In current parlance this is known as event sourcing.There are some problems with this, however. Performance is the obvious one as every time a value is required the whole log must be reprocessed.
Since history does not change once recorded an obvious optimization would be to take a snapshot at a point in time and summarize all events up to that point. Now queries can be answered from this summarization plus the much smaller set of events that have happened since the snapshot was calculated. We could even persist these summaries to avoid having to read the whole event stream on start-up.
Further, every event as it arrives can be written into this summary so no query need ever read the event stream. This, effectively, is how a database server works.
On a controlled shut-down these summaries can be persisted cleanly, ready to be read on start-up, and all values are consistent. But what of uncontrolled shutdowns, such as system crashes or loss of power? How can we handle transactions that were not yet commited when the outage occured? Some hardware is too small to hold all summaries in memory so some parts must be shuffled on and off disk. How to keep all parts consistent irrespective of their current location and transactional status?
There are several solutions to these problems. One is the Aries protocol which SQL Server follows. This separates the events from the summaries. The events go to a log file (ldf in SQL Server) and the summaries (i.e. tables) into a data file (mdf). Log records are written synchronously as SQL statements are processed. Tables are written occassionally and asynchronously as the system decides it needs to. Persisting tables to disk is called checkpointing. There is no direct relation between when a SQL statement is processed or a transaction committed and when a checkpoint is processed. Checkpoints are written by a background system task. Often we decide checkpoint frequency to meet the system's recovery time objective (RTO).
Importantly a table can be checkpointed even if it contains values written by a transaction which has not committed. Aries stores enough in the log to correctly identify and recover from this situation.
Under Aries the logical and physical states of the data are separate. Locking a row is a logical operation; the page holding that row may or may not be in memory. Committing a transaction is a logical operation; the pages touched by that transaction need not be persisted immediately, or may have been persisted already.
When data is changed the old and new values plus some metadata are written to the log file and flushed to disk. This write is synchronous. Once this is acknowledged there is a permanent record of the change which can be examined after a crash and used to bring the system back to a consistent state. Only after the log is acknowledged will the change be applied to the data page. This is why it is known as write-ahead logging.
The start and end of each transaction is also recorded in the log. Transactions can start because the client explicitly requested one (BEGIN TRANSACTION). If no explicit transaction exists the system will start its own to ensure the atomicity (the "A" in ACID) of each SQL statement. When a transaction commits only the log record recording this fact is guaranteed to be persisted. The data page holding the newly written value may either be in memory only or may have been checkpointed. It doesn't matter to Aries which is the case.
On recovery the system can read the log, find the last checkpoint, remove changes made by un-committed transactions and apply changes made by committed transactions.
For your specific example of updating a customer's postcode let's suppose there are two matching rows. The log file will end up looking a little like this (in my made-up pseudo-code).
Of course if you run a hex editor over an ldf it will look nothing like this, but it illustrates the point. "LSN" is the Log Sequence Number and is used by Aries to relate data pages to log events. Other users' log records will be interleaved with yours. I've indicated this by leaving large gaps between LSNs.
Table checkpoint log records can appear before, between or after what I've shown, or not at all. It doesn't matter to the protocol.
Aries protocol is simple in concept but has to accommodate many corner cases which makes it complex to implement. There are many on-line resources available explaining it. I like this one (YouTube) by a professor at Carnegie Mellon University and this write-up.
Recently SQL Server has introduced a new protocol called Accelerated Database Recovery. It can reduce recovery times from minutes to milliseconds.
Kudos to @DanGuzman for his work on this.