Mysql – In InnoDB, how does fuzzy checkpointing’s recovery consistency work

innodbMySQL

I've looked the documentation quite thoroughly but still cannot figure out so asking this question.

In InnoDB, I understand that any updates against buffer pool are tracked by redo log, which gets persisted on the disk, and the redo logs are used for recovery in the case of crash.

Occasionally, the MySQL server also flushes the dirty pages of the Buffer Pool. Under its normal operation, MySQL server only flushed some part of dirty pages, and they call it "fuzzy checkpointing". Under this procedure, the current Buffer Pool is claimed to be recoverable by reading the content of pages on disk and then applying all the redo logs whose LSN is greater than the last checkpoint.

My question is, how does the MySQL server chooses which dirty pages to flush, and also supports the crash-recoverability?

From some googling, I understood that by utilizing the dirty page's first modification LNS number, one can know which pages should be flushed so that the checkpoint LNS can be incremented.

But the dirty page with earliest uncheckpointed redo log can also have been modified by the latest transaction and thus have future content compared to the earliest uncheckpointed redo log. I assume it is very difficult (if possible) to redo from the disk persisted buffer pool pages if those pages includes such future contents.

So question:

  • How does the MySQL server chooses which pages to flush, and also support crash recovery?

Best Answer

(I'm pretty sure of the following.)

InnoDB does not depend on "dirty" pages for recovery. Recovery is guaranteed by what is stored in iblog* and the double-write buffer.

The presumption is that the information about a transaction can be more compactly stored, and more rapidly written to disk, in the redo log (versus the actual table).

The log files are overwritten, but not until LNS says it is OK. So, the optimal dirty page to flush is either the "least recently used" or the one with the oldest position in the log. I don't know what algorithm it uses to decide between these conflicting things.

If there is a lot of activity causing the percentage of the buffer_pool to be 'too close' to 100% dirty, InnoDB shifts gears and becomes more aggressive at flushing dirty pages. This, also, is a tradeoff.

Note also that the writing of changes to non-unique secondary indexes is also 'delayed'. This is in the "change buffer", which (by default) occupies 25% of the buffer_pool. The hope with that is that the updates can be somewhat sorted and written to disk with fewer read-modify-write cycles. Again, recovery does not depend on this having been completely flushed to disk, and the redo log is the critical part.

The double-write buffer protects against "torn-pages". This is a potentially disastrous situation where the disk subsystem write can't write all 16KB in an atomic operation. A few newer disks guarantee atomicity, so the setting can be turned off.

InnoDB is crash-proof. But, it is also "fast" because of delaying I/O, together with these various techniques that work efficiently under high load.