Mysql – In InnoDB, how does fuzzy checkpointing’s recovery consistency work

innodbMySQL

I've looked the documentation quite thoroughly but still cannot figure out so asking this question.

In InnoDB, I understand that any updates against buffer pool are tracked by redo log, which gets persisted on the disk, and the redo logs are used for recovery in the case of crash.

Occasionally, the MySQL server also flushes the dirty pages of the Buffer Pool. Under its normal operation, MySQL server only flushed some part of dirty pages, and they call it "fuzzy checkpointing". Under this procedure, the current Buffer Pool is claimed to be recoverable by reading the content of pages on disk and then applying all the redo logs whose LSN is greater than the last checkpoint.

My question is, how does the MySQL server chooses which dirty pages to flush, and also supports the crash-recoverability?

From some googling, I understood that by utilizing the dirty page's first modification LNS number, one can know which pages should be flushed so that the checkpoint LNS can be incremented.

But the dirty page with earliest uncheckpointed redo log can also have been modified by the latest transaction and thus have future content compared to the earliest uncheckpointed redo log. I assume it is very difficult (if possible) to redo from the disk persisted buffer pool pages if those pages includes such future contents.

So question:

How does the MySQL server chooses which pages to flush, and also support crash recovery?

Best Answer

(I'm pretty sure of the following.)

InnoDB does not depend on "dirty" pages for recovery. Recovery is guaranteed by what is stored in iblog* and the double-write buffer.

The presumption is that the information about a transaction can be more compactly stored, and more rapidly written to disk, in the redo log (versus the actual table).

The log files are overwritten, but not until LNS says it is OK. So, the optimal dirty page to flush is either the "least recently used" or the one with the oldest position in the log. I don't know what algorithm it uses to decide between these conflicting things.

If there is a lot of activity causing the percentage of the buffer_pool to be 'too close' to 100% dirty, InnoDB shifts gears and becomes more aggressive at flushing dirty pages. This, also, is a tradeoff.

Note also that the writing of changes to non-unique secondary indexes is also 'delayed'. This is in the "change buffer", which (by default) occupies 25% of the buffer_pool. The hope with that is that the updates can be somewhat sorted and written to disk with fewer read-modify-write cycles. Again, recovery does not depend on this having been completely flushed to disk, and the redo log is the critical part.

The double-write buffer protects against "torn-pages". This is a potentially disastrous situation where the disk subsystem write can't write all 16KB in an atomic operation. A few newer disks guarantee atomicity, so the setting can be turned off.

InnoDB is crash-proof. But, it is also "fast" because of delaying I/O, together with these various techniques that work efficiently under high load.

UPDATE 2013-04-15 12:43EDT

Let's look at the definition of Innodb_buffer_pool_wait_free

Normally, writes to the InnoDB buffer pool happen in the background. However, if it is necessary to read or create a page and no clean pages are available, it is also necessary to wait for pages to be flushed first. This counter counts instances of these waits. If the buffer pool size has been set properly, this value should be small.

As stated, if the buffer pool size has been set properly, this value should be small. You may simply have lots of dirty pages in the Buffer Pool that need flushing to disk. You should be monitoring Innodb_buffer_pool_pages_dirty.

There are two things you could do to improve the situation:

IMPROVEMENT #1 : Upgrade to the latest MySQL

I trust MySQL 5.5. I have a client going to MySQL 5.6.10 soon. I trust it as well. These versions of MySQL have the InnoDB Plugin standard. They flush dirty pages much more efficiently.

You can also tune InnoDB. Under MySQL 5.1, there are 4 read IO threads and 4 write IO threads. MySQL 5.5+ allows you to increase these for better read and write InnoDB performance. InnoDB For MySQL 5.5.+ can access multiple CPUs/Core. MySQL 5.1 can do this if using MySQL 5.1.38+ and you install the InnoDB Plugin (IMHO too messy, go with MySQL 5.5/5.6). MySQL 5.1.27 cannot do this.

IMPROVEMENT #2 : Get Dirty Pages to Flush More Frequently

You can do this immediately with

SET GLOBAL innodb_max_dirty_pages_pct = 0;

The default value for innodb_max_dirty_pages_pct in MySQL 5.1 is 90. Drop this to zero(0). Then, start watching Innodb_buffer_pool_pages_dirty. On a busy write server, this should drop to 1% of Innodb_buffer_pool_pages_total.

Mysql – InnoDB internals – checkpoint, LSN, dirty pages

Your answers are on the source code: innodb_buffer_pool_pages_dirty. Most of these variables are maintained on memory directly or through very simple calculations. There is no danger of durability loss, as the ones that are "essential" for innodb to work can be recalculated from the transaction log or are 0 on start.
Not really, the lsn is just an offset in bytes, it is maintained on memory (log_sys->lsn) and in the case of a failure, it can be recovered from the header of the transaction log (plus pure file offsets). They are changed on commit/log flush/tablespace flush, both on memory and on disk (exact method depends on the durability configuration, buffering type, etc.).

For basic InnoDB stuff, you can check this blog post by the Oracle InnoDB team and/or check the source code for InnoDB.

So question:

Best Answer

Related Solutions

Mysql – High amount of Read Misses and Pages To Be Flushed

UPDATE 2013-04-15 12:43EDT

IMPROVEMENT #1 : Upgrade to the latest MySQL

IMPROVEMENT #2 : Get Dirty Pages to Flush More Frequently

Mysql – InnoDB internals – checkpoint, LSN, dirty pages

Related Question