Sql-server – SQL Errorlog Constant FlushCache Messages

availability-groupssql serversql-server-2017

A server that is part of a SQL 2017 Always on availability group (BAG) I have noticed an increase of these messages. I understand these are messages that appear in the log as standard since 2012 (trace flag prior to 2012) but over the last 2 days they are appearing every few minutes, there are no backups running or maintenance jobs at these times

The server is acting as the secondary non readable fail over partner and the primary server is not exhibiting the same behavior.

09/04/2020 10:52:24,spid82s,Unknown,FlushCache: cleaned up 2303 bufs with 1881 writes in 81090 ms (avoided 11 new dirty bufs) for db 7:0
09/04/2020 10:52:07,spid52s,Unknown,last target outstanding: 2 avgWriteLatency 86
09/04/2020 10:52:07,spid52s,Unknown,average writes per second: 17.70 writes/sec average throughput: 0.19 MB/sec I/O saturation: 13073 context switches 15434

There is also a warning in the system error log about disk issue, seconds after this the availability group failed over ever since the FlushCache messages have increased.

enter image description here

Has anyone experienced anything similar or have any advice, my sysadmin is also looking at the SAN and VMware estate.

Had a sudden thought could these be caused by autogrowth activities?

Best Answer

Switch to indirect checkpoints (generally, set recovery interval time to 60 seconds, at least for user databases).

I wrote about this issue here and here, and you should read those posts in full before making the change, though it’s pretty much a no-brainer IMHO.

The impact in our environment was drastic and immediate, and it was simple to prove the change was responsible. Every database with the symptom went from 30-60 second checkpoints, and errorlogs full of these messages, to sub-second checkpoints and no more errorlog entries.

There are probably mitigation techniques for your workload, too, for example see this post by Itzik Ben-Gan but making checkpoints more efficient is a quicker and easier path. Your storage and VM folks aren’t off the hook, though; there is definitely a disk issue they need to address (but I disagree that these FlushCache messages were the root cause of any failover; more likely they are just a different victim/symptom).