Sql-server – FlushCache messages appearing in log at specific times

sql serversql-server-2012

We've been having a lot of database performance issues lately, and I've been trying to see if I can figure out why. We don't have a DBA (I'm a software developer) so I'm sort of just winging it, and much of what I find online reads like a foreign language to me.

We've been restarted SQL Server every morning because that's the only way it's operational during the working day. I noticed that every morning around 5 am we are started to get this message every two minutes in the log:

FlushCache: cleaned up 11848 bufs with 7432 writes in 97168 ms
(avoided 8139 new dirty bufs) for db 9:0

last target outstanding: 4, avgWriteLatency 32

average throughput: 0.72 MB/sec, I/O saturation:
11635, context switches 18849

The numbers differ each time of course, but it's the same message over and over in that pattern until I restart the server. I'm not sure how to interpret this, I've been trying to Google about it and all I've gathered is that it means there might be something wrong with I/O and that something is taking longer than it's supposed to. We recently switched to using SSD so I didn't think it should be a write problem.

Could anyone shed some light on this?

Best Answer

The FlushCache message in the error log is caused by checkpoint logging, and in this case by a long checkpoint (which is defined as a checkpoint that is taking longer than the recovery interval). Whether it's logged or not, the behavior is different in pre-2012 and 2012+. Before SQL Server 2012, to get checkpoint logging you'd have to turn on a trace flag (T3504). But starting in SQL Server 2012 that message is logged by default when a long checkpoint is encountered.

Now as for the question of "is this actually bad?", you really need to start looking at these numbers given their context. It took you 97+ seconds to flush only about 93 MB of dirty buffers. This looks like this could potentially be a mixture of a lot of data churn (during the actual checkpoint itself, about 64 MB worth of buffers were also dirtied) and potentially storage that isn't keeping up with the data modification and/or the rest of the I/O workload.

What I would do is verify the health of your storage subsystem, look at waits, and just get an overall performance picture of the instance. Take a look at logical disk perfmon counters and see what the overall I/O churn is with throughput, latency, and IOps. It will help you paint a more vivid picture of how the disks are performing. If you have the ability to benchmark your storage, if you haven't already baselined it, you should see what these volumes in question are capable of (SQLIO is a great utility for that) and what they are doing right now (it's nice to have a benchmark baseline when the volumes were stood up to compare to a current benchmark).

Here is a great article explaining this message - How It Works: When is the FlushCache message added to SQL Server Error Log?

EDIT: Re-reading your question, I must've missed this comment:

I noticed that every morning around 5 am we are started to get this message

See what is happening on your storage at this time as per the guidance above. That sounds like textbook scheduled operation that is taking a toll on the storage causing the checkpoint performance to suffer and to be "long".