Sql-server – Page life expectancy very low on certain time

buffer-poolmemorypage-life-expectancysql-server-2008

Everyday we receive an alert on a specific period (03:00 AM – 04:00 AM) indicating that Page life expectancy is too low :

SCOM :Alert: SQL DB 2008 Engine Page Life Expectancy is too low

We have reviewed if any Memory intensive operation is running on this period but nothing was found.

We assume that the cause is a query run through a windows service that cause pages life in the buffer cache to reach a very low value (8 – 22 seconds)

what is a good way to troubleshoot this issue ?

Thanks
Osama Waly

Best Answer

It's almost certainly going to be full backups or indexing. The queries won't show up as a bunch of memory allocated but rather as intensive disk IO. If you are tracking those counters then have a look. Some other easy ways to confirm that these are the cause of that disk IO:

  • Looking at agent job schedules. If it wasn't checkdb, or indexing, then apps often do their own internal database clean-ups (e.g. deleting records) late at night also.
  • When checkdb runs it will be written to the ERRORLOG, so run sp_readerrorlog and look for anything at the PLE drop time. This is also useful because some apps like OpsMgr will run their own indexing and checkdb routines at random times without using Agent jobs.
  • There are also other queries you can run to identify if indexes or stats were recently updated (you'd run it in each database and cast an eye over the results). They aren't 100% accurate but if you saw lots of indexes or stats updated at the PLE drop time it would be an indicator.

Otherwise just stay up late one night and run sp_WhoIsActive interactively to see what's happening. You can also make sp_WhoIsActive output to a table, which means you can schedule an Agent job to run it to a table a few times during your PLE drop window, and then check the output in the morning.

If it turns out to be checkdb or indexing this PLE behaviour is fairly normal and doesn't cause a problem unless you have a very high throughput or time sensitive system that's very busy that early in the morning; if nobody is complaining (and you don't see an increase in query timeouts at that time if you're even tracking them) then likely it's just smoke and no fire.

What if it's not checkdb or indexes? You could have a bad query compilation combined with special parameters suddenly chewing up disk IO and wrecking the buffer. If you're lucky it'll be obvious when you run sp_WhoIsActive, but if you're unlucky then you'd attempt to go through the plan cache using DMVs to find it.

That's a very tough thing to do and there are entire books on it; even if you have the scripts they often don't really aggregate how a plan changes for a query over time along with execution times. You can waste days and weeks trying to work it out while setting up extended events only to find they didn't capture the plan handle for some reason or it's a cursor and you don't know what it's really doing.

If you reach that point then tell your boss you need a tool to assist and buy a license of a proper monitor like SQL Sentry to deploy just during the real tricky problems like this. That's what I did. You can also use an evaluation or use another similar tool to drill down.

Anyway if it does turn out to be a false alarm, surprise, that's a common downside of OpsMgr. It may be amazing for some tasks but for SQL Server it's never been a good fit no matter how they try to shoehorn it. I haven't been able to tune its alerting to a functionally acceptable level so can't comment on improving that.