Sql-server – Page life expectancy very low on certain time

buffer-poolmemorypage-life-expectancysql-server-2008

Everyday we receive an alert on a specific period (03:00 AM – 04:00 AM) indicating that Page life expectancy is too low :

SCOM :Alert: SQL DB 2008 Engine Page Life Expectancy is too low

We have reviewed if any Memory intensive operation is running on this period but nothing was found.

We assume that the cause is a query run through a windows service that cause pages life in the buffer cache to reach a very low value (8 – 22 seconds)

what is a good way to troubleshoot this issue ?

Thanks
Osama Waly

Best Answer

It's almost certainly going to be full backups or indexing. The queries won't show up as a bunch of memory allocated but rather as intensive disk IO. If you are tracking those counters then have a look. Some other easy ways to confirm that these are the cause of that disk IO:

Looking at agent job schedules. If it wasn't checkdb, or indexing, then apps often do their own internal database clean-ups (e.g. deleting records) late at night also.
When checkdb runs it will be written to the ERRORLOG, so run sp_readerrorlog and look for anything at the PLE drop time. This is also useful because some apps like OpsMgr will run their own indexing and checkdb routines at random times without using Agent jobs.
There are also other queries you can run to identify if indexes or stats were recently updated (you'd run it in each database and cast an eye over the results). They aren't 100% accurate but if you saw lots of indexes or stats updated at the PLE drop time it would be an indicator.

Otherwise just stay up late one night and run sp_WhoIsActive interactively to see what's happening. You can also make sp_WhoIsActive output to a table, which means you can schedule an Agent job to run it to a table a few times during your PLE drop window, and then check the output in the morning.

If it turns out to be checkdb or indexing this PLE behaviour is fairly normal and doesn't cause a problem unless you have a very high throughput or time sensitive system that's very busy that early in the morning; if nobody is complaining (and you don't see an increase in query timeouts at that time if you're even tracking them) then likely it's just smoke and no fire.

What if it's not checkdb or indexes? You could have a bad query compilation combined with special parameters suddenly chewing up disk IO and wrecking the buffer. If you're lucky it'll be obvious when you run sp_WhoIsActive, but if you're unlucky then you'd attempt to go through the plan cache using DMVs to find it.

That's a very tough thing to do and there are entire books on it; even if you have the scripts they often don't really aggregate how a plan changes for a query over time along with execution times. You can waste days and weeks trying to work it out while setting up extended events only to find they didn't capture the plan handle for some reason or it's a cursor and you don't know what it's really doing.

If you reach that point then tell your boss you need a tool to assist and buy a license of a proper monitor like SQL Sentry to deploy just during the real tricky problems like this. That's what I did. You can also use an evaluation or use another similar tool to drill down.

Anyway if it does turn out to be a false alarm, surprise, that's a common downside of OpsMgr. It may be amazing for some tasks but for SQL Server it's never been a good fit no matter how they try to shoehorn it. I haven't been able to tune its alerting to a functionally acceptable level so can't comment on improving that.

Related Solutions

Sql-server – What units is Page Life Expectancy in

I believe that it's in seconds. I've got a script running a loop that's capturing the value once every minute and it's going up steadily by ~60 every 1 minute interval.

Sql-server – Can SQL Server’s Page Life Expectancy be too high

No, I can't think of any situation or theory in my head that could illustrate a negative side effect for an astronomically high PLE (unless you enjoy the humming sound of the platters in a hard disk drive?).

With memory getting cheaper and cheaper, and buyers-of-hardware getting a little more generous, these bigger memory boxes we see it is quite common to have PLE in the thousands upon thousands range. It's simply one indication that your instance may not be under memory pressure. Take it for what it is, and only that.

Page Life Expectancy is just an estimation SQL Server makes on how long it thinks a page will last in a buffer. This is SQL Server's way of saying it thinks it'll be "a while".

Another thing of note, take Paul Randal's advice and don't concentrate on the PLE reported by the buffer manager. Like all averages, the buffer manager can hide differing PLEs. You want to look at all of the individual NUMA nodes' PLEs with the Buffer Node counter:

Get-Counter -ListSet "*" | 
    Select-Object -ExpandProperty Counter | 
    Where-Object {$_ -like "*buffer node*page life expectancy*"} | 
    Get-Counter

Page Life Expectancy is simply one color in the beautiful painting that is SQL Server memory. Look at it along with the other "colors" (target/total server memory, available mbytes, etc.). Too often people look at one simple metric and fear either their server is about to explode, or conversely that it'll live for centuries unmanaged.

Best Answer

Related Solutions

Sql-server – What units is Page Life Expectancy in

Sql-server – Can SQL Server’s Page Life Expectancy be too high

Related Question