Let's imagine : I have several instances of an application, spread into several servers, each instance has a database and data of each database is regularly extracted and stored in a central data warehouse. Sounds like a classic stuff.
This DW size is ~1 To, all the tables are partitioned (monthly) on the creation date of the records.
The DB server has 32 Go of RAM allocated to SQLServer (2014, enterprise).
The thing is that the PLE of this DB server is always low. Low meaning less than 120 seconds.
I know that evaluating PLE "good" value is a complex topic (here or there) and should be correlated with other indicators, but I have always been told that the higher the PLE is, the better it is (for performance, stability, etc…).
The process that extracts data from production instances to the DW runs every 5 minutes.
This process can affect thousands of records, not necessarily the most recent ones (for instance a customer A is created, customer B that was created 6 months ago is updated, etc…). So each time SQLServer will have to access different pages. And therefore PLE will drop.
So, as the process is designed currently, the low PLE seems normal to me.
And actually maybe it's not a so big deal because all other processes using this DW (reporting, aggregation and all) are stable and performant.
So, questions:
- is my understanding correct and low PLE normal in this context ?
- should I stop worrying and tell the supervision guy not to shout every time Nagios says the PLE is too low?
- or should I stop asking dumb questions and ask for more CPU / RAM ?
Best Answer
It can be normal for Page Life Expectancy (PLE) on SQL Servers in a Data Warehouse setup with frequent reloads to jump up and down.
Why?
Well which data will users normally query? It's the data that has been most recently inserted, because that is the data that is interesting. Unless you are performing analysis over long periods of time and conducting forecasts based on last years data.
Page Handling in Memory (Buffer Cache)
Microsoft SQL Server has an internal counter in each page that is stored in memory. This counter starts at four and is deducted by one, each time the SQL Server database engine looks at the page, when no reads occurred.
(Reference: An in-depth look at SQL Server Memory–Part 2)
In a Data Warehouse setup each time you load new data and each time the users request this new data it has to be read from disk into memory (buffer cache). This results in all pages that have a zero in the internal counter to be removed from the buffer cache.
Page Life Expectancy
Removing data pages from the buffer cache and moving new data from disk into the buffer cache, results in the Page Life Expectancy value to fall below the
recommendedthreshold of 300, because the new pages initially have a PLE of zero (0). The sum of all new pages in memory (PLE=0) and all old pages in memory (PLE > 300) can fall below 300.A PLE of 300 is not much of a recommendation nowadays:
Reference: Knee-Jerk PerfMon Counters : Page Life Expectancy)
Your SQL Server is performing as designed, given the current hardware
More Hardware?
Now if your users are experiencing performance issues when retrieving data, then you might benefit from more memory, because SQL Server will then be able to store more data in the Buffer Cache without having to purge all of the data when new data is loaded. Your PLE counter will probably still fall when new data is stored in your DW database, but it might not fall so dramatically.
Baseline
You will have to monitor your system and have a baseline that shows you the normal behaviour. Anything outside the norm is then worth an analysis.
5 Reasons You Must Start Capturing Baseline Data
Back to Basics: Capturing Baselines on Production SQL Servers
Once you know your baseline, you know what values for PLE are good for your DW SQL Server.