Sql-server – Low PLE and data warehouse

data-warehousedatabase-designsql server

Let's imagine : I have several instances of an application, spread into several servers, each instance has a database and data of each database is regularly extracted and stored in a central data warehouse. Sounds like a classic stuff.

This DW size is ~1 To, all the tables are partitioned (monthly) on the creation date of the records.
The DB server has 32 Go of RAM allocated to SQLServer (2014, enterprise).

The thing is that the PLE of this DB server is always low. Low meaning less than 120 seconds.
I know that evaluating PLE "good" value is a complex topic (here or there) and should be correlated with other indicators, but I have always been told that the higher the PLE is, the better it is (for performance, stability, etc…).

The process that extracts data from production instances to the DW runs every 5 minutes.
This process can affect thousands of records, not necessarily the most recent ones (for instance a customer A is created, customer B that was created 6 months ago is updated, etc…). So each time SQLServer will have to access different pages. And therefore PLE will drop.
So, as the process is designed currently, the low PLE seems normal to me.
And actually maybe it's not a so big deal because all other processes using this DW (reporting, aggregation and all) are stable and performant.

So, questions:

  • is my understanding correct and low PLE normal in this context ?
  • should I stop worrying and tell the supervision guy not to shout every time Nagios says the PLE is too low?
  • or should I stop asking dumb questions and ask for more CPU / RAM ?

Best Answer

It can be normal for Page Life Expectancy (PLE) on SQL Servers in a Data Warehouse setup with frequent reloads to jump up and down.

Why?

Well which data will users normally query? It's the data that has been most recently inserted, because that is the data that is interesting. Unless you are performing analysis over long periods of time and conducting forecasts based on last years data.

Page Handling in Memory (Buffer Cache)

Microsoft SQL Server has an internal counter in each page that is stored in memory. This counter starts at four and is deducted by one, each time the SQL Server database engine looks at the page, when no reads occurred.

Under this algorithm, there are “Clock Hands” that sweep the cache at regular intervals. Every time the clock hand steps on a not-in-use entry, it decreases cost by some amount. If the entry is not in use and its cost is zero, the clock hand makes the entry invisible and then attempts to remove it from the cache. In fact, Lifetime of an entry is managed by an embedded reference count in the Clock Entry Info class. After this count goes to 0, an entry is destroyed.

(Reference: An in-depth look at SQL Server Memory–Part 2)

In a Data Warehouse setup each time you load new data and each time the users request this new data it has to be read from disk into memory (buffer cache). This results in all pages that have a zero in the internal counter to be removed from the buffer cache.

Page Life Expectancy

Removing data pages from the buffer cache and moving new data from disk into the buffer cache, results in the Page Life Expectancy value to fall below the recommended threshold of 300, because the new pages initially have a PLE of zero (0). The sum of all new pages in memory (PLE=0) and all old pages in memory (PLE > 300) can fall below 300.

A PLE of 300 is not much of a recommendation nowadays:

A PLE of 300 means your entire buffer pool is being effectively flushed and re-read every five minutes. When the threshold guidance for PLE of 300 was first given by Microsoft, around 2005/2006, that number may have made more sense as the average amount of memory on a server was much lower. Nowadays, where servers routinely have 64GB, 128GB, and higher amounts of memory, having roughly that much data being read from disk every five minutes would likely be the cause of a crippling performance issue...

Reference: Knee-Jerk PerfMon Counters : Page Life Expectancy)

Your SQL Server is performing as designed, given the current hardware

More Hardware?

Now if your users are experiencing performance issues when retrieving data, then you might benefit from more memory, because SQL Server will then be able to store more data in the Buffer Cache without having to purge all of the data when new data is loaded. Your PLE counter will probably still fall when new data is stored in your DW database, but it might not fall so dramatically.

Baseline

You will have to monitor your system and have a baseline that shows you the normal behaviour. Anything outside the norm is then worth an analysis.

5 Reasons You Must Start Capturing Baseline Data

Baselines are a good idea. Every DBA that I’ve spoken with seems to understand the value: So why is it that so many companies do not have baseline data? I think there are two factors at play. The first is the startup cost, and by cost I really mean time and effort. If you’re going to roll your own method for capturing baselines, you have to put some thought into it, answering the questions: - What data are you going to capture? - When are you going to capture data? - What methods will you use to get the data? - Where are you going to store it? - How are you going to report on it? - How are you going to manage the data long term?

Back to Basics: Capturing Baselines on Production SQL Servers

Once you know your baseline, you know what values for PLE are good for your DW SQL Server.