SQL Server Monitoring – Key KPIs to Track

monitoringsql-server-2012t-sql

I actually built some "Custom" Monitoring tool to track information from the DMV Tables and store them in some kind of DataWarehouse I'm taking snapshots each 5minutes.

the data I'm retrieving is:

Database:

  • File Latency (ldf and mdf)
  • File Sizes and use
  • Transactions
  • Log Growths and Flushes
  • Database CPU used
  • Database Buffer MB used

SERVER:

  • SQL CPU / IDLE PROCESS / OTHER PROCESSES
  • RAM (TARGET/USED)
  • Page Life Expectancy
  • Cache Hit Ratio
  • User Connections / Database Connections

LOCKWAITS:

  • General Locks/DeadLocks/Waits information (Network… Io Latch…)

QUERIES:

In this part, I'm a bit lost… I would like to be able to detect problematic queries on my server and to do so I need to store historical information as the information on some of the DMVs is cumulative so if I want to some averages like for example execution_count I need to snapshot them.
I don't know what Key KPIS and threshold should a query reach to be considered a problematic query…

Also, I don't know if I should monitor more KPIs in order to detect CPU pressure/ RAM pressure / Disk pressure.

Any help on improving it will be appreciated!

Best Answer

If you don't know then you're not ready to write a tool of this nature. To echo the comments of others this is why people buy a tool to do it better than we can, and the prices are super reasonable for very small shops of a few servers.

The only people writing decent tools for themselves are highly skilled professionals sometimes working in larger shops where they can't get management to swallow a 6 figure monitoring cost.

If you want to monitor query performance you need to read up on query fingerprints, and fully read though some Grant Fritchey books on how plans are cached. It's hard going and even he makes mistakes; it's MCM level.

After that you'll understand a little but even the paid tools won't point out when something has gone wrong. What they do point out are the top queries (in CPU, disk, memory), or weird memory grants, or missing indexes. You can do and store these easily from the internet but they aren't alertable thresholds. They never are.

On your own you might also be able to pinpoint excessively different plan recompilations based on fingerprints and execution details - but likely this isn't worth it otherwise commercial tools would do it. It's probably too noisy.

There's another book called Healthy SQL which covers a lot of ideas and thresholds but it gets hazy.

Another great place is to download the Performance Analysis of Logs tool (PAL) which has a list of SQL counters (along with disk IO counters and similar things) plus defines in XML exactly how it calculates thresholds and why. These are useful overall but again I'm not sure I'd alert on them except as a daily report.

I think some company has also published a semi marketing white paper on why thresholds are pointless and what you really need are baselines (and doing statistical analysis on standard deviations but I'm pretty sure they don't use it for much themselves). But IMHO nobody is sharing easy to understand ways to analyse and flag disparate raw baseline data between snapshots. R looks like a promising spot to do it but there's no "how".