SQL Server – Ensuring Consistent Query Performance Over Time

optimizationsql serverstatistics

We're running an intensive application load (thousands of operations/second) against a SQL Server database with quite a bit of data. Some tables have billions of rows, several of these have lots of inserts and updates.

DB performance is generally quite OK, but periodically we get query performance problems; rather simple queries that previously worked just fine may take 10-100x the time all of a sudden.

This appears to be related to table/index statistics and the query optimiser – most times a statistics update will fix the problem, then again other times an update of the statistics will make the situation worse (re-running the statistics update will then typically solve the problem eventually).

What appears to be happening is that the optimiser decides to use objectively wrong indexes for some queries; all of a sudden, after having used the correct one for days and weeks.

My question is: Why is this happening and what can we do about it?

This database has been running for years with basically the same load, pretty much the same queries, and the same amount of updates. For 99.995% of the queries, there should be no reason to ever decide on different index strategies over time, regardless of the input (and – indeed – doing so will demonstrably completely destroy query performance).

As indicated above, automatically updating statistics on a schedule will often generate horrible problems – if the statistics sample gets skewed (which appears to happen at least 5% of the times) we end up in a world of pain.

Is there any way to tell SQL Server (on certain tables) that the statistics histogram and density will not change over time, so please just continue using the same query plan for queries involving this table? If not, how can we ensure a predictable result of statistics update over time (avoiding the skewed statistics problem described above)?

No stored procedures. We do have control over the SQL, so it can potentially be changed but it's a LOT of code so it would be unfortunate if we had to change every single query (e.g. adding in an additional clause).

A follow-up question: parameter sniffing only appears to be relevant for stored procedures, is that correct?

Best Answer

I suggest you first determine whether its the stats or if its parameter sniffing that is hurting you.

Regardless of above I suggest you read Erland's article on the subject.

What to do about it is difficult to say. We don't know whether it is stats or sniffing.

But possibly adding OPTIMIZE FOR can be "the" solution. It is cheaper than RECOMPILE since you don't have to take the plan production hit on each execution. And it gives you predictability. This, of course, assumes that you don't have the case where the stats differ so much so the same parameter input yield different plans because of stats reasons.

Try to identify one query. See if you have one or many plans for the query. Test with OPTIMIZE FOR and/or RECOMPILE. The one "global" option at database scale you have is to disable parameter sniffing for the database. This means the optimizer optimizes as it doesn't have a clue of the value. All this and more in Erland's article.

Parameter sniffing does not only apply to stored procedures. It also applies to parameterized SQL (typically executed using sp_executesql), which likely is far more common nowadays than stored procedures.

Related Solutions

Sql-server – If a query triggers a statistics update and times out are the statistics still updated

That was actually my question at AskSSC. I should have just tested it myself as I accepted an incorrect answer.

With the following test table

CREATE TABLE StatsTest
(
a varchar(max),
b varchar(max)
)

DECLARE @VCM VARCHAR(MAX) = 'A'

INSERT INTO StatsTest
SELECT TOP 20000
       REPLICATE(@VCM,10000),
       REPLICATE(@VCM,10000)
FROM master..spt_values v1,  master..spt_values v2

And the following test code

         SqlConnection connection = new SqlConnection(...);
            connection.Open();

            SqlCommand command = connection.CreateCommand();
            command.CommandTimeout = 12;
            command.CommandType = CommandType.Text;
            command.CommandText = @"SELECT COUNT(*)
FROM StatsTest
WHERE 
      a LIKE '%foo%' OR 
      b LIKE '%foo%' ";
            command.ExecuteScalar();

Profiler shows the following

Profiler screenshot

First it creates the stats for column b successfully (the initial SP:StmtStarting /SP:StmtCompleted pair) . Then it starts creating the stats for column a (The selected SP:StmtStarting entry in the screen shot). This entry is followed by an AUTOSTATS event confirming that the statistics on b were created then the timeout kicks in.

It can be seen that the stats creation occurs on the same spid as the query and so this also aborts the creation of stats on column a. At the end of the process only one set of stats exists on the table.

Edit

The above refers to stats creation, to test auto update of the stats I ran the above query without a timeout so both sets of stats were successfully created then updated all columns of all rows so that the stats would be out of date and re-ran the test. The trace for that is pretty similar

Profiler 2

Finally just for completeness with SET AUTO_UPDATE_STATISTICS_ASYNC ON the trace looks as follows. It can be seen that system spids are used to perform the operation and they are unaffected by the query timeout as would be expected.

Profiler 3

Sql-server – Why wouldn’t statistics update if all rows in a table were replaced

Unfortunately after searching SolarWinds' KB for about 10 minutes I cannot find what "Ineffective Statistics" measures. That information would be helpful. However, based on your question I'll infer that it means that the Statistics objects may be out of date or have taken quite a few modifications and Database Performance Analyzer has not detected an update of some sort. Without more information from the product vendor, it's hard to tell.

Having said that, there are a few things that you should know.

First, you can verify what's going on with your server right now (e.g. when you get the alert) and compare it against your monitoring tool. Here's a script that I use:

use [mydatabasename];
go
select 
    [schema_name] = sh.name, 
    [table_name] = t.name, 
    [stat_name] = s.name,
    [column_names] = stuff((
                              select ','+index_col(t.name,sc.stats_id,sc.stats_column_id)
                              from sys.stats_columns sc
                              where sc.object_id = s.object_id
                                and sc.stats_id = s.stats_id    
                               order by sc.stats_column_id        
                              for xml path('')), 1, 1, '' 
                            ) ,
    s.stats_id,     
    sp.last_updated, 
    sp.rows, 
    sp.rows_sampled, 
    sp.unfiltered_rows, 
    [sample_rate] = (100.*sp.rows_sampled)/sp.unfiltered_rows,
    modification_counter,
    [default_mod_threshold] = case when sp.unfiltered_rows < 500
                                                   then 500
                                                   else ( .20
                                                          * sp.unfiltered_rows
                                                          + 500 )
                                              end ,
    [2371_mod_threshold] = sqrt(1000 * sp.unfiltered_rows),
    [default_mod_hit] = case when sp.modification_counter > case
                                                              when sp.unfiltered_rows < 500
                                                              then 500
                                                              else ( .20
                                                              * sp.unfiltered_rows
                                                              + 500 )
                                                              end then 1
                                           else 0
                                      end ,
    [2371_mod_hit] = case when sp.modification_counter > sqrt(1000
                                                              * sp.unfiltered_rows)
                                          then 1
                    end

from sys.stats s 
join sys.tables t 
    on s.object_id = t.object_id
join sys.schemas sh
    on t.schema_id = sh.schema_id
cross apply sys.dm_db_stats_properties(s.object_id,s.stats_id) sp
order by t.name, s.name;

Most of it is self-explanatory. However, I added the modification threshold for a statistics object at default and with TF 2371 included. The default threshold is an interesting calculation, while TF 2371 is the square root of 1000 * Table Cardinality. I included a calculation to include whether or not each of these thresholds had been hit yet. So run that inside the alerted database and see if what SolarWinds is seeing comports with what is occurring on the server right now. If it isn't then there's probably a timing issue in there somewhere such that the alert was generated before a stats update but after a massive amount of data change.

Having said that, it's important to know how AUTO_UPDATE of statistics actually works. There isn't a clock-like sweep that goes through and constantly updates statistics. Instead, the statistics are updated when the statistic object is loaded to compile a query plan. If the statistic is rarely/never used then it won't be updated unless you manually update it or if it happens to get caught up in an index rebuild as a dependent object for the index. However, if you have an index that's never getting used then that's a different situation altogether.

So in summary:

Verify what the alert is telling you
Know the factors that lead to a stats update

Best Answer

Related Solutions

Sql-server – If a query triggers a statistics update and times out are the statistics still updated

Sql-server – Why wouldn’t statistics update if all rows in a table were replaced

Related Question