Sql-server – How to change aggregate function without duplicating SQL (by using SQL)

aggregatecteperformancequery-performancesql-server-2016stored-procedures

In SQL Server 2016 I have a scenario where data will be processed according to different aggregation functions in a large GROUP BY ROLLUP. I would like to have a stored procedure that has a parameter that specifies which aggregation function to use to describe the groupings in a way that does not risk SQL injection and takes advantage of compilation (it is a heavy stored procedure).

My thoughts are to use a collection of queries that summarize the data's groupings on a particular aggregate function. (e.g. agg.DataMin, agg.DataMedian, agg.DataWeightedAverage, and so on). Then use these with the parameter in a CTE

WITH AggData AS
(
   SELECT * FROM agg.DataMin WHERE @AggFunction = 1 
   UNION ALL
   SELECT * FROM agg.DataMedian WHERE @AggFunction = 2 
   UNION ALL
   SELECT * FROM agg.DataWeightedAverage WHERE @AggFunction = 3
)
SELECT ...

My concerns are query performance and industry best practice. The data table is of a reasonable size (2+ Gig). I will have to add many aggregate queries with some being inline table-valued functions for some leave-out aggregations.

In the above, will the queries/table-valued functions only execute when the @AggFunction matches the WHERE condition or will they all execute and filter after the results are returned? If the latter, is there a method to short-circuit the evaluation of the unneeded queries at run-time? Also, is there some standard method to perform this in SQL that I have overlooked?

Best Answer

Contradiction Detection could kick in to make sure only one of the statements is run, and in my simple test it did as long as there was a statement-level recompile hint, but why risk it? For example:

USE tempdb
GO

-- CREATE SCHEMA agg
--DROP TABLE agg.DataMin
--DROP TABLE agg.DataMedian
--DROP TABLE agg.DataWeightedAverage
--GO

CREATE TABLE agg.DataMin ( x INT PRIMARY KEY )
CREATE TABLE agg.DataMedian ( x INT PRIMARY KEY )
CREATE TABLE agg.DataWeightedAverage ( x INT PRIMARY KEY )
GO

INSERT INTO agg.DataMin ( x )
SELECT object_id FROM sys.all_objects

INSERT INTO agg.DataMedian ( x )
SELECT object_id FROM sys.all_objects WHERE type = 'P'

INSERT INTO agg.DataWeightedAverage ( x )
SELECT object_id FROM sys.all_objects WHERE type = 'X'
GO


-- Are there some situations when it wouldn't...
DECLARE @AggFunction INT = 1

;WITH AggData AS
(
   SELECT * FROM agg.DataMin WHERE @AggFunction = 1 
   UNION ALL
   SELECT * FROM agg.DataMedian WHERE @AggFunction = 2 
   UNION ALL
   SELECT * FROM agg.DataWeightedAverage WHERE @AggFunction = 3
)
SELECT *
FROM AggData
OPTION ( RECOMPILE )

My results:

In this simple example, only one table is scanned on the left with the recompile, and 3 tables are scanned on the right, without the recompile. The recompile hint allows the optimizer to "see" the parameter value and act accordingly. In a stored procedure where parameter sniffing would be used, a recompile would also be needed to get the same behaviour, either at statement or stored-proc level.

However I cannot say if there are no situations where contradiction detection would not occur; and you can't prove a negative. To put it another way, I cannot prove contradiction detection would always occur even with a recompile. There may be some unknown situations where even with a recompile it does not occur; excessive complexity springs to mind.

Also, there is no real advantage to using the CTE in your example, so why not keep it simple? You could just write some simple procedural SQL with IF...THEN...ELSE which would guarantee only one of your statements would fire, eg

DECLARE @AggFunction INT = 99

IF @AggFunction = 1
   SELECT * FROM agg.DataMin
ELSE IF @AggFunction = 2
   SELECT * FROM agg.DataMedian
ELSE IF @AggFunction = 3
   SELECT * FROM agg.DataWeightedAverage
ELSE
    RAISERROR( 'Unknown value for parameter @AggFunction (%i).', 16, 1, @AggFunction )

Add some parameter checking while you're at it. Hopefully this meets your requirements of guaranteeing only one statement is compiled when needed, is safe and hopefully simple to implement.

HTH

Related Solutions

Mysql – Stored Procedure – MySQL

In my opinion, a procedure should do one thing and one thing only. The logic should be performed in the application code. Secondly, a procedure should not return a result set. It should do work and then possibly return a status e.g. success, failure or something similar. Also, the name of your procedure is meaningless. Choose a descriptive name.

So, break it up into three procedures (insert, update and delete). Move the select statement into a view rather than a procedure. Let your programmers add the logic in their code to determine which one to call. Make sure you put index on the table.

Never do a select *. Only select the columns you need. It may be that you add columns later on but you will still SELECT *. Your SELECT * will then store all (incl unnecessary columns) columns in memory and, in worst case, run out of memory and be swapping to disk.

Mysql – Prepared statement execution with variable number of parameters to be bound

At least in this example, there's an easier approach than this. Remember that the optimizer is always trying to plan the query execution in a way that the path involving least amount of work necessary to retrieve a valid result set will be the path chosen.

SELECT * FROM t1 WHERE (col1 = 'foo') OR (1 = 1);

The server will always return all of the rows, because 1 = 1 contains all the truthiness required to conclude "yes, this row matches the WHERE clause"... and it will not bother scanning the values in col1 to see if they contain 'foo' because the optimizer has found a more obvious, simpler, "lower cost" approach... (any expression) OR (TRUE) is always true... and the shortest path to truth is the one the optimizer is supposed to find and take. It doesn't need to evaluate whether 1 = 1 row-by-row, since that's a constant expression.

You don't even need a prepared statement, just a query in the procedure:

SELECT id, bar 
  FROM table
 WHERE baz = mandatory
   AND (optional IS NULL OR field = optional)
   AND (optional2 IS NULL OR field2 = optional2);

If the program variable "optional" is null, the optimizer realizes that entire expression (in parentheses on that line) will be true for each row, no matter what, since, in the context of the running query, "optional" is a constant value, and optional is NULL is always going to be true, just like 1 = 1 is always true. The optimizer realizes that this also is a constant expression. The contents of "field" never need be evaluated, because the OR expression has been determined in advance to be true. The same goes for optional2.

On the flip side, if "optional" is not null, then optional IS NULL gets optimized away, since it's impossible for that expression to be true, and you are left with AND (field = optional). Same, again, for "optional2."

The optimizer is left, in each case, with query that can still use indexes as appropriate.

Since the query is in the procedure body, not a prepared statement, SQL injection is impossible since the server has no chance of blurring the distinction between data in variables and the literal query.

Best Answer

Related Solutions

Mysql – Stored Procedure – MySQL

Mysql – Prepared statement execution with variable number of parameters to be bound

Related Question