Sql-server – Does SQL Server Cache Aggregate Results When Duplicated Across Columns

sql serversql-server-2012

Suppose we have a table Orders containing the columns order_id, total, discount

Then we write a query similar to the following

SELECT
    COUNT(order_id) AS num_orders
    , SUM(total) / COUNT(order_id) as avg_total
    , SUM(discount) / COUNT(order_id) AS avg_discount
FROM Orders

Is the value for COUNT(order_id) Preserved across columns or re-computed (Eg, is there a performance hit)? Or is it better to determine the computed value(s) first and used those in the query, for example:

DECLARE @order_count AS INT 

SELECT 
    @order_count = COUNT(order_id) 
FROM Orders

SELECT
    @order_count AS num_orders
    , SUM(total) / @order_count as avg_total
    , SUM(discount) / @order_count AS avg_discount
FROM Orders

Note that, while writing this question I noticed that since SQL Server 2008 AVG() is supported. However, I continued this question and intend this to be more general to wanting to understand how SQL server handles Identical aggregates across columns as I do sometimes run into this in other forms.

Best Answer

SQL Server only calculates the COUNT once. You can see this by looking at the properties of the execution plan for

create table Orders(order_id int, total int, discount int)


SELECT
    COUNT(order_id) AS num_orders
    , SUM(total) / COUNT(order_id) as avg_total
    , SUM(discount) / COUNT(order_id) AS avg_discount
FROM Orders

The stream aggregate (1) has the following defined values

[Expr1008] = Scalar Operator(COUNT([tempdb].[dbo].[Orders].[order_id])), 
[Expr1009] = Scalar Operator(COUNT_BIG([tempdb].[dbo].[Orders].[total])), 
[Expr1010] = Scalar Operator(SUM([tempdb].[dbo].[Orders].[total])), 
[Expr1011] = Scalar Operator(COUNT_BIG([tempdb].[dbo].[Orders].[discount])), 
[Expr1012] = Scalar Operator(SUM([tempdb].[dbo].[Orders].[discount]))

Expr1008 is the calculation of the COUNT that you ask about.

There are some other COUNT aggregates for the other two columns. These are needed because the correct result for SUM(total) (for example) if COUNT(total) is 0 should be NULL.

This is carried out by the next compute scalar along (2). This also converts the COUNT result (Expr1008) from bigint to int and labels that as Expr1003

[Expr1003] = Scalar Operator(CONVERT_IMPLICIT(int,[Expr1008],0)),
[Expr1004] = Scalar Operator(CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END), 
[Expr1005] = Scalar Operator(CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END)

Finally the left most compute scalar (3) uses Expr1003 in the division operation...

[Expr1006] = Scalar Operator([Expr1004]/[Expr1003]), 
[Expr1007] = Scalar Operator([Expr1005]/[Expr1003])

... and outputs columns Expr1003, Expr1006, Expr1007 as the final result

PS: AVG has been supported much longer than SQL Server 2008. I imagine it has likely been available in the beginning. However it does not have the same semantics as your rewrite in the presence of NULLs anyway.

I assume order_id is the primary key and therefore not nullable but for a table with 10 orders and two NOT NULL total values of 2 and 4 then AVG(total) would be 3 but SUM(total) / COUNT(order_id) would be 0.6 (or 0 once integer division is taken into account).

Related Solutions

Sql-server – Does SQL Server cache data results for DAO

While I would usually leave an answer this short as a comment, this seems to be worthy of breaking with good practice.

No. Nonsense. Absolutely not.

The answer that @gbn gave to the question you reference is valid, regardless of the method of query. Stored procedure or adhoc, the same applies... Query results are not cached.

However, the source table and index data and metadata will be cached after the 1st use (subject to continued use, load and memory pressure though)

That is, the results of a query will be evaluated every execution but the tables(s) (and any indexes etc) used by the query will most likely be in memory already.

The "some sort of caching for DAO" is client/API behaviour, irrelevant and unbeknownst to SQL Server.

Sql-server – Design of an application log database

I did the following

CREATE TABLE L(
Time_Series_TS TIMESTAMP, 
Channel VARCHAR(10), 
Operation VARCHAR(10), 
Function VARCHAR(10), 
Duration INT);

Then

INSERT INTO L VALUES('2014-06-10 09:00:03.457', 'Channel1', 'Operation3', 'Function15', 15);
INSERT INTO L VALUES('2014-06-10 09:00:08.245', 'Channel2', 'Operation5', 'Function10', 22);
INSERT INTO L VALUES('2014-06-10 09:00:22.005', 'Channel1', 'Operation3', 'Function15', 48);
INSERT INTO L VALUES('2014-06-10 09:01:03.457', 'Channel2', 'Operation3', 'Function15', 296);
INSERT INTO L VALUES('2014-06-10 09:01:08.245', 'Channel2', 'Operation5', 'Function10', 225);
INSERT INTO L VALUES('2014-06-10 09:01:22.005', 'Channel1', 'Operation3', 'Function15', 7);
INSERT INTO L VALUES('2014-06-10 09:01:16.245', 'Channel2', 'Operation5', 'Function10', 10);
INSERT INTO L VALUES('2014-06-10 09:01:47.005', 'Channel1', 'Operation3', 'Function15', 20);

I added a few records to your sample for checking. Then ran this query

SELECT MINUTE(Time_Series_TS) AS Minute, Channel, Operation, Function, 
COUNT(*) AS "Count/min", SUM(Duration) AS Duration 
FROM L
GROUP BY Minute, Channel, Operation, Function
ORDER By Minute, Channel, Operation, Function;

Which gave

+--------+----------+------------+------------+-----------+----------+
| Minute | Channel  | Operation  | Function   | Count/min | Duration |
+--------+----------+------------+------------+-----------+----------+
|      0 | Channel1 | Operation3 | Function15 |         2 |       63 |
|      0 | Channel2 | Operation5 | Function10 |         1 |       22 |
|      1 | Channel1 | Operation3 | Function15 |         2 |       27 |
|      1 | Channel2 | Operation3 | Function15 |         1 |      296 |
|      1 | Channel2 | Operation5 | Function10 |         2 |      235 |
+--------+----------+------------+------------+-----------+----------+

Which appears to be the result you want (note 63 as the 1st duration as per my earlier comment). Is this the result you wanted? You can then use HOUR() and DAYOFMONTH() and even YEAR() to aggregate over these also with this query.

For performance, I did create an index

CREATE INDEX L_Index ON L(Channel, Operation, Function) using BTREE;

and explained the query before and after creating it, but there was no difference. This is hardly a surprise, since the optimizer probably said that there's no point in using one for such a small table. Obviously, I can't test with your data, but there are a couple of points. If you are performing this operation over a large number of records with a large no. of fields, you may run into issues and if you create many indexes, your insert performance will decrease. Is it possible for you to categorise your data in some way to reduce the number of fields - i.e. split your big table into ones with a smaller number of fields? Check out different scenarios, test and see what happens with your data, your queries, your application and your hardware.

[EDIT]

For something more human readable, you might like to try something like

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,
..
..

for your first field.

[EDIT - Response to UPDATE-1]

OK - so in my schema, you are indexing by (Minute, Channel, Operation, Function)? See here for the docco on composite indexes in MySQL. If your queries have a predominatly left-right orientation, i.e you [always | usually] query Channel first and then Operation, then Function, you could try an index on Minute + (the usual three). If it's fairly arbitrary, then you could try using 6 indexes, but this will hit insert performance. How much, I can't say, but if this is a DW type app which performs the analysis, you can batch the inserts and only occasionally take the hit for that. You'll have to do a few tests with realistic data and EXPLAIN your queries - with realistic sample data, as I said earlier, the Optimiser with just a few records ignores indexes because the table is too small. Interestingly, on the MySQL man page given above, there's a hashing strategy which looks interesting - take MD5 hashes of CONCAT(Your_Column_List_Here). One other thing that I can suggest is that instead of using the

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,...

Just remove the TIME() function and then you'll be storing INTs which appears to be better than indexes on DATETIMES - see here for a benchmark. Also as previously mentioned, you should remove your data from Production and perform the OLAP/DW on another machine. You could also test out the InfiniDB solution that I suggested. It's drop-in compatible with MySQL (no learning curve). Then there are all the NoSQL solutions - we could be here all day :-). Take a look at a few scenarios, evaluate and test and then choose what best fits your budget and requirements. Forgot: Make your OLAP/DW system read only for performing queries - no transactional overhead! Make the OLAP/DW tables MyISAM? This last one is controversial - again, test and see.

Best Answer

Related Solutions

Sql-server – Does SQL Server cache data results for DAO

Sql-server – Design of an application log database

Related Question