Sql-server – Performance considerations for Caching Aggregate counts

aggregatecachesql serversql-server-2012

We have an InventoryActivity table that holds transactional changes to item quantity:

CREATE TABLE dbo.InventoryActivity(
    InventoryActivity_uid int IDENTITY(1,1) NOT NULL PRIMARY KEY,
    Organization_uid int NOT NULL,
    MasterInventory_uid int NOT NULL,
    AdjustmentType_cd varchar(20) NULL,
    AdjustmentReason_cd varchar(20) NULL,
    Quantity int NULL
)

We would like an InventorySummary which should aggregate to the current quantity for each. The summary counts should always be derivable from the the sum of the transactional records, but we have several different approaches on how to calculate aggregate counts:

Stored Procedure
Separate Summary Table
Indexed View

What performance considerations should tip the scales in favor of a particular strategy?
What best practices exist? ^{*(I know best practices borders on discussion, but I want to know what considerations would come into play to help make the decision)}

Here's a fiddle with some sample code and data

Stored Procedure

Simplest option is just perform the SUM operation fresh everytime. But involves no caching and would likely cause performance issues over time.

CREATE PROCEDURE dbo.GetInventorySummary 
AS   
SELECT Organization_uid,
       MasterInventory_uid,
       SUM(Quantity) AS Quantity
FROM dbo.InventoryActivity
GROUP BY Organization_uid, MasterInventory_uid

Separate Table

We could create a table to store the current quantities. The plus side is fetching this data would be trivial. The downside is we would have to manually maintain it and keep the records in sync everytime we do a write to the InventoryActivity table.

CREATE TABLE dbo.InventorySummary(
    Organization_uid int NOT NULL,
    MasterInventory_uid int NOT NULL,
    Quantity int NOT NULL,
    PRIMARY KEY (Organization_uid, MasterInventory_uid)
)

Triggers could help alleviate some of that maintenance.

CREATE TRIGGER dbo.InventoryActivity_I
ON dbo.InventoryActivity
AFTER INSERT
AS 

    CREATE TABLE #InsertSummaryTemp
    (
      Organization_uid int,
      MasterInventory_uid int,
      Quantity int
    )
    INSERT INTO #InsertSummaryTemp
    SELECT Organization_uid,
           MasterInventory_uid,
           SUM(Quantity) AS Quantity
    FROM INSERTED
    GROUP BY Organization_uid, MasterInventory_uid

    -- UPDATE EXISTING RECORDS
    UPDATE InventorySummary
    SET Quantity = s.Quantity + i.Quantity
    FROM InventorySummary s
    JOIN #InsertSummaryTemp i ON s.Organization_uid = i.Organization_uid AND 
                                 s.MasterInventory_uid = i.MasterInventory_uid

    -- INSERT NEW RECORDS
    INSERT INTO InventorySummary
        (Organization_uid, MasterInventory_uid, Quantity) 
    SELECT i.Organization_uid, i.MasterInventory_uid, i.Quantity
    FROM #InsertSummaryTemp i
    LEFT JOIN InventorySummary s ON i.Organization_uid = s.Organization_uid AND 
                                    i.MasterInventory_uid = s.MasterInventory_uid
    WHERE s.MasterInventory_uid IS NULL

Indexed View

Borrowed from this and this we could create an Indexed View. Which lowers the maintenance cost associated with option 2. However there is a concern about performance that we are still aggregating all records through the entire history. As opposed to a simple read from a table.

CREATE VIEW dbo.InventorySummaryView
WITH SCHEMABINDING
AS
    SELECT
      Organization_uid,
      MasterInventory_uid,
      SUM(Quantity) AS Quantity,
      COUNT_BIG(*) AS Count
    FROM dbo.InventoryActivity
    GROUP BY Organization_uid, MasterInventory_uid
GO

CREATE UNIQUE CLUSTERED INDEX PK_InventorySummaryView ON dbo.InventorySummaryView
(
      Organization_uid,
      MasterInventory_uid
)

Best Answer

What performance considerations should tip the scales in favor of a particular strategy?

Essentially, as any form of cache, these strategies allow to increase performance of reads at the expense of writes and disk space.

So, you should consider the following :

The expected number of writes/updates vs. number of reads.
What is more important: fast writes or fast reads.
Whether extra disk space is available and how "cheap" it is.
Whether the summary has to be always up to date, or it can be delayed.

The first strategy is a baseline. You don't duplicate the data, writes are as fast as they can be.

The last strategy (indexed view) slows down writes immediately (as they happen), but the aggregated amounts are always up to date.

The second strategy with an explicit summary table gives you greater control over when to perform the aggregation. You can delay the aggregation and perform it when the server is under less load. If you accumulate a bunch of pending changes for calculating the summary and perform these calculations in bulk, it may be more efficient than updating the summary after every single change of the source data.

On the other hand, the engine that maintains the indexed view during updates should be smart enough to update the summary by applying only the changes to the summary without reading through the whole table each time. For example, if you update only, say, 2 rows in 10M table, then to calculate the new SUM(Quantity) the engine can subtract two old values of Quantity and add two new values. I don't have an authoritative source at hand that would confirm that the engine indeed works like this, but it should be fairly easy to test and verify by measuring the amount of reads and writes of a few test statements on a large table.

This leads to another thing to consider:

The type of aggregation that is done in the indexed view and whether the engine is smart enough to keep it up-to-date efficiently. For example, it is easy to keep SUM up-to-date efficiently and not so easy for MIN.
How big is the table that is being aggregated and what percentage of this table changes with each update.

Related Solutions

PostgreSQL – Using Nested WHEN Aggregate Functions

After some processing this boiled down to:

While your predicate d."SettlementPointName" = 'John' is filtering a single value for "SettlementPointName" anyway, simplify to:

SELECT count(                                     d."SettlementPointPrice" < 10.5 OR NULL) AS da_00_10
     , count(d."SettlementPointPrice" >= 10.5 AND d."SettlementPointPrice" < 20.5 OR NULL) AS da_11_20
     , count(d."SettlementPointPrice" >= 20.5 AND d."SettlementPointPrice" < 30.5 OR NULL) AS da_21_30
FROM   public.da d
JOIN   public.rt_aggregate r USING ("DeliveryDate", "SettlementPointName")
WHERE  d."SettlementPointName" = 'John'
AND    d."DeliveryDate" >= '2015-02-01'
AND    d."DeliveryDate" <= '2015-02-20'
AND    r."DeliveryHour" = 14
AND    date_part('hour', d."DeliveryHour") = r."DeliveryHour";

About the counting technique:

For absolute performance, is SUM faster or COUNT?

Or better, yet, use the new aggregate filter technique in pg 9.4:

SELECT d."SettlementPointName"
     , count(*) FILTER (WHERE d."SettlementPointPrice" <  10.5) AS da_00_10
     , count(*) FILTER (WHERE d."SettlementPointPrice" >= 10.5
                        AND   d."SettlementPointPrice" <  20.5) AS da_11_20
     , count(*) FILTER (WHERE d."SettlementPointPrice" >= 20.5
                        AND   d."SettlementPointPrice" <  30.5) AS da_21_30
FROM   public.da d
JOIN   public.rt_aggregate r USING ("DeliveryDate", "SettlementPointName")
WHERE  d."DeliveryDate" >= '2015-02-01'
AND    d."DeliveryDate" <= '2015-02-20'
AND    r."DeliveryHour" = 14
AND    date_part('hour', d."DeliveryHour") = r."DeliveryHour"
GROUP  BY 1;

This time, selecting all names and returning one row per name like you asked in the comment.

Details for FILTER:

Return counts for multiple ranges in a single SELECT statement

SQL Server – Unique Clustered Index on Indexed View Fails Due to Aggregates

Your issue is definitely your function, and you will have to do something to make that work differently. Here is a very basic example to illustrate the problem.

First create your test data.

CREATE TABLE dbo.testResults
(
    id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
    ,col1 VARCHAR(200)
);
GO
INSERT INTO dbo.testResults
        ( col1 )
VALUES  ( 'test1' ), ( 'test1' ), ( 'test2' ), ( 'test3' ), ( 'test4' ), ( 'test4' );
GO

Now we will create a view (that works correctly) and add the clustered index to that view.

CREATE VIEW vwResults
WITH SCHEMABINDING
AS
SELECT col1
,COUNT_BIG(*) AS cbcol1
FROM dbo.testResults AS tr
GROUP BY tr.col1;
GO
--Works great
CREATE UNIQUE CLUSTERED INDEX cls_tr ON dbo.vwResults (col1);
GO

Next we will add a very simple and basic function that does nothing more than very basic string manipulation (left 3 characters of a given string).

CREATE FUNCTION dbo.left3 (@str varchar(200))
RETURNS varchar(3)
WITH schemabinding
AS
begin
RETURN LEFT(@str, 3)
END;

Now we will get rid of our previous view add this column to the view.

DROP VIEW dbo.vwResults;
GO
CREATE VIEW vwResults
WITH SCHEMABINDING
AS
SELECT col1
,dbo.left3(col1) AS left3col1
,COUNT_BIG(*) AS cbcol1
FROM dbo.testResults AS tr
GROUP BY tr.col1, dbo.left3(col1);
GO

So far, so good. But then by adding even this simplistic function to the view and nothing else, we are no longer able to index the view.

/*FAILURE!!!!!!*/
CREATE UNIQUE CLUSTERED INDEX cls_tr ON dbo.vwResults (col1);
GO

So for this example there is a pretty simple solution where I can fairly easily create a computed column and then everything works. Here is how I would do that.

ALTER TABLE dbo.testResults
ADD left3col1 AS LEFT(col1, 3);

After doing this, I can create the same view and this time I can add the unique clustered index as follows.

DROP VIEW dbo.vwResults;
GO
CREATE VIEW vwResults
WITH SCHEMABINDING
AS
SELECT col1
,left3col1
,COUNT_BIG(*) AS cbcol1
FROM dbo.testResults AS tr
GROUP BY tr.col1, left3col1;
GO
/*IT WORKS!!!!!!*/
CREATE UNIQUE CLUSTERED INDEX cls_tr ON dbo.vwResults (col1, left3col1);
GO

This might not be an option for your scenario, but basically your function is the problem and you need to work through some alternatives (depending on what it does) if you want to make an indexed view work for your scenario.

Best Answer

Related Solutions

PostgreSQL – Using Nested WHEN Aggregate Functions

SQL Server – Unique Clustered Index on Indexed View Fails Due to Aggregates

Related Question