Sql-server – Performance considerations for Caching Aggregate counts

aggregatecachesql serversql-server-2012

We have an InventoryActivity table that holds transactional changes to item quantity:

CREATE TABLE dbo.InventoryActivity(
    InventoryActivity_uid int IDENTITY(1,1) NOT NULL PRIMARY KEY,
    Organization_uid int NOT NULL,
    MasterInventory_uid int NOT NULL,
    AdjustmentType_cd varchar(20) NULL,
    AdjustmentReason_cd varchar(20) NULL,
    Quantity int NULL
)

We would like an InventorySummary which should aggregate to the current quantity for each. The summary counts should always be derivable from the the sum of the transactional records, but we have several different approaches on how to calculate aggregate counts:

  1. Stored Procedure
  2. Separate Summary Table
  3. Indexed View

What performance considerations should tip the scales in favor of a particular strategy?
What best practices exist? *(I know best practices borders on discussion, but I want to know what considerations would come into play to help make the decision)

Here's a fiddle with some sample code and data

  1. Stored Procedure

    Simplest option is just perform the SUM operation fresh everytime. But involves no caching and would likely cause performance issues over time.

    CREATE PROCEDURE dbo.GetInventorySummary 
    AS   
    SELECT Organization_uid,
           MasterInventory_uid,
           SUM(Quantity) AS Quantity
    FROM dbo.InventoryActivity
    GROUP BY Organization_uid, MasterInventory_uid
    
  2. Separate Table

    We could create a table to store the current quantities. The plus side is fetching this data would be trivial. The downside is we would have to manually maintain it and keep the records in sync everytime we do a write to the InventoryActivity table.

    CREATE TABLE dbo.InventorySummary(
        Organization_uid int NOT NULL,
        MasterInventory_uid int NOT NULL,
        Quantity int NOT NULL,
        PRIMARY KEY (Organization_uid, MasterInventory_uid)
    )
    

    Triggers could help alleviate some of that maintenance.

    CREATE TRIGGER dbo.InventoryActivity_I
    ON dbo.InventoryActivity
    AFTER INSERT
    AS 
    
        CREATE TABLE #InsertSummaryTemp
        (
          Organization_uid int,
          MasterInventory_uid int,
          Quantity int
        )
        INSERT INTO #InsertSummaryTemp
        SELECT Organization_uid,
               MasterInventory_uid,
               SUM(Quantity) AS Quantity
        FROM INSERTED
        GROUP BY Organization_uid, MasterInventory_uid
    
        -- UPDATE EXISTING RECORDS
        UPDATE InventorySummary
        SET Quantity = s.Quantity + i.Quantity
        FROM InventorySummary s
        JOIN #InsertSummaryTemp i ON s.Organization_uid = i.Organization_uid AND 
                                     s.MasterInventory_uid = i.MasterInventory_uid
    
        -- INSERT NEW RECORDS
        INSERT INTO InventorySummary
            (Organization_uid, MasterInventory_uid, Quantity) 
        SELECT i.Organization_uid, i.MasterInventory_uid, i.Quantity
        FROM #InsertSummaryTemp i
        LEFT JOIN InventorySummary s ON i.Organization_uid = s.Organization_uid AND 
                                        i.MasterInventory_uid = s.MasterInventory_uid
        WHERE s.MasterInventory_uid IS NULL
    
  3. Indexed View

    Borrowed from this and this we could create an Indexed View. Which lowers the maintenance cost associated with option 2. However there is a concern about performance that we are still aggregating all records through the entire history. As opposed to a simple read from a table.

    CREATE VIEW dbo.InventorySummaryView
    WITH SCHEMABINDING
    AS
        SELECT
          Organization_uid,
          MasterInventory_uid,
          SUM(Quantity) AS Quantity,
          COUNT_BIG(*) AS Count
        FROM dbo.InventoryActivity
        GROUP BY Organization_uid, MasterInventory_uid
    GO
    
    CREATE UNIQUE CLUSTERED INDEX PK_InventorySummaryView ON dbo.InventorySummaryView
    (
          Organization_uid,
          MasterInventory_uid
    )
    

Best Answer

What performance considerations should tip the scales in favor of a particular strategy?

Essentially, as any form of cache, these strategies allow to increase performance of reads at the expense of writes and disk space.

So, you should consider the following :

  1. The expected number of writes/updates vs. number of reads.
  2. What is more important: fast writes or fast reads.
  3. Whether extra disk space is available and how "cheap" it is.
  4. Whether the summary has to be always up to date, or it can be delayed.

The first strategy is a baseline. You don't duplicate the data, writes are as fast as they can be.

The last strategy (indexed view) slows down writes immediately (as they happen), but the aggregated amounts are always up to date.

The second strategy with an explicit summary table gives you greater control over when to perform the aggregation. You can delay the aggregation and perform it when the server is under less load. If you accumulate a bunch of pending changes for calculating the summary and perform these calculations in bulk, it may be more efficient than updating the summary after every single change of the source data.

On the other hand, the engine that maintains the indexed view during updates should be smart enough to update the summary by applying only the changes to the summary without reading through the whole table each time. For example, if you update only, say, 2 rows in 10M table, then to calculate the new SUM(Quantity) the engine can subtract two old values of Quantity and add two new values. I don't have an authoritative source at hand that would confirm that the engine indeed works like this, but it should be fairly easy to test and verify by measuring the amount of reads and writes of a few test statements on a large table.

This leads to another thing to consider:

  1. The type of aggregation that is done in the indexed view and whether the engine is smart enough to keep it up-to-date efficiently. For example, it is easy to keep SUM up-to-date efficiently and not so easy for MIN.
  2. How big is the table that is being aggregated and what percentage of this table changes with each update.