Sql-server – Calculating stock quantity based on change log

optimizationrunning-totalssql serversql server 2014

Imagine that you have the following table structure:

LogId | ProductId | FromPositionId | ToPositionId | Date                 | Quantity
-----------------------------------------------------------------------------------
1     | 123       | 0              | 10002        | 2018-01-01 08:10:22  | 5
2     | 123       | 0              | 10003        | 2018-01-03 15:15:10  | 9
3     | 123       | 10002          | 10004        | 2018-01-07 21:08:56  | 3
4     | 123       | 10004          | 0            | 2018-02-09 10:03:23  | 1

FromPositionId and ToPositionId are stock positions.
Some position ID:s have special meaning, for example 0. An event from or to 0 means that stock was created or removed. From 0 could be stock from a delivery and to 0 could be a shipped order.

This table currently holds around 5.5 million rows. We calculate the stock value for each product and position into a cache table on a schedule using a query that looks something like this:

WITH t AS
(
    SELECT ToPositionId AS PositionId, SUM(Quantity) AS Quantity, ProductId 
    FROM ProductPositionLog
    GROUP BY ToPositionId, ProductId
    UNION
    SELECT FromPositionId AS PositionId, -SUM(Quantity) AS Quantity, ProductId 
    FROM ProductPositionLog
    GROUP BY FromPositionId, ProductId
)

SELECT t.ProductId, t.PositionId, SUM(t.Quantity) AS Quantity
FROM t
WHERE NOT t.PositionId = 0
GROUP BY t.ProductId, t.PositionId
HAVING SUM(t.Quantity) > 0

Even though this completes in a reasonable amount of time (around 20 seconds), I feel like this is a pretty inefficient way of calculating the stock values.
We rarely do anything but INSERT:s in this table, but sometimes we go in and adjust the quantity or remove a row manually due to mistakes by the people generating these rows.

I had an idea of creating "checkpoints" in a separate table, calculating the value up to a specific point in time and using that as a start value when creating our stock quantity cache table:

ProductId | PositionId | Date                | Quantity
-------------------------------------------------------
123       | 10002      | 2018-01-07 21:08:56 | 2

The fact that we sometimes change rows poses a problem to this, in that case we must also remember to remove any checkpoint created after the log row we changed. This could be solved by not calculating the checkpoints up until now, but leave a month between now and the last checkpoint (we very very rarely make changes that far back).

The fact that we sometimes need to change rows are hard to avoid and I would like to be able to still do this, it's not shown in this structure but the log events are sometimes tied to other records in other tables, and adding another log row to get the right quantity is sometimes not possible.

The log table is, as you can imagine, growing pretty fast and the time to calculate will only increase with time.

So to my question, how would you solve this? Is there a more efficient way of calculating the current stock value? Is my idea of checkpoints a good one?

We're running SQL Server 2014 Web (12.0.5511)

Execution plan: https://www.brentozar.com/pastetheplan/?id=Bk8gyc68Q

I actually gave the wrong execution time above, 20s was the time that the complete update of the cache took. This query takes somewhere around 6-10 seconds to run (8 seconds when I created this query plan). There's also a join in this query that was not in the original question.

Best Answer

Sometimes you can improve query performance just by doing a little bit of tuning instead of changing your entire query. I noticed in your actual query plan that your query spills to tempdb in three places. Here's one example:

tempdb spills

Resolving those tempdb spills may improve performance. If Quantity is always non-negative then you can replace UNION with UNION ALL which will likely change the hash union operator to something else that doesn't require a memory grant. Your other tempdb spills are caused by issues with cardinality estimation. You're on SQL Server 2014 and using the new CE so it may be difficult to improve the cardinality estimates because the query optimizer won't use multi-column statistics. As a quick fix, consider using the MIN_MEMORY_GRANT query hint made available in SQL Server 2014 SP2. Your query's memory grant is only 49104 KB and the max available grant is 5054840 KB so hopefully bumping it up won't impact concurrency too much. 10% is a reasonable starting guess but you may need to adjust it up and done depending on your hardware and data. Putting that all together, this is what your query might look like:

WITH t AS
(
    SELECT ToPositionId AS PositionId, SUM(Quantity) AS Quantity, ProductId 
    FROM ProductPositionLog
    GROUP BY ToPositionId, ProductId
    UNION ALL
    SELECT FromPositionId AS PositionId, -SUM(Quantity) AS Quantity, ProductId 
    FROM ProductPositionLog
    GROUP BY FromPositionId, ProductId
)

SELECT t.ProductId, t.PositionId, SUM(t.Quantity) AS Quantity
FROM t
WHERE NOT t.PositionId = 0
GROUP BY t.ProductId, t.PositionId
HAVING SUM(t.Quantity) > 0
OPTION (MIN_GRANT_PERCENT = 10);

If you wish to improve performance further I recommend trying out indexed views instead of building and maintaining your own checkpoint table. Indexed views are significantly easier to get right than a custom solution involving your own materialized table or triggers. They will add a small amount of overhead to all DML operations but it may allow you to remove some of the nonclustered indexes that you currently have. Indexed views appear to be supported in the web edition of the product.

There are some restrictions on indexed views so you'll need to create a pair of them. Below is an example implementation, along with the fake data that I used for testing:

CREATE TABLE dbo.ProductPositionLog (
    LogId BIGINT NOT NULL,
    ProductId BIGINT NOT NULL,
    FromPositionId BIGINT NOT NULL,
    ToPositionId BIGINT NOT NULL,
    Quantity INT NOT NULL,
    FILLER VARCHAR(20),
    PRIMARY KEY (LogId)
);

INSERT INTO dbo.ProductPositionLog WITH (TABLOCK)
SELECT RN, RN % 100, RN % 3999, 3998 - (RN % 3999), RN % 10, REPLICATE('Z', 20)
FROM (
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q;

CREATE INDEX NCI1 ON dbo.ProductPositionLog (ToPositionId, ProductId) INCLUDE (Quantity);
CREATE INDEX NCI2 ON dbo.ProductPositionLog (FromPositionId, ProductId) INCLUDE (Quantity);

GO    

CREATE VIEW ProductPositionLog_1
WITH SCHEMABINDING  
AS  
   SELECT ToPositionId AS PositionId, SUM(Quantity) AS Quantity, ProductId, COUNT_BIG(*) CNT
    FROM dbo.ProductPositionLog
    WHERE ToPositionId <> 0
    GROUP BY ToPositionId, ProductId
GO  

CREATE UNIQUE CLUSTERED INDEX IDX_V1   
    ON ProductPositionLog_1 (PositionId, ProductId);  
GO  

CREATE VIEW ProductPositionLog_2
WITH SCHEMABINDING  
AS  
   SELECT FromPositionId AS PositionId, SUM(Quantity) AS Quantity, ProductId, COUNT_BIG(*) CNT
    FROM dbo.ProductPositionLog
    WHERE FromPositionId <> 0
    GROUP BY FromPositionId, ProductId
GO  

CREATE UNIQUE CLUSTERED INDEX IDX_V2   
    ON ProductPositionLog_2 (PositionId, ProductId);  
GO  

Without the indexed views the query takes about 2.7 seconds to finish on my machine. I get a similar plan to yours except mine runs in serial:

enter image description here

I believe that you'll need to query the indexed views with the NOEXPAND hint because you aren't on enterprise edition. Here's one way to do that:

WITH t AS
(
    SELECT PositionId, Quantity, ProductId 
    FROM ProductPositionLog_1 WITH (NOEXPAND)
    UNION ALL
    SELECT PositionId, Quantity, ProductId 
    FROM ProductPositionLog_2 WITH (NOEXPAND)
)
SELECT t.ProductId, t.PositionId, SUM(t.Quantity) AS Quantity
FROM t
GROUP BY t.ProductId, t.PositionId
HAVING SUM(t.Quantity) > 0;

This query has a simpler plan and finishes in under 400 ms on my machine:

enter image description here

The best part is that you won't have to change any of the application code that loads data into the ProductPositionLog table. You simply need to verify that the DML overhead of the pair of indexed views is acceptable.