Sql-server – Is any caching mechamism used for non-persisted computed columns

cachecomputed-columnsql server

Does SQL Server cache the result of a non-persisted computed column so it can be reused without incurring cost of recalculation?

Additional Context
I've always given the advice that if using computed columns you should use the PERSISTED option unless you're expecting a higher insert/update frequency than read frequency, or if you need better performance around inserts/updates than you do reads of the data (i.e. since the cost of calculation has to be incurred on one or the other, so the decision is really where you want to pay that cost).

There's also the consideration of the additional storage for the computed data, but typically that's pretty negligible and cheap, so not much of a consideration.

However, I wanted to check if my advice is entirely accurate, since SQL may be more intelligent… i.e. Once SQL's calculated a computed column, it could record this value in memory so that it doesn't have to recalculate the value on subsequent queries. SQL could have a timestamp on the cached value and another on the underlying record data to say if that record's been changed since the computed value was calculated to determine whether the cached value is still valid.

Is there anything like this in place / does it depend on available resources (e.g. memory), or other factors (e.g. does the cached value have a TTL beyond that of the process lifetime)? I've never read anything implying that this exists, but I'd be surprised if there weren't some optimization going on under the covers.

Best Answer

The non-persisted computed column values are not cached in memory.

Here's a brief-ish demo that confirms this. First I'll create a database for our testing:

CREATE DATABASE CC_BufferPool_Test;
GO
USE [CC_BufferPool_Test];
GO

And then create a table without a computed column as a baseline. The column sizes in this table are specifically chosen so that 4 rows will exactly fill one 8KB data page (and thus one 1 page in SQL Server's buffer cache):

CREATE TABLE T
(
    C1 INT NOT NULL IDENTITY PRIMARY KEY,
    Filler CHAR(2011) NOT NULL
);
GO

INSERT INTO T (Filler) VALUES(REPLICATE('A', 2011));
GO 4

Next I'll create the same table, but with the addition of a computed column:

CREATE TABLE T_WithCC
(
    C1 INT NOT NULL IDENTITY PRIMARY KEY,
    C2 AS 'Row #' + CONVERT(CHAR(8), C1, 1),
    Filler CHAR(2011) NOT NULL
);
GO

INSERT INTO T_WithCC(Filler) VALUES(REPLICATE('A', 2011));
GO 4

Next, I'll make sure these 8 rows are written to disk, and then clear the buffer cache, and then select all 8 rows so that they get pulled into memory:

CHECKPOINT
DBCC DROPCLEANBUFFERS;

SELECT * FROM T;
SELECT * FROM dbo.T_WithCC;

Finally, I'll use this query that I totally did not steal from Aaron Bertrand*

From that, you can see there is only one page in memory for each of those two tables. If the computed column were stored in memory, it would have pushed the second table's rows onto a second page.

You can also see the calculation occurring each time in the actual execution plan:

If you do need to "cache" non-persisted computed columns for specific queries, you can actually create a nonclustered index on them. That way they are stored in the index, but not on the base table.

*see Determine SQL Server memory use by database and object, and here's the code:

;WITH src AS
(
    SELECT
        [Object] = o.name,
        [Type] = o.type_desc,
        [Index] = COALESCE(i.name, ''),
        [Index_Type] = i.type_desc,
        p.[object_id],
        p.index_id,
        au.allocation_unit_id
    FROM sys.partitions AS p
        INNER JOIN sys.allocation_units AS au
            ON p.hobt_id = au.container_id
        INNER JOIN sys.objects AS o
            ON p.[object_id] = o.[object_id]
        INNER JOIN sys.indexes AS i
            ON o.[object_id] = i.[object_id]
            AND p.index_id = i.index_id
    WHERE
        au.[type] IN (1,2,3)
        AND o.is_ms_shipped = 0
)
SELECT
    src.[Object],
    src.[Type],
    src.[Index],
    src.Index_Type,
    buffer_pages = COUNT_BIG(b.page_id),
    buffer_mb = COUNT_BIG(b.page_id) / 128
FROM src
    INNER JOIN sys.dm_os_buffer_descriptors AS b
        ON src.allocation_unit_id = b.allocation_unit_id
WHERE b.database_id = DB_ID()
GROUP BY
    src.[Object],
    src.[Type],
    src.[Index],
    src.Index_Type
ORDER BY buffer_pages DESC;

Explanation

The real question is why the optimizer felt the need to retrieve A, B, and C for the index seek at all. We would expect it to read the Comp column using a nonclustered index scan, and then perform a seek on the same index (alias T2) to locate the Top 1 record.

The query optimizer expands computed column references before optimization begins, to give it a chance to assess the costs of various query plans. For some queries, expanding the definition of a computed column allows the optimizer to find more efficient plans.

When the optimizer encounters a correlated subquery, it attempts to 'unroll it' to a form it finds easier to reason about. If it cannot find a more effective simplification, it resorts to rewriting the correlated subquery as an apply (a correlated join):

Apply rewrite

It just so happens that this apply unrolling puts the logical query tree into a form that does not work well with project normalization (a later stage that looks to match general expressions to computed columns, among other things).

In your case, the way the query is written interacts with internal details of the optimizer such that the expanded expression definition is not matched back to the computed column, and you end up with a seek that references columns A, B, and C instead of the computed column, Comp. This is the root cause.

Workaround

One idea to workaround this side-effect is to write the query as an apply manually:

SELECT
    T1.ID,
    T1.Comp,
    T1.D,
    CA.D2
FROM dbo.T AS T1
CROSS APPLY
(  
    SELECT TOP (1)
        D2 = T2.D
    FROM dbo.T AS T2
    WHERE
        T2.Comp = T1.Comp
        AND T2.D > T1.D
    ORDER BY
        T2.D ASC
) AS CA
WHERE
    T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
ORDER BY
    T1.Comp;

Unfortunately, this query will not use the filtered index as we would hope either. The inequality test on column D inside the apply rejects NULLs, so the apparently redundant predicate WHERE T1.D IS NOT NULL is optimized away.

Without that explicit predicate, the filtered index matching logic decides it cannot use the filtered index. There are a number of ways to work around this second side-effect, but the easiest is probably to change the cross apply to an outer apply (mirroring the logic of the rewrite the optimizer performed earlier on the correlated subquery):

SELECT
    T1.ID,
    T1.Comp,
    T1.D,
    CA.D2
FROM dbo.T AS T1
OUTER APPLY
(  
    SELECT TOP (1)
        D2 = T2.D
    FROM dbo.T AS T2
    WHERE
        T2.Comp = T1.Comp
        AND T2.D > T1.D
    ORDER BY
        T2.D ASC
) AS CA
WHERE
    T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
ORDER BY
    T1.Comp;

Now the optimizer does not need to use the apply rewrite itself (so the computed column matching works as expected) and the predicate is not optimized away either, so the filtered index can be used for both data access operations, and the seek uses the Comp column on both sides:

Outer Apply Plan

This would generally be preferred over adding A, B, and C as INCLUDEd columns in the filtered index, because it addresses the root cause of the problem, and does not require widening the index unnecessarily.

Persisted computed columns

As a side note, it is not necessary to mark the computed column as PERSISTED, if you don't mind repeating its definition in a CHECK constraint:

CREATE TABLE dbo.T 
(   
    ID integer IDENTITY(1, 1) NOT NULL,
    A varchar(20) NOT NULL,
    B varchar(20) NOT NULL,
    C varchar(20) NOT NULL,
    D date NULL,
    E varchar(20) NULL,
    Comp AS A + '-' + B + '-' + C,

    CONSTRAINT CK_T_Comp_NotNull
        CHECK (A + '-' + B + '-' + C IS NOT NULL),

    CONSTRAINT PK_T_ID 
        PRIMARY KEY (ID)
);

CREATE NONCLUSTERED INDEX IX_T_Comp_D
ON dbo.T (Comp, D) 
WHERE D IS NOT NULL;

The computed column is only required to be PERSISTED in this case if you want to use a NOT NULL constraint or to reference the Comp column directly (instead of repeating its definition) in a CHECK constraint.

Sql-server – SQL Server 2014 – Compute scalar over computed indexed column

I believe the "computer scalar" operator in the plan where it is using the index is actually not being executed. On my test rig with 1,000,000 sample rows, which is shown below, the query without the non-clustered index is a lot slower than the query that uses the clustered index on the myhash column.

USE tempdb;
SET NOCOUNT ON;

IF EXISTS (SELECT 1
    FROM sys.objects o
    WHERE o.name = 'HTest'
        AND o.type = 'U')
DROP TABLE dbo.HTest;

CREATE TABLE dbo.HTest
(
    HTest_ID INT NOT NULL
        CONSTRAINT PK_HTest_ID
        PRIMARY KEY CLUSTERED
        IDENTITY(1,1)
    , V1 VARCHAR(255) NOT NULL
    , V2 VARCHAR(255) NOT NULL
    , V3 VARCHAR(255) NOT NULL
    , V4 VARCHAR(255) NOT NULL
);

INSERT INTO dbo.HTest (V1, V2, V3, V4)
SELECT TOP(1000000) 
    o1.name
    , o2.name
    , o3.name
    , o4.name
FROM sys.objects o1
    , sys.objects o2
    , sys.objects o3
    , sys.objects o4;

ALTER TABLE dbo.HTest
ADD MyHash AS(CAST(HASHBYTES('SHA1', V1 + V2 + V3 + V4) AS VARBINARY(20)));

Below we have two test runs, one without the index, and the 2nd one with an index on the MyHash column.

IF EXISTS (SELECT 1
    FROM sys.indexes i
    WHERE i.name = 'IX_Htest_MyHash'
)
DROP INDEX IX_Htest_MyHash
ON dbo.HTest;

PRINT (N'');
PRINT (N'-----set stats io on---------------------------------------------------');
PRINT (N'');

SET STATISTICS IO, TIME ON;

PRINT (N'');
PRINT (N'-----run 1 (no index)--------------------------------------------------');
PRINT (N'');

SELECT MyHash
FROM dbo.HTest;

PRINT (N'');
PRINT (N'-----end of run 1------------------------------------------------------');
PRINT (N'');

CREATE INDEX IX_Htest_MyHash
ON dbo.HTest(MyHash);

PRINT (N'');
PRINT (N'-----run 2 (with index)------------------------------------------------');
PRINT (N'');

SELECT MyHash
FROM dbo.HTest;

PRINT (N'');
PRINT (N'-----end of run 2------------------------------------------------------');
PRINT (N'');

SET STATISTICS IO, TIME OFF

The salient bits from the output of this is:

-----run 1 (no index)--------------------------------------------------

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.
Table 'HTest'. Scan count 1, logical reads 9409, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1560 ms,  elapsed time = 7516 ms.

-----end of run 1------------------------------------------------------

-----run 2 (with index)------------------------------------------------

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'HTest'. Scan count 1, logical reads 4227, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 188 ms,  elapsed time = 4912 ms.

-----end of run 2------------------------------------------------------

As you can see, run 2 is clocking in with far less CPU time, and around half the reads of run 1.

The plan for run 1:

The plan for run 2:

Looking at the XML for the 2nd plan, you can see the <ComputeScalar> operator is actually just a lookup. Look at the 5th line <ScalarOperator ScalarString="[tempdb].[dbo].[HTest].[MyHash]">:

  <ComputeScalar>
    <DefinedValues>
      <DefinedValue>
        <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
        <ScalarOperator ScalarString="[tempdb].[dbo].[HTest].[MyHash]">
          <Identifier>
            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
          </Identifier>
        </ScalarOperator>
      </DefinedValue>
    </DefinedValues>
    <RelOp AvgRowSize="21" EstimateCPU="1.10016" EstimateIO="3.11572" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1000000" LogicalOp="Index Scan" NodeId="1" Parallel="false" PhysicalOp="Index Scan" EstimatedTotalSubtreeCost="4.21587" TableCardinality="1000000">
      <OutputList>
        <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
      </OutputList>
      <RunTimeInformation>
        <RunTimeCountersPerThread Thread="0" ActualRows="1000000" ActualRowsRead="1000000" ActualEndOfScans="1" ActualExecutions="1" />
      </RunTimeInformation>
      <IndexScan Ordered="false" ForcedIndex="false" ForceSeek="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">
        <DefinedValues>
          <DefinedValue>
            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Column="MyHash" ComputedColumn="true" />
          </DefinedValue>
        </DefinedValues>
        <Object Database="[tempdb]" Schema="[dbo]" Table="[HTest]" Index="[IX_Htest_MyHash]" IndexKind="NonClustered" />
      </IndexScan>
    </RelOp>
  </ComputeScalar>

Presumably, the plan for the 2nd query includes the base calculation for the computed column, even though the results actually come from the index without any calculations actually taking place.

Best Answer

Related Solutions

Sql-server – Index on Persisted Computed column needs key lookup to get columns in the computed expression

Explanation

Workaround

Persisted computed columns

Sql-server – SQL Server 2014 – Compute scalar over computed indexed column

Related Question