Sql-server – Strange behaviour with Computed Columns in SQL-Server

sql server

Whilst reading through my 70-433 exam book, I thought of something that I can see not working, yet I believe it does. The passage read something like:

The column must also be marked as PERSISTED, which means that SQL Server physically stores the result of the computed column's expression in the data row instead of calculating it each time it is referenced in a query.

From this I understand two things:

A non-persisted computed column is calculated every time that it is referenced in a query
Because nothing is stored for the computed column, I assume no index can be created for the column.

After reading it, I thought that this was a bit strange as I have managed to create an index on a non-persisted column in a previous project.

How can an index be created for something that is not persisted and is this detrimental in the long run?

To prove this I have run the following SQL statement:

CREATE TABLE testTable
(
    ID INT IDENTITY(1,1) PRIMARY KEY,
    telephone VARCHAR(14),
    c_areaCode AS (SUBSTRING(telephone,0,5)),
    cp_areaCode AS (SUBSTRING(telephone,0,5)) PERSISTED
)

INSERT INTO testTable VALUES('09823 000000');
INSERT INTO testTable VALUES('09824 000000');
INSERT INTO testTable VALUES('09825 000000');

CREATE NONCLUSTERED INDEX IX_NotPersisted ON testTable(c_areaCode);
CREATE NONCLUSTERED INDEX IX_Persisted ON testTable(cp_areaCode);

And then run the following queries:

DBCC FREEPROCCACHE
DBCC FREESYSTEMCACHE('ALL');
DBCC DROPCLEANBUFFERS
GO
SELECT cp_areaCode FROM testTable;
GO
SELECT c_areaCode FROM testTable;

Having looked at the query plan for the above code, I can see that both select queries are using the non-persisted index. Again, how?

enter image description here

Best Answer

2.Because nothing is stored for the computed column, I assume no index can be created for the column.

This assumption is not true - either kind can be indexed. The computed column must be deterministic in either case but when the computed column is persisted, the requirement that the computation is also precise is relaxed (ie it can involve floating point operations).

How can an index be created for something that is not persisted and is this detrimental in the long run?

The result of the function is 'persisted' in the index in either case - the only difference is whether it is persisted in the table.

Related Solutions

Sql-server – Question about non-clustered index storage in SQL Server

Nonclustered indexes always include a row locator.

For a heap this will be an 8 byte RID (File:Page:Slot). On a table with a clustered index it will be the clustered index key column(s). And it will always be the copied values not a pointer to the values. This duplication of CI key values into all non clustered indexes is why it is often recommended that the CI key be narrow and not frequently updated.

In the table shown in the question the Clustered index key is a 4 byte integer and potentially may also include a 4 byte uniqueifier for any duplicate key values.

In your case as the NCIs are not declared as unique the CI key will be appended to the NCI key.

For unique non clustered indexes the CI key would be added as included column(s) in the leaf pages unless explicitly made part of the key.

See Kalen Delaney: More About Nonclustered Index Keys for some additional information about how you can see this for yourself.

With these 4 rows of data all three indexes only consume a single 8KB data page.

SELECT index_id,
       index_level,
       page_count,
       record_count
FROM   sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('people'), NULL, NULL, 'DETAILED')

Returns

+----------+-------------+------------+--------------+
| index_id | index_level | page_count | record_count |
+----------+-------------+------------+--------------+
|        1 |           0 |          1 |            4 |
|        2 |           0 |          1 |            4 |
|        3 |           0 |          1 |            4 |
+----------+-------------+------------+--------------+

The additional page shown in use by sys.allocation_units.total_pages is an IAM page. This is not used for storing data but just for tracking the pages and extents comprising the index.

Sql-server – Index on Persisted Computed column needs key lookup to get columns in the computed expression

Why is a Key Lookup required to get A, B and C when they are not referenced in the query at all? I assume they are being used to calculate Comp, but why?

Columns A, B, and C are referenced in the query plan - they are used by the seek on T2.

Also, why can the query use the index on t2, but not on t1?

The optimizer decided that scanning the clustered index was cheaper than scanning the filtered nonclustered index and then performing a lookup to retrieve the values for columns A, B, and C.

Explanation

The real question is why the optimizer felt the need to retrieve A, B, and C for the index seek at all. We would expect it to read the Comp column using a nonclustered index scan, and then perform a seek on the same index (alias T2) to locate the Top 1 record.

The query optimizer expands computed column references before optimization begins, to give it a chance to assess the costs of various query plans. For some queries, expanding the definition of a computed column allows the optimizer to find more efficient plans.

When the optimizer encounters a correlated subquery, it attempts to 'unroll it' to a form it finds easier to reason about. If it cannot find a more effective simplification, it resorts to rewriting the correlated subquery as an apply (a correlated join):

Apply rewrite

It just so happens that this apply unrolling puts the logical query tree into a form that does not work well with project normalization (a later stage that looks to match general expressions to computed columns, among other things).

In your case, the way the query is written interacts with internal details of the optimizer such that the expanded expression definition is not matched back to the computed column, and you end up with a seek that references columns A, B, and C instead of the computed column, Comp. This is the root cause.

Workaround

One idea to workaround this side-effect is to write the query as an apply manually:

SELECT
    T1.ID,
    T1.Comp,
    T1.D,
    CA.D2
FROM dbo.T AS T1
CROSS APPLY
(  
    SELECT TOP (1)
        D2 = T2.D
    FROM dbo.T AS T2
    WHERE
        T2.Comp = T1.Comp
        AND T2.D > T1.D
    ORDER BY
        T2.D ASC
) AS CA
WHERE
    T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
ORDER BY
    T1.Comp;

Unfortunately, this query will not use the filtered index as we would hope either. The inequality test on column D inside the apply rejects NULLs, so the apparently redundant predicate WHERE T1.D IS NOT NULL is optimized away.

Without that explicit predicate, the filtered index matching logic decides it cannot use the filtered index. There are a number of ways to work around this second side-effect, but the easiest is probably to change the cross apply to an outer apply (mirroring the logic of the rewrite the optimizer performed earlier on the correlated subquery):

SELECT
    T1.ID,
    T1.Comp,
    T1.D,
    CA.D2
FROM dbo.T AS T1
OUTER APPLY
(  
    SELECT TOP (1)
        D2 = T2.D
    FROM dbo.T AS T2
    WHERE
        T2.Comp = T1.Comp
        AND T2.D > T1.D
    ORDER BY
        T2.D ASC
) AS CA
WHERE
    T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
ORDER BY
    T1.Comp;

Now the optimizer does not need to use the apply rewrite itself (so the computed column matching works as expected) and the predicate is not optimized away either, so the filtered index can be used for both data access operations, and the seek uses the Comp column on both sides:

Outer Apply Plan

This would generally be preferred over adding A, B, and C as INCLUDEd columns in the filtered index, because it addresses the root cause of the problem, and does not require widening the index unnecessarily.

Persisted computed columns

As a side note, it is not necessary to mark the computed column as PERSISTED, if you don't mind repeating its definition in a CHECK constraint:

CREATE TABLE dbo.T 
(   
    ID integer IDENTITY(1, 1) NOT NULL,
    A varchar(20) NOT NULL,
    B varchar(20) NOT NULL,
    C varchar(20) NOT NULL,
    D date NULL,
    E varchar(20) NULL,
    Comp AS A + '-' + B + '-' + C,

    CONSTRAINT CK_T_Comp_NotNull
        CHECK (A + '-' + B + '-' + C IS NOT NULL),

    CONSTRAINT PK_T_ID 
        PRIMARY KEY (ID)
);

CREATE NONCLUSTERED INDEX IX_T_Comp_D
ON dbo.T (Comp, D) 
WHERE D IS NOT NULL;

The computed column is only required to be PERSISTED in this case if you want to use a NOT NULL constraint or to reference the Comp column directly (instead of repeating its definition) in a CHECK constraint.