Sql-server – Changing primary key from IDENTITY to being persisted Computed column using COALESCE

computed-columnidentitysql server

In an attempt to decouple an application from our monolithic database, we've tried to change INT IDENTITY columns of various tables to be a PERSISTED computed column that uses COALESCE. Basically, we need the decoupled application an ability to still update the database for common data shared across many applications while still allowing existing applications to create data in these tables without the need for code or procedure modification.

So essentially, we've moved from a column definition of;

PkId INT IDENTITY(1,1) PRIMARY KEY

to;

PkId AS AS COALESCE(old_id, external_id, new_id) PERSISTED NOT NULL,
old_id INT NULL, -- Values here are from existing records of PkId before table change
external_id INT NULL,
new_id INT IDENTITY(2000000,1) NOT NULL

In all cases the PkId is also a PRIMARY KEY and in all but one case, it is CLUSTERED. All tables have the same foreign keys and indexes as before. In essence, the new format allows the PkId to be supplied by the decoupled application (as external_id), but also allows the PkId to be the IDENTITY column value therefore allowing existing code that relies on the IDENTITY column through use of SCOPE_IDENTITY and @@IDENTITY to work as it used to.

The problem we've had is that we've come across a couple of queries that used to run in an acceptable time to now blow out completely. The generated query plans used by these queries is nothing like what it used to be before.

Given the new column is a PRIMARY KEY, the same data type as before, and PERSISTED, I would have expected queries and query plans to behave the same as they did before. Should the COMPUTED PERSISTED INT PkId essentially behave the same way as an explicit INT definition in terms of how SQL Server will produce the execution plan? Are there other likely issues with this approach that you can see?

The purpose of this change was supposed to allow us to change the table definition without the need to modify existing procedures and code. Given these issues, I don't feel like we can go with this approach.

Best Answer

FIRST

You probably don't need all three columns: old_id, external_id, new_id. The new_id column, being an IDENTITY, will have a new value generated for each row, even when you insert into external_id. But, between old_id and external_id, those are pretty much mutually exclusive: either there is already an old_id value or that column, in the current conception, will just be NULL if using external_id or new_id. Since you won't be adding a new "external" id to a row that already exists (i.e. one that has an old_id value), and there won't be any new values coming in for old_id, then there can be one column that is used for both purposes.

So, get rid of the external_id column and rename old_id to be something like old_or_external_id or whatever. This shouldn't require any real changes to anything, yet reduces some of the complication. At most you might need to call the column external_id, even if it contains "old" values, if app code is already written to insert into external_id.

That reduces the new structure to be just:

PkId AS AS COALESCE(old_or_external_id, new_id, -1) PERSISTED NOT NULL,
old_or_external_id INT NULL, -- values from existing record OR passed in from app
new_id INT IDENTITY(2000000, 1) NOT NULL

Now you have only added 8 bytes per row instead of 12 bytes (assuming that you aren't using the SPARSE option or Data Compression). And you didn't need to change any code, T-SQL or App code.

SECOND

Continuing down this path of simplification, let's look at what we have left:

  • The old_or_external_id column either has values already, or will be given a new value from the app, or will be left as NULL.
  • The new_id will always have a new value generated, but that value will only be used if the old_or_external_id column is NULL.

There is never a time when you would need values in both old_or_external_id and new_id. Yes, there will be times when both columns have values due to new_id being an IDENTITY, but those new_id values are ignored. Again, these two fields are mutually exclusive. So what now?

Now we can look into why we needed the external_id in the first place. Considering that it is possible to insert into an IDENTITY column using SET IDENTITY_INSERT {table_name} ON;, you could get away with making no schema changes at all, and only modify your app code to wrap the INSERT statements / operations in SET IDENTITY_INSERT {table_name} ON; and SET IDENTITY_INSERT {table_name} OFF; statements. You then need to determine what starting range to reset the IDENTITY column to (for newly generated values) as it will need to be well above the values that the App code will be inserting since inserting a higher value will cause the next auto-generated value to be greater than the current MAX value. But you can always insert a value that is below the IDENT_CURRENT value.

Combining the old_or_external_id and new_id columns also does not increase the chances of running into an overlapping value situation between auto-generated values and app-generated values since the intention of having the 2, or even 3, columns is to combine them into a Primary Key value, and those are always unique values.

In this approach, you just need to:

  • Leave tables as being:

    PkId INT IDENTITY(1,1) PRIMARY KEY
    

    This adds 0 bytes to each row, instead of 8, or even 12.

  • Determine the starting range for app-generated values. These will be greater than the current MAX value in each table, but less than what will become the minimum value for the auto-generated values.
  • Determine what value the auto-generated range should start at. There should be plenty of room between the current MAX value and plenty of room to grow, knowing at the upper limit is just over 2.14 billion. You can then set this new minimum seed value via DBCC CHECKIDENT.
  • Wrap app code INSERTs in SET IDENTITY_INSERT {table_name} ON; and SET IDENTITY_INSERT {table_name} OFF; statements.

SECOND, Part B

A variation on the approach noted directly above would be to have the App code insert values starting with -1 and going down from there. This leaves the IDENTITY values as being the only ones going up. The benefit here is that you not only don't complicate the schema, you also don't need to worry about running into overlapping IDs (if the app-generated values run into the new auto-generated range). This is only an option if you aren't already using negative ID values (and it seems pretty rare for people to use negative values on auto-generated columns so this should be a likely posibility in most situations).

In this approach, you just need to:

  • Leave tables as being:

    PkId INT IDENTITY(1,1) PRIMARY KEY
    

    This adds 0 bytes to each row, instead of 8, or even 12.

  • The starting range for app-generated values will be -1.
  • Wrap app code INSERTs in SET IDENTITY_INSERT {table_name} ON; and SET IDENTITY_INSERT {table_name} OFF; statements.

Here you still need to do the IDENTITY_INSERT, but: you don't add any new columns, don't need to "reseed" any IDENTITY columns, and have no future risk of overlaps.

SECOND, Part 3

One last variation of this approach would be to possibly swap out the IDENTITY columns and instead use Sequences. The reason to take this approach is to be able to have the app code insert values that are: positive, above the auto-generated range (not below), and no need for SET IDENTITY_INSERT ON / OFF.

In this approach, you just need to:

  • Create Sequences using CREATE SEQUENCE
  • Copy the IDENTITY column to a new column that doesn't have the IDENTITY property, but does have a DEFAULT Constraint using the NEXT VALUE FOR function:

    PkId INT PRIMARY KEY CONSTRAINT [DF_TableName_NextID] DEFAULT (NEXT VALUE FOR...)
    

    This adds 0 bytes to each row, instead of 8, or even 12.

  • The starting range for app-generated values will well above what you think the auto-generated values will approach.
  • Wrap app code INSERTs in SET IDENTITY_INSERT {table_name} ON; and SET IDENTITY_INSERT {table_name} OFF; statements.

HOWEVER, due to the requirement that code with either SCOPE_IDENTITY() or @@IDENTITY still functions properly, switching to Sequences is not currently an option as it appears that there is no equivalent of those functions for Sequences :-(. Sad!