How to Add an Identity Field as New Cluster in SQL Server

clustered-indexidentitysql-server-2012

I have a very large table (88m rows, nearly 300GB in size) that has a poorly planned clustered index. I want to add a BIGINT identity field, and make it the new clustered index (having the desirable attributes of unique, narrow, static, and ever-increasing).

The problem is that I also want this new identity field to be ordered in a certain way on the current records (based on an existing date field that is not the current cluster). This is so when we purge the oldest records, those will all be contiguous on disk. (The date field isn't unique, though, so isn't a good candidate for the cluster by itself.)

How can I get the values for this new identity field ordered the way I want?

The first idea I had was to make a new table with the identify field, and just INSERT INTO rows from the current table ORDER BY the date field? Would that work?

If I don't have enough disk space to do that, is there any way to do that "in-place"? Perhaps by changing the clustered index to the date field (wait a long time), then add the identity (and wait a long time again)? Are the values of the identity guaranteed to be ordered by the current cluster?

Best Answer

I just performed the following experiment on a SQL Server 2012 box:

CREATE TABLE rdtest (label varchar(100), value int);

INSERT INTO rdtest VALUES
       ('one', 1)
      ,('two', 4)
      ,('three', 9)
      ,('four', 16)
      ,('five', 25)
;

ALTER TABLE rdtest ADD test_id int IDENTITY(1,1);

SELECT * FROM rdtest;

Result:

label    value       test_id
-------- ----------- -----------
one      1           1
two      4           2
three    9           3
four     16          4
five     25          5

ALTER TABLE rdtest DROP COLUMN test_id;

ALTER TABLE rdtest ALTER COLUMN value int NOT NULL;
GO

ALTER TABLE rdtest ADD CONSTRAINT PK_rdtest PRIMARY KEY ([value] DESC);

ALTER TABLE rdtest ADD test_id int IDENTITY(1,1);

SELECT * FROM rdtest;

Results:

label    value       test_id
-------- ----------- -----------
five     25          1
four     16          2
three    9           3
two      4           4
one      1           5

And, one more test run:

ALTER TABLE rdtest DROP CONSTRAINT PK_rdtest;

ALTER TABLE rdtest DROP COLUMN test_id;

ALTER TABLE rdtest ALTER COLUMN label varchar(100) NOT NULL;
GO

ALTER TABLE rdtest ADD CONSTRAINT PK_rdtest PRIMARY KEY ([label] ASC);

ALTER TABLE rdtest ADD test_id int IDENTITY(1,1);

SELECT * FROM rdtest;

Results:

label    value       test_id
-------- ----------- -----------
five     25          1
four     16          2
one      1           3
three    9           4
two      4           5

I suppose this isn't a guarantee that the clustered index dictates the order a new IDENTITY column gets filled in - but, I'd say it strongly indicates that it's very likely.

Related Solutions

Sql-server – How to reset all of the identity values for a table, and reseed the auto-number in Sql Server 2008

The easiest way would be to start a new table. Inject the 30,000 rows you want to keep, then drop the old table and rename the new table.

BEGIN TRANSACTION;

CREATE TABLE dbo.copy_of_foo
(
  ID INT IDENTITY(1,1) PRIMARY KEY,
  col2 ..., ...
);

INSERT dbo.copy_of_foo(col2, ...) 
  SELECT TOP 30000 col2, ...
  FROM dbo.foo
  ORDER BY ...;

DROP TABLE dbo.foo;

EXEC sp_rename N'dbo.copy_of_foo', N'foo', N'OBJECT';

COMMIT TRANSACTION;

You may need to nest the order by and add OPTION (MAXDOP 1) if you want to ensure that you get the 30K most recent rows and the IDENTITY values are assigned in order from oldest to newest, e.g. hopefully you have some other column, such as a date time column, that can help identify the 30000 rows you want to keep:

INSERT dbo.copy_of_foo(col2, ...)
  SELECT col2, ...
  FROM 
  (
    SELECT TOP 30000 col2, ...
      FROM dbo.foo 
      ORDER BY date_time_column DESC
  ) AS x
  ORDER BY date_time_column
  OPTION (MAXDOP 1);

Keep in mind though that if someone reported an exception yesterday with row #42,000,564, good luck finding it today. As I suggested in the comment, maybe you shouldn't be exposing these obviously-otherwise-absolutely-meaningless surrogate identifiers to users.

Postgresql – How does PostgreSQL physically order new records on disk (after a cluster on primary key)

Rows in postgresql have no fixed order. Not only are records placed where ever there is free space records can also move around. This is because when a row is updated a new version of the row is created in a new location while the old version continues to live in it's old location until it's removed by vacuum.

A CLUSTER operations sorts all rows but does indeed not affect how postgresql add's rows. So the data won't stay sorted. However postgresql keeps statistics among which is the correlation for each column between the order of the rows in the tables and the sorted order of that column. So the planner can still optimize it's plan based on the statistics that tell it the table is still mostly sorted even if some rows have been added after a cluster operation (or have been moved by updates).

Best Answer

Related Solutions

Sql-server – How to reset all of the identity values for a table, and reseed the auto-number in Sql Server 2008

Postgresql – How does PostgreSQL physically order new records on disk (after a cluster on primary key)

Related Question