Sql-server – Fastest way to add a field to a large table

sql serversql server 2014

I have a fairly large table (approx 400 million rows) and I need to add new column of type bit with NOT NULL constraint. My approach is to add the field without NOT NULL constraint, set the field to 0 and the alter the column to add the NOT NULL constraint.
This is a pretty slow process. Is there a quicker approach?

I am on SQL Server 2014 Enterprise Edition

Best Answer

As you are on Enterprise Edition and version >= 2012 then the information in Online non-NULL with values column add in SQL Server 2012 should apply.

Alter table yourtable 
    add newcolumn bit not null default 0;

Should be metadata only and, unlike your proposed solution, will not have to write anything to all rows.

Summary

For large tables, it is nearly always preferable to move the data instead of altering the table. To get maximum speed, use one of the following patterns. Let us call the table you want to change S (for source).

Create an empty copy of S. Let us call that T
ALTER TABLE on T. This is very fast because T has no rows
If T has a cluster index, remove all other indexes but the cluster
If T is a heap, remove all indexes
Put the database in SIMPLE recovery mode (if you can, but be aware of the backup implications)
Use one of the two options provided in the Copy Data section below
Rebuild the indexes on T
Drop S and rename to T

Copy Data

In step 6 above, you will need to move data from S to T at highest possible speed. The following are two options:

Use INSERT ... SELECT
Use SSIS

Of these two, SSIS is the fastest. With INSERT ... SELECT you are restricted to a single thread. However, INSERT ... SELECT is much easier. See below for more details.

Option 1: INSERT ... SELECT

If T is a heap, do this:

INSERT INTO T WITH (TABLOCK) SELECT ... FROM S

This is minimally logged.

If T has a clustered index, do this:

DBCC TRACEON (610)
INSERT INTO T SELECT ... FROM S

This is also be minimally logged. The problem with this option is that you can only run one INSERT statement in parallel. Because SQL Server does not do parallel INSERT ... SELECT, this restricts you to around 40-80MB/sec - which is quite slow.

Option 2: Use SSIS

This option is a lot more trouble, but also faster. The idea is this:

Drop all indexes on T
Create an SSIS package that takes a parameter that selects a subset of the rows in S (using some filter that is supported by an index). The SSIS package then moves the rows to T using the TABLOCK hint in bulk mode
Execute multiple copies of the SSIS package, each operating on their own, distinct subset of S (so you don't duplicate rows in T)

This is MUCH faster because the packages can run in parallel. With this technique, I can typically move tables at GB/sec.

Sql-server – How much disk space do I need to add a primary key to a very large table

I would also add that not only data pages will be moved when creating a clustered index - also all the non-clustered indexes will have to be rebuilt since each row in non-clustered index contains clustered key value. All these data movements need to be logged and sorting process will heavily use tempDb since you probably don't have 500 GBs of memory available, right? ;)

You'll find plenty of debates and different opinions about using heaps and I'm of course speculating here but most likely there's a reason why such a big table doesn't already have a clustered index defined on top of it. It's not exactly the best practise but sometimes it does make sense to define a table without clustered index (quick, minimally logged inserts) with a few narrow non-clustered indexes that cover frequent queries.

Decision about adding any kind of index (primary, clustered, non-clustered, XML, spatial, ...) depends of course on the problem you're trying to solve. In OLTP system you typically optimise your indexes to improve write performance while in reporting systems like OLAP you usually need to work on read types of workloads.

And few suggestions to wrap this up: - primary key doesn't have to be clustered - if the table is not partitioned you should definitely consider partition it - if your goal is to make certain types of queries faster (and not trying to enforce uniqueness of your bigint column) you should consider creating additional (or modifying existing) non-clustered (possibly filtered) index or indexed view (perfect for aggregate queries)

Hope it helps, good luck!

Best Answer

Related Solutions

Sql-server – Change Data Types large Live table

Summary

Copy Data

Sql-server – How much disk space do I need to add a primary key to a very large table

Related Question