Sql-server – SQL Server — optimization techniques for rebuilding very large table

sql serversql-server-2008

We generally use SQL Compare to generate upgrade/change scripts, and we also use the "Force column order" option by default. This basically means that the column order will be preserved going from database version A to B, so in situations where this is relevant a table rebuild becomes necessary.

Well, somebody decided it would be a good idea to sandwich a new column in-between existing columns when making a schema change to a table containing a very large amount of data (obviously ill-advised). Here are some more details of what I am faced with:

The table has about 1.3 billion rows
A clustered index exists on the table
A few non-clustered indexes exist on the table
The server the database instance is running on is very limited from a resource perspective

The script generated by SQL Compare handles this by dropping all indexes on the table, throwing the data into a temp table, dropping the table, re-creating the new table, and inserting the data from the temp table into the new table schema. I feel like this can be optimized. Here is what I came up with to do so:

bcp data out to file
disable indexes
truncate table
rebuild table based on new schema
bcp data into new table schema
enable indexes

One obvious problem is at some point the table has to be dropped and recreated if I am not mistaken since the column order is changing. Dropping the table will also drop the indexes that I would like to keep in place, wouldn't it? Has anybody been faced with a similar situation? Any advice as to whether or not my plan looks like a better one to reduce potential down time? Any ideas or advice will be greatly appreciated.

Best Answer

Code that relies on the order of columns is asking for trouble anyway. It should never be a problem, since as a DBA, you're profiling and identifying SELECT * statement, and ostracizing the perpetrators, correct?

When I do have to perform such an operation, due to some dumb legacy software, I would follow these steps:

rename existing table and named constraints
create table with new schema
bulk insert from old to new table. if required, break on the id and do it in batches of 100,000* records using a while loop (* - a number that suits the hardware & environment)
drop old table - reclaim space here
create indexes with proper fillfactors
create constraints (after indexes)

Distractions

Let's start with the items that are not your major problem areas. These are things that should be looked into as it definitely helps improve performance to use the datatypes that you need and not just a general, one-size-fits-most datatype. There is a very good reason why the different datatypes exist, and if storing 100 characters in NVARCHAR(MAX) had no negative impact on queries (or any other aspect of the system), then everything would be stored as NVARCHAR(MAX). However, cleaning up these areas won't lead to true scalamability.

to MAX, or not to MAX

I am working with a table that has all character types set to nvarchar some of them are nvarchar(max).

Ok. This is not necessarily a bad thing, though most often there is at least one field that is of a numeric type as an ID. However, there are certainly valid cases for the scenario described so far. And there is nothing inherently bad about MAX fields since they will store the data on the data page (i.e. in row) if the data can fit there. And in that situation it should perform as well as a non-MAX value of that same datatype. But yes, a bunch of MAX type fields is a sign of sloppiness in the data modeling and is far more likely to have most (or all) of that MAX data stored in separate data pages (i.e. off row) that need an extra lookup, hence less efficient.

VARCHAR vs NVARCHAR

We are converting all these to varchar...

Ok, but why exactly (yes, I know that info and comments that follow this statement add clarity, but I am going in order to preserve the conversational aspect for a reason). Each datatype has its place. VARCHAR is 1 byte per character and can represent 256 characters (most of the time) as defined on a single code page. While character values 0 - 127 are the same between code pages, character values between 128 and 255 can change:

;WITH chars ([SampleCharacters]) AS
(
  SELECT CHAR(42) + ' '   -- *
       + CHAR(65) + ' '   -- A
       + CHAR(126) + ' '  -- 
   -------------------------------
       + CHAR(128) + ' '  -- €
       + CHAR(149) + ' '  -- •
       + CHAR(165) + ' '  -- ¥, Y, ?
       + CHAR(183) + ' '  -- ·, ?
       + CHAR(229) + ' '  -- å, a, ?
)
SELECT chr.SampleCharacters COLLATE SQL_Latin1_General_CP1_CI_AS AS [SQL_Latin1_General_CP1_CI_AS],
       chr.SampleCharacters COLLATE SQL_Latin1_General_CP1255_CI_AS AS [SQL_Latin1_General_CP1255_CI_AS],
       chr.SampleCharacters COLLATE Thai_CI_AS_KS_WS AS [Thai_CI_AS_KS_WS],
       chr.SampleCharacters COLLATE Yakut_100_CS_AS_KS AS [Yakut_100_CS_AS_KS],
       chr.SampleCharacters COLLATE Albanian_CS_AI AS [Albanian_CS_AI]
FROM   chars chr;

Please note that it is possible for VARCHAR data to take up 2 bytes per character and represent more than 256 characters. For more information on Double-Byte Characters Sets, please see the following answer: Storing Japanese characters in a table .

NVARCHAR is stored as UTF-16 (Little Endian) and is either 2 or 4 bytes per character, which can represent the full Unicode spectrum. So, if your data will need to ever store more characters than can be represented by a single code page, then switching to VARCHAR won't truly help you.

Prior to converting to VARCHAR, you need to make sure that you are not storing any Unicode characters. Try the following query to see if there are any rows that cannot be converted to VARCHAR without losing data:

SELECT tbl.PKfield, tbl.SubType
FROM   dbo.[Listings] tbl
WHERE  tbl.SubType <> CONVERT(NVARCHAR(MAX), CONVERT(VARCHAR(MAX), tbl.SubType))

To clarify how NVARCHAR works: the max length of an NVARCHAR field is the number of 2-byte characters. Hence, NVARCHAR(50), will allow for a maximum of 100 bytes. How many characters will fit into that 100 bytes depends on how many 4 byte characters there are: none will allow you to fit in all 50 characters, all characters being 4-bytes will only fit 25 characters, and many combinations between.

Another thing to consider regarding space taken up by VARCHAR vs NVARCHAR: starting in SQL Server 2008 (Enterprise and Developer editions only!) you can enable Row or Page compression on Tables, Indexes, and Indexed Views. For situations where much of the data within an NVARCHAR field can actually fit within VARCHAR without any data loss, compression will allow for characters that do fit into VARCHAR to be stored as 1 byte. And only characters that require either 2 or 4 bytes will take up that space. This should remove one of the larger reasons that people often choose to stick with VARCHAR. For more info on Compression, please see the MSDN page for Creating Compressed Tables and Indexes. Please note that data in MAX datatypes that is being stored off-row is not compressible.

Real Areas of Concern

The following areas should be addressed if you want this table to be truly scalable.

Problemo Numero Uno

...and specifying a character width based upon the actual usage in production. The production data uses a range of 2 characters up to 900 characters of actual used width for any given column. We are going to add padding of 10% when applicable.

Uh, what? Have you added all of those values up? Given how many MAX fields you have, it is possible that 1 or more of those fields has 900 characters, and even though that should equate to 1800 bytes, the value stored on the main data page is just 24 bytes (not always 24 as the size varies related to several factors). And that could be why there are so many MAX fields: they couldn't fit in another NVARCHAR(100) (taking up to 200 bytes), but they did have room for 24 bytes.

If the goal is to improve performance, then converting the full strings to codes is, on some levels, a step in the right direction. You are drastically reducing the size of each row which is more efficient for the buffer pool and disk I/O. And shorter strings take less time to compare. That's good, but not awesome.

If the goal is to dramastically improve performance, then converting to codes is the wrong step in the right direction. It still relies upon string-based scans (with 30 indexes and 140 columns, there should be a lot of scans, unless most of the fields are not used for filtering), and I assume that those will be case-insensitives scans at that, which are less efficient than if they were case-sensitive or binary (i.e. using a case-sensitive, or binary, collation).

Additionally, converting to string-based codes ultimately misses the point of how to properly optimize a transactional system. Are these codes going to be entered in on a search form? Having people use 'S' for [SubType] is far less meaningful than searching on 'Single Family'.

There is a way to retain your full descriptive text while both reducing the space used and greatly speeding up queries: create a lookup table. You should have a table named [SubType] that stores each of the descriptive terms distinctly and has a [SubTypeID] for each one. If the data is part of the system (i.e. an enum), then the [SubTypeID] field should not be an IDENTITY field as the data should get populated via a release script. If the values are entered by end users, then the [SubTypeID] field should be an IDENTITY. In both situations:

[SubTypeID] is the Primary Key.
Most likely use INT for [SubTypeID].
If the data is internal / system data, and you know that the max number of distinct values will always be under 40k, then you might could get away with SMALLINT. If you start numbering at 1 (either manually or via IDENTITY seed), then you get a max of 32,768. But, if you start at the lowest value, -32,768, then you get the full 65,535 values to use.
If you are using Enterprise Edition, then enable Row or Page Compression
Descriptive text field can be called either [SubType] (same as the table name), or maybe [SubTypeDescription]
UNIQUE INDEX on [SubTypeDescription]. Keep in mind that indexes have a max size of 900 bytes. If the max length of this data in Production is 900 characters, and if you do need NVARCHAR, then this might work with compression enabled, OR use VARCHAR only if you definitely do NOT need to store Unicode characters, ELSE enforce uniqueness via an AFTER INSERT, UPDATE trigger.
[Listings] table has [SubTypeID] field.
[SubTypeID] field in [Listings] table is Foreign Key, referencing [SubType].[SubTypeID].
Queries can now JOIN the [SubType] and [Listings] tables and search on the full text of [SubTypeDescription] (case-insensitive, even, same as current functionality), while using that ID to perform a very efficient search on the indexed FK field in [Listings].

This approach can (and should) be applied to other fields in this table (and other tables) that behave similarly.

Problemo Numero Dos

A large overhaul of this table that consists literally 140 (nvarchar) columns, 11 being MAX. I am dropping 30 indexes and recreating them afterwards.

If this is a transactional system and not a data warehouse, then I would say that (again, generally), 140 columns is too much to handle efficiently. I highly doubt that all 140 fields are used at the same time and/or have the same use cases. The fact that 11 are MAX is irrelevant if they need to contain more than 4000 characters. BUT, having 30 indexes on a transactional table is again a bit unwieldy (as you are clearly seeing).

Is there a technical reason why the table needs to have all 140 fields? Can those fields be split into a few smaller groups? Consider the following:

Find the "core" (most important / frequently used) fields and put them into the "main" table, named [Listing] (I prefer to keep with singular words so that the ID field can easily be just TableName + "ID").
[Listing] table has this PK: [ListingID] INT IDENTITY(1, 1) NOT NULL CONSTRAINT [PK_Listing] PRIMARY KEY
"secondary" tables are named as [Listing{GroupName}] (e.g. [ListingPropertyAttribute] -- "attributes" as in: NumberOfBedrooms, NumberOfBathrooms, etc).
[ListingPropertyAttribute] table has this PK: [ListingID] INT NOT NULL CONSTRAINT [PK_ListingPropertyAttribute] PRIMARY KEY, CONSTRAINT [FK_ListingPropertyAttribute_Listing] FOREIGN KEY REFERENCES([Listing].[ListingID])
- notice no IDENTITY here
- notice PK name is the same between "core" and "secondary" tables
- notice PK and FK to "core" table is the same field
"core" [Listing] table gets both [CreatedDate] and [LastModifiedDate] fields
"secondary" tables only get [LastModifiedDate] field. The assumption is that all secondary tables get their rows populated at the same time as the "core" table (i.e. all rows should always be represented across all "secondary" tables). Hence the [CreatedDate] value in the "core" [Listing] table would always be the same across all "secondary" tables, on a per-row basis, so no need to duplicate it across the "secondary" tables. But they can each be updated at different times.

This structure does increase the number of JOINs that many queries will need, though one or more Views can be created to encapsulate the more frequently used JOINs, for coding convenience. But on the plus side:

when it comes to DML statements, there should be much less contention since the "core" table should get most of the updates.
most updates will take less time since they are modifying smaller rows.
index maintenance on each of the new tables (both "core" and "secondary" tables) should be faster, at least on a per-table basis.

Recap

The current model is designed to be inefficient, and it seems to be fulfilling that design goal (i.e. it be slow). If you want the system to be fast, then the data model needs to be designed to be efficient, not merely less-inefficient.

Best Answer

Related Solutions

Sql-server – Disk Space Considerations for Rebuilding a large Clustered Primary Key Index

Sql-server – nvarchar(max) converting to varchar and table optimization