I am working with a table that has all character types set to nvarchar
some of them are nvarchar(max)
. We are converting all these to varchar
and specifying a character width based upon the actual usage in production. The production data uses a range of 2 characters up to 900 characters of actual used width for any given column. We are going to add padding of 10% when applicable.
-- Insert statements for procedure here
UPDATE Listings WITH (ROWLOCK)
SET [SubType] = 'S'
WHERE @idSettings = idSettings AND
(@idRetsClass = 0 OR idRetsClass = @idRetsClass)
AND (@idRetsSetting = 0 OR idRetsSetting = @idRetsSetting)
AND IsNew = 1 AND ([SubType] LIKE '%Single Family Home%' OR [SubType] LIKE '%Modular%' OR [SubType] LIKE '%Mobile Home%'
OR [SubType] LIKE '% Story%' OR [SubType] = '' OR [SubType] = 'residential - S' OR [SubType] = '1 House on Lot' OR [SubType] = '2 Houses on Lot'
OR [SubType] = 'Detached' OR [SubType] LIKE '%single family%' OR [SubType] = 'ranch' OR [SubType] = 'Semi-Detached' OR [SubType] = 'single' OR [SubType] = 'one family' OR [SubType] = 'Residential'
OR [SubType] = 'Ranch Type' OR [SubType] = '2 or More Stories' OR [SubType] = 'Cape Cod' OR [SubType] = 'Split Level' OR [SubType] = 'Bi-Level' OR [SubType] = 'Detached Single'
OR [SubType] = 'Single-Family Homes' OR [SubType] = 'house' OR [SubType] = 'detached housing' OR [SubType] = 'det')
A large overhaul of this table that consists literally 140 (nvarchar
) columns, 11 being MAX. I am dropping 30 indexes and recreating them afterwards.
My question is in what situations is varchar(max)
preferred?
Only when you expect to have 4k or more characters?
What should I learn and prepare for when doing this?
Will this improve performance when a clustered index update that affects the clustered key has to update the all non-clustered indexes?
We are having update procedures timing out that are using 75% to 95% of the query execution plan & displayed plan for an a clustered index update.
Best Answer
{ This is a little lengthy, perhaps, but your actual problem(s) cannot be solved by looking at execution plans. There are two main issues, and both are architectural. }
Distractions
Let's start with the items that are not your major problem areas. These are things that should be looked into as it definitely helps improve performance to use the datatypes that you need and not just a general, one-size-fits-most datatype. There is a very good reason why the different datatypes exist, and if storing 100 characters in
NVARCHAR(MAX)
had no negative impact on queries (or any other aspect of the system), then everything would be stored asNVARCHAR(MAX)
. However, cleaning up these areas won't lead to true scalamability.to MAX, or not to MAX
Ok. This is not necessarily a bad thing, though most often there is at least one field that is of a numeric type as an ID. However, there are certainly valid cases for the scenario described so far. And there is nothing inherently bad about
MAX
fields since they will store the data on the data page (i.e. in row) if the data can fit there. And in that situation it should perform as well as a non-MAX value of that same datatype. But yes, a bunch ofMAX
type fields is a sign of sloppiness in the data modeling and is far more likely to have most (or all) of thatMAX
data stored in separate data pages (i.e. off row) that need an extra lookup, hence less efficient.VARCHAR vs NVARCHAR
Ok, but why exactly (yes, I know that info and comments that follow this statement add clarity, but I am going in order to preserve the conversational aspect for a reason). Each datatype has its place.
VARCHAR
is 1 byte per character and can represent 256 characters (most of the time) as defined on a single code page. While character values 0 - 127 are the same between code pages, character values between 128 and 255 can change:Please note that it is possible for
VARCHAR
data to take up 2 bytes per character and represent more than 256 characters. For more information on Double-Byte Characters Sets, please see the following answer: Storing Japanese characters in a table .NVARCHAR
is stored as UTF-16 (Little Endian) and is either 2 or 4 bytes per character, which can represent the full Unicode spectrum. So, if your data will need to ever store more characters than can be represented by a single code page, then switching toVARCHAR
won't truly help you.Prior to converting to
VARCHAR
, you need to make sure that you are not storing any Unicode characters. Try the following query to see if there are any rows that cannot be converted toVARCHAR
without losing data:To clarify how
NVARCHAR
works: the max length of anNVARCHAR
field is the number of 2-byte characters. Hence,NVARCHAR(50)
, will allow for a maximum of 100 bytes. How many characters will fit into that 100 bytes depends on how many 4 byte characters there are: none will allow you to fit in all 50 characters, all characters being 4-bytes will only fit 25 characters, and many combinations between.Another thing to consider regarding space taken up by
VARCHAR
vsNVARCHAR
: starting in SQL Server 2008 (Enterprise and Developer editions only!) you can enable Row or Page compression on Tables, Indexes, and Indexed Views. For situations where much of the data within anNVARCHAR
field can actually fit withinVARCHAR
without any data loss, compression will allow for characters that do fit intoVARCHAR
to be stored as 1 byte. And only characters that require either 2 or 4 bytes will take up that space. This should remove one of the larger reasons that people often choose to stick withVARCHAR
. For more info on Compression, please see the MSDN page for Creating Compressed Tables and Indexes. Please note that data inMAX
datatypes that is being stored off-row is not compressible.Real Areas of Concern
The following areas should be addressed if you want this table to be truly scalable.
Problemo Numero Uno
Uh, what? Have you added all of those values up? Given how many
MAX
fields you have, it is possible that 1 or more of those fields has 900 characters, and even though that should equate to 1800 bytes, the value stored on the main data page is just 24 bytes (not always 24 as the size varies related to several factors). And that could be why there are so manyMAX
fields: they couldn't fit in anotherNVARCHAR(100)
(taking up to 200 bytes), but they did have room for 24 bytes.If the goal is to improve performance, then converting the full strings to codes is, on some levels, a step in the right direction. You are drastically reducing the size of each row which is more efficient for the buffer pool and disk I/O. And shorter strings take less time to compare. That's good, but not awesome.
If the goal is to dramastically improve performance, then converting to codes is the wrong step in the right direction. It still relies upon string-based scans (with 30 indexes and 140 columns, there should be a lot of scans, unless most of the fields are not used for filtering), and I assume that those will be case-insensitives scans at that, which are less efficient than if they were case-sensitive or binary (i.e. using a case-sensitive, or binary, collation).
Additionally, converting to string-based codes ultimately misses the point of how to properly optimize a transactional system. Are these codes going to be entered in on a search form? Having people use
'S'
for[SubType]
is far less meaningful than searching on'Single Family'
.There is a way to retain your full descriptive text while both reducing the space used and greatly speeding up queries: create a lookup table. You should have a table named
[SubType]
that stores each of the descriptive terms distinctly and has a[SubTypeID]
for each one. If the data is part of the system (i.e. anenum
), then the[SubTypeID]
field should not be anIDENTITY
field as the data should get populated via a release script. If the values are entered by end users, then the[SubTypeID]
field should be an IDENTITY. In both situations:[SubTypeID]
is the Primary Key.INT
for[SubTypeID]
.SMALLINT
. If you start numbering at 1 (either manually or via IDENTITY seed), then you get a max of 32,768. But, if you start at the lowest value, -32,768, then you get the full 65,535 values to use.[SubType]
(same as the table name), or maybe[SubTypeDescription]
UNIQUE INDEX
on[SubTypeDescription]
. Keep in mind that indexes have a max size of 900 bytes. If the max length of this data in Production is 900 characters, and if you do needNVARCHAR
, then this might work with compression enabled, OR useVARCHAR
only if you definitely do NOT need to store Unicode characters, ELSE enforce uniqueness via anAFTER INSERT, UPDATE
trigger.[Listings]
table has[SubTypeID]
field.[SubTypeID]
field in[Listings]
table is Foreign Key, referencing[SubType].[SubTypeID]
.JOIN
the[SubType]
and[Listings]
tables and search on the full text of[SubTypeDescription]
(case-insensitive, even, same as current functionality), while using that ID to perform a very efficient search on the indexed FK field in[Listings]
.This approach can (and should) be applied to other fields in this table (and other tables) that behave similarly.
Problemo Numero Dos
If this is a transactional system and not a data warehouse, then I would say that (again, generally), 140 columns is too much to handle efficiently. I highly doubt that all 140 fields are used at the same time and/or have the same use cases. The fact that 11 are
MAX
is irrelevant if they need to contain more than 4000 characters. BUT, having 30 indexes on a transactional table is again a bit unwieldy (as you are clearly seeing).Is there a technical reason why the table needs to have all 140 fields? Can those fields be split into a few smaller groups? Consider the following:
[Listing]
(I prefer to keep with singular words so that the ID field can easily be justTableName + "ID"
).[Listing]
table has this PK:[ListingID] INT IDENTITY(1, 1) NOT NULL CONSTRAINT [PK_Listing] PRIMARY KEY
[Listing{GroupName}]
(e.g.[ListingPropertyAttribute]
-- "attributes" as in: NumberOfBedrooms, NumberOfBathrooms, etc).[ListingPropertyAttribute]
table has this PK:[ListingID] INT NOT NULL CONSTRAINT [PK_ListingPropertyAttribute] PRIMARY KEY, CONSTRAINT [FK_ListingPropertyAttribute_Listing] FOREIGN KEY REFERENCES([Listing].[ListingID])
IDENTITY
here[Listing]
table gets both[CreatedDate]
and[LastModifiedDate]
fields[LastModifiedDate]
field. The assumption is that all secondary tables get their rows populated at the same time as the "core" table (i.e. all rows should always be represented across all "secondary" tables). Hence the[CreatedDate]
value in the "core"[Listing]
table would always be the same across all "secondary" tables, on a per-row basis, so no need to duplicate it across the "secondary" tables. But they can each be updated at different times.This structure does increase the number of JOINs that many queries will need, though one or more Views can be created to encapsulate the more frequently used JOINs, for coding convenience. But on the plus side:
Recap
The current model is designed to be inefficient, and it seems to be fulfilling that design goal (i.e. it be slow). If you want the system to be fast, then the data model needs to be designed to be efficient, not merely less-inefficient.