Sql-server – What are the current best practices concerning varchar sizing in SQL Server

best practicesperformancesql serverstoragevarchar

I'm trying to understand the best way to decide how big varchar columns should be, both from storage and performance perspectives.

Performance
From my research, it seems that varchar(max) should only be used if you really need it; that is, if the column must accommodate more than 8000 characters, one reason being the lack of indexing (though I'm a little suspicious of indexing on varchar fields in general. I'm pretty new to DB principles though, so maybe that's unfounded) and compression (more a storage concern). In fact, in general people seem to recommend only using what you need, when doing varchar(n)….oversizing is bad, because queries must account for the maximum possible size. But it's also been stated that the engine will use half the indicated size as an estimate of the average actual size of the data. This would imply that one should determine, from the data, what the average size is, double it, and use that as n. For data with very low but non-zero variability though, this implies up to a 2x oversizing over the maximum size, which seems like a lot, but maybe it's not? Insights would be appreciated.

Storage
After reading about how in-row vs. out-of-row storage works, and keeping in mind that actual storage is limited to actual data, it actually seems to me that the choice of n has little or no bearing on storage (besides making sure it's big enough to hold everything). Even using varchar(max) shouldn't have any impact on storage. Instead, a goal might be to limit the actual size of each data row to ~8000 bytes if possible. Is that an accurate read on things?

Context
Some of our customer data fluctuates a little, so we generally make columns just a little wider than they need to be, say 15-20% bigger, for those columns. I was wondering if there were any other special considerations; for example, someone I work with told me to use 2^n – 1 sizes (I have found no evidence that's a thing though….)

I'm talking about initial table creation. A customer will tell us that they are going to start sending us a new table, and send sample data (or just the first production data set), which we look at and make a table on our end to hold the data. We want to make the table on our end to handle future imports as well as what's in the sample. But, certain rows are bound to get longer, so we pad them.

The question is how much, and are there technical guidelines?

Best Answer

Regardless of specific datatype, you need to be able to store whatever the application requests to be stored. You cannot specify something smaller than the max size of what will actually be saved.

You also do not need, nor want, to specify a column length larger than the maximum actual size that will be stored for a variety of reasons: query memory allocation, potentially filling up the maximum row size and not leaving any room for adding columns in the future, etc.

True, variable length string and binary columns do not have the storage implication that fixed-length datatypes (string / binary / numeric / date / etc) do (although, some of those implications can be nullified via data compression or use of the SPARSE column definition option). However, as you pointed out, even if no there is direct storage implication, there is still the performance implication of overestimating required memory for queries.

Be sensible. Use only what you need. Considerations can be made if there is a high probability that column length will need to increase in the near future, but keep in mind that it is easier to expand the size of a column than reduce the size. Yes, some work will be involved, but since that work is merely "potential", while performance implications of over-sizing are "actual", it is often best to define columns based on what you actually need, not what you maybe-kinda-sorta think you might need in the future. Many changes that are talked about never happen, and often the changes that are required cannot be foreseen. Go with what you know.

Instead, a goal might be to limit the actual size of each data row to ~8000 bytes if possible.

I am not exactly sure what you are getting at here. SQL Server will physically limit you to just over 8000 bytes. Using LOB types — VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), XML, and the deprecated TEXT, NTEXT, and IMAGE types — allow for going beyond that initial page size limitation, but that is only due to placing a pointer (16 or more bytes, depending on the type, and depending on the size of the value being stored off-row when using the MAX types). The actual physical limit of the data page did not change.

Your goal should be to use the least amount of physical space to store what the app / business needs to store without breaking or truncating such that the incomplete value loses meaning or causes problems downstream. If you need to store a 12,000 character thingy, then use VARCHAR(MAX) because that is what is needed. If you are storing a phone number or postal / zip code, then it would be unwise to use VARCHAR(100), and irresponsible to use VARCHAR(MAX).

some of our customer data fluctuates a little, so we generally make columns just a little wider than they need to be, say 15-20% bigger, for those columns. I was wondering if there were any other special considerations;

Don't all systems have at least some data that fluctuates? Any system that stores a person's name would qualify, right? There is a fairly large variance in length of names. And then you have someone like Prince go and change their name to a symbol and now you have an entirely different problem that is not length. This is just how things are.

But, to play devil's advocate for a moment: how can the "15-20% larger than what is needed" value not be the actual needed value? Let's say that there's a discussion about adding a new column, and someone suggests 50 characters, then someone else says, "well, 20% more is 60 so let's do 60 because someone might have 60." If it's true that a customer might have 60, then 60 is, and was always, the actual needed value, and 50 was wrong the whole time.

Of course, it would help if there was some indication as to the source of the data because:

  1. if you make "URL" 1024 and someone needs 1060, then it needed to be 1060 (similarly, if you make URL VARCHAR and get complaints that it is messing up Unicode characters which are now allowed in domain names, then it needed to be NVARCHAR), but
  2. if someone wants to add 1000 characters to a 500 character-limit comment field, then it still only needed to be 500. People can be less verbose in comments (a huge challenge for me ;-), but ProductSKU better be big enough to fit all of the customer's SKUs.

I'm talking about initial table creation. A customer will tell us that they are going to start sending us a new table, and send sample data (or just the first production dataset), which we look at and make a table on our end to hold the data. We want to make the table on our end to handle future imports as well as what's in the sample. But, certain rows are bound to get longer, so we pad them. The question is how much, and are there technical guidelines?

You are making a lot of assumptions here. Sure some fields might get bigger. But then again, they might not. Or, some could get smaller. Some can change from being non-Unicode to being Unicode (once they realize that the world is getting smaller and one cannot assume that last names will only ever have basic ASCII / US English characters). Or, they could stop sending a field. Or they can add one or more fields in the future. Any combination of this and other things. So why focus only on VARCHAR columns? What if they are currently sending an INT value and in a year or two they reach the max value and starting sending a BIGINT? What if they have a "status" field with values 0 - 5. Are you just going to assume INT which is "padded" as it allows for growth, but should probably be TINYINT?

The only thing that you can safely predict is that trying to predict how your customers data will change will be wrong more often than it is correct. And being correct is a matter of luck / coincidence (if not luck, then just go play the lottery ;).

So the guideline is:

  1. Don't waste time and energy on trying to answer an unanswerable question.
  2. Instead, focus on getting as much information as possible regarding your customer's actual data, and go with that (i.e. data-driven decision making ;-).

You already have example data, great. But, please don't forget that you also have your customer's contact info: phone and/or email. Contact them! Ask them for their data specs (just like your system, the data currently in their system might have a max length of 35, but their system has it defined as VARCHAR(50), and their system will accept up to that length, in which case you should use 50). And, ask them if they have any near-term plans to change and of those datatypes (type and/or size).