Sql-server – Should I used varchar(max) or varchar(4000) SPARSE

sparse-columnsql servervarchar

I have a "comment"-type column that is rarely used — around 6% non-null in a population of 3 million records. The average length (when used) is 6 characters, and the max so far is around 3KB. A maximum of 4000 characters is reasonable for this field. I have two options:

comments varchar(max) NULL -- this is the current column definition
comments varchar(4000) SPARSE NULL

My current understanding is that in both cases, a NULL value would require no storage — just the column's NULL bit set and a length of 0 in the row metadata.

But for the non-null cases, does one have a clear advantage over the other?

The extra 4-byte pointer for sparse columns with values suggests they are always stored off-row like text or very large varchar(max) fields. Is that the case?

If so, I'd lean toward using varchar(max), since it only stores values off-row if the total row length exceeds 8KB, and the majority of my values are short and unlikely to put a row over the limit.

I haven't seen this particular situation addressed in the BOL, so I'm hoping someone here knows enough about the innards of MSSQL to give some insight.

(If it matters, I'm currently using 2008R2, but hoping to upgrade soon to 2014.)

Best Answer

There is no advantage for the non-NULL cases when using SPARSE, and in fact, there are two stated disadvantages:

  • an extra 4 bytes per each non-NULL value
  • slightly longer access time

As you pretty much already gathered, the SPARSE option only makes sense for fixed-length datatypes; I can't think of a single reason to use it on variable-length types.

I am not sure that the extra 4 bytes implies anything about it being stored off-row, and the MAX types aren't entirely off-row when they exceed 8000 bytes as there is then the 16-byte pointer in the row to that off-row location.

Stick with VARCHAR(4000), no SPARSE, and I would even consider making it NOT NULL DEFAULT('') (an empty string is still 0 bytes, but now you don't need to mess with the NULL indicator, and can a comment really be "unknown" as opposed to "no comment"?).