Sql-server – the best way to store optional values in table

nullperformancesql serversql-server-2019

I have a large table that will be partitioned (about 300 million rows). There are currently 57 columns in the table. At the moment we also have an 'Option' column, which is to hold values ​​that would be null for most records. The column type is varchar (max). In other words, this column is to store values ​​that are certain business characteristics of the row. In the extreme case, when all these values ​​occur, the length of this column will exceed 8000 characters. There may also be 0 characters.

There will be around 300 inserts per second and around 200-250 updates on this table (sometimes update which is also updating Option column (for example add new value).

The question is whether it is a better approach to create new columns for these values ​​(then need to create an additional 63 columns) that would mostly be null for a single row. Or is it just a better idea to keep everything in varchar (max)?

Best Answer

Per your comment:

There will be around 300 inserts per second and around 200-250 updates on this table (sometimes update which is also updating Option column (for example add new value).

This leaves us with two options:

  1. Store the optional columns in their own table(s) and create a view to stick everything together. In this the PK of the optional table(s) will be the primary key of the main table, migrated through a FK constraint.
  2. Just deal with the 1 bit penalty per column for the other rows. Depends on how often the optional 63 columns are populated versus the other rows and what the actual differential in storage would be.

With respect to explicit columns vs VARCHAR(MAX), it is definitely worth defining/creating the columns, if for nothing else the sanity of your end users. You can certainly group related columns together (those that are/will be NULL together) in their own tables if the data size is relatively large, or just deal with having one (potentially) wide table with very few records in it.

Do not use EAV. There are plenty of SO questions and articles you can find outlining why and I won't repeat them here.

Ok so the option with VARCHAR(max) is the worst option ? Why new partitioned table with only optional columns will be better solution ?

I'd say the worst option would still be EAV, but using a VARCHAR column to store multiple values violates 1st normal form (which exists for a very good reason), but from a practical standpoint here are the drawbacks:

  1. Can't ensure the datatypes are consistent for the data stored
  2. Trying to make sense of the data requires someone knows how the data is encoded, then they have to try and extract the data. Things like JSON and XML can reduce that somewhat, but they have storage overhead and are still not as easy to work with as just writing a simple SELECT statement.

Relevant example from earlier in my career: There was a need to store derived characteristics for each customer to be calculate a price. These varied based on what price was being determined and there were over 500 characteristics total. Someone "solved" the problem by concatenating all of the characteristics into one string and loaded that into a VARCHAR column.

This had the following disadvantages:

  1. There had to be a huge amount of effort put into ensuring the resulting string was valid, complicating business rules and making changes difficult (a lot of testing since changing one thing could break another)
  2. Getting the data parsed was intensive and required operations that couldn't be performed easily in the database. So there was a whole other program that would pull the data out, parse the 500 some odd columns, then return it. Slow, and the typical developer/user had no clue how anything worked. And if you only wanted "Characteristic 26B" or whatever, you had to process every row, whether or not "26B" made any sense at all for that particular row.

Currently, I deal with a lot of aggregated data that is ridiculously wide (usually 100+ columns, most NULLable). Given the type of workload (ad hoc queries selecting anywhere between 5-20% of the columns), I will usually create a table for each NULLable column or subject area, then combine everything into a single view. SQL Server is great at join elimination, so we've been able to cut the amount of data read for a given query substantially. This results in faster performance and we don't have to rebuild the table every time a new data element is added (just create a new table, load the data, then alter the view).

If your situation was closer to that of a DWH (a lot of inserts and relatively few updates, a lot of ad-hoc queries), then Clustered Columnstore Index would be a good option, but that doesn't seem to be the case here.