I have a large table that will be partitioned (about 300 million rows
). There are currently 57 columns in the table. At the moment we also have an 'Option' column, which is to hold values that would be null for most records. The column type is varchar (max)
. In other words, this column is to store values that are certain business characteristics of the row. In the extreme case, when all these values occur, the length of this column will exceed 8000 characters. There may also be 0 characters.
There will be around 300 inserts per second and around 200-250 updates on this table (sometimes update which is also updating Option column (for example add new value).
The question is whether it is a better approach to create new columns for these values (then need to create an additional 63 columns) that would mostly be null for a single row. Or is it just a better idea to keep everything in varchar (max)
?
Best Answer
Per your comment:
This leaves us with two options:
With respect to explicit columns vs
VARCHAR(MAX)
, it is definitely worth defining/creating the columns, if for nothing else the sanity of your end users. You can certainly group related columns together (those that are/will be NULL together) in their own tables if the data size is relatively large, or just deal with having one (potentially) wide table with very few records in it.Do not use EAV. There are plenty of SO questions and articles you can find outlining why and I won't repeat them here.
I'd say the worst option would still be EAV, but using a
VARCHAR
column to store multiple values violates 1st normal form (which exists for a very good reason), but from a practical standpoint here are the drawbacks:SELECT
statement.Relevant example from earlier in my career: There was a need to store derived characteristics for each customer to be calculate a price. These varied based on what price was being determined and there were over 500 characteristics total. Someone "solved" the problem by concatenating all of the characteristics into one string and loaded that into a VARCHAR column.
This had the following disadvantages:
Currently, I deal with a lot of aggregated data that is ridiculously wide (usually 100+ columns, most NULLable). Given the type of workload (ad hoc queries selecting anywhere between 5-20% of the columns), I will usually create a table for each NULLable column or subject area, then combine everything into a single view. SQL Server is great at join elimination, so we've been able to cut the amount of data read for a given query substantially. This results in faster performance and we don't have to rebuild the table every time a new data element is added (just create a new table, load the data, then alter the view).
If your situation was closer to that of a DWH (a lot of inserts and relatively few updates, a lot of ad-hoc queries), then Clustered Columnstore Index would be a good option, but that doesn't seem to be the case here.