Sql-server – the best practice for storing large data when nvarchar(4000) is sufficient 99% of the time

sql serversql-server-2005

I have a table with about 100MM rows that stores information about a user.

CREATE TABLE [dbo].[UserData](
        [UserDataID] [int] IDENTITY(1,1) NOT NULL,
        [UserID] [int] NOT NULL,
        [DataId] int NOT NULL,
        [DataValue] [nvarchar](4000) NOT NULL,
        [EncryptedDataValue] [varbinary](max) NULL)

I need to be able to store data in DataValue that is > nvarchar(4000). But, only a very small percentage of the rows actually need this column as nvarchar(max). I know that as soon as DataValue is over nvarchar(4000), internally, sql will store the data as a blob[?], substantially increasing the time it takes to make this change. (not sure how read/write time will be affected later).

I thought of a few potential options…

Change DataValue from nvarchar(4000) to nvarchar(max) and just eat the time it takes to make the change; not worrying that only 1% of the rows are using MAX?
Alongside DataValue, add a DataValueXL column that is nvarchar(max) and introduce application logic to save in the appropriate column depending on the size of the data? (Marking both as NULL)
Create a new table FKed to UserDataId to store only large DataValues > 4000?

Which–if any–should I go with?

thanks

Best Answer

as soon as DataValue is over nvarchar(4000), internally, sql will store the data as a blob

This is incorrect. SQL Server will use off-row storage for both max and non-max types. In other words your NVARCHAR(4000) may also be stored off-row. The actual storage location will depend on whether the row fits or not on a page and on the exact combination of sp_tableoption values for the large value types out of row and text in row settings.

Your best bet is to change it to max. SQL will store the value in-row whenever possible, offering fast access. This makes application programing much simpler as you don't have to deal with the actual location (Column) based on size. You should only be concerned if you have an objection on adding the first max type column to a table (eg. it will prevent online rebuild operations), but you are already willing to add one so that should not be an issue.

Related Solutions

Sql-server – Data type for storing an array of flags (a bitmap/bit array)

I can't advocate strongly enough not to use a single field for this.

I'm currently dealing with maintaining a very large dataset with a bigint bitmask field and it's a bit of a performance nightmare.

If you check a single bit it's fine. If you check more than one bit performance degrades very quickly.

Due to the nature of bitmask integers, the data distribution will be very imbalanced and you'll get suboptimal plans.

Multiple bit checks result in range or index scans with a function running against every row. It's a mess.

My workaround was simple - I made a table to store the PK for each of the conditions to be checked. This is counter-intuitive initially but the space needed is low (you only store the PK) and lookups are lightning fast, especially if you use a UNIQUE CLUSTERED INDEX.

You can add as many conditions as you want without affecting your main table, and updates also don't affect your main table.

Indexing is simple since you just index all the lookup tables individually, and since your clustered key is the same on your main table and the lookups all your evaluations are merge joins which are very efficient.

Sql-server – Getting “Conversion failed when converting the nvarchar value to data type int.”

Write your query this way. It includes a better test for integers and dumps ISNUMERIC that returns 1 for '-.', for example.

SELECT
    zip,
    coordinates
FROM (
    SELECT
        CASE WHEN ZIP5>'' AND NOT ZIP5 LIKE '%[^0-9]%' THEN
            CONVERT(int, [ZIP5]) END zip,
        CONVERT(varchar(max), geom) AS coordinates
    FROM 
        [SpatialData].[dbo].[zip5]
    WHERE
        ZIP5>'' AND NOT ZIP5 LIKE '%[^0-9]%'
) AS t1
WHERE
    zip >= 85000 AND
    zip < 86000

See this Connect item

SQL Server is free to evaluate the WHERE/SELECT clause in the order that it decides is optimized. A view or derived table is not materialized can easily be expanded into, from the outer query.

What SQL Server is compiling is really a query that looks like this

SELECT
    CONVERT(int, [ZIP5]) AS zip,
    CONVERT(varchar(max), geom) AS coordinates
FROM 
    [SpatialData].[dbo].[zip5]
WHERE
    ISNUMERIC([ZIP5]) = 1
    AND CONVERT(int, [ZIP5]) >= 85000
    AND CONVERT(int, [ZIP5]) < 86000

You can inspect the query plan of your original, but my guess from looking at the structure is that the WHERE clause uses the expression CONVERT(int, [ZIP5]) twice, so it makes sense to streamline the resolution (calculation to a result) of the expression in the process of retrieving data from the table. This puts the processing of the SELECT clause before the WHERE, so your ISNUMERIC() = 1 never got a chance to filter the bad eggs.

It is by-design.

Best Answer

Related Solutions

Sql-server – Data type for storing an array of flags (a bitmap/bit array)

Sql-server – Getting “Conversion failed when converting the nvarchar value to data type int.”

Related Question