Sql-server – How to avoid tempdb spill

sql server

I have several tempdb spills in queries which are currently serving business purposes. I am told by online searching that tempdb spills should never be ignored as they can cause performance problems. It's still a bit unclear to me as to why that is. Since I have also read that SQL Server by itself will request additional memory for a query if it notices its estimation was wrong for a certain operator. It would seem logical then, that there would only be a performance problem when SQL Server is actually out of memory, which would happen regardless as the memory requirement and available memory would not change if only the estimation is changed to be correct.

Do tempdb spills mean that the SQL Server needs more memory added to it or is there some other reason why tempdb spills are bad that I am not accounting for? I do notice performance drops for queries where the size of the tempdb spills is large, so I am assuming it is a problem which will only get worse as data is added.

To avoid tempdb spills I see the following mentioned online:

  1. Make sure statistics are up to date.
  2. Build indexes in a correct manner.
  3. Adjust query if possible.

I believe I've done this. Though I am definitely happy to be proven wrong. It is also worth noting that I also receive tempdb spills when converting said queries into parameterized queries.

I'd really like insights into how a sort operator, and more specifically a tempdb spill can be completely avoided. Either by table design or any other possibility I can't think of. I am finding very little info about this online apart from the rules above which I have already applied. Even a prominent book on SQL Server performance only mentions the word spill twice. Is there any good resource on this which really goes into depth?

This is the sort operator on the production server:

enter image description here

Below you will find the query I am trying to execute which leads to a sort operator, which does not spill locally with 390 thousand Attribute records, but does with 5 million Attribute records. However, the execution plan has a sort operator in it. This sort operator is what I am trying to remove due to it spilling on the production server. How can I remove the sort operator or avoid the tempdb spill it causes?

Execute this script if you wish to follow along with an example. Otherwise skip it.

USE [master]
GO

DROP DATABASE IF EXISTS [TempDbSpill]
GO

CREATE DATABASE [TempDbSpill]
GO

USE [TempDbSpill]

CREATE TABLE [Product] (
    [Id] UNIQUEIDENTIFIER NOT NULL
)

CREATE TABLE [AttributeType] (
    [Id] UNIQUEIDENTIFIER NOT NULL,
    [Description] NVARCHAR(MAX) NOT NULL
)

CREATE TABLE [AttributeValue] (
    [Id] UNIQUEIDENTIFIER NOT NULL,
    [Value] NVARCHAR(MAX) NOT NULL
)

CREATE TABLE [Attribute] (
    [Id] UNIQUEIDENTIFIER NOT NULL,
    [AttributeTypeId] UNIQUEIDENTIFIER NOT NULL,
    [ProductId] UNIQUEIDENTIFIER NOT NULL,
    [ValueId] UNIQUEIDENTIFIER NULL
)

INSERT INTO
    [Product] ([Id])
SELECT
    NEWID()
FROM
    INFORMATION_SCHEMA.COLUMNS AS [a]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [b]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [c]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [d]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [e]

INSERT INTO
    [AttributeType] ([Id], [Description])
SELECT
    NEWID(), CONCAT(NEWID(), NEWID(), NEWID(), NEWID())
FROM
    INFORMATION_SCHEMA.COLUMNS AS [a]
INSERT INTO
    [AttributeType] ([Id], [Description])
SELECT
    '960BE057-EB5B-4746-9A96-0806723433E9', 'Description'

INSERT INTO
    [AttributeValue] ([Id], [Value])
SELECT
    NEWID(), CONCAT(NEWID(), NEWID(), NEWID(), NEWID())
FROM
    INFORMATION_SCHEMA.COLUMNS AS [a]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [b]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [c]
CROSS JOIN
    INFORMATION_SCHEMA.COLUMNS AS [d]
INSERT INTO
    [AttributeValue] ([Id], [Value])
SELECT
    '98082f59-2cbf-439a-a0c5-b3f56aa8d71a', 'Value'

INSERT INTO
    [Attribute] ([Id], [AttributeTypeId], [ProductId], [ValueId])
SELECT
    NEWID(), [ChecksumAttribute].[AttributeTypeId], [Product].[Id], [ChecksumAttribute].[AttributeValueId]
FROM
    [Product]
JOIN (
    SELECT
        [AttributeType].[Id] AS [AttributeTypeId]
        ,[AttributeValue].[Id] AS [AttributeValueId]
        ,ABS(CHECKSUM(CONCAT([AttributeType].[Id], [AttributeValue].[Id]))) % 10000 AS [Checksum]
    FROM
        [AttributeType]
    CROSS JOIN
        [AttributeValue]
) AS [ChecksumAttribute] ON ABS(CHECKSUM([Product].[Id])) % 10000 = [ChecksumAttribute].[Checksum]


ALTER TABLE [Product] ADD CONSTRAINT [PK_Product] PRIMARY KEY CLUSTERED ([Id])
ALTER TABLE [Attribute] ADD CONSTRAINT [PK_Attribute] PRIMARY KEY CLUSTERED ([Id])
ALTER TABLE [AttributeValue] ADD CONSTRAINT [PK_AttributeValue] PRIMARY KEY CLUSTERED ([Id])

Below is the query demonstrating the problem.

SELECT
    [GroupByResult].[AttributeTypeId]
    ,[AttributeValue].[Value]
    ,[GroupByResult].[Count]
FROM (
    SELECT
        [AttributeTypeId]
        ,[ValueId]
        ,COUNT(*) AS [Count]
    FROM
        [Attribute]
    WHERE
        [ProductId] IN 
        (
            SELECT
                [Id]
            FROM
                [Product]
            WHERE
                EXISTS (
                    SELECT
                        1
                    FROM
                        [Attribute]
                    JOIN
                        [AttributeValue] ON [Attribute].[ValueId] = [AttributeValue].[Id]
                    WHERE
                        [Product].[Id] = [Attribute].[ProductId] AND
                        ([Attribute].[AttributeTypeId] = '960BE057-EB5B-4746-9A96-0806723433E9') AND [Attribute].[ValueId] IN ('98082f59-2cbf-439a-a0c5-b3f56aa8d71a'))
        )
    GROUP BY
        [AttributeTypeId]
        ,[ValueId]
) AS [GroupByResult]
JOIN
    [AttributeValue] ON [GroupByResult].[ValueId] = [AttributeValue].[Id]
OPTION (MAXDOP 1, RECOMPILE)

As expected, clearly seen from the execution plan, an index is recommended.

enter image description here

CREATE NONCLUSTERED INDEX [IX_Attribute_SuggestedIndex1] ON [dbo].[Attribute] ([AttributeTypeId],[ValueId]) INCLUDE ([ProductId])

However after applying this index the sort operator remains. A second execution leads to another index being recommended.

enter image description here

CREATE NONCLUSTERED INDEX [IX_Attribute_SuggestedIndex2] ON [dbo].[Attribute] ([ProductId]) INCLUDE ([AttributeTypeId],[ValueId])

However after applying this index, again no luck.

enter image description here

I've pretty much tried every combination of indexes possible. I've tried making AttributeTypeId, ProductId, ValueId the primary key, to no avail.

Additionally, below is another query with a sort operator which spills on the production server.

SELECT TOP 1001
    *
FROM
    [Product]
WHERE
    EXISTS (
        SELECT
            1
        FROM
            [Attribute]
        WHERE
            [Product].[Id] = [Attribute].[ProductId] AND
            [Attribute].[AttributeTypeId] = '960BE057-EB5B-4746-9A96-0806723433E9' AND
            [Attribute].[ValueId] IN ('98082f59-2cbf-439a-a0c5-b3f56aa8d71a'))
OPTION (MAXDOP 1, RECOMPILE)

Again, I want to understand how I can avoid a tempdb spill by any means necessary.

EDIT.
JSON as an alternative was suggested. This is new for me and still leads to the following problems: the sample SELECT queries are slow when compared to the ones above using the EAV model, and I need to be able to query any value, not just UNIQUEIDENTIFIER (suppose ValueId really was a JSON value in this case).

CREATE TABLE [Product2] (
    [Id] UNIQUEIDENTIFIER NOT NULL,
    [Attributes] NVARCHAR(MAX)
)

INSERT INTO
    [Product2] ([Id], [Attributes])
SELECT
    [Product].[Id]
    ,[Json].[Value]
FROM
    [Product]
CROSS APPLY (
    SELECT (
        SELECT
            [Attribute].[AttributeTypeId]
            ,[Attribute].[ValueId]
        FROM
            [Attribute]
        WHERE
            [Product].[Id] = [Attribute].[ProductId]
        FOR JSON AUTO
    ) AS [Value]
) AS [Json]

SELECT TOP 1001
    *
FROM
    [Product2]
WHERE
    EXISTS (
        SELECT
            1
        FROM
            OPENJSON([Attributes]) WITH([AttributeTypeId] UNIQUEIDENTIFIER '$.AttributeTypeId', [ValueId] UNIQUEIDENTIFIER '$.ValueId')
        WHERE
            [AttributeTypeId] = '960BE057-EB5B-4746-9A96-0806723433E9' AND
            [ValueId] IN ('98082f59-2cbf-439a-a0c5-b3f56aa8d71a'))
OPTION (MAXDOP 1, RECOMPILE)
SELECT
    [AttributeTypeId]
    ,[ValueId]
    ,COUNT(*)
FROM
    [Product2]
CROSS APPLY OPENJSON([Attributes]) WITH([AttributeTypeId] UNIQUEIDENTIFIER '$.AttributeTypeId', [ValueId] UNIQUEIDENTIFIER '$.ValueId')
GROUP BY
    [AttributeTypeId]
    ,[ValueId]
OPTION (MAXDOP 1, RECOMPILE)

Best Answer

Ok. You've got the Entity-Attribute-Value (EAV) pattern there, and it's notorious for poor performance. If the tables are not too big, you won't notice. But don't expect blazing fast performance. You should expect TempDb spills, memory grant warnings, locking issues, and poor execution plans. So if you really care about optimal performance, choose a different way to model the data.

SQL Server has this really great feature where you model entities as rows and attributes as columns. It's called the relational model, and it's modeling approach that the SQL Server engine is optimized for.

If you can't use the relational model for your Product attributes because they are too variable, then model a subset of the attributes using XML or JSON, which SQL Server also supports. eg

CREATE TABLE [Product] 
(
    [Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
    Name nvarchar(20) not null,
    ProductType int not null,
    Active bit not null default 1,
    Attributes nvarchar(max),
    constraint ck_is_valid_json check (isjson(Attributes) = 1)
)

The JSON should be a simple list of name/value pairs, like this

{"Description":"My Product","ManufacturerName":"Acme Widgets","ValidOn":"2019-01-01T12:00:00","SafetyStockLevel":10}

You can do lookups like this

SELECT *
FROM Product p
WHERE  json_value(p.Attributes,'$.Description') = 'My Cool Product'

And if you have an performance-critical attribute lookups, put an indexed computed column on Product that "promotes" the attribute and gives you fast lookups, like this

alter table product add Description as json_value(Attributes,'$.Description') persisted

create index ix_product_description on product(Description)

On top of that, the table design for EAV is not optimal. You should not use UNIQUEIDENTIFIERS for the keys, and even if you do, the key structure should look like this:

CREATE TABLE [Product] (
    [Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID()
)

CREATE TABLE [AttributeType] (
    [Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
    [Description] NVARCHAR(MAX) NOT NULL
)

CREATE TABLE [Attribute] (
    [ProductId] UNIQUEIDENTIFIER NOT NULL REFERENCES Product,
    [AttributeTypeId] UNIQUEIDENTIFIER NOT NULL REFERENCES AttributeType,
    [Value] NVARCHAR(MAX) NULL,
    CONSTRAINT [PK_Attribute] PRIMARY KEY CLUSTERED (ProductId, AttributeTypeId)
)

In EAV accessing the attributes for a single entity is the primary access path, so the attributes should be clustered by EntityID. And the AttributeValue table doesn't really belong.