Sql-server – OrderBy int slow for large table join

execution-planperformancequery-performancesql server

I have roughly 1.5 million records. This query works fine for smaller data, but it gets a second slower for every few hundred thousand records. For 1.5 million, the query is noticeably struggling.

I mocked up a whole new Code First project to verify the problem is independent of my implementation, and to provide this example query:

    SELECT TOP 1 Widgets.*
    FROM [WidgetSandbox].[dbo].[Widgets] Widgets
    INNER JOIN [WidgetSandbox].[dbo].[Status] Statuses ON Widgets.StatusId = Statuses.Id
    INNER JOIN [WidgetSandbox].[dbo].[Colors] Colors ON Widgets.ColorCode = Colors.ColorCode
    INNER JOIN [WidgetSandbox].[dbo].[Sizes] Sizes ON Widgets.SizeId = Sizes.Id
    WHERE Statuses.Name = 'Available'
    AND Colors.Name = 'Red'
    ORDER BY Sizes.DiameterInches

DiameterInches is an int, as a metaphor for "PriorityLevel" in my actual code.

If I comment out ORDER BY Sizes.DiameterInches, it returns immediately, but if I want to find "the smallest available red widget", then it crawls.

Is there a better way?

Execution plan XML: https://gist.github.com/RobertBaldini/57c8b61d135cc5c84c38b2da243611ad

DDL: https://gist.github.com/RobertBaldini/3740c7bb85eea47d7fe63cb8602ac2d6

Repo (data loader takes several minutes): https://github.com/RobertBaldini/WidgetSandbox

Best Answer

You should retype the various Name columns from nvarchar(max) to nvarchar([right size]). It's unlikely that names will be up to 2GB in length, and making them max sized prevents them being used as a key in an index. A good general rule of thumb is to avoid large object data types wherever possible.

You'll probably need to make that change to the EF code, but in T-SQL:

-- Guessing at 100 characters maximum
ALTER TABLE dbo.[Status]
ALTER COLUMN Name nvarchar(100) NOT NULL;

ALTER TABLE dbo.Colors
ALTER COLUMN Name nvarchar(100) NOT NULL;

ALTER TABLE dbo.Sizes
ALTER COLUMN Name nvarchar(100) NOT NULL;

ALTER TABLE dbo.Widgets
ALTER COLUMN Name nvarchar(100) NOT NULL;

I have also changed the column definition to NOT NULL there.

Anyway, given indexes:

CREATE NONCLUSTERED INDEX IX_dbo_Sizes__DiameterInches
ON dbo.Sizes (DiameterInches);

CREATE NONCLUSTERED INDEX IX_dbo_Widgets__SizeId_ColorCode_StatusId__Name
ON dbo.Widgets (SizeId, ColorCode, StatusId)
INCLUDE (Name);

...you should end up with a plan that avoids lookups and sorts, something like:

With the small number of rows in Status and Colors, it is probably not worth indexing their Name columns right now, but that could change over time. In any case, if the name columns should be unique, you should constrain them to be so using a unique constraint or unique nonclustered index, for example:

CREATE UNIQUE NONCLUSTERED INDEX UQ_dbo_Colors__Name
ON dbo.Colors (Name);

CREATE UNIQUE NONCLUSTERED INDEX UQ_dbo_Status__Name
ON dbo.[Status] (Name);

CREATE UNIQUE NONCLUSTERED INDEX UQ_dbo_Sizes__Name
ON dbo.Sizes (Name);

CREATE UNIQUE NONCLUSTERED INDEX UQ_dbo_Widgets__Name
ON dbo.Widgets (Name);

Second approach

With the same indexes, if you are able to change the SQL, you could also separate out the Color and Status lookups:

DECLARE 
    @StatusId integer,
    @ColorCode nvarchar(6);

SELECT @StatusId = ST.Id 
FROM dbo.[Status] AS ST 
WHERE ST.Name = N'Available';

SELECT @ColorCode = C.ColorCode 
FROM dbo.Colors AS C 
WHERE C.Name = N'Red';

SELECT TOP (1) 
    W.*
FROM dbo.Widgets AS W
JOIN dbo.Sizes AS S
    ON W.SizeId = S.Id
WHERE
    W.StatusId = @StatusId
    AND W.ColorCode = @ColorCode
ORDER BY
    S.DiameterInches;

This makes for a simpler set of operations that also happen to make life easier for the optimizer:

Note also that Unicode string literals should be prefixed with N for correct data typing.

Oh, and you should also patch your instance. It is currently SQL Server 2008 Service Pack 1 - Service Pack 4 has been available for some time now.

Related Solutions

Sql-server – Optimising join on large table

Your ix_hugetable looks quite useless because:

it is the clustered index (PK)
the INCLUDE makes no difference because a clustered index INCLUDEs all non-key columns (non-key values at lowest leaf = INCLUDEd = what a clustered index is)

In addition: - added or fk should be first - ID is first = not much use

Try changing the clustered key to (added, fk, id) and drop ix_hugetable. You've already tried (fk, added, id). If nothing else, you'll save a lot of disk space and index maintenance

Another option might be to try the FORCE ORDER hint with table order boh ways and no JOIN/INDEX hints. I try not to use JOIN/INDEX hints personally because you remove options for the optimiser. Many years ago I was told (seminar with a SQL Guru) that FORCE ORDER hint can help when you have huge table JOIN small table: YMMV 7 years later...

Oh, and let us know where the DBA lives so we can arrange for some percussion adjustment

Edit, after 02 Jun update

The 4th column is not part of the non-clustered index so it uses the clustered index.

Try changing the NC index to INCLUDE the value column so it doesn't have to access the value column for the clustered index

create nonclustered index ix_hugetable on dbo.hugetable (
    fk asc, added asc
) include(value)

Note: If value is not nullable then it is the same as COUNT(*) semantically. But for SUM it need the actual value, not existence.

As an example, if you change COUNT(value) to COUNT(DISTINCT value) without changing the index it should break the query again because it has to process value as a value, not as existence.

The query needs 3 columns: added, fk, value. The first 2 are filtered/joined so are key columns. value is just used so can be included. Classic use of a covering index.

Mysql – Int comparison very slow in query

First, a query or (update) statement with a condition like WHERE posts < comments that compares 2 columns cannot effectively use your indexes so it will probably have to do a full table scan. It might be better if you had a composite index, (posts, comments) or the other way around, but it would still need to do a full index scan.

If the rows to be updated are very few, say 100, it's not very efficient to scan 1 million rows to update only 100.

So, the easiest thing would probably be to add such a composite index and test. If the resulting efficiency from the full index scan is acceptable, you can keep using this query.

Another thing you could do - since you are using MariaDB - is to add a VIRTUAL (computed) column and index it:

ALTER TABLE posts
  ADD COLUMN comment_posts_diff INT AS (columns - posts) PERSISTENT,
  ADD INDEX more_comments_ix (comment_posts_diff) ;

Then you only need to change the condition of your queries to use:

WHERE comments - posts > 0

or:

WHERE comment_posts_diff > 0

and the index will be effectively used.

Note that there are 2 variants of computed columns, the VIRTUAL and the PERSISTENT ones. As it's obvious from the names, the persistent columns actually take space on disks and the values are automatically modified when any of the columns in their definition is modified. Their advantage is that they can be indexed.

If the 'more comments' though is just a constant string that applies to rows with comments > posts and there is another string ('less comment'?) that applies to the rest of the rows or a similar need, you don't really need the column notes and any of the above. Your queries could just compute the respective value (more comments, less comment, no comments) at run time.

Or you could have a computed column to do the work for you (vitual this time, not persistent):

ALTER TABLE posts
  DROP COLUMN notes,
  ADD COLUMN notes VARCHAR(20) AS
      CASE WHEN columns > posts THEN 'more comments'
           WHEN columns = 0     THEN 'no comments'
           WHEN columns = posts THEN 'comments = posts'
           ELSE 'less comments'
      END
    VIRTUAL ;

Best Answer

Second approach

Related Solutions

Sql-server – Optimising join on large table

Mysql – Int comparison very slow in query

Related Question