Sql-server – SQL Server: Clustered index, sorting and pagination

clustered-indexdata-pagesindexmulti-tenantsql server

In my application, several times I have to show results that are paginated and sorted by some field.

For example a simple user list sorted by last name. Because of that and because I also have logical deletion and it's a multi-tenancy application, I generally use a CLUSTERED INDEX like this:

CREATE CLUSTERED INDEX [idx] ON [Users]
(
     IsDeleted ASC,
     [AccountId] ASC,
     [LastName] ASC
)

That means that a query like paginated like SELECT TOP(20) * FROM Users WHERE IsDeleted = 0 AND AccountId = xxx is sorted by LastName. I know it's not guaranteed to be sorted, but in practice it always is.

However, reading here Kimberly Tripp blog post on clustered indexes, she says it's a horrible idea to do it like so. And it's even worse because the IsDeleted (BIT) field doesn't let me to set the

However, if I change the CLUSTERED INDEX to just the unique ID, I would need to start using ORDER BY LastName, which in practice is very slow.

My table has a few million records (tens of millions at most) and it's used in general as follows:

  • Querying data. Most of the time.
  • Bulk updates/inserts, where only the data modified is under the subset IsDeleted = 0, AccountId = xxxx (only non-deleted data for a single account is updated in bulk).

Question:

What would be the recommended index (and how to sort) for these kind of tables?

Another example for these kind of tables would be a survey results table that has the following columns IsDeleted (BIT), AccountId (FK GUID), UserId (FK GUID), QuestionKey (NVARCHAR), AnswerValue (TEXT), where my CLUSTERED KEY would be probably (IsDeleted, AccountId, UserId, QuestionKey) and 99% of time I would query the table or update it in bulk by the 3 first fields

WHERE IsDeleted = 0 
AND AccountId = xxx 
AND UserId = yyy

or even the 4 fields: ... AND QuestionKey = 'country'

EDIT:

One of the primary reasons I did this is because the bulk updates and the queries are always limited to 1 or a small number of pages. Having an Identity column would require the queries and the updates to read/write in most pages.

EDIT 2:

Following Joe Obbish's example:

This query:

SELECT TOP (20) * 
FROM Users2
WHERE IsDeleted = 0 
AND AccountId = '46FC5693-7446-415A-8626-8937365460D1'
ORDER BY [LastName];
  • Having 1 cluster index in (IsDeleted, AccountId, LastName) results to this:

CPU time = 3 ms, elapsed time = 3 ms.

Table 'Users2'. Scan count 1, logical reads 5, physical reads 4, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

  • Having one cluster index in a new PK ID (NEWID()) column (that way the data is randomly sorted internally) and one non-clustered (IsDeleted, AccountId, LastName) results to this:

CPU time = 16 ms, elapsed time = 18 ms.

Table 'Users2'. Scan count 1, logical reads 533, physical reads 5, read-ahead reads 1240, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Note the IO and time. It's way slower and requires more IO if the data is not stored together. It might take more space, but the speed difference is notable. Am I wrong to do it like that?

Best Answer

First I need to reiterate what RDFozz said in his answer about including an explicit ORDER BY. If SQL Server does an allocation order scan you may get the wrong results. Including ORDER BY in the query doesn't cause a performance hit. Why not do it?

From a query performance perspective, the index that you want depends on how many rows you return at a time and how many columns are actually needed from the table.

First I'll throw around 6.5 million rows into a table with six tenants:

CREATE TABLE dbo.Users2 (
    IsDeleted Bit NOT NULL,
    [AccountId] UNIQUEIDENTIFIER NOT NULL,
    [LastName] NVARCHAR(50) NOT NULL,
    [UsefulColumn] NVARCHAR(20) NOT NULL,
    [OtherColumns] NVARCHAR(100) NOT NULL
);

CREATE CLUSTERED INDEX [idx] ON [Users2]
(
     IsDeleted ASC,
     [AccountId] ASC,
     [LastName] ASC
);

CREATE TABLE #ids (id INT NOT NULL IDENTITY (0, 1), [AccountId] UNIQUEIDENTIFIER NOT NULL);
INSERT INTO #ids
SELECT TOP 6 NEWID()
FROM master..spt_values;

INSERT INTO [Users2] WITH (TABLOCK)
SELECT 
CASE WHEN t1.number % 10 = 1 THEN 1 ELSE 0 END
, #ids.[AccountId]
, LEFT(REPLACE(CONVERT(NVARCHAR(50), NEWID()), '-', ''), 12)
, REPLICATE(N'Z', 20)
, REPLICATE(N'Z', 100)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
LEFT OUTER JOIN #ids ON ABS(t1.number % 6) = #ids.id;

DROP TABLE #ids;

-- get an ID: FFA7D6D8-63E8-422B-B5E7-F7020871CDB4
SELECT TOP 1 [AccountId] FROM Users2
WHERE IsDeleted = 0
ORDER BY [AccountId] DESC;

When running a query similar to yours:

SELECT TOP (20) * 
FROM Users2
WHERE IsDeleted = 0 
AND AccountId = 'FFA7D6D8-63E8-422B-B5E7-F7020871CDB4'
ORDER BY [LastName];

I get a clustered index seek as expected:

query plan

Is using the clustered index the best option? It depends. If you don't need to select every column from the table then you can define a smaller covering index which returns the data you need without an explicit sort. Having a smaller covering index is good for performance as you page out further into the data. Suppose that you only need UsefulColumn and not the OtherColumns column. You could define the following index:

CREATE NONCLUSTERED INDEX [idx_1] ON [Users2]
(
     [AccountId] ASC,
     [LastName] ASC
) 
INCLUDE ([UsefulColumn]) 
WHERE IsDeleted = 0;

It's pretty large. For my test case it's around 28% of the size of the data. For this index, it's important to note that changing the clustered key of the table will not have a big impact on its size. SQL Server stores the clustered key columns in the leaf node of the index unless they are already included in the index. This can be demonstrated with a simple test:

CREATE TABLE dbo.IX_TEST (
    COL1 BIGINT NOT NULL,
    COL2 BIGINT NOT NULL,
    FILLER VARCHAR(6) NOT NULL,
    PRIMARY KEY (COL1)
);

INSERT INTO dbo.IX_TEST WITH (TABLOCK)
SELECT TOP (1000000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 6)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

EXEC sp_spaceused 'IX_TEST'; -- 96 KB

CREATE INDEX COL1 ON dbo.IX_TEST (COL1)
EXEC sp_spaceused 'IX_TEST'; -- 14032 KB

CREATE INDEX COL2 ON dbo.IX_TEST (COL2)
EXEC sp_spaceused 'IX_TEST'; --- 35920 KB

Going back to our table, if I created an index on just the UsefulColumn column there would be an overhead (with no compression) of approximately 1 byte for the IsDeleted column, 16 bytes for the AccountId, 2 * the average length of the LastName column, and 0 or 4 bytes for the internal uniquifier (only happens for duplicate last names). For my test data that's quite a bit of overhead:

1 + 16 + 2 * 12 + 0 = 41 bytes

However for the idx_1 index I defined above it's around only 1 bytes (1 for IsDeleted and 0 for the uniquifier, assuming not very many duplicate last names). The index is mainly large because I'm using wide columns as key columns. Keep in mind that the indexes will be the same size with your current clustering key, but changing the clustering key of the table to a thinner set of columns will greatly decrease the size of the index only defined on UsefulColumn.

For any TOP value I should get a nice index seek on the covering index. This query:

SELECT TOP (200000) [LastName], [UsefulColumn] 
FROM Users2
WHERE IsDeleted = 0 AND AccountId = 'FFA7D6D8-63E8-422B-B5E7-F7020871CDB4'
ORDER BY [LastName];

Has the following plan:

covering plan

Even if you need every single column from the table the index defined above can still give good enough performance. You'll get a key lookup for every row that you return. For a single end user, doing key lookup on 20 rows it shouldn't be noticeable. In your test you saw execution times of 3 ms and 18 ms. However, if you have a highly concurrent workload it could make a difference. Only you can evaluate that properly.

Even without the caveats above, when selecting many rows you may notice a large performance difference:

clust compare

The first query has an index hint to force the use of the index. The query optimizer thinks that plan will be much more expensive than the plan that uses the clustered index.

If I had to guess, I would say that you probably don't need to use a clustered index to perform your paging. If you have a lot of indexes on the table you might benefit from creating a new covering index for your pagination queries and by defining the clustered index on a smaller, unique set of columns.