Sql-server – MSSQL efficient paging including a join on big tables

offset-fetchpagingsql serversql server 2014

We have a table with products and a table with a current selection/assortment of products. Now need to join them create an export file by some .net code.

Currently using query like this:

SELECT * FROM ASSORTMENT a
INNER JOIN PRODUCTS p on a.clientID = 38 and a.productID = p.productID
ORDER BY a.ID_ROW OFFSET 100000 ROWS FETCH NEXT 1000 ROWS ONLY

Problem is that the assortment can contain discontinued products, that are not existing in products table anymore, so need a join to verify. Also need each time exactly same amount of rows per select, e.g. 1000.

The query gets slower if the offset gets higher because it does a full join of products and assortment table each time. On 3 Million products and an assortment of 200.000 articles it takes multiple seconds to fetch next 1000 if the offset is over 50.0000

Is there a more efficient way of doing this?

Best Answer

You didn't describe the problem very precisely so I made some guesses about table structures and data volume. I put 3 million rows in both tables:

CREATE TABLE dbo.ASSORTMENT (
    ID_ROW BIGINT NOT NULL,
    clientID INT NOT NULL,
    productID BIGINT NOT NULL,
    FILLER VARCHAR(2000) NOT NULL,
    PRIMARY KEY (ID_ROW)
);

INSERT INTO dbo.ASSORTMENT WITH (TABLOCK)
SELECT t.RN, 30 + t.RN % 15, t.RN, REPLICATE('Z', 2000)
FROM
(
    SELECT TOP (3000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t (RN)
OPTION (MAXDOP 1);

CREATE TABLE dbo.PRODUCTS (
    productID BIGINT NOT NULL,
    FILLER VARCHAR(2000) NOT NULL,
    PRIMARY KEY (productID)
);

INSERT INTO dbo.PRODUCTS WITH (TABLOCK)
SELECT t.RN, REPLICATE('Z', 2000)
FROM
(
    SELECT TOP (3000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t (RN)
OPTION (MAXDOP 1);

Getting the first thousand rows is fast. It only requires 15 milliseconds of cpu time on my machine:

SELECT *
FROM dbo.ASSORTMENT a
INNER JOIN dbo.PRODUCTS p on a.clientID = 38 and a.productID = p.productID
ORDER BY a.ID_ROW OFFSET 0 ROWS FETCH NEXT 1000 ROWS ONLY
OPTION (MAXDOP 1);

If I change the OFFSET to 50000 rows the query plan looks the same but the query now takes 1700 ms of CPU time on my machine:

bad query plan

Why is the query so much slower? ORDER BY with an OFFSET requires SQL Server to scan past the unwanted data. For my data set SQL Server SQL server reads 764993 rows from ASSORTMENT before finding the 51k rows that it needs. It also does 51k index seeks to PRODUCTS before returning the 1000 rows that you requested.

Let's take a step back. Do you really need to select all columns from both tables? Your question said that the join to PRODUCTS is only there for data integrity, so why do you need all columns from that table? Unnecessary SELECT * is a bad habit which is best avoided. It would make a great New Year's resolution!

One standard solution would be to only select the columns that you need and to make covering indexes on the two tables. I'm going to assume that you absolutely need all columns from both tables. In that case you want the query optimizer to do the scan over as little data as possible, but in terms of number of rows and number of columns. To make that possible I'll create the following indexes:

CREATE INDEX IX_ASSORTMENT ON dbo.ASSORTMENT (clientID, ID_ROW) INCLUDE (productID);

CREATE INDEX IX_PRODUCTS ON dbo.PRODUCTS (productID);

A query to get just the clustered indexes of both tables now just takes 63 ms of CPU time on my machine:

SELECT a.ID_ROW, p.productID
FROM dbo.ASSORTMENT a
INNER JOIN dbo.PRODUCTS p on a.clientID = 38 and a.productID = p.productID
ORDER BY a.ID_ROW OFFSET 50000 ROWS FETCH NEXT 1000 ROWS ONLY
OPTION (MAXDOP 1);

The big differences are that SQL Server doesn't have to read rows for which ProductId is not 38 and it doesn't need to read data pages that contain unnecessary columns. Adding all of the columns back into the query bumps up the CPU time requirement to 80 ms but that's still better than a 20X improvement than before:

SELECT a.*, p.*
FROM
(
    SELECT a.ID_ROW, p.productID
    FROM dbo.ASSORTMENT a
    INNER JOIN dbo.PRODUCTS p on a.clientID = 38 and a.productID = p.productID
    ORDER BY a.ID_ROW OFFSET 50000 ROWS FETCH NEXT 1000 ROWS ONLY
) t
LEFT OUTER JOIN dbo.ASSORTMENT a ON t.ID_ROW = a.ID_ROW
LEFT OUTER JOIN dbo.PRODUCTS p ON t.productID = p.productID 
OPTION (MAXDOP 1);

If you need the query to be faster than that you can consider using an anchored solution as described here.