Sql-server – How to hint many-to-many join in SQL Server

I have 3 "large" tables that join on a pair of columns (both ints).

Table1 has ~200 million rows
Table2 has ~1.5 million rows
Table3 has ~6 million rows

Each table has a clustered index on Key1, Key2, and then one more column. Key1 has low cardinality and is very skewed. It is always referenced in the WHERE clause. Key2 is never mentioned in the WHERE clause. Each join is many-to-many.

The problem is with cardinality estimation. The output estimation of each join gets smaller instead of larger. This results in final estimates of low hundreds when the actual result is well into the millions.

Is there any way for me to clue the CE into making better estimates?

SELECT 1
FROM Table1 t1
     JOIN Table2 t2
       ON t1.Key1 = t2.Key1
          AND t1.Key2 = t2.Key2
     JOIN Table3 t3
       ON t1.Key1 = t3.Key1
          AND t1.Key2 = t3.Key2
WHERE t1.Key1 = 1;

Solutions I have tried:

Creating multi-column stats on Key1, Key2
Creating tons of filtered stats on Key1 (This helps quite a bit, but I end up with thousands of user-created stats in the database.)

Masked execution plan (sorry for the bad masking)

In the case I'm looking at, the result has 9 million rows. The new CE estimates 180 rows; the legacy CE estimates 6100 rows.

Here's a reproducible example:

DROP TABLE IF EXISTS #Table1, #Table2, #Table3;
CREATE TABLE #Table1 (Key1 INT NOT NULL, Key2 INT NOT NULL, T1Key3 INT NOT NULL, CONSTRAINT pk_t1 PRIMARY KEY CLUSTERED (Key1, Key2, T1Key3));
CREATE TABLE #Table2 (Key1 INT NOT NULL, Key2 INT NOT NULL, T2Key3 INT NOT NULL, CONSTRAINT pk_t2 PRIMARY KEY CLUSTERED (Key1, Key2, T2Key3));
CREATE TABLE #Table3 (Key1 INT NOT NULL, Key2 INT NOT NULL, T3Key3 INT NOT NULL, CONSTRAINT pk_t3 PRIMARY KEY CLUSTERED (Key1, Key2, T3Key3));

-- Table1 
WITH Numbers
     AS (SELECT TOP (1000000) Number = ROW_NUMBER() OVER(ORDER BY t1.number)
         FROM master..spt_values t1
              CROSS JOIN master..spt_values t2),
     DataSize (Key1, NumberOfRows)
     AS (SELECT 1, 2000 UNION
         SELECT 2, 10000 UNION
         SELECT 3, 25000 UNION
         SELECT 4, 50000 UNION
         SELECT 5, 200000)
INSERT INTO #Table1
SELECT Key1
     , Key2 = ROW_NUMBER() OVER (PARTITION BY Key1, T1Key3 ORDER BY Number)
     , T1Key3
FROM DataSize
     CROSS APPLY (SELECT TOP(NumberOfRows) 
                         Number
                       , T1Key3 = Number%(Key1*Key1) + 1 
                  FROM Numbers
                  ORDER BY Number) size;

-- Table2 (same Key1, Key2 values; smaller number of distinct third Key)
WITH Numbers
     AS (SELECT TOP (1000000) Number = ROW_NUMBER() OVER(ORDER BY t1.number)
         FROM master..spt_values t1
              CROSS JOIN master..spt_values t2)
INSERT INTO #Table2
SELECT DISTINCT 
       Key1
     , Key2
     , T2Key3
FROM #Table1
     CROSS APPLY (SELECT TOP (Key1*10) 
                         T2Key3 = Number
                  FROM Numbers
                  ORDER BY Number) size;

-- Table2 (same Key1, Key2 values; smallest number of distinct third Key)
WITH Numbers
     AS (SELECT TOP (1000000) Number = ROW_NUMBER() OVER(ORDER BY t1.number)
         FROM master..spt_values t1
              CROSS JOIN master..spt_values t2)
INSERT INTO #Table3
SELECT DISTINCT 
       Key1
     , Key2
     , T3Key3
FROM #Table1
     CROSS APPLY (SELECT TOP (Key1) 
                         T3Key3 = Number
                  FROM Numbers
                  ORDER BY Number) size;


DROP TABLE IF EXISTS #a;
SELECT col = 1 
INTO #a
FROM #Table1 t1
     JOIN #Table2 t2
       ON t1.Key1 = t2.Key1
          AND t1.Key2 = t2.Key2
WHERE t1.Key1 = 1;

DROP TABLE IF EXISTS #b;
SELECT col = 1 
INTO #b
FROM #Table1 t1
     JOIN #Table2 t2
       ON t1.Key1 = t2.Key1
          AND t1.Key2 = t2.Key2
     JOIN #Table3 t3
       ON t1.Key1 = t3.Key1
          AND t1.Key2 = t3.Key2
WHERE t1.Key1 = 1;

Best Answer

Just to be clear, the optimizer already knows that it's a many-to-many join. If you force merge joins and look at an estimated plan you can see a property for the join operator which tells you if the join could be many-to-many. The problem that you need to solve here is bumping up the cardinality estimates, presumably so you get a more efficient query plan for the part of the query that you left out.

The first thing that I would try is putting the results of the join from Object3 and Object5 into a temp table. For the plan that you posted it's just a single column on 51393 rows, so it should hardly take up any space in tempdb. You can gather full stats on the temp table and that alone might be enough to get a sufficient accurate final cardinality estimate. Gathering full stats on Object1 may help as well. Cardinality estimates often get worse as you traverse from a plan from right to left.

If that doesn't work you can try the ENABLE_QUERY_OPTIMIZER_HOTFIXES query hint if you don't already have it enabled at the database or server level. Microsoft locks plan-affecting performance fixes for SQL Server 2016 behind that setting. Some of them relate to cardinality estimates, so perhaps you'll get lucky and one of the fixes will help with your query. You can also try using the legacy cardinality estimator with a FORCE_LEGACY_CARDINALITY_ESTIMATION query hint. Certain data sets may get better estimates with the legacy CE.

As a last resort you can manually increase the cardinality estimate by whatever factor you like using Adam Machanic's MANY() function. I talk about it in another answer but it looks like the link is dead. If you're interested I can try to dig something up.

Best Answer

Related Solutions

Sql-server – Huge cross product view query speed

Mysql – Optimizing a query so MySQL will use the indexes for the joins

Related Question