SQL Server – Improving Join Performance

sql serverssis-2016

I have a table on sql server lets say TableA.
and a another table as TableB.Both of these tables are heavily loaded.

Table A is loaded with records 35883788. It is holding a lot of duplicates on column program_id.

Table B is loaded with records 27343331. It is also holding a duplicate on column on program_id.
For some business reason we need to keep duplicate records on program_id on both the tables.

Now I am performing a left join on given table as:

select 
a.*,b.date_of_birth,
datediff(year,b.date_of_birth,a.from_date)-
(case when dateadd(year,
           datediff(year,b.date_of_birth,a.from_date),b.date_of_birth)
           > a.from_date then 1 else 0
                   end) as age_final,
case when datediff(yyyy,b.date_of_birth,a.from_date) is null then ''
     when datediff(yyyy,b.date_of_birth,a.from_date) <=1 then 'less than 1 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=17 then 'then 1 - 17 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=29 then 'then 19 - 29 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=39 then 'then 30 - 39 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=49 then 'then 40 - 49 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=59 then 'then 50 - 59 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=64 then 'then 60 - 64 year'
else 'More than 64 years' end as Age_band
from TableA a 
left join ( select program_id,date_of_birth,max(process_date) 
            from TableB 
            group by program_id,date_of_birth) b
    on a.program_id=b.program_id;

my inner query is giving me a unique set of program_id as using group by statement.

None of these tables are indexed. If I have to create an index on both the tables, what kind of index do I need to have on program_id for both the tables.

what all other things I can take care of to improve the joining performance.

need suggestion to optimize the join and indexes.

Best Answer

DDL & DML at the bottom

Standard indexes that could be created based on the information given

CREATE NONCLUSTERED INDEX IX_program_id_date_of_birth_process_date
ON dbo.TableB (program_id,date_of_birth,process_date);
CREATE NONCLUSTERED INDEX IX_TableA
ON dbo.TableA(program_id);

Running the query

SET STATISTICS IO, TIME ON;
select 
a.*,b.date_of_birth
from dbo.TableA a 
left join ( select program_id,date_of_birth,max(process_date)  as maxprocess_date
            from dbo.TableB 
            group by program_id,date_of_birth) b
    on a.program_id=b.program_id;

enter image description here

The NC index on TableA is used since ID is included in the NC index and no other columns are in TableA except the program_id coluln, which is a key column of that index.


But, since you are using select a.* you will probably have to add all these columns to the NC index on TableA as included columns, which will increase your index size depending on the amount of columns in the table.

If we add a varchar(3) column and populate it with a value:

ALTER TABLE dbo.TableA ADD bla varchar(3) 
DEFAULT ('bla')
WITH VALUES;

The index is no longer used:

enter image description here

Because of the additional bla column added:

enter image description here

To fix this, recreate the index with the bla column included.

DROP  INDEX IX_TableA ON dbo.TableA;
CREATE NONCLUSTERED INDEX IX_TableA
ON dbo.TableA(program_id)
INCLUDE(bla);

Additionally

None of these tables are indexed.

Heap tables are not ideal, you should consider adding a clustered index that makes sense and reading up on indexing.

More on the difference between heap tables and tables with a Clustered index here

select a.*

Using Select * is not a good practice, consider naming all columns used in your select statement separately.

Bad habits to kick : using SELECT *


DDL + DML used for testing

CREATE TABLE 
dbo.TableA(ID INT IDENTITY(1,1) PRIMARY KEY,
           program_id INT);
CREATE TABLE 
dbo.TableB(ID INT IDENTITY(1,1) PRIMARY KEY,
           program_id INT,
           date_of_birth date,
           process_date datetime2)
-- Dataset reduced by / 10 ~= 3 588 378

INSERT INTO 
dbo.TableA WITH(TABLOCK)  (program_id)
SELECT TOP(3588378) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) / 1000
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2;
-- values from 0 to 3588

-- Dataset reduced by / 10 ~= 2 734 333
INSERT INTO 
dbo.TableB WITH(TABLOCK) (program_id,date_of_birth,process_date) 
SELECT TOP(2734333) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) / 1000,
CAST('01/01/1990' AS date),
DATEADD(MINUTE,-ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),GETDATE()) -- one minute less to get distinct process_dates
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2;
-- values from 0 to 2734