SQL Server – Improving Join Performance

sql serverssis-2016

I have a table on sql server lets say TableA.
and a another table as TableB.Both of these tables are heavily loaded.

Table A is loaded with records 35883788. It is holding a lot of duplicates on column program_id.

Table B is loaded with records 27343331. It is also holding a duplicate on column on program_id.
For some business reason we need to keep duplicate records on program_id on both the tables.

Now I am performing a left join on given table as:

select 
a.*,b.date_of_birth,
datediff(year,b.date_of_birth,a.from_date)-
(case when dateadd(year,
           datediff(year,b.date_of_birth,a.from_date),b.date_of_birth)
           > a.from_date then 1 else 0
                   end) as age_final,
case when datediff(yyyy,b.date_of_birth,a.from_date) is null then ''
     when datediff(yyyy,b.date_of_birth,a.from_date) <=1 then 'less than 1 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=17 then 'then 1 - 17 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=29 then 'then 19 - 29 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=39 then 'then 30 - 39 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=49 then 'then 40 - 49 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=59 then 'then 50 - 59 year'
     when datediff(yyyy,b.date_of_birth,a.from_date) <=64 then 'then 60 - 64 year'
else 'More than 64 years' end as Age_band
from TableA a 
left join ( select program_id,date_of_birth,max(process_date) 
            from TableB 
            group by program_id,date_of_birth) b
    on a.program_id=b.program_id;

my inner query is giving me a unique set of program_id as using group by statement.

None of these tables are indexed. If I have to create an index on both the tables, what kind of index do I need to have on program_id for both the tables.

what all other things I can take care of to improve the joining performance.

need suggestion to optimize the join and indexes.

Best Answer

DDL & DML at the bottom

Standard indexes that could be created based on the information given

CREATE NONCLUSTERED INDEX IX_program_id_date_of_birth_process_date
ON dbo.TableB (program_id,date_of_birth,process_date);
CREATE NONCLUSTERED INDEX IX_TableA
ON dbo.TableA(program_id);

Running the query

SET STATISTICS IO, TIME ON;
select 
a.*,b.date_of_birth
from dbo.TableA a 
left join ( select program_id,date_of_birth,max(process_date)  as maxprocess_date
            from dbo.TableB 
            group by program_id,date_of_birth) b
    on a.program_id=b.program_id;

The NC index on TableA is used since ID is included in the NC index and no other columns are in TableA except the program_id coluln, which is a key column of that index.

But, since you are using select a.* you will probably have to add all these columns to the NC index on TableA as included columns, which will increase your index size depending on the amount of columns in the table.

If we add a varchar(3) column and populate it with a value:

ALTER TABLE dbo.TableA ADD bla varchar(3) 
DEFAULT ('bla')
WITH VALUES;

The index is no longer used:

Because of the additional bla column added:

To fix this, recreate the index with the bla column included.

DROP  INDEX IX_TableA ON dbo.TableA;
CREATE NONCLUSTERED INDEX IX_TableA
ON dbo.TableA(program_id)
INCLUDE(bla);

Additionally

None of these tables are indexed.

Heap tables are not ideal, you should consider adding a clustered index that makes sense and reading up on indexing.

More on the difference between heap tables and tables with a Clustered index here

select a.*

Using Select * is not a good practice, consider naming all columns used in your select statement separately.

Bad habits to kick : using SELECT *

DDL + DML used for testing

CREATE TABLE 
dbo.TableA(ID INT IDENTITY(1,1) PRIMARY KEY,
           program_id INT);
CREATE TABLE 
dbo.TableB(ID INT IDENTITY(1,1) PRIMARY KEY,
           program_id INT,
           date_of_birth date,
           process_date datetime2)
-- Dataset reduced by / 10 ~= 3 588 378

INSERT INTO 
dbo.TableA WITH(TABLOCK)  (program_id)
SELECT TOP(3588378) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) / 1000
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2;
-- values from 0 to 3588

-- Dataset reduced by / 10 ~= 2 734 333
INSERT INTO 
dbo.TableB WITH(TABLOCK) (program_id,date_of_birth,process_date) 
SELECT TOP(2734333) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) / 1000,
CAST('01/01/1990' AS date),
DATEADD(MINUTE,-ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),GETDATE()) -- one minute less to get distinct process_dates
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2;
-- values from 0 to 2734

Related Solutions

SQL Server 2008 R2 – Managing Multiple Files and Filegroups

1: yes.

2: depends;) The query execution is still driven by the optimizer. It wont do paralellization for small result sets.

3: Ah - what would that be good given that you are still reading one backup file ;)?

4: Not running the files on one underlying SAN and thus improoving your IO budget?

The most brutal thing I have ever seen like that had nearly 30 files all on separate SAN volumes (count of hard discs going close to 200) and it was done as every LUN had a queue limit in the driver of 255 outstanding requests, which the SAN (with a 32 gigabyte cache) was just not caring about ;) THat thing was pulling in nearly 1.5 gigabyte per second over multiple fiber connections.

Sql-server – SQL join query to show rows with non-existent rows in one table

Thank you for SQLfiddle and sample data! I wish more questions started this way.

If you want all members regardless of whether they have an entry for that date, you want a LEFT OUTER JOIN. You were very close with this version however a little trick with outer joins is that if you add a filter to the outer table in the WHERE clause, you turn an outer join to an inner join, because it will exclude any rows that are NULL on that side (because it doesn't know if NULL would match the filter or not).

I modified the first query to get a row for every member:

SELECT Members.Member_ID
      ,Time_Entry.Date_Start
      ,Time_Entry.Hours_Actual
      ,Time_Entry.Hours_Bill
FROM dbo.Members
  LEFT OUTER JOIN dbo.Time_Entry
--^^^^ changed from FULL to LEFT
  ON Members.Member_ID = Time_Entry.Member_ID
  AND Time_Entry.Date_Start = '20131110';
--^^^ changed from WHERE to AND

I'll leave it as an exercise for the reader to take it from there and add the other columns, formatting, COALESCE etc.

Some other notes:

please always use the schema prefix when creating and referencing objects
please always use a length when converting to varchar etc.
stay away from ambiguous, regional date formats like mm-dd-yyyy

consider using aliases to make your queries easier to read. E.g. the above could be re-written as:

SELECT m.Member_ID
  ,t.Date_Start
  ,t.Hours_Actual
  ,t.Hours_Bill
FROM dbo.Members AS m
LEFT OUTER JOIN dbo.Time_Entry AS t
ON m.Member_ID = t.Member_ID
AND t.Date_Start = '20131110';

... a lot tidier, IMHO, as long as you use sensible aliases.

Best Answer

Related Solutions

SQL Server 2008 R2 – Managing Multiple Files and Filegroups

Sql-server – SQL join query to show rows with non-existent rows in one table

Related Question