Sql-server – Joining of two tables, one containing a large amount of data. Need help optimising the query

sql servert-sql

I have been given a job of taking two tables (trimmed down create scripts below) and merging their data into a third table.

CREATE TABLE dbo.Ztest_nominal
(
  BMCode       varchar(20)     NOT NULL,
  nomcode      varchar(500)    NOT NULL,
  Description  nvarchar(4000)  NULL,
  LanguageID   int             NOT NULL,
  CONSTRAINT PK_Ztest_nominal PRIMARY KEY CLUSTERED (BMCode ASC)
)

CREATE TABLE dbo.Ztest_Participation
(
  ParticipationID       int IDENTITY(1,1)  NOT NULL,
  BMCode                varchar(20)        NOT NULL,
  LatestVersionNo       varchar(5)         NOT NULL,
  LastVersionSubmitted  varchar(5)         NOT NULL,
  CONSTRAINT PK_Ztest_Participation PRIMARY KEY CLUSTERED (BMCode ASC)
)

The third tables columns to take the contents of above (basically the above two tables):

[ParticipationID] [int] IDENTITY(1,1) NOT NULL,
[BMCode] [varchar](20) NOT NULL,
[LatestVersionNo] [varchar](5) NOT NULL,
[LastVersionSubmitted] [varchar](5) NOT NULL,
[nomcode] [varchar](500) NOT NULL,
[Description] [nvarchar](4000) NULL,
[LanguageID] [int] NOT NULL

Ztest_nominal has about 63,000,000 rows and Ztest_Participation about 62,000

The sql I was going to use (with an insert into) is (I did have the inner select outside before I gave up and posted here!):

SELECT 
  p.ParticipationId,
  p.LatestVersionNo,
  n.*
FROM
  Ztest_nominal n
INNER JOIN
  (
    SELECT 
      ParticipationId,
      LatestVersionNo,
      BMCode
    FROM 
      Ztest_participation
    WHERE 
      LatestVersionNo = LastVersionSubmitted
  ) p 
    ON p.BMCode = n.BMCode

However i feel its going need some optimisation so its not slow, also I've already had out of memory errors when running it.

Any help would be appreciated

Best Answer

So I re-wrote your query (under a couple of assumptions) to hopefully set it up for "great success". The three main things the below queries do are:

1: Pre-filter the data from Ztest_Participation into a temp table. By doing this filtering upfront you simplify your final INSERT query for the query optimizer by pre-materializing just the data you need to JOIN to first. Also materializing it into a temp table can leverage some of the performance benefits in TempDB later on (such as the potential of the data being materialized in memory).

2: The final INSERT query leverages the TABLOCK hint which results in less log generation and removes the potential of many row lock escalations, leading to optimized performance on INSERT. (Additional information on the benefits of TABLOCK here.)

3: I simplified your JOIN query between #Ztest_Participation_Filtered and Ztest_nominal by using a better relational solution. This should also help the query optimizer when generating a performant execution plan.

-- Pre-filter Ztest_Participation down into a temp table
SELECT P.ParticipationID, 
       P.BMCode, 
       P.LatestVersionNo, 
       P.LastVersionSubmitted
  INTO #Ztest_Participation_Filtered
  FROM Ztest_Participation AS P
 WHERE P.LatestVersionNo = P.LastVersionSubmitted

-- Create the appropriate index needed for our JOIN later on
CREATE CLUSTERED INDEX IX_TEMP_Ztest_Participation_Filtered_BMCode 
    ON #Ztest_Participation_Filtered (BMCode)

-- Final insert into table "FinalResults" (replace this with your actual new table's name)
INSERT INTO FinalResults WITH (TABLOCK) -- TABLOCK hint to reduce log generation and eliminate potential for many row lock escalations
     ( ParticipationID, 
       BMCode, 
       LatestVersionNo, 
       LastVersionSubmitted, 
       nomcode, 
       [Description], 
       LanguageID
     )
SELECT PF.ParticipationID, 
       PF.BMCode, 
       PF.LatestVersionNo, 
       PF.LastVersionSubmitted, 
       N.nomcode, 
       N.[Description], 
       N.LanguageID
  FROM #Ztest_Participation_Filtered AS PF
 INNER JOIN Ztest_nominal AS N-- Simplified JOIN logic to make the query optimizer's job easier
    ON PF.BMCode = N.BMCode

*Note one assumption my above query makes is that you want to use the ParticipationID from Ztest_Participation in your new 3rd table. You won't be able to INSERT this field with your current table schema where it defines ParticipationID as an IDENTITY in the new table. You'll either need to remove that table attribute before the INSERT and then add in after the INSERT or you can temporarily turn on IDENTITY_INSERT for your new table. (But you should turn it back off after the INSERT.)

On my humble laptop, with the schemas you provided above, and 63,000,000 and 62,000 semi-randomly generated records to fill the tables respectively, the above batch of queries runs in a total runtime of 0 seconds. You can find the execution plan it generated for the FinalResults INSERT query on my end here.