Sql-server – Joining of two tables, one containing a large amount of data. Need help optimising the query

sql servert-sql

I have been given a job of taking two tables (trimmed down create scripts below) and merging their data into a third table.

CREATE TABLE dbo.Ztest_nominal
(
  BMCode       varchar(20)     NOT NULL,
  nomcode      varchar(500)    NOT NULL,
  Description  nvarchar(4000)  NULL,
  LanguageID   int             NOT NULL,
  CONSTRAINT PK_Ztest_nominal PRIMARY KEY CLUSTERED (BMCode ASC)
)

CREATE TABLE dbo.Ztest_Participation
(
  ParticipationID       int IDENTITY(1,1)  NOT NULL,
  BMCode                varchar(20)        NOT NULL,
  LatestVersionNo       varchar(5)         NOT NULL,
  LastVersionSubmitted  varchar(5)         NOT NULL,
  CONSTRAINT PK_Ztest_Participation PRIMARY KEY CLUSTERED (BMCode ASC)
)

The third tables columns to take the contents of above (basically the above two tables):

[ParticipationID] [int] IDENTITY(1,1) NOT NULL,
[BMCode] [varchar](20) NOT NULL,
[LatestVersionNo] [varchar](5) NOT NULL,
[LastVersionSubmitted] [varchar](5) NOT NULL,
[nomcode] [varchar](500) NOT NULL,
[Description] [nvarchar](4000) NULL,
[LanguageID] [int] NOT NULL

Ztest_nominal has about 63,000,000 rows and Ztest_Participation about 62,000

The sql I was going to use (with an insert into) is (I did have the inner select outside before I gave up and posted here!):

SELECT 
  p.ParticipationId,
  p.LatestVersionNo,
  n.*
FROM
  Ztest_nominal n
INNER JOIN
  (
    SELECT 
      ParticipationId,
      LatestVersionNo,
      BMCode
    FROM 
      Ztest_participation
    WHERE 
      LatestVersionNo = LastVersionSubmitted
  ) p 
    ON p.BMCode = n.BMCode

However i feel its going need some optimisation so its not slow, also I've already had out of memory errors when running it.

Any help would be appreciated

Best Answer

So I re-wrote your query (under a couple of assumptions) to hopefully set it up for "great success". The three main things the below queries do are:

1: Pre-filter the data from Ztest_Participation into a temp table. By doing this filtering upfront you simplify your final INSERT query for the query optimizer by pre-materializing just the data you need to JOIN to first. Also materializing it into a temp table can leverage some of the performance benefits in TempDB later on (such as the potential of the data being materialized in memory).

2: The final INSERT query leverages the TABLOCK hint which results in less log generation and removes the potential of many row lock escalations, leading to optimized performance on INSERT. (Additional information on the benefits of TABLOCK here.)

3: I simplified your JOIN query between #Ztest_Participation_Filtered and Ztest_nominal by using a better relational solution. This should also help the query optimizer when generating a performant execution plan.

-- Pre-filter Ztest_Participation down into a temp table
SELECT P.ParticipationID, 
       P.BMCode, 
       P.LatestVersionNo, 
       P.LastVersionSubmitted
  INTO #Ztest_Participation_Filtered
  FROM Ztest_Participation AS P
 WHERE P.LatestVersionNo = P.LastVersionSubmitted

-- Create the appropriate index needed for our JOIN later on
CREATE CLUSTERED INDEX IX_TEMP_Ztest_Participation_Filtered_BMCode 
    ON #Ztest_Participation_Filtered (BMCode)

-- Final insert into table "FinalResults" (replace this with your actual new table's name)
INSERT INTO FinalResults WITH (TABLOCK) -- TABLOCK hint to reduce log generation and eliminate potential for many row lock escalations
     ( ParticipationID, 
       BMCode, 
       LatestVersionNo, 
       LastVersionSubmitted, 
       nomcode, 
       [Description], 
       LanguageID
     )
SELECT PF.ParticipationID, 
       PF.BMCode, 
       PF.LatestVersionNo, 
       PF.LastVersionSubmitted, 
       N.nomcode, 
       N.[Description], 
       N.LanguageID
  FROM #Ztest_Participation_Filtered AS PF
 INNER JOIN Ztest_nominal AS N-- Simplified JOIN logic to make the query optimizer's job easier
    ON PF.BMCode = N.BMCode

*Note one assumption my above query makes is that you want to use the ParticipationID from Ztest_Participation in your new 3rd table. You won't be able to INSERT this field with your current table schema where it defines ParticipationID as an IDENTITY in the new table. You'll either need to remove that table attribute before the INSERT and then add in after the INSERT or you can temporarily turn on IDENTITY_INSERT for your new table. (But you should turn it back off after the INSERT.)

On my humble laptop, with the schemas you provided above, and 63,000,000 and 62,000 semi-randomly generated records to fill the tables respectively, the above batch of queries runs in a total runtime of 0 seconds. You can find the execution plan it generated for the FinalResults INSERT query on my end here.

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.

For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.

For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.

Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.

Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.

The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.

Taking all of this into consideration, I would recommend Option A.

I hope this helps,

Matt

Sql-server – Do I need a separate Id column for this “mapping” table

If you have a one-to-many relationship between Producers and Products (in other words, a product can only belong to one producer), than it would make sense to just put a foreign key reference directly in your Products table:

One-To-Many

create table Producer ( id int identity(1, 1) not null primary key clustered, Name varchar(100) not null ) go create table Product ( id int identity(1, 1) not null, Name varchar(100) not null, ProducerId int not null foreign key references Producer(id) ) go

But if there's a chance that this will be a many-to-many relationship, then your best bet would be to use a Join table.

Many-To-Many

create table Producer
(
    id int identity(1, 1) not null primary key clustered,
    Name varchar(100) not null
)
go

create table Product
(
    id int identity(1, 1) not null primary key clustered,
    Name varchar(100) not null
)
go

create table ProductProducer
(
    ProductId int not null foreign key references Product(id),
    ProducerId int not null foreign key references Producer(id)
)
go

-- adding the primary key also ensures uniqueness
alter table ProductProducer
add constraint PK_ProductProducer 
primary key (ProductId, ProducerId)
go

If you decide to go with the Join table, you wouldn't need to have an additional key, as the combination of ProductId/ProducerId would ultimately be unique. You could use them as a composite key, so you wouldn't need that additional Id field in ProductProducer.

Best Answer

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Sql-server – Do I need a separate Id column for this “mapping” table

Related Question