Sql-server – Optimizing performance of INSERT INTO table from INNER JOIN between two large tables

performancequery-performancesql serversql-server-2012

I have the following query:

DECLARE @accountCard TABLE (
    numUsu VARCHAR(10)
    ,quantity INT
    )

/*Takes less than a minute*/
INSERT INTO @accountCard
SELECT a.numUsu
    ,count(*)
FROM tableB a
INNER JOIN tableA b ON b.numUsu = a.numUsu
GROUP BY a.numUsu

/*This one takes the time*/
INSERT INTO finalTable
SELECT tarjeta = 'FIXED'
    ,fieldA = c.fieldA
    ,fieldB = c.fieldB
    ,fieldC = convert(BIGINT, b.numUsu)
    ,fieldD = ''
    ,fieldE = ''
    ,fieldF = rtrim(fieldF)
    ,fieldG = rtrim(fieldG)
    ,fieldH
    ,fieldI
FROM @accountCard a
INNER JOIN tableB b ON a.numUsu = b.numUsu
INNER JOIN tableA c ON b.numUsu = c.numUsu
WHERE quantity = 1

tableB has 2.7 million registers, and tableA has 1.7 million. Both tables have non-clustered, non-unique indexes for the column with which I do the join. Currently the second INSERT is taking too long: we have left it running for over 2 days and it was still not finished.

I ran the Estimated Execution Plan feature in SQL Server and it returned that the most expensive process was Searching for Keys (Clustered) in tablaA, but using the primary key, which isn't numUsu.

What can I do to speed it up?

Update 1:

select session_id, 
    status, 
    command,
    blocking_session_id,
    wait_type, 
    wait_time,
    last_wait_type,
    wait_resource
from sys.dm_exec_requests r
where r.session_id >= 50
and r.session_id <> @@spid;

There's only one session running and the result is:

+------------+-----------+---------+---------------------+----------------+-----------+----------------+---------------+
| session_id |  status   | command | blocking_session_id |   wait_type    | wait_time | last_wait_type | wait_resource |
+------------+-----------+---------+---------------------+----------------+-----------+----------------+---------------+
|         59 | suspended | INSERT  |                   0 | PAGEIOLATCH_SH |         0 | PAGEIOLATCH_SH | 5:1:2431256   |
+------------+-----------+---------+---------------------+----------------+-----------+----------------+---------------+

Best Answer

The easiest way to speed up your query is to insert data from the first query into a #temp table instead of a table variable. Table variables do not get statistics in SQL Server. The query optimizer will always think that your table variable has one row which can lead to very poor performance when the table variable has many more rows than that. Based on your description of the plan, it's possible that the query optimizer is doing full scans of tableA many times which could explain why your query did not finish after 2 days.

Adjusting your code to use a temp table is straightforward. It would look something like this:

CREATE TABLE #accountCard
( numUsu        varchar(10),
quantity int);

INSERT INTO #accountCard
SELECT a.numUsu, count(*)
FROM tableB a 
INNER JOIN  tableA b ON b.numUsu = a.numUsu
GROUP BY a.numUsu;

In my experience, table variables can be useful when inserting logging data that you do not want to be deleted when rolling back a transaction. For example, you might detect an error in your application and need to roll back a query. Putting troubleshooting data into a table variable will be preserved after the rollback but putting it into a temp table would not be. Table variables can also be useful when you have a process that runs many times per day (in the thousands at least) that inserts a small number of rows into table variables. Any queries that reference the table variable should not benefit from statistics on the table.

If you think that there's any chance that your queries would benefit from statistics on the table I would err on the side of caution and use a temp table. The query that you have in the question looks something that processes a large amount of data and you definitely can benefit from statistics on the table so I would use a temp table.

If you'd like to learn more about the differences between table variables and temp tables @Martin Smith wrote an excellent summary here.

There is potentially another performance problem with your query. You said that the numUsu column is not unique but you are joining to it twice. Are you sure that your insert query is returning the correct number of rows? Consider the case where you have 100 rows for a single value of numUsu in both tableA and tableB. With the joins that you have you will end up with 100 X 100 = 10000 rows for the single row in your temp table. Is that what you want to happen?

Here's a simple demo to demonstrate the kind of trap that you can fall into:

CREATE TABLE #accountCard
( numUsu varchar(10));

CREATE TABLE #tableA
( numUsu varchar(10));

CREATE TABLE #tableB
( numUsu varchar(10));

INSERT INTO #accountCard
VALUES ('1');

INSERT INTO #tableA
VALUES ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1');

INSERT INTO #tableB
VALUES ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1'), ('1');

SELECT COUNT(*) -- result is 100 rows
FROM #accountCard a
INNER JOIN #tableB b ON a.numUsu = b.numUsu
INNER JOIN #tableA c ON b.numUsu = c.numUsu;

Of course, depending on your data it's possible that you don't have this problem. I just wanted to mention it as a possible cause for the long running query.