SQL Server – TSQL vs SSIS for Direct Insert Speed Performance

performanceperformance-tuningsql serversql-server-2017ssis

I am moving data from Table A into Table B. Tables have the exact same DDL, table structure, and are straight inserts. We are Not transforming or modifying data.

If TSQL Server and SSIS have the same hardware specs for both cpu core and memory, and the two tables reside in a Same database, which tool is better performance wise? Table is 50 million rows, around 30GB.

Our performance testing shows T-sql; I would like the confirm what Microsoft or vendor recommends.

Purpose of question is to understand speed difference between Tsql and ssis.

Best Answer

To briefly quote the documentation:

Microsoft Integration Services is a platform for building enterprise-level data integration and data transformations solutions. Use Integration Services to solve complex business problems by copying or downloading files, loading data warehouses, cleansing and mining data, and managing SQL Server objects and data.

Integration Services can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations.

You say that you're loading data from one table to another in the same database without transformations. SSIS isn't really the right tool for the job. With a reasonable table structure, properly written T-SQL inserts will perform better than loading through SSIS because you avoid the overhead of sending data to another server, passing it through buffers, and sending it back to the destination server. Behind the scenes SSIS is just doing a BULK INSERT which isn't going to be faster than a T-SQL INSERT except in edge cases.

In the interest of giving a complete answer, I'll describe a few of the features of SSIS which can be helpful from a data loading performance perspective. Consider a table in SQL Server with a complex five column clustered index and many nonclustered indexes. Suppose you can't disable NCIs. You might get the best possible loading performance by splitting up the insert into batches because smaller sorts will be used to do NCI maintenance. It may be difficult to code an efficient batching solution in SQL Server with a five column clustered key. However, this is trivial to do in SSIS. SSIS can read the data in order and split the data up into batches for you just by changing a parameter. So in that case, you may get better performance with SSIS because the T-SQL code wasn't written to be as efficient as possible.

For another example, a BULK INSERT is able to take a BU lock on a target heap. BU locks are compatible with other BU locks. As a result, you can get multiple concurrent T-SQL inserts all loading minimally logged data into the same heap without blocking. That cannot be accomplished with T-SQL alone. The BU lock is not available in that context. SQL Server 2017 offers parallel inserts into heaps in T-SQL, so the benefit of a BU lock is diminished when loading data from a table with an identical structure. An example where a BU lock is better than anything in T-SQL would be loading from a table with a clustered index into a heap on SQL Server 2014 Standard Edition. In T-SQL you can only run a single insert at a time but in SSIS you could have many concurrent inserts running.

In summary, SSIS is almost certainly the wrong tool for the scenario that you described. You should expect to see better performance with properly written T-SQL code, with the exception of a few edge cases, because it avoids the overhead inherent to moving data between servers.

Parallelization

Perhaps you'd be able to break up the query and run it over only segments of your locations table, and then run these queries in parallel using a connection pooler like pgBouncer?

For some good info about access data in sequential chunks (pagination), check out this amazing blog post by Markus Winand.

An example of how you might break this into chunks might be (warning:untested SQL ahead)

SELECT
id
FROM
(SELECT * FROM locations LIMIT x OFFSET y) AS locs_fragment
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
criteria_2 = 23 AND 
criteria_3 = 'ABCD' AND 
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;

Note that the sub-select SELECT * FROM locations LIMIT x OFFSET y, given no additional criteria, will extract rows in ctid order. By selecting appropriate LIMIT and OFFSET values for a rnage of queries, you can essentially run the query in parallel.

Denormalization

Don't rule out denormalization as an option as well. In this case, it can help you to much more quickly get you the results you want. Of course, denormalization will only really help if your query criteria will be fairly stable over time, so that you can effectively pre-calculate results sets.

I'd say there are two decent options:

(1) Use partial indexing, where a given index is designed around some set of criteria, as

CREATE INDEX idx_locations_criteria_set_0
ON locations (id)
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
criteria_2 = 23 AND 
criteria_3 = 'ABCD' AND 
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;

With such an index defined, if you make a query with a matching predicate, an index scan is performed rather than having to perform all the calculations for the SELECT statement.

(2) Another denormalization option is to use triggers during INSERT on your locations table to verify if certain conditions are met, and when they are met, store the id of that locations entry in another table, thus making a simple table scan all you need to do at a later time to find matching entries.

Both of these cases will require additional storage space, as well as some minor INSERT overhead for predicate checking, but it allows you to distribute your predicate checking over time, rather than being forced to do it all at once when a query is issued.

Conclusions

As with any advice on the internet, take mine with a grain of salt! :P I've done some work in both of these areas, but that doesn't make me an expert in your application area. Run some small scale tests on either approach to see what works best for your application.

Best Answer

Related Solutions

Sql-server – TSQL vs SSIS Lookup

postgresql – Postgres vs MapReduce for Matching Reciprocal Needle Pairs

Parallelization

Denormalization

Conclusions

Related Question