Sql-server – MS SQL 2017 Batch insert performance degradation

odbcpythonsql serversql-server-2017

This one's a bit strange so bear with me:

I'm seeing extreme insert performance degradation after an insert was interrupted in a dirty fashion (the transaction was not committed and the connection was not closed properly).

Pre incident, insert performance was roughly 100m rows/hour with 500k batches (about 20 seconds/batch).

Post incident, batch inserts with more than ~80k rows will never complete (left it running for more than a day) whereas an insert with 20k rows completes in ~95 seconds but that results in an insert performance of only around 18m rows/hour. The machine the SQL server is installed on does not experience any load. As far as I can tell the server is idling.

What trips me up about this is that dropping the database or even reinstalling the server instance does not resolve the issue. I simply can't reproduce the performance I was seeing pre incident. On top of that, I had observed this exact phenomenon with Azure SQL instances, except that dropping the database "solved" the issue there.

The way this issue presents itself indicates that the problem is not actually with the SQL server but rather on the insert side / the driver. However I've tested basically all possible ways to connect to a MS SQL instance from python with the same result.

The SQL Server 2017 instance is updated and running on a fairly powerful dedicated machine with NVMe drives located in the same datacenter as the client machine.

The inserts are executed using:

ubuntu 16.04
Python 2.7
pyodbc 4.0.23 (also tested pymssql+freetds and pypyodbc)
Microsoft ODBC driver 17.2.0.1

Best Answer

Okay, so it is a very specific issue with PyODBC afterall:

I create tables with IGNORE_DUP_KEY = ON as ensuring no duplicate insertions would be much more expensive in our application and previously inserted IDs will get updated shortly afterwards anyways. The incident described in the original question (in combination with an unrelated bug) left the client application in a somewhat erroneous state, trying to re-insert previously inserted IDs. But even this should not be an issue normally.

I'm not sure yet why but PyODBC seems to take offense to duplicate primary key insertions. It becomes completely unresponsive (ignoring configured timeouts and ctrl+c) and using 100% CPU. SQL Server logs show it repeatedly kills its connection to the server and reconnects. I'll post this issue over on their github.

Related Solutions

Sql-server – Insert performance with Geography column

After doing some research, including looking at this question, it appears it isn't directly possible to efficiently bulk load any of the CLR-based types, including geography. ¹

You said that adding just the geography column added a significant amount of time to the load process -- this may, in fact, be entirely reasonable, depending on the amount of data that's going into that column.

The most concerning thing is that it's taking so long to load the data without the geography column involved.

When doing large data loads, maximizing throughput is the most important part of optimizing the process, particularly if the client application and database are separated by a network connection. (By the way, I'm going to assume that using a client application to load the data is the best architectural decision for the problem here.)

To maximize throughput means minimizing other types of overhead, such as data preparation on the client, and network roundtrips between servers.

Based on the comments, you have some type of loop happening that reads a row from the source and writes it to the target immediately. Something like this (pseudo-code):

while (more rows to read from source)
{
    read one row from source
    write one row to target
}

This is inefficient primarily because it maximizes network roundtrips and small chunks of I/O on the target, which are both expensive in aggregate.

Probably the easiest way to modify your code to fix this issue is to introduce a delayed-write scheme, where a chunk of rows are read from the source, and flushed to the target all at once. Something like this:

while (more rows to read from source)
{
    read 1..batchSize rows from source
    write 1..batchSize rows to target
}

Even using INSERT ... VALUES with multiple rows at a time (note: there is a 1000-row limit per statement) will significantly improve performance. The test harness I constructed saw a 43x improvement between a batch size of 1 and a batch size of 1,000. ²

A note about parameterization. While this is a best-practice when it comes to application security, for a batch loading process, parameterized statements can cripple performance because of the associated processing time. If you start batching things together, it's still possible to parameterize the whole statement, but the performance gets worse and worse the more parameters you use (there's a limit of something like 2,000 parameters anyway). Assuming the data source values are "safe," I would actually recommend doing your own escaping (replace ' with '') for the string columns and concatenate all values directly into the SQL text as literals. By the way, if you're using .NET, make sure you use a StringBuilder to concatenate strings.

About how to move the geography column around efficiently. Simply batching the INSERTs may be enough. If not, it may be advantageous to attempt loading (or bulk loading) the text definition of the column (I'm assuming you're given a text list of points or something from the data sources) along with the rest of the data into a staging table, then populate the target by doing an INSERT/SELECT to convert the text definition into the CLR type. (Note: I haven't tested this.)

Either way, if you haven't reached network saturation after doing that, and depending on how fast you can read from the data source vs. write to the target, the code could be refactored into a producer/consumer architecture where reads and writes happen asynchronously, and possibly multithreaded on either or both sides. The details of how to do this are out of scope for this site, but I figured it should be mentioned. I don't think you'll have to go to this length to get acceptable performance.

Finally, for completeness of this answer, in terms of indexes, for a one-time load, it's advantageous to create nonclustered indexes after all the data has been loaded into the table. This reduces the number of internal index inserts, much in the same way that batching INSERT statements reduces network roundtrips. It's probably best to create the clustered index, however, before loading, so there isn't a need to rewrite the whole table to give it order (assuming the clustering key is ever-increasing).

¹ Well, probably. If you extract Microsoft.SqlServer.Types from the resource database and reference it in an assembly, it should be possible; however, this is wildly unsupported and will probably break the process horribly if Microsoft updates it between versions.

² This was a single-threaded test inserting constant values of integers and strings, not selecting from a data source; 1,500,000 rows at 201 bytes/row; using INSERT ... VALUES took 771,971ms for a batch size of 1, and 18,120ms for a batch size of 1,000. I also tested the same data with SqlBulkCopy for comparison, and it took 16,576ms for a batch size of 1,000, and 13,829ms for a batch size of 10,000 (this batch size saturated throughput). You can download the source from here -- add the program to a console application; I used .NET 4.0.

SQL Server Connection – Connecting to SQL Server from R on Remote UNIX Machine

I changed delimiter which between server and port from : to , . It fixed my problem.

dbhandle <- odbcDriverConnect('driver=ODBC Driver 11 for SQL Server;server=10.XXX.XX.XX,1663;')

Best Answer

Related Solutions

Sql-server – Insert performance with Geography column

SQL Server Connection – Connecting to SQL Server from R on Remote UNIX Machine

Related Question