Sql-server – MS SQL 2017 Batch insert performance degradation

odbcpythonsql serversql-server-2017

This one's a bit strange so bear with me:

I'm seeing extreme insert performance degradation after an insert was interrupted in a dirty fashion (the transaction was not committed and the connection was not closed properly).

Pre incident, insert performance was roughly 100m rows/hour with 500k batches (about 20 seconds/batch).

Post incident, batch inserts with more than ~80k rows will never complete (left it running for more than a day) whereas an insert with 20k rows completes in ~95 seconds but that results in an insert performance of only around 18m rows/hour. The machine the SQL server is installed on does not experience any load. As far as I can tell the server is idling.

What trips me up about this is that dropping the database or even reinstalling the server instance does not resolve the issue. I simply can't reproduce the performance I was seeing pre incident. On top of that, I had observed this exact phenomenon with Azure SQL instances, except that dropping the database "solved" the issue there.

The way this issue presents itself indicates that the problem is not actually with the SQL server but rather on the insert side / the driver. However I've tested basically all possible ways to connect to a MS SQL instance from python with the same result.

The SQL Server 2017 instance is updated and running on a fairly powerful dedicated machine with NVMe drives located in the same datacenter as the client machine.

The inserts are executed using:

  • ubuntu 16.04
  • Python 2.7
  • pyodbc 4.0.23 (also tested pymssql+freetds and pypyodbc)
  • Microsoft ODBC driver 17.2.0.1

Best Answer

Okay, so it is a very specific issue with PyODBC afterall:

I create tables with IGNORE_DUP_KEY = ON as ensuring no duplicate insertions would be much more expensive in our application and previously inserted IDs will get updated shortly afterwards anyways. The incident described in the original question (in combination with an unrelated bug) left the client application in a somewhat erroneous state, trying to re-insert previously inserted IDs. But even this should not be an issue normally.

I'm not sure yet why but PyODBC seems to take offense to duplicate primary key insertions. It becomes completely unresponsive (ignoring configured timeouts and ctrl+c) and using 100% CPU. SQL Server logs show it repeatedly kills its connection to the server and reconnects. I'll post this issue over on their github.