Sql-server – Bulk SQL INSERT into Azure SQL Database using spark causes blocking/contention

apache-sparkazurequery-performancesql server

I am running the following code using microsoft's sql sparkconnector to write a 1-2 Billion dataframe into Azure SQL Database.

df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("append") \
.option("url", secrets.db.url) \
.option("dbtable", 'tableName') \
.option("user", secrets.db.user) \
.option("password", secrets.db.password) \
.option("batchsize", 1048576) \
.option("schemaCheckEnabled", "false") \
.option("BulkCopyTimeout", 3600) \
.save()

This is a snapshot of the DB Utilization graph

And These are the first few rows from the following query using sp_whoisactive

EXEC sp_WhoIsActive
@find_block_leaders = 1,
@sort_order = '[blocked_session_count] DESC'

The wait_info column's value is Resource_Semaphore.

Configs:
My dataframe is partitioned over 2100 partitions on a cluster of 900 cores
My database is 14 vcores in General Purpose tier on Azure.

My query is incredibly slow because of this blocking. It's almost like it's running one bulk insert from my cluster at a time. Any suggestions on what to change to speed it up? or any insights into why it's blocking?

Best Answer

You could be hitting the Resource Governance limits for the Azure SQL DB. Query the sys.dm_db_resource_stats while running the insert and you will see more detail on what is happening.

Possible workarounds could be to load into TempDB first of all and then process into your main DB (TempDB has higher resource governor limits) or scale up for the bulk load then scale back down afterwards. Also consider partitioning of the table you are inserting into and consider the tips here