Excessive log throttling on Azure SQL on business critical tier

azure-sql-database

We moved from the Biz Critical vCore Gen4_8 tier to Gen5_8 over the holidays. We had to move because our log growth is constantly pushing us into the 1 TB limit of Gen4. Not only are our queries taking twice as long in Gen5, but we are seeing a lot of log rate throttles as shown here:

enter image description here

Why are there three different log rate throttles – what is the difference between them? If you remove the non-important CXCONSUMER wait, almost half of our waits are from the 3 throttles.

I can reproduce the log rate throttles with nothing else running in our pool with a simple "SELECT TOP 20000000 * INTO newtable FROM oldtable" query. We are selecting from a very narrow table (4 int and 2 float columns) that has a CCI and no fragmentation.

enter image description here

We contacted support and they responded with "its because of poor indexing or a very inefficient query plan". Obviously, there is nothing wrong with this query. It runs in 1:50 on Gen5 consistently, 49 secs in Gen4, and 20 secs on my old laptop.

When I run this query, the DMVs show the I/O for this query as: total_logical_writes=150258, total_physical_reads=44080.

Given that it takes 110 seconds to run, can we then say that each write and read is a single I/O which would make our average IOPS about 1766 ((150258+44080)/110)? If not, is there a good way to measure how many IOPS a query hits? I bring this up because the DB limit for Gen5_8 shows "Target IOPS" = 20000, but I think we are well below that. Also is "target" an actual limit or an estimate of the throughput we could expect?

We never had throttling under Gen4 and are paying $50k a year for this service tier and our app is barely usable now – any suggestions/thoughts on where we go from here would be much appreciated!

Best Answer

After a few weeks with the Azure product team we have great answers to these questions.

The main bottleneck in Azure for us is Log I/O. Each logical processor has 6 MB/s throughput. So, on Gen4_8, we have 16 logical processors and get 96 MB/s log I/O throughput. On Gen5_8, there are 8 processors so the throughput is 48 MB/S.

The reason for these limits is that if they allowed higher log I/O throughput they wouldn't be able to guarantee that the log shipping job that runs every 5 minutes would be able to capture all the transactions that occurred.

There is also a major issue with using MAXDOP on Azure right now. By adding MAXDOP 1 to the SELECT INTO statements and our parallel inserts, our performance improved tremendously. Do NOT use MAXDOP 8 or higher on Gen4 or Gen5! I tested on Gen5_16 and MAXDOP 8 adds a 10 fold increase to run times. On Gen4_8, our production pool, our parallel insert for loading 1 billion records goes from 40 minutes with MAXDOP 1 to 3.5 hours with MAXDOP 8 (yes, which means we aren't really doing a parallel insert).