Sql-server – SQL Server – quite big data flow – Disk became the bottleneck … what now

performancesql server

After a lot of searching & follow up of the system we are getting out of ideas to effienctly solve our issue … hence this question. We suspect (are rather sure) our storage became the bottleneck of our system, but we have no experience in how to solve this.

Situation/Context => We have a single sql-server (standard edition) instance with:

  • 2 x "Intel Xeon Gold 6230" CPU's => with a combined core count of 40 (80 logical)
  • 192GB MEM
  • 2 X 500GB SSD RAID 1 (OS)
  • 3 X 2TB SSD RAID5 (SQL)
  • 1 x 2TB NON-RAID

The data we are processing is a continuous flow of inserts through the day. We process pushed files on our server and 20 files are processed simultaneously, generating batched inserts. This data needs to be queried as soon as it enters the database by other processes too. To give an idea of the data flow (of the troublesome database): the "SQL Server Activity Monitor -> Data File I/O" shows now (early morning and a calm period on the system) 12MB/sec read and 150MB/sec write.

It is also important to mention that we have some indexes (including a clustered identity field) that get fragmented fast. Those indexes need to be rebuild now and then slowing the whole system down and during such periods we are getting behind on the input flow. If we don't rebuild them, the queries just slow down other systems/processes too much. These indexes (except the clustered) simply must remain and are as minimal as possible. No foreign keys are used on this part of the database. OLTP system thus.

This continuous inserted data gets divided across 4 tables in the troublesome database. Each week the tables are renamed and we keep 4-5 tables before the last one gets truncated and renamed to become the active table with most recent entries.
Each of those 4 tables (and the renamed variants) are about 100GB each. Once renamed only read (still a lot per second) actions are done on these "less-recent" tables.

Things we did (all active now and above numbers are from this config):

  • We isolated the troublesome database (datafile) to the RAID5 config. Other user databases & all logs are (temporary moved) to our unsafe NON-RAID config.
  • The systems databases (except tempdb data + log) remained on RAID5, but don't seem to cause any mentionable I/O.
  • Delayed durability is set to forced to disable transaction log delays. Data loss is never good, but in our case it isn't the end of the world and thus we preferred the gained speed.
  • MAXDOP is set to 16 to avoid parallelism related latching
  • A clustered index was created on an identity column and we suspect index rebuilds are faster now. Previously we worked with a heap table. We should test again without this index, but as you can imagine it takes a while to add the clustered index and thus we prefer to test all the possibilities before dropping it again.

The main issue we see in the Activity Monitor now is a big Buffer I/O fluctuation (2000-17000 ms/sec), but it is always above 1200 ms/sec once we can't process the input fast enough.

We also monitored some disk performance counters. Disk response times are fine, except during burst periods (almost continuously now) we have latencies going up to 45-60ms.
Another performance counter "Avg Disk Queue Lenght" has an dramatic avg of 235 (min 0, max 27 583).
We also searched for heavy queries using the DMV found in another post (https://stackoverflow.com/questions/48425245/what-can-cause-high-buffer-i-o-on-my-clients-server), but most activity comes from the inserts (most logical reads and most logical writes and most physical reads too).

After all this info, what can we do next?
Could be general advice too on how to handle such heavy data flows properly, good books, courses, …
Any advice is welcome as we expect even greater data flows in the future.

Thanks in advance

Edit 2020-06-21 (additional info based on comment rois):

Hereby the wait stats. Although they will be based on activity before we did some tweaks too … no? We do see latches now and then, but since the MaxDop change they don't have a big impact on the system as a whole (at least based on activity monitor and off course my second screenshot show latchs wait now … :)).
Wait stats
Further I added a screenshot of the current I/O visible in Activity Monitor. Off course during the mornings there is less input, but still a 90 mb/s of writes to the actual data file. Regarding tempdb comment: tempdb was moved to the non-raid disk (also visible on the screenshot).
enter image description here

Best Answer

I must say I'm not sure what is the problem you are facing. I can only give some ideas what could be wrong.

I speculated it could be tempdb. But in this case IO_COMPLETION waits would be much higher. Still I find the IO on tempdb and database puzzling.

First the CXPACKET waits. As you note you changed maxdop. If you haven't restarted the instance in the meantime or reset wait stats, it is probably so high because maxdop was 0. So generally I would ignore it. (Though you could try setting maxdop to 8 and increase cost threshold for parallelism if you haven't)

I see quite a lot of waiting for schema stability locks (LCK_M_SCH_S). It seems were some ddl changes, which blocked other processes. You mentioned you converted heap table to clustered one, so my guess is those waits are results of this conversion.

Then there is PAGEIOLATCH_EX. Which points to inserts you mentioned. In this case the first culprit would be the indexes. I would urge you to review them once more, if they are all really used and needed. Maybe some can be combined or made filtered.

I would consider compressing tables/indexes. Compressed pages take less space in memory, so maybe it would allow you to "gain" more memory. Use sp_estimate_data_compression_savings to check, if there are any savings to be had.

You don't mention any triggers. If you have any, it would be good idea to double check that they don't make anything funky.

Or you simply hit the limits of your hardware.