Sql-server – Should I switch the database LOG volumes from IO1 to ST1

I've got a SQL 2014 database that's been running in AWS for a couple of years now, in a dual-server mirrored configuration, using separate EBS volumes for data, logs, etc. When the system was created, the performance characteristics of the various EBS volume seemed to indicate that IO1 was the safest option for the kind of latency and throughput we needed for the log volume (it's a pretty busy database and at the time we were willing to pay extra to make sure the transition from self-hosted when smoothly). Since then, AWS has added a new much cheaper volume type, ST1, that's backed by HDD instead of SSD (see AWS EBS volume types). We're also now interested in cutting costs if we can do so without a major impact on performance. Amazon specifically lists 'Log Processing' as one of the use cases for this type of volume. The performance characteristics of ST1 allow for better throughput (MB/sec) but fewer IOPS than IO1 or GP2 type volumes, and given my understanding of database logs, that would seem to be perfect, so I'd like to make a switch.

However, in checking the CloudWatch metrics for these volumes (we actually have two databases on two servers with each server as principal for one of the database and as mirror for the other database), the 'Write Bandwidth (KiB/s) is FAR below the 500 MiB/s limit (generally <3 MiB/s), but the 'Write Throughput (Ops/s)' metric frequently goes up to about 2000. The docs say that for ST1 the Max. IOPS/Volume is 500, assuming a 1Mib I/O size, but it's not clear whether the 500MiB/s limit comes from the IOPS limit or vice-versa. It's also not clear whether this Ops/s is the same as IOPS. Although logic would seem to indicate that they are, I have other volumes on those systems that I recently switched to ST1, and CloudWatch shows up to 10,000 Write Ops/s for those, so maybe not.

So, my questions are:

Would it be prudent to switch the log volumes to ST1?
If yes, what is the best way to go about doing this?

Taking the databases offline, even for a few minutes, to make the switch is NOT an option.

AWS may provide a 'seamless' 'online' conversion of these volumes (there is a warning that it may or may not be available for these volumes, possibly because they were created before that features was guaranteed to be available). I converted the EBS volumes we use for backups on the same servers from GP2 to ST1 this week and had no problems at all, but if those volumes had performance issues during the conversion process, it wouldn't have affected the system anyway, as they only get used at night when we do the database backups. There is no indication I can find about the performance impact of the conversion process itself, nor whether or not it can be cancelled should it cause issues.

The other two possible migration methods I can think of are:

Add new ST1 volumes to both the principal and mirror, add a new log file and set the log file parameters such that it will attempt to put all the log data into the new file. Then run log backups, delete the old log file, and then remove the old log volume.
Add a new ST1 volume to the mirror, take the mirror offline, move the log file from the old volume to the new one, then bring the mirror back up, failover, and repeat with the principal.

Any advice or insight into the inner workings of EBS or SQL 2014 would be greatly appreciated. For example, how exactly does EBS enforce performance limits on the various types of volumes? Will SQL 2014 automatically increase the size of units it writes to the log in response to an increase in log file latency? (If it automatically flushes to the log every time data is pending and the previous flush completes, for example, such a change might simply increase how much is written each time and have little impact on overall throughput, and I'm much more concerned with throughput than latency).

Best Answer

(First two paragraphs are repeats of my comments.)

When Amazon specifies "log processing" as a use case, I'd suspect they're talking about infrequently-flushed IIS logs and the like, not database logs. And although ST1 is specified as not suited for "small, random" workloads, my reading of http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#inefficiency would be that it's not suited for "small and/or random" workloads. But this, and your questions of how the limits are enforced/calculated, are probably something you need to get clarified from AWS.

"Will SQL 2014 automatically increase the size of units it writes to the log in response to an increase in log file latency?" I'm pretty sure it won't by default. But if you can afford some data loss risk, then delayed durability may be just what you need: https://sqlperformance.com/2014/04/io-subsystem/delayed-durability-in-sql-server-2014

If you do end up moving to ST1, AWS's online conversion option sounds far riskier than "Add a new ST1 volume to the mirror, take the mirror offline, move the log file from the old volume to the new one, then bring the mirror back up, failover, and repeat with the principal", which should work well for this. (Moving or dropping the primary log file with the database online is not an option.)

Best Answer

Related Solutions

Resolving AWS EBS I/O Bottleneck Issues

MongoDB – Sharded Cluster Reliability on AWS

Related Question