SQL Server – Using Multiple PVSCSI with VMware

sql servervirtualisationvmware

Regarding SQL Server virtualization, been trying to find info if there is a positive performance impact on separating Data devices from Log devices into different Paravirtual SCSI (PVSCSI) Adapters, similar to what is done here.

There has been a scenario on a client where an additional PVSCSI was added and log devices were separated to the new PVSCSI, showing considerable performance gains. Yet, the doubt remains if it was due to this separation or simply due to the fact that an additional PVSCSI was now present.

As it is known, Log disks typically are written in a sequential manner, while Data disks follow a more random pattern in their r/w, and there are performance benefits on placing these two different kinds of files on separate disks.

But what about the controllers? Is there a benefit also on keeping these different patterns in separate PVSCSI controllers?

Anyone has any insight on this?

Thanks in advance

Best Answer

I'll answer in two parts: first "why the traditional answer about separating sequential and random often doesn't apply."

Then I'll discuss the potential benefits of separating files at the Windows physicaldisk, and to adding additional vHBAs and distributing the physicaldisks among them.

Expecting benefit from separating random and sequential disk IO at the Windows physicaldisk level typically assumes HDD devices for the data storage. It also typically assumes that separate Windows physicaldisks means separate HDD devices. The idea is that some set of HDDs are handling mainly sequential disk IO and have very limited disk head movement (eg the HDDs hosting a single busy txlog*) while a separate set of HDDs are handling random disk IO.

Those assumptions rarely hold today - especially in a VM. First of all, unless the VMs Windows physicaldisks are RDMs, multiple of the could be in a single datastore - or maybe multiple datastores are on a single ESXi host LUN. So what is separated in the guest can be commingled at the ESXi host level.

But let's say that RDMs are used, or that each guest physicaldisk is on its own datastore, on its own ESXi LUN. Even then, separate sequential from random io in the guest is often commingled at the array, because the LUNs presented to the ESXi host may be from the same single pool of disk devices. Almost every storage array does this now - either exclusively or as an option to ease management and increase array efficiency/resource utilization.

Finally, so much storage today is either all flash or hybrid flash+HDD. With no head movement to worry about, flash doesn't care about separation of sequential for random… doesn't even care about IO weaving.

So… those are all the reasons separating sequential from random may not be all that beneficial. Next why spreading files across physicaldisks and spreading physicaldisks across vHBAs can still boost performance anyway.

*I purposefully mentioned a single transaction log in this HDD example. When several separate sequential disk IO streams (eg 8 busy transaction logs) are taking place on the same HDDs - unless somehow almost all of the activity is within SAN cache - constant head movement among the sequential IO tracks leads to IO weaving. That's a specific kind of disk head thrashing which leads to disk latency that is "worse than random". Happens on RAID5 and RAID10, although RAID10 can tolerate just a little bit more variation in this regard than RAID5 before significant degradation.


Now - given that longwinded talk about how separating sequential from random might not help - how can spreading files across physicaldisks still help? How can spreading physicaldisks among vHBAs help?

It's all about disk IO queues.

Any Windows physicaldisk or LogicalDisk can have up to 255 outstanding disk IOs at a time in what is reported by perfmon as "Current Disk Queue". From the outstanding disk IOs in the physicaldisk queue, storport can pass up to 254 to the minidriver. But the minidriver may also have both a service queue (passed down to the next lower level) and a wait queue. And storport can be told to lower the number it passes on from 254.

In a VMware Windows guest, the pvscsi driver has a default "device" queue depth of 64, where the device is a physicaldisk. So although perfmon could show up to 255 disk IOs in "current disk queue length" for a single physicaldisk, only up to 64 of them would be passed to the next level at a time (unless defaults are changed).

How many disk IOs can be outstanding to one busy transaction log at a time? Well, transaction log writes can be up to 60kb in size. During a high scale ETL, I'll often see every write to the txlog at 60kb. The txlog writer can have up to 32 writes of 60kb outstanding to one txlog at a time. So what if I've got a busy staging txlog and a busy dw txlog on the same physicaldisk, with default VMware settings? If both txlogs are maxing out at 32 outstanding 60kb writes each, that physicaldisk is at its queue depth of 64. Now… what if there are also flatfiles as an ETL source on the physicaldisk? Well… between reads to the flatfiles and txlog writes, they'd have to use the wait queue in, because only 64 can get out at a time. For databases with busy txlogs like that, whether physical server or virtual, I recommend the txlog on its own physicaldisk, with nothing else on the physicaldisk. That prevents queueing at that level and also eliminates any concern with contents of multiple files interleaving(which is a much, much lesser concern these days).

How many disk IOs can be outstanding to a rowfile at a time(from SQL Server's perspective, not necessarily submitted to lower levels)? There isn't really a limit in SQL Server itself(that I've found, anyway). But assuming the file is on a single Windows physicaldisk (I do not recommend using striped dynamic disks for SQL Server, that's a topic for another time), there is a limit. It's the 255 I mentioned before.

With the magic of SQL Server readahead and asynchronous IO, I've seen 4 concurrent queries each running in serial drive a total "current disk queue length" of over 1200! Because of the 255 limit, that isn't even possible with all rowfile contents on a single physicaldisk. It was against a primary filegroup with 8 files, each on own physicaldisk.

So readahead reads can be very aggressive, and can stress IO queues. They can be so aggressive that other rowfile reads and writes end up waiting. If transaction logs are on the same physicaldisk as rowfiles, during simultaneous readahead reads and txlog writes it's very easy for waiting to take place. Even if that waiting isn't at the "current disk queue length" level, it may be waiting at the device queue (64 by default with pvscsi).

Backup reads against rowfiles can also be aggressive, especially if buffercount has been tuned in order to maximize backup throughput.

There's one more SQL Server io type to be aware of when considering isolating txlogs: query spill to tempdb. When query spill takes place, each spilling working writes to tempdb. Got a lot of parallel workers all spilling at the same time? That can be quite a write load. Keeping a busy txlog and important rowfiles away from that can be really helpful :-)

Now, it is possible to change the default device queue depth for the pvscsi driver. It defaults to 64, and can be set as high as 254 which is the most storport will pass on. But be careful changing this. I always recommend aligning the guest device queue depth with the underlying ESXi host LUN queue depth. And setting ESXi host LUN queue depth per array best practices. Using an EMC VNX? Host LUN queue depth should be 32. Guest uses RDMs? Great. Set guest pvscsi device queue depth to 32 so it's aligned with the ESXi host LUN queue depth. EMC VMAX? Typically 64 at ESXi host level, 64 in guest. Pure/Xtremio/IBM FlashSystem? Sometimes host LUN queue depth will be set as high as 256! Go ahead and set pvscsi device queue depth to 254 (Max possible) then.

Here's a link with instructions. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053145

The link also talks about requestringpages - WhatAreThose?? They determine queue depth for the pvscsi adapter itself. Each page gives 32 slots in adapter queue depth. By default, requestringpages is 8 for an adapter queue depth of 256. It can be set as high as 32 for 1024 adapter queue depth slots.

Let's say everything is at default. I've got 8 physicaldisks with rowfiles on them, and SQL Server is lightly busy. There's an average of 32 "current disk queue length" across the 8, and none are higher than 64(everything fits in the various device service queues). Great - that gives 256 OIO. It fits in the device service queues, it fits in the adapter service queue so all 256 make it out of the guest to queues at the ESX host level.

But… if things get a little busier, so an average of 64 with some physical disks' queue as high as 128. For those devices with more than 64 outstanding, the overage is in a wait queue. If more than 256 is in the devices' service queue across the 8 physicaldisks, the overage there is in a wait queue until slots in the adapter service queue open up.

In that case, adding another pvscsi vHBA and spreading the physicaldisks between them doubles the total adapter queue depth to 512. More io can be passed from guest to host at the same time.

Something similar could be achieved by staying at one pvscsi adapter and increasing requestringpages. Going to 16 would yield 512 slots, and 32 yields 1024 slots.

When possible, I recommend going wide (adding adapters) before going deep (increasing adapter queue depth). But… on many of the busiest systems, gotta do both: put 4 vHBAs on the guest, and increase requestringpages to 32.

There are lots of other considerations, too. Things like sioc and adaptive queue depth throttling if vmdks are used, configuration of multipathing, configuration of the ESXi adapter beyond LUN queue depth, etc.

But I don't want to overstay my welcome :-)

Lonny Niederstadt @sqL_handLe