IOMMU Virtualization – What Is IOMMU and Will It Improve VM Performance

performancevirtualboxvirtualization

My Motherboard's chipset supports this feature IOMMU, i've read (but not really understand) that it improves the VM performance by letting the VM make use of the actual physical hardware instead of the Virtual hardware.

Does this really bring a performance improvement in the VM? and if it does how can I make Virtualbox benefit from that?.

Best Answer

So long story short, the only way an IOMMU will help you is if you start assigning HW resources directly to the VM. Just having it doesn't make things faster.

It would help to know exactly what Motherboard/CPU is advertising this feature. IOMMU is a system specific IO mapping mechanism and can be used with most devices.

IOMMU sounds like a generic name for Intel VT-d and AMD IOV. In which case I don't think you can multiplex devices, it's a lot like PCI passthrough before all these fancy virtualization instructions existed :). SR-IOV is different, the peripheral itself must carry the support. The HW knows it's being virtualized and can delegate a HW slice of itself to the VM. Many VMs can talk to an SR-IOV device concurrently with very low overhead.

The only thing faster than SR-IOV is PCI passthrough though in that case only one VM can make use of that device, not even the host operating system can use it. PCI passthrough would be useful for say a VM that runs an intense database that would benefit from being attached to a FiberChannel SAN.

Getting closer to the HW does have limitations however, it makes your VMs less portable for deployments that require live migration for example. This applies to both SR-IOV and PCI passthrough.

Default virtualized Linux deployments usually use VirtIO which is pretty fast to begin with.

Related Solutions

Ubuntu – How to measure performance of a virtual server

Well, since nobody wants to answer... :)

Searching Synaptic for "bench" finds a lot of benchmarking suites capable of testing different aspects of a machine. The only one I heard about previously is phoronix-test-suite, which I'm sure is very comprehensive although my short attention span didn't allow me to figure out how to use it.

Then I found UnixBench, which is described as

UnixBench is the original BYTE UNIX benchmark suite, updated and revised by many people over the years.

The purpose of UnixBench is to provide a basic indicator of the performance of a Unix-like system; ... These test results are then compared to the scores from a baseline system to produce an index value, which is generally easier to handle than the raw scores.

Multi-CPU systems are handled. ... The tests compare Unix systems by comparing their results to a set of scores set by running the code on a benchmark system, which is a SPARCstation 20-61 (rated at 10.0).

UnixBench is mentioned by Linode as a tool for VM performance testing in this blog post:

Using identical hardware, KVM Linodes are much faster compared to Xen. For example, in our UnixBench testing a KVM Linode scored 3x better than a Xen Linode.

The test suite is NOT in Ubuntu repositories, but it is trivial to download and compile it:

wget https://github.com/kdlucas/byte-unixbench/archive/master.zip
unzip ./master.zip
cd ./byte-unixbench-master/UnixBench
./Run

The tests take a while to finish. The output looks like

------------------------------------------------------------------------
Benchmark Run: Mon Oct 15 2012 23:55:22 - 00:23:16
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       12015218.4 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     2214.8 MWIPS (10.1 s, 7 samples)
Execl Throughput                                896.9 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         58968.3 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           14578.6 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        422068.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                               70993.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  16001.5 lps   (10.0 s, 7 samples)
Process Creation                               1861.8 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   2525.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    737.8 lpm   (60.1 s, 2 samples)
System Call Overhead                         432496.2 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   12015218.4   1029.6
Double-Precision Whetstone                       55.0       2214.8    402.7
Execl Throughput                                 43.0        896.9    208.6
File Copy 1024 bufsize 2000 maxblocks          3960.0      58968.3    148.9
File Copy 256 bufsize 500 maxblocks            1655.0      14578.6     88.1
File Copy 4096 bufsize 8000 maxblocks          5800.0     422068.2    727.7
Pipe Throughput                               12440.0      70993.3     57.1
Pipe-based Context Switching                   4000.0      16001.5     40.0
Process Creation                                126.0       1861.8    147.8
Shell Scripts (1 concurrent)                     42.4       2525.5    595.6
Shell Scripts (8 concurrent)                      6.0        737.8   1229.7
System Call Overhead                          15000.0     432496.2    288.3
                                                                   ========
System Benchmarks Index Score                                         249.7

------------------------------------------------------------------------
Benchmark Run: Tue Oct 16 2012 00:23:16 - 00:51:20
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       42619039.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     8274.0 MWIPS (10.4 s, 7 samples)
Execl Throughput                               3398.5 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         68332.4 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           21462.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        718205.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                              149713.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  61968.3 lps   (10.0 s, 7 samples)
Process Creation                               5321.7 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   5957.1 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    812.6 lpm   (60.1 s, 2 samples)
System Call Overhead                        1557391.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   42619039.2   3652.0
Double-Precision Whetstone                       55.0       8274.0   1504.4
Execl Throughput                                 43.0       3398.5    790.4
File Copy 1024 bufsize 2000 maxblocks          3960.0      68332.4    172.6
File Copy 256 bufsize 500 maxblocks            1655.0      21462.9    129.7
File Copy 4096 bufsize 8000 maxblocks          5800.0     718205.6   1238.3
Pipe Throughput                               12440.0     149713.5    120.3
Pipe-based Context Switching                   4000.0      61968.3    154.9
Process Creation                                126.0       5321.7    422.4
Shell Scripts (1 concurrent)                     42.4       5957.1   1405.0
Shell Scripts (8 concurrent)                      6.0        812.6   1354.3
System Call Overhead                          15000.0    1557391.5   1038.3
                                                                   ========
System Benchmarks Index Score                                         592.5

Which means that the VPS in question has a score of 249.7 for single task and 592.5 for parallel processing.

My desktop machine, while having similar or lower specs to the physical machine my VPS is running on, produced a score of 1409.7 for single task and 5156.3 for parallel processing. Exactly the kind of metric I was looking for.

Another important metric is network speed. I've found a script which downloads test files from different locations and measures download speed. The script can be run with

wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash

(although it probably would be safer to download the script and inspect its contents before running)

To monitor disk I/O latency there is ioping utility which can be installed from Ubuntu repositories:

# ioping . -c 10
4096 bytes from . (ext4 /dev/disk/...): request=1 time=16.4 ms
4096 bytes from . (ext4 /dev/disk/...): request=2 time=16.1 ms
...

Ubuntu – How to make a VM use AMD IOMMU so I can use a 2nd Graphics card

I assume you use VirtualBox, like in your question earlier. However you may want to try KVM. They have a How-to on that topic. I did a quick search on if the M5A99x EVO and its BIOS supports IOMMU and it looks promising, may be you need a BIOS update.

Best Answer

Related Solutions

Ubuntu – How to measure performance of a virtual server

Ubuntu – How to make a VM use AMD IOMMU so I can use a 2nd Graphics card

Related Question