Well, since nobody wants to answer... :)
Searching Synaptic for "bench" finds a lot of benchmarking suites capable of testing different aspects of a machine. The only one I heard about previously is phoronix-test-suite
, which I'm sure is very comprehensive although my short attention span didn't allow me to figure out how to use it.
Then I found UnixBench, which is described as
UnixBench is the original BYTE UNIX benchmark suite, updated and
revised by many people over the years.
The purpose of UnixBench is to provide a basic indicator of the
performance of a Unix-like system; ... These test results
are then compared to the scores from a baseline system to produce an
index value, which is generally easier to handle than the raw scores.
Multi-CPU systems are handled. ... The tests compare Unix systems by
comparing their results to a set of scores set by running the code on
a benchmark system, which is a SPARCstation 20-61 (rated at 10.0).
UnixBench is mentioned by Linode as a tool for VM performance testing in this blog post:
Using identical hardware, KVM Linodes are much faster compared to Xen.
For example, in our UnixBench testing a KVM Linode scored 3x better
than a Xen Linode.
The test suite is NOT in Ubuntu repositories, but it is trivial to download and compile it:
wget https://github.com/kdlucas/byte-unixbench/archive/master.zip
unzip ./master.zip
cd ./byte-unixbench-master/UnixBench
./Run
The tests take a while to finish. The output looks like
------------------------------------------------------------------------
Benchmark Run: Mon Oct 15 2012 23:55:22 - 00:23:16
4 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 12015218.4 lps (10.0 s, 7 samples)
Double-Precision Whetstone 2214.8 MWIPS (10.1 s, 7 samples)
Execl Throughput 896.9 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 58968.3 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 14578.6 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 422068.2 KBps (30.0 s, 2 samples)
Pipe Throughput 70993.3 lps (10.0 s, 7 samples)
Pipe-based Context Switching 16001.5 lps (10.0 s, 7 samples)
Process Creation 1861.8 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 2525.5 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 737.8 lpm (60.1 s, 2 samples)
System Call Overhead 432496.2 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 12015218.4 1029.6
Double-Precision Whetstone 55.0 2214.8 402.7
Execl Throughput 43.0 896.9 208.6
File Copy 1024 bufsize 2000 maxblocks 3960.0 58968.3 148.9
File Copy 256 bufsize 500 maxblocks 1655.0 14578.6 88.1
File Copy 4096 bufsize 8000 maxblocks 5800.0 422068.2 727.7
Pipe Throughput 12440.0 70993.3 57.1
Pipe-based Context Switching 4000.0 16001.5 40.0
Process Creation 126.0 1861.8 147.8
Shell Scripts (1 concurrent) 42.4 2525.5 595.6
Shell Scripts (8 concurrent) 6.0 737.8 1229.7
System Call Overhead 15000.0 432496.2 288.3
========
System Benchmarks Index Score 249.7
------------------------------------------------------------------------
Benchmark Run: Tue Oct 16 2012 00:23:16 - 00:51:20
4 CPUs in system; running 4 parallel copies of tests
Dhrystone 2 using register variables 42619039.2 lps (10.0 s, 7 samples)
Double-Precision Whetstone 8274.0 MWIPS (10.4 s, 7 samples)
Execl Throughput 3398.5 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 68332.4 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 21462.9 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 718205.6 KBps (30.0 s, 2 samples)
Pipe Throughput 149713.5 lps (10.0 s, 7 samples)
Pipe-based Context Switching 61968.3 lps (10.0 s, 7 samples)
Process Creation 5321.7 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 5957.1 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 812.6 lpm (60.1 s, 2 samples)
System Call Overhead 1557391.5 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 42619039.2 3652.0
Double-Precision Whetstone 55.0 8274.0 1504.4
Execl Throughput 43.0 3398.5 790.4
File Copy 1024 bufsize 2000 maxblocks 3960.0 68332.4 172.6
File Copy 256 bufsize 500 maxblocks 1655.0 21462.9 129.7
File Copy 4096 bufsize 8000 maxblocks 5800.0 718205.6 1238.3
Pipe Throughput 12440.0 149713.5 120.3
Pipe-based Context Switching 4000.0 61968.3 154.9
Process Creation 126.0 5321.7 422.4
Shell Scripts (1 concurrent) 42.4 5957.1 1405.0
Shell Scripts (8 concurrent) 6.0 812.6 1354.3
System Call Overhead 15000.0 1557391.5 1038.3
========
System Benchmarks Index Score 592.5
Which means that the VPS in question has a score of 249.7 for single task and 592.5 for parallel processing.
My desktop machine, while having similar or lower specs to the physical machine my VPS is running on, produced a score of 1409.7 for single task and 5156.3 for parallel processing. Exactly the kind of metric I was looking for.
Another important metric is network speed. I've found a script which downloads test files from different locations and measures download speed. The script can be run with
wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash
(although it probably would be safer to download the script and inspect its contents before running)
To monitor disk I/O latency there is ioping
utility which can be installed from Ubuntu repositories:
# ioping . -c 10
4096 bytes from . (ext4 /dev/disk/...): request=1 time=16.4 ms
4096 bytes from . (ext4 /dev/disk/...): request=2 time=16.1 ms
...
Best Answer
So long story short, the only way an IOMMU will help you is if you start assigning HW resources directly to the VM. Just having it doesn't make things faster.
It would help to know exactly what Motherboard/CPU is advertising this feature. IOMMU is a system specific IO mapping mechanism and can be used with most devices.
IOMMU sounds like a generic name for Intel VT-d and AMD IOV. In which case I don't think you can multiplex devices, it's a lot like PCI passthrough before all these fancy virtualization instructions existed :). SR-IOV is different, the peripheral itself must carry the support. The HW knows it's being virtualized and can delegate a HW slice of itself to the VM. Many VMs can talk to an SR-IOV device concurrently with very low overhead.
The only thing faster than SR-IOV is PCI passthrough though in that case only one VM can make use of that device, not even the host operating system can use it. PCI passthrough would be useful for say a VM that runs an intense database that would benefit from being attached to a FiberChannel SAN.
Getting closer to the HW does have limitations however, it makes your VMs less portable for deployments that require live migration for example. This applies to both SR-IOV and PCI passthrough.
Default virtualized Linux deployments usually use VirtIO which is pretty fast to begin with.