Linux – How to figure out what’s freezing up the machine

arch linuxbtrfsfreezevmware

I am running Arch on this machine:

3.40GHz i7 hexacore (4930K)

16GB DDR3 1600MHz RAM

2xSamsung 840 EVO SSDs in Raid0 (using BTRFS raid)

When I run VMware on my Arch with a few VMs (2 or 3), giving them about 2-4 cores each, and 2GB RAM each, my system starts having random freezes. Every couple of minutes, the system will freeze up for anywhere from 10 to 30 seconds, and then start moving again, only to freeze up 30 seconds later until I shut down the VMs. When the system freezes, the mouse still moves fine, but applications stop responding on the host – vmware doesn't respond, firefox (which is also open on the host) doesn't respond, etc.

When the freeze happens, if I have process monitor running, it does show several cores maxed out by vmware, but at the same time, there are other unused cores. I also have more than enough RAM – the VMs use a total of 6GB, and the host has 10GB left over. I have 0 swap space, so there's no way swapping is slowing anything down.

There are reports that because btrfs causes fragmentation of files on a filesystem level, virtual machines may run slow. As far as I can tell however, fragmentation is only a problem on traditional hard disks – SSDs don't have read heads that seek, so they don't care if a file is highly fragmented.

This never used to happen when I was running Debian 7, so I'm pretty sure it's not a hardware problem.

What tools can I run to figure out why my system keeps freezing up? I've tried top/htop, and iotop (nothing is writing or reading excessively when the system freezes up). There doesn't appear to be any kind of activity monitor for btrfs to tell if it's having problems keeping up with writing/reading anything. Is there anything else I can try?

Best Answer

From the btrfs gotchas page:

Files with a lot of random writes can become heavily fragmented (10000+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.

  • On servers and workstations this affects databases and virtual machine images.

    • The nodatacow mount option may be of use here, with associated gotchas.

    ...

  • Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs). You can use filefrag to locate heavily fragmented files (may not work correctly with compression).

I had similar problems as you describe with Virtualbox. The nodatacow option for btrfs did not help in a noticeable way on my system. I tried the auto-defragment option (mentioned as a possible solution for application databases in desktop environments) as well, also without results that would make the behaviour acceptable.

In the end I shrunk my btrfs partion and the Logical Volume it lives in, I created a new LV and formatted it as ext4, and then put the VM disc images that I have (VirtualBox) on that "partition".

Related Question