There are oh so many reasons to have multiple swap areas (they don't need to be files), even if you only have a single spindle.
20-20 hindsight: You deployed a machine with a single swap area, then eventually realised it's not enough. You can't redeploy the machine at will, but you can make another swap area (probably a file) until redoing the partition layout becomes an option.
Resizing or moving swap areas: You can't resize swap areas (as mentioned by Evan Teitelman). And you can't just swapoff
, make a new swap area and then swapon
again unless you have enough RAM: swapoff
wants to move all the swapped out pages to RAM before letting go of the swap area. So you make a temporary swap area, swapoff
the original, wait till all the pages have moved from the old swap area to the temporary one, resize the original swap partition, mkswap
it, then swapon
the resized one and swapoff
the temporary one. The swapped pages are copied from the temporary swap area to the resized one, and you're done. If you're moving swap areas, you don't even need a temporary area. mkswap
the new one, swapon
it, then swapoff
the old one and everything's moved.
Crazy fast swapping: modern disks employ zone bit recording. The first zone of the disk is the fastest. You may want to measure the disk, and create a partition covering exactly the first, fastest zone of the drive. This may be smaller than your intended swap size. So you add multiple partitions on several disks, using the same technique.
Crazy fast swapping, the sequel: alternatively, once you know where your disks' fastest zones are, you can make high priority swap areas in the first zone, lower priority swap areas in the second zone, etc. This way your swapping system automatically knows to load balance across all fast disk zones, prefer the faster zones, and use the slower zones as an overflow area when the need arises.
Symmetric load balancing: on a nicely built system with many spindles (like a server), I like to have multiple swap partitions occupying the beginning of every disk (to take advantage of zone bit recording). They all have identical priorities, so the kernel will load-balance the swap. One spindle may give you 100 MB/s, but swap across all spindles could give you a multiple of that. (naïvely speaking)
Bottleneck-aware load balancing: in practice, however, there are other bottlenecks in place. So, for instance, a 16 disk server may have four 6 Gbps SATA ports, each with a four-port multiplier and four disks sharing the bandwidth. If you know about this, you can organise your swap spaces so Disk 1 on Ports 1–4 have the highest priority, the second disks on ports 1–4 have the second highest priority, etc. This will load balance swapping but not overwhelm the port multipliers.
Swapping across devices with different performance: (as mentioned by Luke) if your system isn't a brand new server, and it's grown organically over the years, it may have block devices that are significantly faster than others. You'll want to swap to the fastest device first, then to the next fastest, etc.
Size considerations: (courtesy of David Kohen) maybe putting all your swap on one drive leaves a few gigs free on the drive (this sounds like a 2001 scenario, but there are plenty of old or embedded devices where this could be an issue). Split it across all drives, and on top of all the other benefits above, you get better disk space usage per drive. It's one thing to lose a couple of gigs per spindle, and another to lose 300 gigs from one disk.
Emergencies: you have exactly 96 hours to submit your PhD thesis, and your last experiment (the one that's likely to get you that Nobel prize as well as funky mixed-case letters after your name) is sucking memory at impressive rates. You're almost out of swap. You create a swap file with a priority less than the priority of your main swap device — the kernel will use it as overflow swap space. You could even install swapd to do this for you automatically, so you'll also have plenty of swap space for those huge emacs
and LaTeX runs.
Swapping across different media: Linux can't swap to character devices, but there are lots of different media, physical and virtual: SSDs (note: you probably don't want to swap on SSDs), dozens of shockingly different types of spinning hard disks, floppies (yes, you can swap on a floppy — you can always shoot yourself in the foot with Unix), DRBD volumes, iSCSI, LVM volumes, LUKS encrypted partitions, etc (including surreal, mind-boggling layered combinations of these — swap on LUKS on LVM on a parallel port ZIP drive over iSCSI over IEEE802.3ad aggregated Ethernet? No problem, you filthy pervert). These are niche scenarios, and are meant to support niche requirements.
Best Answer
Since kernel 2.6.28, Linux uses a Split Least Recently Used (LRU) page replacement strategy. Pages with a filesystem source, such as program text or shared libraries belong to the file cache. Pages without filesystem backing are called anonymous pages, and consist of runtime data such as the stack space reserved for applications etc. Typically pages belonging to the file cache are cheaper to evict from memory (as these can simple be read back from disk when needed). Since anonymous pages have no filesystem backing, they must remain in memory as long as they are needed by a program unless there is swap space to store them to.
It is a common misconception that a swap partition would somehow slow down your system. Not having a swap partition does not mean that the kernel won't evict pages from memory, it just means that the kernel has fewer choices in regards to which pages to evict. The amount of swap available will not affect how much it is used.
Linux can cope with the absence of a swap space because, by default, the kernel memory accounting policy may overcommit memory. The downside is that when physical memory is exhausted, and the kernel cannot swap anonymous pages to disk, the out-of-memory-killer (OOM-killer) mechanism will start killing off memory-hogging "rogue" processes to free up memory for other processes.
The
vm.swappiness
option is a modifier that changes the balance between swapping out file cache pages in favour of anonymous pages. The file cache is given an arbitrary priority value of 200 from whichvm.swappiness
modifier is deducted (file_prio=200-vm.swappiness
). Anonymous pages, by default, start out with 60 (anon_prio=vm.swappiness
). This means that, by default, the priority weights stand moderately in favour of anonymous pages (anon_prio=60
,file_prio=200-60=140
). The behaviour is defined inmm/vmscan.c
in the kernel source tree.Given a
vm.swappiness
of100
, the priorities would be equal (file_prio=200-100=100
,anon_prio=100
). This would make sense for an I/O heavy system if it is not wanted that pages from the file cache being evicted in favour of anonymous pages.Conversely setting the
vm.swappiness
to0
will prevent the kernel from evicting anonymous pages in favour of pages from the file cache. This might be useful if programs do most of their caching themselves, which might be the case with some databases. In desktop systems this might improve interactivity, but the downside is that I/O performance will likely take a hit.The default value has most likely been chosen as an approximate middleground between these two extremes. As with any performance parameter, adjusting
vm.swappiness
should be based on benchmark data comparable to real workloads, not just a gut feeling.