What could be using 6GB of the swap

resourcesswap

I have a mystery: what is using 6GB of my swap? My kernel version is 4.15.9-300.fc27.x86_64.

This happened following some crashes. dmesg shows I had a segfault in a gnome-shell process (which belonged to gdm) and later some firefox processes (Chrome_~dThread, in libxul.so). coredumpctl -r shows no other crashes on my current boot.

1. free and df -t tmpfs

# free -h
              total        used        free      shared  buff/cache   available
Mem:           7.7G        1.2G        290M        5.4G        6.1G        761M
Swap:          7.8G        6.0G        1.8G

# swapoff -a
swapoff: /dev/dm-1: swapoff failed: Cannot allocate memory

# df -h -t tmpfs
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G   17M  3.9G   1% /dev/shm
tmpfs           3.9G  1.9M  3.9G   1% /run
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
tmpfs           3.9G   40K  3.9G   1% /tmp
tmpfs           786M   20K  786M   1% /run/user/1000

I also manually checked the mount namespace of every process, for any extra tmpfs. There was no other mounted tmpfs (or they were the same – so only 17M, and there were less than 10 different mount namespaces).

2. ipcs

# ipcs --human

------ Message Queues --------
key        msqid      owner      perms      size         messages    

------ Shared Memory Segments --------
key        shmid      owner      perms      size       nattch     status      
0x00000000 20643840   alan-sysop 600          512K     2          dest         
0x00000000 22970369   alan-sysop 600           36K     2          dest         
0x00000000 20774914   alan-sysop 600          512K     2          dest         
0x00000000 20905987   alan-sysop 600          3.7M     2          dest         
0x00000000 23461892   alan-sysop 600            2M     2          dest         
0x00000000 20873221   alan-sysop 600          3.7M     2          dest         
0x00000000 22511622   alan-sysop 600            2M     2          dest         
0x00000000 28278791   alan-sysop 600           60K     2          dest         
0x00000000 23003144   alan-sysop 600           36K     2          dest         
0x00000000 27394057   alan-sysop 600           60K     2          dest         
0x00000000 29622282   alan-sysop 600          156K     2          dest         
0x00000000 27426828   alan-sysop 600           60K     2          dest         
0x00000000 28246029   alan-sysop 600           60K     2          dest         
0x00000000 29655054   alan-sysop 600          156K     2          dest         
0x00000000 29687823   alan-sysop 600          512K     2          dest         

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x002fa327 98304      root       600        2

3. Process memory

The per-process swap usage script says process memory only accounts for 54MB of swap:

PID=1 swapped 2292 KB (systemd)
PID=605 swapped 4564 KB (systemd-udevd)
PID=791 swapped 324 KB (auditd)
PID=793 swapped 148 KB (audispd)
PID=797 swapped 232 KB (sedispatch)
PID=816 swapped 120 KB (mcelog)
PID=824 swapped 1544 KB (ModemManager)
PID=826 swapped 152 KB (rngd)
PID=827 swapped 300 KB (avahi-daemon)
PID=829 swapped 1688 KB (abrtd)
PID=830 swapped 836 KB (systemd-logind)
PID=831 swapped 432 KB (dbus-daemon)
PID=843 swapped 368 KB (chronyd)
PID=848 swapped 312 KB (avahi-daemon)
PID=854 swapped 476 KB (gssproxy)
PID=871 swapped 1140 KB (abrt-dump-journ)
PID=872 swapped 1280 KB (abrt-dump-journ)
PID=873 swapped 1236 KB (abrt-dump-journ)
PID=874 swapped 14196 KB (firewalld)
PID=911 swapped 592 KB (mbim-proxy)
PID=926 swapped 1356 KB (NetworkManager)
PID=943 swapped 17936 KB (libvirtd)
PID=953 swapped 200 KB (atd)
PID=955 swapped 560 KB (crond)
PID=1267 swapped 284 KB (dnsmasq)
PID=1268 swapped 316 KB (dnsmasq)
PID=10397 swapped 160 KB (gpg-agent)
PID=14862 swapped 552 KB (systemd-journal)
PID=18131 swapped 28 KB (login)
PID=18145 swapped 384 KB (bash)
Overall swap used: 54008 KB

  1. So far I am assuming that there is no negligent program which used umount -l on a full tmpfs. I haven't tried to scrape /proc/*/fd for anyone holding such a hidden tmpfs open.

  2. I suppose I am also assuming no-one has constructed a giant memfd and is holding it open… haha why would I even suspect such a thing… sob.

The memfd names attached to processes seem innocent to me:

# ls -l /proc/*/fd/* 2>/dev/null|grep /memfd:
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/20889/fd/37 -> /memfd:xshmfence (deleted)
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/20889/fd/53 -> /memfd:xshmfence (deleted)
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/20889/fd/54 -> /memfd:xshmfence (deleted)
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/20889/fd/55 -> /memfd:xshmfence (deleted)
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/20889/fd/57 -> /memfd:xshmfence (deleted)
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/20889/fd/60 -> /memfd:xshmfence (deleted)
lrwx------. 1 alan-sysop alan-sysop 64 Mar 18 22:52 /proc/21004/fd/6 -> /memfd:pulseaudio (deleted)

These memfds seem innocent because: Process 20889 is my current Xorg, which post-dates the 6GB of swap. Similarly process 21004 is indeed my pulseaudio process, and the creation time on this process is later than the 6GB of swap was built up.

In theory the ones I'm worried about could also be in limbo though, attached to a unix socket message and never read.


EDIT1

After stopping systemd-logind – which native Xorg responds to by dying – and restarting Xorg, I see the entire 6GB of swap wiped out.

Note I forgot I needed to start logind again. Although lennart told me logind is not supposed to be bus-activated, logind immediately restarted. This is from journalctl -b, i.e. the system log, with no messages removed in between:

Mar 18 23:14:12 alan-laptop systemd[1]: Stopped Login Service.
Mar 18 23:14:12 alan-laptop dbus-daemon[831]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1
Mar 18 23:14:12 alan-laptop systemd[1]: Starting Login Service...

This is relevant in that logind then went through a cycle of a few crashes. This is expected for this version of logind (PRs to fix it have been merged upstream, following my issue reports).

So this doesn't quite isolate an individual cause, and I really should have checked the fds logind was holding before killing it.

Question

Is there any possible swap user I have missed in the above checks? (The non-destructive ones, prior to EDIT1).

Is there a better way to get usage reports for any of the possible users I listed above? That is, either an alternative that corrects some inaccuracy I haven't noticed? Or something that will be easier to run, and get a quick result when this happens again?

Does anyone have a nice script to check for fds holding open a "hidden" tmpfs (a tmpfs which was detached with umount -l)?

Does anyone have a nice way to check memory usage of memfds?

Is there any way to check for massive memfds having been left in limbo in an unread unix socket message? (Did any of these geniuses think about this at all when implementing memfds, which were explicitly intended for passing over unix sockets?)

EDIT2: Am I right to guess that a file descriptor of a graphics device (DRM), can hold a reference to swappable memory? Note logind holds such file descriptors.

Best Answer

EDIT1 After stopping systemd-logind - which native Xorg responds to by dying - and restarting Xorg, I see the entire 6GB of swap wiped out.

After the second time, I can confirm that this is a bug in systemd-logind. logind remembers to close the copy of the DRM fd which it holds, but it fails to close the copy which is held in PID1 (used support "seamless" restart of logind):

$ sudo lsof /dev/dri/card0 | grep systemd
[sudo] password for alan-sysop: 
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
systemd      1       root   16u   CHR  226,0      0t0 14690 /dev/dri/card0
systemd      1       root   87u   CHR  226,0      0t0 14690 /dev/dri/card0
systemd      1       root  101u   CHR  226,0      0t0 14690 /dev/dri/card0
systemd      1       root  106u   CHR  226,0      0t0 14690 /dev/dri/card0
systemd      1       root  110u   CHR  226,0      0t0 14690 /dev/dri/card0
systemd-l  860       root   21u   CHR  226,0      0t0 14690 /dev/dri/card0
systemd-l  860       root   25u   CHR  226,0      0t0 14690 /dev/dri/card0

This feels very much like a known bug, which should already be fixed in v238 of systemd.


Indeed, logind seems to be leaking a DRM fd this way every time I log in and out of GNOME. Presumably this bug only becomes obvious when you have display servers shut down uncleanly, so they don't get a chance to deallocate the buffers attached to their DRM fd.

EDIT2: Am I right to guess that a file descriptor of a graphics device (DRM), can hold a reference to swappable memory? Note logind holds such file descriptors.

Answer: yes.

filp

SHMEM file node used as backing storage for swappable buffer objects.

-- https://www.kernel.org/doc/html/v4.15/gpu/drm-mm.html

As I understand it, "SHMEM file node" here is something that does the exact same job as a tmpfs file / memfd. The above quote is regarding a "GEM buffer object"...

The mmap system call can't be used directly to map GEM objects, as they don't have their own file handle. Two alternative methods currently co-exist to map GEM objects to userspace... The second method uses the mmap system call on the DRM file handle.

-- https://01.org/linuxgraphics/gfx-docs/drm/drm-memory-management.html#id-1.3.4.6.6.8

CONCLUSION: someone should really double-check the current code in logind as it relates to the closing of file handles :).


Appendix: how you might try to rule out memfds

Does anyone have a nice way to check memory usage of memfds?

The memory usage of memfds can be read using stat --dereference or du -D on the magic symlink in /proc/$PID. Either under fd/$FD for a file descriptor, or - which you forgot - map_files/... for memory-mapped objects.

I don't have a really nice convenience for this, but you can at least search for the most massive individual FDs or mapped files. (The example below is not additional evidence; it was taken after the 6GB of swap usage went away).

$ sudo du -aLh /proc/*/map_files/ /proc/*/fd/ | sort -h | tail -n 10
du: cannot access '/proc/self/fd/3': No such file or directory
du: cannot access '/proc/thread-self/fd/3': No such file or directory
108M    /proc/10397/map_files/7f1e141b4000-7f1e1ad84000
111M    /proc/14862/map_files/
112M    /proc/10397/map_files/
113M    /proc/18324/map_files/7efdda2fb000-7efddaafb000
121M    /proc/18324/map_files/7efdea2fb000-7efdeaafb000
129M    /proc/18324/map_files/7efdc82fb000-7efdc8afb000
129M    /proc/18324/map_files/7efdd42fb000-7efdd4afb000
129M    /proc/18324/map_files/7efde52fb000-7efde5afb000
221M    /proc/26350/map_files/
3.9G    /proc/18324/map_files/

$ ps -x -q 18324
  PID TTY      STAT   TIME COMMAND
18324 pts/1    S+     0:00 journalctl -b -f

$ ps -x -q 26350
  PID TTY      STAT   TIME COMMAND
26350 ?        Sl     4:35 /usr/lib64/firefox/firefox

$ sudo ls -l /proc/18324/map_files/7efde52fb000-7efde5afb000
lr--------. 1 root root 64 Mar 19 00:32 /proc/18324/map_files/7efde52fb000-7efde5afb000
-> /var/log/journal/f211872a957d411a9315fd911006ef03/user-1001@c3f024d4b01f4531b9b69e0876e42af8-00000000002e2acf-00055bbea4d9059d.journal