How to identify all of the configured memory limits for a service started using systemd

configurationmemorysystemd

I am chasing an error trying to apply a new tune to postgres.

The exact error is:

2018-11-07 22:14:49 EST [7099]: [1-1] FATAL:  could not map anonymous shared memory: Cannot allocate memory
2018-11-07 22:14:49 EST [7099]: [2-1] HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 35301089280 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.

I am familiar with this error. Tuning various instances of postgres is a monthly task for the engineers I work with. The solutions are to either pull back our postgres tune or manage settings like shmall and ulimit.

In this case we are tuning a postgres installation that was created by someone else and has some cruft from a few years of runtime and upgrades. This installation started on a CentOS 5 install and is now on CentOS 7. The old SysV install on CentOS 5 applied several controls on memory limits including:

  • /etc/sysconfig/postgresql.d/ulimit.sh
  • /etc/sysconfig/postgresql.d/memory-cap
  • Extremely conservative settings for shmmax and shmall
  • Scripts from another vendor or sysadmin which intentionally force certain values by altering config files
  • /etc/sysctl.conf

Since the upgrade from CentOS 5 to CentOS 7 there now appears to be additional controls on memory limit which were applied when changing it from SysV to SystemD.

For example systemctl cat postgresql.service shows:

# /usr/lib/systemd/system/postgresql.service
[Unit]
Description=PostgreSQL database server
After=network.target

[Service]
Type=forking
User=postgres
Group=postgres
Environment=PGPORT=5432
Environment=PGDATA=/opt/pgsql/data
OOMScoreAdjust=-1000
LimitSTACK=16384
ExecStart=/opt/pgsql/bin/pg_ctl start -D ${PGDATA} -s -o "-p ${PGPORT}" -w -l ${PGDATA}/serverlog
ExecStop=/opt/pgsql/bin/pg_ctl stop -D ${PGDATA} -s -m fast
ExecReload=/opt/pgsql/bin/pg_ctl reload -D ${PGDATA} -s
TimeoutSec=300

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/postgresql.service.d/memory-cap.conf
#
# THIS FILE IS AUTO-GENERATED by /opt/pgsql/bin/tune.sh
# DO NOT MODIFY, it will be overwritten on next postgres startup.
# If you need to make a change, then disable the tuner:
#
# ln -s /dev/null /etc/systemd/system/postgresql.service.d/tune.conf
#

[Service]
LimitAS=12884901888
# /etc/systemd/system/postgresql.service.d/tune.conf
# /usr/lib/systemd/system/postgresql.service.d/use-system-timezone.conf
# Disable automatically setting the timezone by masking this drop-in file:
# ln -s /dev/null /etc/systemd/system/postgresql.service.d/use-system-timezone.conf
# Then you need to:
# systemctl daemon-reload
[Service]
ExecStartPre=/opt/pgsql/bin/use-system-timezone.sh

Now coming around to my actual question: There are clearly several layers of kernel settings, per-user limits, and service configurations which can each impose limits on shmmax, shmall, ulimit, and related settings. How do I determine either from configuration or at runtime what limits a SystemD service actually has applied when it is started?

If I can identify what the limits are at runtime, I can then start greping config files and scripts to find where those are set. Once I can find those I can set the values as they need to be. I'm hoping there is a flag I can set to get SystemD or my postgres process to log out its apparent settings when it starts as a service.

I am comfortable with what these values should be set to, there are just too many layers which might be forcing or overriding these values. I want to learn what configuration locations I need to touch.

My perception is that I can have situations like a SystemD LimitFOO setting which is a different value than sysctl -w kernel.shmfoo and a different value than /etc/someconfig/serviceuser/limit.foo. I need to determine what limits are actually being used or applied so that I can correctly change and set those limits to tune the service I am running.

Best Answer

As you point out in your question, there are several limits in play:

  1. the System V IPC ones, such as shmall, shmmax, etc.
  2. the RLIMIT ones (which are often set and inspected by the ulimit command in the shell, so you might know them by that name.)
  3. the cgroup limits (particularly the memory cgroup, in your case), which is a new way to apply limits to groups of processes in modern kernels.

systemd manages the latter two, in particular using cgroups as the main mechanism for limiting and accounting. It does have some small limited support for System V IPC, but not really for limits.

Let's break down these three separate concepts and look into how to inspect and tune the limits on each, related to systemd.

System V IPC

systemd has some small support for System V IPC (for example, cleaning up IPCs when service stops, running a service in its own IPC namespace or mounting a private tmpfs (backed by shm) on /tmp for a single service), but for the most part it doesn't further manage System V IPC limits and doesn't do any accounting on it.

So limits of System V IPC are exclusively managed by sysctl, so you can inspect those with something like:

$ sysctl kernel.shmmax kernel.shmall kernel.shmmni
kernel.shmmax = 18446744073692774399
kernel.shmall = 18446744073692774399
kernel.shmmni = 4096

And tune them with sysctl -w.

systemd only gets involved in setting these limits since it includes systemd-sysctl.service which is responsible for setting those from /etc/sysctl.conf and /etc/sysctl.d/*.conf. But other than that, it's all sysctl, which also gives you directly the kernel's information on these limits.

RLIMITs (ulimit)

These limits are set per-process and inherited by subprocesses (so typically they are the same through a process tree, but not necessarily.)

systemd allows setting those per service, so that the limits are set as configured when the service starts.

These are configured by directives such as LimitSTACK=, LimitAS=, etc. which you already mention on your question. You can see the full list of RLIMITs in the man page for systemd, where it also correlates those to the familiar ulimit commands.

You can inspect the current limits for a running unit by using the systemctl show command, which dumps the internal state of the unit from systemd.

For example:

$ systemctl show postgresql.service | grep ^Limit
LimitSTACK=16384
LimitSTACKSoft=16384
LimitAS=12884901888
LimitASSoft=12884901888
... (other RLIMITs omitted for terseness) ...

You can also inspect what the kernel thinks the limits are, by looking at /proc/$pid/limits (remember, these are per-process, so you need to look at individual PIDs.)

For example:

$ cat /proc/12345/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max stack size            16384                16384                bytes     
Max address space         12884901888          12884901888          bytes     
... (other RLIMITs omitted for terseness) ...

cgroups (memory cgroup)

Finally, cgroups are the main mechanism by which systemd manages services, providing limits and accounting.

There are many cgroups available and supported by systemd (like CPU, Memory, IO, Tasks, etc.), but for this discussion, let's focus on the memory cgroup (since these are the limits involved in your issue, and we looked at the corresponding memory limits for SysV IPC and RLIMITs too.)

Same as with the RLIMITs, you can also use systemctl show to look at the memory accounting provided by systemd by using cgroups:

$ systemctl show postgresql.service | grep ^Memory
MemoryCurrent=631328768
MemoryAccounting=yes
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
MemoryDenyWriteExecute=yes

You'll see that memory accounting is enabled (MemoryAccounting=yes) but none of the limits are set (all set to inifinity.) The list of limits may vary depending on your version of systemd and kernel, this is systemd 239 on kernel 4.20-rc0, which has "low", "high", "max", "limit" and a separate limit specifically for swap.

One more point you may find interesting is that you'll be able to tell how much memory that service is using, through the MemoryCurrent= value. That is taken from the kernel cgroup information, it's a fresh measurement of memory usage by that service.

You can also see that information when you use systemctl status on the service:

$ systemctl status postgresql.service
● postgresql.service - PostgreSQL database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled)
 Main PID: 12345 (postgresql)
    Tasks: 10 (limit: 4321)
   Memory: 602M
   CGroup: /system.slice/postgresql.service
           └─12345 /usr/lib/postgresql/postgresql

As you can see, systemd is reporting memory usage (Memory: 602M), which comes from the cgroup information. You can also see the Tasks accounting is enabled (through the corresponding cgroup), and it's reporting currently using 10 tasks out of a limit of 4321 max tasks for that service.

The status output also includes information about the underlying cgroup, named after the service (every service runs in its own cgroup), which you can then use to inspect the cgroup limits and accounting information directly from the kernel.

For example:

$ cd /sys/fs/cgroup/memory/system.slice/postgresql.service/
$ cat memory.limit_in_bytes 
9223372036854771712
$ cat memory.usage_in_bytes 
631328768

(The number 9223372036854771712 is 2^63 - 4096, which in this case represents infinity within a 64-bit counter.)

You can look at the kernel documentation for the memory cgroup for more details on these limits and counters. There are two versions of cgroup in the kernel (cgroup-v1 and cgroup-v2), so you might find some significant differences in your system if it's using cgroup-v2 instead. systemd supports both (and a hybrid model where both are used), so querying the limits and counters using systemctl should give you a consistent view regardless of what version of cgroups is enabled on the kernel.