Linux – Why would ps *very* occasionally fail to find a valid process

linuxprocfsps

I've run into an odd issue in which a ps -o args -p <pid> command very occasionally fails to find the process in question, even though it is definitely running on the server in question. The processes in question are long-running wrapper scripts used to launch some Java apps.

The "in the wild" occurrences of the issue always seem to happen early in the morning, so there is some evidence that it's related to disk load on the server in question, because they're quite heavily loaded then, but by running the ps in question in a tight loop, I can eventually replicate the problem – once every few hundred or so runs I get an error.

By running the following bash script, I've managed to generate strace output for both a failed and a successful run:

while [ $? == 0 ] ; do strace -o fail.out ps -o args -p <pid> >/dev/null ; done ; strace -o good.out ps -o args -p <pid>

Comparing the output from fail.out and good.out , I can see that the getdents system call on the run that fails somehow returns a much smaller number than the actual count of processes on the system (on the order of ~500 compared with ~1100)

grep getdents good.out
  getdents(5, /* 1174 entries */, 32768)  = 32760
  getdents(5, /* 31 entries */, 32768)    = 992
  getdents(5, /* 0 entries */, 32768)     = 0

grep getdents fail.out
  getdents(5, /* 673 entries */, 32768)   = 16728
  getdents(5, /* 0 entries */, 32768)     = 0

… and that shorter list doesn't include the actual pid in question, so it's not found.

You can ignore this section, the ENOTTY errors are explained by dave_thompson's comment below, and are unrelated

Additionally, the failed run gets some ENOTTY errors that don't appear in the successful run. Near the beginning of the output I see

ioctl(1, TIOCGWINSZ, 0x7fffe19db310) = -1 ENOTTY (Inappropriate ioctl for device)
ioctl(1, TCGETS, 0x7fffe19db280) = -1 ENOTTY (Inappropriate ioctl for device)

And at the end I see a single

ioctl(1, TCGETS, 0x7fffe19db0d0) = -1 ENOTTY (Inappropriate ioctl for device)

The failed ioctl at the end happens right before the ps returns, but it occurs after the ps has already printed an empty results set, so I'm not sure if they're related. I do know that they're consistent in all of the failed strace outputs I have, but don't appear in the successful ones.

I have absolutely no idea why getdents would occasionally not find the full list of processes, and I've now reached the point where I'm just going to slap a band-aid on the entire thing by changing the control script that checks the wrapper script in question to call the ps a second time if the first one fails, but I'd be interested to know if anyone has any ideas what's going on here.

The system in question is running Kernel 4.16.13-1.el7.elrepo.x86_64 on CentOS 7 and procps-ng version 3.3.10-17.el7_5.2.x86_64

Best Answer

Consider reading the information you need directly from the /proc filesystem rather than through a tool such as ps. You will find the information you're looking for ("args") inside file /proc/$pid/cmdline, only separated by NUL bytes instead of spaces.

You can use this sed one-liner to get the args of process $pid:

sed -e 's/\x00\?$/\n/' -e 's/\x00/ /g' "/proc/$pid/cmdline"

This command is equivalent to:

ps -o args= -p "$pid"

(Using args= in ps will omit the header.)

The sed command will first look for the last trailing NUL byte and replace it with a newline, and after that replace all other NUL bytes (separating individual arguments) with spaces, finally producing the same format you're seeing from ps.


Regarding listing processes in the system, ps does it by listing directories in /proc, but there are inherent race conditions to that procedure, since processes are starting and exiting while ps is running, so what you get is not really a snapshot but an approximation. In particular, it's possible that ps will show processes that have already terminated by the time it shows its results, or omits processes that have started while it was running (but weren't returned by the kernel while listing the contents of /proc.)

I always assumed that if a process is there before ps starts and is still there after it's done, then it would not be missed by it, I assumed the kernel would guarantee those would be always included, even if there's a lot of churn of other processes being created and destroyed. What you're describing implies that's not the case. I'm still skeptical on that, but given there are known race conditions in how ps works, I guess it's at least plausible that listing PIDs from /proc might miss an existing one due to those race conditions.

It would be possible to verify that by checking the source of the Linux kernel, but I haven't done that (yet) so can't really tell for sure whether such a race condition exists that would miss a long-running process, as you describe.


The other part is the way ps works. Even if you're passing it a single PID with the -p argument, it's still listing all the existing PIDs, even though you're only interested in that single one. It could definitely take a shortcut in that case and skip listing the entries in /proc and going directly to /proc/$pid.

I can't say why it was implemented this way. Perhaps because most ps options are "filters" on the processes, so implementing -p the same way was easier, taking a shortcut to go straight to /proc/$pid might involve a separate code path or code duplication... Another hypothesis is that some cases including -p plus additional options would end up requiring listing, so it's perhaps complex to determine which exact cases would allow taking the shortcut and which ones wouldn't.


Which takes us to the workaround, going straight to /proc/$pid, without listing the full set of PIDs of the system, avoiding all the known races and simply getting the information you need straight from the source.

It's a bit ugly, but the issue you describe indeed exists, it should be a reliable way to fetch that information.

Related Question