Strace Performance – Why Doesn’t Strace Show Process Waiting?

performancestracetime

The mighty strace has let me down. How is this possible?


time foo shows that foo takes several seconds to run ("real"), but uses negligible cpu time, both in userspace ("user") and in the kernel ("sys"). For the curious, foo is defined below.

So it spends most of its time waiting for something else, not executing CPU instructions. Normally, I can see how it is waiting in strace – i.e. what system call is blocking for a long period of time. Unfortunately this approach didn't work.

strace -ttt -T -C -w foo shows system calls, timestamped, and a summary of the (real) time spent in system calls. But this particular process showed as spending negligible overall (real) time inside system calls.


foo is actually journalctl -b -u dev-hugepages.mount. Except that I had to change the last argument to a different systemd unit each time in order to reproduce this. In other words, the delay I am investigating happened the first time that I try to get the logs for any one systemd unit. EDIT: after answering the main question, I also realized the reason I was having this problem reproducing the delay.

The time spent by this process is a specific issue, apparently it does not occur on all systems. https://github.com/systemd/systemd/issues/7963

Best Answer

The usual reason for hitting this issue, is that the process is blocking in page faults. These are reads or possibly writes to files performed through a memory mapping aka mmap(). You may have noticed some mmap() in the trace of system calls.

If you had used the /usr/bin/time program instead of the time shell builtin, you might also have noticed:

0.04user 0.10system 0:02.29elapsed 6%CPU (0avgtext+0avgdata 40464maxresident)k
73632inputs+0outputs (376major+1081minor)pagefaults 0swaps

major pagefaults are the ones that require filesystem IO. minor pagefaults are much less significant (probably only a "TLB miss").

I suspect inputs are the total number of pages read. Currently, I think file mapped pages are always the same size. 4096 bytes in most cases, but you can check getconf PAGESIZE.

So this represents ~290 megabytes, read at something over 100 megabytes per second, a standard speed for a hard disk like mine. Mystery solved!


Note also, you are assuming that you have a whole free CPU for this process. Otherwise, the process could simply be blocked waiting for other processes to yield the CPU.

strace only shows when the process enters (and then leaves) the kernel due to a system call. Or when a unix signal is delivered. However there are other types of interrupts which strace does not show at all. So these include

  • Page faults.
  • The timer interrupt. This is used to switch to a different process, when the current one has exhausted its allocated time slice on the CPU.
Related Question