Linux – Why do `time` and `strace -c` differ

linuxstracesyscallstime

To my understanding, time would record the total amount of time spend in syscalls. Then I would expect the cumulative totals for sys time as reported by time and strace -fc to be the same. But they are wildly different (13.5 vs 0.005). What is happening?

# time php index.php >/dev/null

real  0m16.292s
user  0m2.728s
sys   0m13.548s

# strace -fc php index.php >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.72    0.000365           0     54845           munmap
 10.09    0.000044           0     36705           mmap
  6.19    0.000027           0     18249           madvise
  0.00    0.000000           0       289           read
  0.00    0.000000           0       119           write
  0.00    0.000000           0       118         3 open
  0.00    0.000000           0       118           close
  0.00    0.000000           0        23           stat

[ cut 0 duration syscalls for brevity ]

100.00    0.000436                110951        82 total

(I've rerun these tests for about 50 times and they both give consistent results)

Best Answer

A process can spend any amount of time not doing any system call.

For instance a shell doing a while :; do :; done will spend an infinite amount of time not doing any system call and not spending any sys CPU time, only user CPU time.

strace -c tries to count the sys CPU time by each system call. A nanosleep(1000000000) will spend close to 0ms CPU time but 1s wall clock time.

$ bash -c 'time strace -fc sleep 1'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  [...]
  0.00    0.000000           0         1           nanosleep
  [...]
100.00    0.000000                    33         3 total

real    0m1.006s
user    0m0.001s
sys     0m0.003s

Comparing the sys time with the time reported by strace would make more sense but note that not all sys CPU time accounted to the process is spent in those syscalls. Accessing mmaped data for instance can make the process spend a lot of system time without doing any system call.

Here, those numerous mmaps() calls could be to allocate memory. Those are instantaneous, as they just update some mappings. But the first time you write to those, that's where some system time is needed to back those writes with actual memory.

Or they could be to map object files like shared libraries (that's also a likely possibility as the number is not far from the number of open()s). Then again, the mmap() is quick, but reading the memory later does also mean some page faults and actual time to read the data off disk, which is not accounted to any system call.

More fundamentally, if you do time strace -f your-application as in an earlier version of your question, time will time both the command and strace.

strace adds a lot of overhead. It does itself a few system calls for each system call of the straced application.

Doing

strace -fc time your-application

instead of

time strace -fc your-application

is more likely to give you a better match.

What I find though is that for the wait*() system calls that processes make to wait their children, strace counts the time reported by those wait*() system calls as system time, which means that the time of the children processes (at least the ones that are waited for) is counted several times. That matters in strace -f time cmd as time does run cmd in a child process and waits for it.

$ strace -c time head -c 100M /dev/urandom > /dev/null
0.00user 0.76system 0:00.76elapsed 99%CPU (0avgtext+0avgdata 1796maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.762288      762288         1           wait4
  0.00    0.000000           0         1           read
  0.00    0.000000           0       112           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         2           fstat
  0.00    0.000000           0         6           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         4           rt_sigaction
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.762288                   142         3 total

time and strace report the same sys time (as returned by the wait4() system call), but with -f:

$ strace -fc time head -c 100M /dev/urandom > /dev/null
strace: Process 2298 attached
0.01user 1.33system 0:01.91elapsed 69%CPU (0avgtext+0avgdata 1920maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54.60    1.331335     1331335         1           wait4
 39.43    0.961497          75     12804           read
  5.94    0.144825           6     25711           write
  0.01    0.000148          11        13         6 open
  0.00    0.000104           8        13           mmap
  0.00    0.000094          19         5         3 execve
  0.00    0.000063           8         8           mprotect
  0.00    0.000050           6         9           close
  0.00    0.000041           7         6         6 access
  0.00    0.000037           5         7           fstat
  0.00    0.000031          16         2           munmap
  0.00    0.000030           8         4           brk
  0.00    0.000007           4         2           arch_prctl
  0.00    0.000006           6         1         1 ioctl
  0.00    0.000000           0         4           rt_sigaction
  0.00    0.000000           0         1           clone
------ ----------- ----------- --------- --------- ----------------
100.00    2.438268                 38591        16 total

The 1.33 is the time reported by the one wait4() system call that time does. That reports the sys time of head (time's child).

However, strace also tries to get the system time for every syscall of its head grandchild on top of that, which means that is counted twice (not exactly). That stracing of the grand child does still incur some overhead accounted to the straced process as we get 1.33 as opposed to 0.76 earlier. That is considerably reduced if I force strace, time and head to run on the same processor (with taskset 1).

Related Question