Performance – Why Does `perf stat` Show 0 Context Switches?

perfperformance

I ran a shell pipeline under perf stat, using taskset 0x1 to pin the whole pipeline to a single CPU. I know taskset 0x1 had an effect, because it more than doubled the throughput of the pipeline. However, perf stat shows 0 context switches between the different processes of the pipeline.

So what exactly does perf stat mean by context switches?

I think I was interested in the number of context switches to/from the individual tasks in the pipeline. Is there a better way to measure that?

This was in the context of comparing dd bs=1M </dev/zero, to dd bs=1M </dev/zero | dd bs=1M >/dev/null. If I can measure context switches as desired, I assume that it would be useful in quantifying why the first version is several times more "efficient" than the second.

$ rpm -q perf
perf-4.15.0-300.fc27.x86_64
$ uname -r
4.15.17-300.fc27.x86_64

$ perf stat taskset 0x1 sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C18366+0 records in
18366+0 records out
19258146816 bytes (19 GB, 18 GiB) copied, 5.0566 s, 3.8 GB/s

 Performance counter stats for 'taskset 0x1 sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':

       5059.273255      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               414      page-faults:u             #    0.082 K/sec                  
        36,915,934      cycles:u                  #    0.007 GHz                    
         9,511,905      instructions:u            #    0.26  insn per cycle         
         2,480,746      branches:u                #    0.490 M/sec                  
           188,295      branch-misses:u           #    7.59% of all branches        

       5.061473119 seconds time elapsed

$ perf stat sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C6637+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.04907 s, 1.7 GB/s
6636+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.0492 s, 1.7 GB/s
sh: Interrupt

 Performance counter stats for 'sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':

       3560.269345      task-clock:u (msec)       #    0.878 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               355      page-faults:u             #    0.100 K/sec                  
        32,302,387      cycles:u                  #    0.009 GHz                    
         4,823,855      instructions:u            #    0.15  insn per cycle         
         1,167,126      branches:u                #    0.328 M/sec                  
            88,982      branch-misses:u           #    7.62% of all branches        

       4.052844128 seconds time elapsed

Best Answer

perf was silently failing to count context switches because you were not root.

(Linux has 64k pipe buffers. In either case, you can see very close to 2 context switches per 64k transferred. Not exactly sure how that works, but I suspect it's only counting context switches away from dd, either to the other dd, or to the idle task for that cpu).

$ sudo perf stat taskset 0x1 sh -c 'dd bs=1M </dev/zero|dd bs=1M >/dev/null'
^C14508+0 records in
14507+0 records out
15211692032 bytes (15 GB, 14 GiB) copied, 3.87098 s, 3.9 GB/s
14508+0 records in
14508+0 records out
15212740608 bytes (15 GB, 14 GiB) copied, 3.87044 s, 3.9 GB/s
taskset: Interrupt

 Performance counter stats for 'taskset 0x1 sh -c dd bs=1M </dev/zero|dd bs=1M >/dev/null':

       3872.597645      task-clock (msec)         #    1.000 CPUs utilized          
           464,325      context-switches          #    0.120 M/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               928      page-faults               #    0.240 K/sec                  
    11,099,016,844      cycles                    #    2.866 GHz                    
    13,765,220,898      instructions              #    1.24  insn per cycle         
     3,053,464,009      branches                  #  788.480 M/sec                  
        15,462,959      branch-misses             #    0.51% of all branches        

       3.874121023 seconds time elapsed

$ echo $((15212740608 / 464325))
32763

$ sudo perf stat sh -c 'dd bs=1M </dev/zero|dd bs=1M >/dev/null'
^C7031+0 records in
7031+0 records out
7032+0 records in
7031+0 records out
7372537856 bytes (7.4 GB, 6.9 GiB) copied, 4.27436 s, 1.7 GB/s7372537856 bytes (7.4 GB, 6.9 GiB) copied, 4.27414 s, 1.7 GB/s

sh: Interrupt

 Performance counter stats for 'sh -c dd bs=1M </dev/zero|dd bs=1M >/dev/null':

       3736.056509      task-clock (msec)         #    0.873 CPUs utilized          
           218,047      context-switches          #    0.058 M/sec                  
               206      cpu-migrations            #    0.055 K/sec                  
               877      page-faults               #    0.235 K/sec                  
     8,328,413,541      cycles                    #    2.229 GHz                    
     7,617,859,285      instructions              #    0.91  insn per cycle         
     1,671,904,009      branches                  #  447.505 M/sec                  
        13,827,669      branch-misses             #    0.83% of all branches        

       4.277591869 seconds time elapsed

$ echo $((7372537856 / 218047))
33811

Older versions of perf ~2.6.x

I'm using perf version: 2.6.35.14-106.

Capture all the output

I don't have the -x switch on my Fedora 14 system so I'm not sure if that's your actual problem or not. I'll investigate on a newer Ubuntu 12.10 system later on but this worked for me:

$ (perf stat -ecache-misses ls ) > stat.log 2>&1
$
$ more stat.log 
maccheck.txt
sample.txt
stat.log

 Performance counter stats for 'ls':

              13209  cache-misses            

        0.018231264  seconds time elapsed

I only want perf's output

You could try this, the output from ls will get redirected to /dev/null. The output form perf (both STDERR and STDOUT) goes to the file, stat.log.

$ (perf stat -ecache-misses ls > /dev/null ) > stat.log 2>&1
[saml@grinchy 89576]$ more stat.log 

 Performance counter stats for 'ls':

              12949  cache-misses            

        0.022831281  seconds time elapsed

Newer versions of perf 3.x+

I'm using perf version: 3.5.7

Capturing only perf's output

With the newer versions of perf there are dedicated options for controlling where messages get sent. You have the choice of either sending them to a file via the -o|--output option. Simply give either of those switches a filename to capture the output.

-o file, --output file
    Print the output into the designated file.

The alternative is to redirect the output to a alternate file descriptor, 3, for example. All you need to do is direct this alternate file handle prior to streaming to it.

--log-fd
    Log output to fd, instead of stderr. Complementary to --output, and 
    mutually exclusive with it. --append may be used here. Examples: 
       3>results perf stat --log-fd 3  — $cmd
       -or-
       3>>results perf stat --log-fd 3 --append — $cmd

So if we wanted to collect the perf output for the ls command you could use this command:

$ 3>results.log perf stat --log-fd 3 ls > /dev/null
$ 
$ more results.log

 Performance counter stats for 'ls':

          2.498964 task-clock                #    0.806 CPUs utilized          
                 0 context-switches          #    0.000 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
               258 page-faults               #    0.103 M/sec                  
           880,752 cycles                    #    0.352 GHz                    
           597,809 stalled-cycles-frontend   #   67.87% frontend cycles idle   
           652,087 stalled-cycles-backend    #   74.04% backend  cycles idle   
         1,261,424 instructions              #    1.43  insns per cycle        
                                             #    0.52  stalled cycles per insn [55.31%]
     <not counted> branches                
     <not counted> branch-misses           

       0.003102139 seconds time elapsed

If you use the --append version then the contents of multiple commands will be appended to the same log file, results.log in our case.

Installing perf

Installation is pretty trivial:

Fedora

$ yum install perf

Ubuntu/Debian

$ apt-get install linux-tool-common linux-tools

References

Understanding Linux Perf sched-switch and context-switches

I know this question is pretty old (Feb 16) but here a response in case it helps someone else. The problem is that you've entered the '-F 999' indicating that you want to sample the events at a frequency of 999 times a second. For 'trace' events, you don't generally want to do sampling. For instance, when I select sched:sched_switch, I want to see every context switch. If you enter -F 999 then you will get a sampling of the context switches... If you look at the output of your 'perf record' cmd with something like:

perf script --verbose -I --header -i perf.dat -F comm,pid,tid,cpu,time,period,event,trace,ip,sym,dso > perf.txt

then you would see that the 'period' (the number between the timestamp and the event name) would not (usually) be == 1.

If you use a 'perf record' cmd like below, you'll see a period of 1 in the 'perf script' output like:

Binder:695_5   695/2077  [000] 16231.700440:          1         sched:sched_switch: prev_comm=Binder:695_5 prev_pid=2077 prev_prio=120 prev_state=S ==> next_comm=kworker/u16:17 next_pid=7665 next_prio=120

A long winded explanation but basically: don't do that (where 'that' is '-F 999').

If you just do something like:

perf record -a -g -e sched:sched_switch -e sched:sched_blocked_reason -e sched:sched_stat_sleep -e sched:sched_stat_wait sleep 5

then the output would show every context switch with the call stack for each event. And you might need to do:

echo 1 > /proc/sys/kernel/sched_schedstats

to get the sched_stat events.