Linux – How to change Linux context-switch frequency

linuxlinux-kernelprocess

How is it possible to change the Linux (linaro, ubuntu, debian) context-switch frequency?

I am okay for trading-off a less-responsive system for a more efficient one.

EDIT1: I have a main process which I want to run as fast as possible (maximal clock cycles per second), so I thought of reducing the context-switch frequency (=increasing the timeslice). The question is how to do it, and would there be a significant effect. Can I calculate the cost of the context switch? Meaning, can I estimate if I increase the timeslice by two, what will my performance gain be in % for the main process I care about?

Best Answer

If your task is the only process requesting time on a specific CPU, there will be no context switches between tasks :-). But the CPU may still be interrupted, causing a context switch into the kernel and back. And one possible cause is the pre-emption timer, checking if there is another task to run on this CPU...

Linux can avoid generating any pre-emption timer interrupts on the cpu when there will be no reason to do so. See CONFIG_NO_HZ_FULL. To use this feature, it must be enabled when the kernel was built, and it must be enabled using a boot option.

https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
https://lwn.net/Articles/549580/ "(Nearly) full tickless operation in 3.10"

By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" boot parameter specifies the adaptive-ticks CPUs. For example, "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks CPUs. Note that you are prohibited from marking all of the CPUs as adaptive-tick CPUs [...]

LWN.net says "according to Ingo Molnar, as much as 1% of the CPU's time will be saved" for adaptive-ticks CPUs. The kernel document says this has six different costs, and there is also a list of "KNOWN ISSUES".

This gain is relatively small, particularly compared to the potential throughput gains of reducing the frequency of context-switches between multiple tasks, as referenced in this answer: How to change the length of time-slices used by the Linux CPU scheduler?

Small print: these measurements pre-date Spectre, Meltdown, KPTI and x86 ASID support :-(. And I guess they also apply to somewhat older hardware. Ask a kernel expert or run your own measurements on how the cost of context-switches has changed on your specific kernel version and hardware... PTI was largely supposed to be mitigated by ASID, except for software that calls into the kernel very frequently, the main example being databases. But I don't have a good grasp on the numbers.

Molnar's hope in the original RFC patch was that with time, it "will likely be enabled by most Linux distros". I notice Fedora 28 provides a default kernel built with NO_HZ_FULL support. Debian 9 does not, however.

More recently, Linux v4.17 removes a residual 1 Hz timer tick from the nohz_full CPUs. I imagine the effect on throughput is quite small :-), but I've been trying to follow the status of NO_HZ_FULL benefits when there are multiple runnable processes on a CPU -

once we reach 0 Hz we can [then] remove the periodic tick assumption from nr_running>=2 as well, by essentially interrupting busy tasks only as frequently as the sched_latency constraints require us to do - once every 4-40 msecs, depending on nr_running.

This is a bit confusing as pre-emption already started using a separate, more precise tick back in v2.6.25-rc1, commit 8f4d37ec073c, "sched: high-res preemption tick". Found via this comment on the same LWN.net article: https://lwn.net/Articles/549754/ ).

Related Solutions

Linux – How to add CPU frequency governors to the Linux kernel

You'll have to find the code for that specific governor and add it to your kernel before recompiling it or you can also write the proper Makefile and compile the governor as a module. The code for the governor should be in drivers/cpufreq/. For example, for the lulzactive governor: drivers/cpufreq/cpufreq_lulzactive.c

Record time of every process or thread context switch

I don't have an answer but you might find one amongst the tools, examples and resources written or listed by Brendan Gregg on the perf command and Linux kernel ftrace and debugfs.

On my Raspberry Pi these tools were in package perf-tools-unstable. The perf command was actually in /usr/bin/perf_3.16.

Of interest may be this discussion and context-switch benchmark by Benoit Sigoure, and the lat_ctx test from the fairly old lmbench suite.

They may need some work to run on the Pi, for example with tsuna/contextswitch I edited timectxswws.c get_iterations() to while (iterations * ws_pages * 4096UL < 4294967295UL) {, and removed -march=native -mno-avx from the Makefile.

Using perf record for 10 seconds on the Pi over ssh whilst simultaneously doing while sleep .1;do echo hi;done in another ssh:

sudo timeout -10 perf_3.16 record -e context-switches -a
sudo perf_3.16 script -f time,pid,comm | less

gives output like this

           sleep 29341 2703976.560357: 
         swapper     0 2703976.562160: 
    kworker/u8:2 29163 2703976.564901: 
         swapper     0 2703976.565737: 
            echo 29342 2703976.565768: 
     migration/3    19 2703976.567549: 
           sleep 29343 2703976.570212: 
     kworker/0:0 28906 2703976.588613: 
     rcu_preempt     7 2703976.609261: 
           sleep 29343 2703976.670674: 
            bash 29066 2703976.671654: 
            echo 29344 2703976.675065: 
            sshd 29065 2703976.675454: 
         swapper     0 2703976.677757:

presumably showing when a context-switch event happened, for which process.

Best Answer

Related Solutions

Linux – How to add CPU frequency governors to the Linux kernel

Record time of every process or thread context switch

Related Question