Change Time-Slices in Linux CPU Scheduler

linux-kerneltuning

Is it possible to increase the length of time-slices, which the Linux CPU scheduler allows a process to run for? How could I do this?

Background knowledge

This question asks how to reduce how frequently the kernel will force a switch between different processes running on the same CPU. This is the kernel feature described as "pre-emptive multi-tasking". This feature is generally good, because it stops an individual process hogging the CPU and making the system completely non-responsive. However switching between processes has a cost, therefore there is a tradeoff.

If you have one process which uses all the CPU time it can get, and another process which interacts with the user, then switching more frequently can reduce delayed responses.

If you have two processes which use all the CPU time they can get, then switching less frequently can allow them to get more work done in the same time.

Motivation

I am posting this based on my initial reaction to the question How to change Linux context-switch frequency?

I do not personally want to change the timeslice. However I vaguely remember this being a thing, with the CONFIG_HZ build-time option. So I want to know what the current situation is. Is the CPU scheduler time-slice still based on CONFIG_HZ?

Also, in practice build-time tuning is very limiting. For Linux distributions, it is much more practical if they can have a single kernel per CPU architecture, and allow configuring it at runtime or at least at boot-time. If tuning the time-slice is still relevant, is there is a new method which does not lock it down at build-time?

Best Answer

For most RHEL7 servers, RedHat suggest increasing sched_min_granularity_ns to 10ms and sched_wakeup_granularity_ns to 15ms. (Source. Technically this link says 10 μs, which would be 1000 times smaller. It is a mistake).

We can try to understand this suggestion in more detail.

Increasing sched_min_granularity_ns

On current Linux kernels, CPU time slices are allocated to tasks by CFS, the Completely Fair Scheduler. CFS can be tuned using a few sysctl settings.

kernel.sched_min_granularity_ns
kernel.sched_latency_ns
kernel.sched_wakeup_granularity_ns

You can set sysctl's temporarily until the next reboot, or permanently in a configuration file which is applied on each boot. To learn how to apply this type of setting, look up "sysctl" or read the short introduction here.

sched_min_granularity_ns is the most prominent setting. In the original sched-design-CFS.txt this was described as the only "tunable" setting, "to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads."

In other words, we can change this setting to reduce overheads from context-switching, and therefore improve throughput at the cost of responsiveness ("latency").

I think of this CFS setting as mimicking the previous build-time setting, CONFIG_HZ. In the first version of the CFS code, the default value was 1 ms, equivalent to 1000 Hz for "desktop" usage. Other supported values of CONFIG_HZ were 250 Hz (the default), and 100 Hz for the "server" end. 100 Hz was also useful when running Linux on very slow CPUs, this was one of the reasons given when CONFIG_HZ was first added as an build setting on X86.

It sounds reasonable to try changing this value up to 10 ms (i.e. 100 Hz), and measure the results. Remember the sysctls are measured in ns. 1 ms = 1,000,000 ns.

We can see this old-school tuning for 'server' was still very relevant in 2011, for throughput in some high-load benchmark tests: https://events.static.linuxfound.org/slides/2011/linuxcon/lcna2011_rajan.pdf

And perhaps a couple of other settings

The default values of the three settings above look relatively close to each other. It makes me want to keep things simple and multiply them all by the same factor :-). But I tried to look into this and it seems some more specific tuning might also be relevant, since you are tuning for throughput.

sched_wakeup_granularity_ns concerns "wake-up pre-emption". I.e. it controls when a task woken by an event is able to immediately pre-empt the currently running process. The 2011 slides showed performance differences for this setting as well.

See also "Disable WAKEUP_PREEMPT" in this 2010 reference by IBM, which suggests that "for some workloads" this default-on feature "can cost a few percent of CPU utilization".

SUSE Linux has a doc that suggests setting this to larger than half of sched_latency_ns will effectively disable wake-up pre-emption, and then "short duty cycle tasks will be unable to compete with CPU hogs effectively".

The SUSE document also suggest some more detailed descriptions of the other settings. You should definitely check what the current default values are on your own systems though. For example the default values on my system seem slightly different to what the SUSE doc says.

https://www.suse.com/documentation/opensuse121/book_tuning/data/sec_tuning_taskscheduler_cfs.html

If you experiment with any of these scheduling variables, I think you should also be aware that all three are scaled (multiplied) by 1+log_2 of the number of CPUs. This scaling can be disabled using kernel.sched_tunable_scaling. I could be missing something, but this seems surprising e.g. if you are considering the responsiveness of servers providing interactive apps and running at/near full load, and how that responsiveness will vary with the number of CPUs per server.

Suggestion if your workload has large numbers of threads / processes

I also came across a 2013 suggestion, for a couple of other settings, that may gain significant throughput if your workload has large numbers of threads. (Or perhaps more accurately, it re-gains the throughput which they had obtained on pre-CFS kernels).

"Two Necessary Kernel Tweaks" - discussion on PostgreSQL mailing list.
"Please increase kernel.sched_migration_cost in virtual-host profile" - Red Hat Bug 969491.

Ignore `CONFIG_HZ`

I think you don't need to worry about what CONFIG_HZ is set to. My understanding is it is not relevant on current kernels, assuming you have reasonable timer hardware. See also commit 8f4d37ec073c, "sched: high-res preemption tick", found via this comment in a thread about the change: https://lwn.net/Articles/549754/ .

(If you look at the commit, I wouldn't worry that SCHED_HRTICK depends on X86. That requirement seems to have been dropped in some more recent commit).

Background knowledge

Motivation

Best Answer

Increasing sched_min_granularity_ns

And perhaps a couple of other settings

Suggestion if your workload has large numbers of threads / processes

Ignore CONFIG_HZ

Related Solutions

Linux – How does the Linux kernel schedule CPU, between user space processes/threads and kernel tasks/jobs

Related Question

Ignore `CONFIG_HZ`