Linux – how to detect if isolcpus is activated

kernellinux

How to detect if isolcpus is activated and on which cpus, when for example you connect for the first time on a server.
Conditions:

not spawning any process to see where it will be migrated.

The use case is that isolcpus=1-7 on a 6 cores i7, seems to not activate isolcpus at boot, and i would like to know if its possible from /proc/, /sys or any kernel internals which can be read in userspace, to provide a clear status of activation of isolcpus and which cpu are concerned.
Or even read active setting of the scheduler which is the first concerned by isolcpus.

Consider the uptime is so big, that dmesg is no more displaying boot log to detect any error at startup.
Basic answer like "look at kernel cmd line" will not be accepted 🙂

Best Answer

What you look for should be found inside this virtual file:

/sys/devices/system/cpu/isolated

and the reverse in

/sys/devices/system/cpu/present    // Thanks to John Zwinck

From drivers/base/cpu.c we see that the source displayed is the kernel variable cpu_isolated_map:

static ssize_t print_cpus_isolated(struct device *dev,
    n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(cpu_isolated_map));
...
static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);

and cpu_isolated_map is exactly what gets set by kernel/sched/core.c at boot:

/* Setup the mask of cpus configured for isolated domains */
static int __init isolated_cpu_setup(char *str)
{
    int ret;

    alloc_bootmem_cpumask_var(&cpu_isolated_map);
    ret = cpulist_parse(str, cpu_isolated_map);
    if (ret) {
            pr_err("sched: Error, all isolcpus= values must be between 0 and %d\n", nr_cpu_ids);
            return 0;
    }
    return 1;
}

But as you observed, someone could have modified the affinity of processes, including daemon-spawned ones, cron, systemd and so on. If that happens, new processes will be spawned inheriting the modified affinity mask, not the one set by isolcpus.

So the above will give you isolcpus as you requested, but that might still not be helpful.

Supposing that you find out that isolcpus has been issued, but has not "taken", this unwanted behaviour could be derived by some process realizing that it is bound to only CPU=0, believing it is in monoprocessor mode by mistake, and helpfully attempting to "set things right" by resetting the affinity mask. If that was the case, you might try and isolate CPUS 0-5 instead of 1-6, and see whether this happens to work.

Related Solutions

Linux – Understanding System ‘iowait’

Content notice: this post includes links to various Linux discussion and code. Some linked content does not meet the current Code of Conduct for StackExchange or for Linux. Mostly they "insult the code [but not the person]". However some language is used, that should simply not be repeated. I ask you to avoid imitating, parrotting, or debating such language.

Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low

On 05/07/2019 12:38, Peter Zijlstra wrote:

On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:

My cpu "iowait" time appears to be reported incorrectly. Do you know why this could happen?

Because iowait is a magic random number that has no sane meaning. Personally I'd prefer to just delete the whole thing, except ABI :/

Also see the comment near nr_iowait()

Thanks. I take [the problems mentioned in current documentation] as being different problems, but you mean there is not much demand (or point) to "fix" my issue.

I found my problem. It was already noticed five years ago, and it would not be trivial to fix.

"iowait" time is updated by the function account_idle_time():

/*
 * Account for idle time.
 * @cputime: the CPU time spent in idle wait
 */
void account_idle_time(u64 cputime)
{
    u64 *cpustat = kcpustat_this_cpu->cpustat;
    struct rq *rq = this_rq();

    if (atomic_read(&rq->nr_iowait) > 0)
        cpustat[CPUTIME_IOWAIT] += cputime;
    else
        cpustat[CPUTIME_IDLE] += cputime;
}

This works as I expected, if you are approximating cpu time by "sampling" with the traditional timer interrupt ("tick"). However, it may not work if the tick is turned off during idle time to save power - NO_HZ_IDLE. It may also fail if you allow the tick to be turned off for performance reasons - NO_HZ_FULL - because that requires starting VIRT_CPU_ACCOUNTING. Most Linux kernels use the power-saving feature. Some embedded systems do not use either feature. Here is my explanation:

When the IO is complete, the device sends an interrupt. The kernel interrupt handler wakes the process using try_to_wake_up(). It subtracts one from the nr_iowait counter:

if (p->in_iowait) {
    delayacct_blkio_end(p);
    atomic_dec(&task_rq(p)->nr_iowait);
}

If the process is woken on an idle CPU, that CPU calls account_idle_time(). Depending on which configuration applies, this is called either from tick_nohz_account_idle_ticks() from __tick_nohz_idle_restart_tick(), or from vtime_task_switch() from finish_task_switch().

By this time, ->nr_iowait has already been decremented. If it is reduced to zero, then no iowait time will be recorded.

This effect can vary: it depends which CPU the process is woken on. If the process is woken on the same CPU that received the IO completion interrupt, the idle time could be accounted earlier, before ->nr_iowait is decremented. In my case, I found CPU 0 handles the ahci interrupt, by looking at watch cat /proc/interrupts.

I tested this with a simple sequential read:

dd if=largefile iflag=direct bs=1M of=/dev/null

If I pin the command to CPU 0 using taskset -c 0 ..., I see "correct" values for iowait. If I pin it to a different CPU, I see much lower values. If I run the command normally, it varies depending on scheduler behaviour, which has changed between kernel versions. In recent kernels (4.17, 5.1, 5.2-rc5-ish), the command seems to spend about 1/4 of the time on CPU 0, because "iowait" time is reduced to that fraction.

(Not explained: why running this test on my virtual machine now seems to reproduce "correct" iowait, for each (or any) CPU. I suspect this might involve IRQ_TIME_ACCOUNTING, although this feature is also being used in my tests outside the VM.

I also have not confirmed exactly why suppressing NO_HZ_IDLE gives "correct" iowait for each CPU on 4.17+, but not on 4.16 or 4.15.

Running this test on my virtual machine seems to reproduce "correct" iowait, for each (or any) CPU. This is due to IRQ_TIME_ACCOUNTING. It is also used in the tests outside the VM, but I get more interrupts when testing inside the VM. Specifically, there are more than 1000 "Function call interrupts" per second on the virtual CPU that "dd" runs on.

So you should not rely too much on the details of my explanation :-)

There is some background about "iowait" here: How does a CPU know there is IO pending? The answer here quotes the counter-intuitive idea, that cumulative iowait "may decrease in certain conditions". I wonder if my simple test can be triggering such an undocumented condition?

Yes.

When I first looked this up, I found talk of "hiccups". Also, the problem was illustrated by showing the cumulative "iowait" time was non-monotonic. That is it sometimes jumped backwards (decreased). It was not as straightforward as the test above.

However, when they investigated they found the same fundamental problem. A solution was proposed and prototyped, by Peter Zijlstra and Hidetoshi Seto respectively. The problem is explained in the cover message:

[RFC PATCH 0/8] rework iowait accounting (2014-07-07)

I found no evidence of progress beyond this. There was an open question on one of the details. Also, the full series touched specific code for the PowerPC, S390, and IA64 CPU architectures. So I say this is not trivial to fix.

Best Answer

Related Solutions

Linux – Understanding System ‘iowait’

Related Question