Linux – How to verify `nice` is working

linuxlinux-kernelnicescheduling

Looking at different jobs running on a system with shared resources, it seems the nice values are being ignored. Many jobs with nice set at 19 are running at 100% cpu load while other many other jobs with nice set at 0 are running as low as 10% cpu load.
All of these processes are demanding and run on an idle system would max out every cpu given to it (e.g. NAMD).

I read here that

"…while [a nice] value is adjustable it can be ignored by the kernel's scheduler in Linux implementations."

Is this true? Is it possible that the kernel could be ignoring the nice value? It seems this is what is going on but how can I be certain? I do not want to make this into an issue with the sys admin without being more sure. I have read the related posts discussing How is nice working? and nice not really helping on Linux but these do not discuss not working with CPU loads.

Could it be that once a task is given resources, it will hold on to them for some time before reassigning them to the higher priority task? The low priority task has been running for days while the higher priority task repeatedly starts lots of short but demanding calculations which run for less than 10 minutes. Could it be that in between the short tasks the system gives resources to the low priority task which then holds onto them?

I believe the system I am experiencing this is on a StackIQ wrapped CentOS 6.5 installation (though I could easily be mistaken on some detail).

Best Answer

10 minutes is very much long-term as far as Linux's scheduler is concerned. Time slices are something like 10ms.

When you're looking at CPU usage percentages, keep in mind that top adds up the per-thread usage of multi-threaded processes. So a 10-thread process that has each thread getting 10% active time will show up as using 100% of a CPU.

Linux's scheduler won't starve a nice 19 task (because deadlock bugs are hard to avoid if a process can be descheduled forever), so even nice 19 won't stop a task from getting some CPU time. If it has a lot of threads, it may still use significant CPU resources.

If some of the processes are blocking on I/O, especially virtual memory paging, their CPU usage % will go way down. Run something like dstat to see CPU usage breakdowns, disk, network, paging, and context switches. It's like vmstat but colourized and nicer.

Make sure your processes really are niced the way you think they are, by looking at the NI column in top. (It's unlikely that different threads in the same process will have different nice levels, but I think possible.)

If you've been using renice, remember that it's not recursive. renice-ing a parent process won't affect existing children, only future children.

Related Question