Looking at different jobs running on a system with shared resources, it seems the nice values are being ignored. Many jobs with nice set at 19 are running at 100% cpu load while other many other jobs with nice set at 0 are running as low as 10% cpu load.
All of these processes are demanding and run on an idle system would max out every cpu given to it (e.g. NAMD).
I read here that
"…while [a nice] value is adjustable it can be ignored by the kernel's scheduler in Linux implementations."
Is this true? Is it possible that the kernel could be ignoring the nice value? It seems this is what is going on but how can I be certain? I do not want to make this into an issue with the sys admin without being more sure. I have read the related posts discussing How is nice working? and nice
not really helping on Linux but these do not discuss not working with CPU loads.
Could it be that once a task is given resources, it will hold on to them for some time before reassigning them to the higher priority task? The low priority task has been running for days while the higher priority task repeatedly starts lots of short but demanding calculations which run for less than 10 minutes. Could it be that in between the short tasks the system gives resources to the low priority task which then holds onto them?
I believe the system I am experiencing this is on a StackIQ wrapped CentOS 6.5 installation (though I could easily be mistaken on some detail).
Best Answer
10 minutes is very much long-term as far as Linux's scheduler is concerned. Time slices are something like 10ms.
When you're looking at CPU usage percentages, keep in mind that
top
adds up the per-thread usage of multi-threaded processes. So a 10-thread process that has each thread getting 10% active time will show up as using 100% of a CPU.Linux's scheduler won't starve a
nice 19
task (because deadlock bugs are hard to avoid if a process can be descheduled forever), so evennice 19
won't stop a task from getting some CPU time. If it has a lot of threads, it may still use significant CPU resources.If some of the processes are blocking on I/O, especially virtual memory paging, their CPU usage % will go way down. Run something like
dstat
to see CPU usage breakdowns, disk, network, paging, and context switches. It's likevmstat
but colourized and nicer.Make sure your processes really are niced the way you think they are, by looking at the
NI
column in top. (It's unlikely that different threads in the same process will have different nice levels, but I think possible.)If you've been using
renice
, remember that it's not recursive. renice-ing a parent process won't affect existing children, only future children.