I've gotten an initial explanation about this test case from Stefan Seyfried, who wrote the paper this example was taken from. The problem here is that the CPU scheduler parts of cgroups always aims to keep any available CPU busy; it doesn't ever enforce a hard limit if everything will fit at once.
In the case where two processes (high and low here) are running on >=2 cores, it's just going to keep high on one core and low on the other. Both will then run all the time, at close to 100% usage, because they can do so without hitting the situation where the scheduler doesn't give them enough time on the CPU. cpu.share scheduling only happens if there's a shortage.
In the second case, both processes are pinned to the same CPU. Then the CPU sharing logic has to do something useful with the relative cpu.shares numbers to balance them out, and it does that as hoped.
Hard limits on CPU usage aren't likely to appear until after the CFS Bandwidth Control patch hits. At that point it may be possible to get something more like what I was hoping for.
Standalone printf
Part of the "expense" in invoking a process is that several things have to happen that are resource intensive.
- The executable has to be loaded from the disk, this incurs slowness since the HDD has be be accessed in order to load the binary blob from the disk which the executable is stored as.
- The executable is typically built using dynamic libraries, so some secondary files to the executable will also have to be loaded, (i.e. more binary blob data being read from the HDD).
- Operating system overhead. Each process that you invoke incurs overhead in the form of a process ID having to be created for it. Space in memory will also have be carved out to both house the binary data being loaded from the HDD in steps 1 & 2, as well as multiple structures having to be populated to store things such as the processes' environment (environment variables etc.)
excerpt of an strace of /usr/bin/printf
$ strace /usr/bin/printf "%s\n" "hello world"
*execve("/usr/bin/printf", ["/usr/bin/printf", "%s\\n", "hello world"], [/* 91 vars */]) = 0
brk(0) = 0xe91000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd155a6b000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=242452, ...}) = 0
mmap(NULL, 242452, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd155a2f000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p\357!\3474\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1956608, ...}) = 0
mmap(0x34e7200000, 3781816, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x34e7200000
mprotect(0x34e7391000, 2097152, PROT_NONE) = 0
mmap(0x34e7591000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x191000) = 0x34e7591000
mmap(0x34e7596000, 21688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x34e7596000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd155a2e000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd155a2c000
arch_prctl(ARCH_SET_FS, 0x7fd155a2c720) = 0
mprotect(0x34e7591000, 16384, PROT_READ) = 0
mprotect(0x34e701e000, 4096, PROT_READ) = 0
munmap(0x7fd155a2f000, 242452) = 0
brk(0) = 0xe91000
brk(0xeb2000) = 0xeb2000
brk(0) = 0xeb2000
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=99158752, ...}) = 0
mmap(NULL, 99158752, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd14fb9b000
close(3) = 0
fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd155a6a000
write(1, "hello world\n", 12hello world
) = 12
close(1) = 0
munmap(0x7fd155a6a000, 4096) = 0
close(2) = 0
exit_group(0) = ?*
Looking through the above you can get a sense of the additional resources that /usr/bin/printf
is having to incur due to it being a standalone executable.
Builtin printf
With the built version of printf
all the libraries that it depends on as well as its binary blob have already been loaded into memory when Bash was invoked. So none of that has to be incurred again.
Effectively when you call the builtin "commands" to Bash, you're really making what amounts to a function call, since everything has already been loaded.
An analogy
If you've ever worked with a programming language, such as Perl, it's equivalent to making calls to the function (system("mycmd")
) or using the backticks (`mycmd`
). When you do either of those things, you're forking a separate process with it's own overhead, vs. using the functions that are offered to you through Perl's core functions.
Anatomy of Linux Process Management
There's a pretty good article on IBM Developerworks that breaks down the various aspects of how Linux processes are created and destroyed along with the different C libraries involved in the process. The article is titled:Anatomy of Linux process management - Creation, management, scheduling, and destruction. It's also available as a PDF.
Best Answer
grep -i 'a'
is equivalent togrep '[Aa]'
in an ASCII-only locale. In a Unicode locale, character equivalences and conversions can be complex, sogrep
may have to do extra work to determine which characters are equivalent. The relevant locale setting isLC_CTYPE
, which determines how bytes are interpreted as characters.In my experience, GNU
grep
can be slow when invoked in a UTF-8 locale. If you know that you're searching for ASCII characters only, invoking it in an ASCII-only locale may be faster. I expect thatwould produce indistinguishable timings.
That being said, I can't reproduce your finding with GNU
grep
on Debian jessie (but you didn't specify your test file). If I set an ASCII locale (LC_ALL=C
),grep -i
is faster. The effects depend on the exact nature of the string, for example a string with repeated characters reduces the performance (which is to be expected).