No, you can set the maximum kernel threads to very high numbers.
Note that the word "threads" is used for many different things:
It may be that Intels use causes confusion.
Update re kernel threads
Here are some Linux kernel threads running in CoLinux under Vista on AMD Athlon 64 X2 dual-core.
$ ps -eLf
UID PID PPID LWP C NLWP STIME TTY TIME CMD
root 1 0 1 0 1 17:24 ? 00:00:00 init [2]
root 2 0 2 0 1 17:24 ? 00:00:00 [kthreadd]
root 3 2 3 0 1 17:24 ? 00:00:00 [ksoftirqd/0]
root 4 2 4 0 1 17:24 ? 00:00:00 [events/0]
root 5 2 5 0 1 17:24 ? 00:00:00 [khelper]
root 21 2 21 0 1 17:24 ? 00:00:00 [kblockd/0]
root 22 2 22 0 1 17:24 ? 00:00:00 [kseriod]
root 41 2 41 0 1 17:24 ? 00:00:00 [pdflush]
root 42 2 42 0 1 17:24 ? 00:00:00 [pdflush]
root 43 2 43 0 1 17:24 ? 00:00:00 [kswapd0]
root 44 2 44 0 1 17:24 ? 00:00:00 [aio/0]
root 727 2 727 0 1 17:24 ? 00:00:00 [kjournald]
LWP is the thread ID.
(See man ps
: "-L Show threads, possibly with LWP and NLWP columns" … "LWP lwp (light weight process, or thread) ID of the lwp being reported. (alias spid, tid)")
kthreadd is the kernel thread daemon, I believe is is responsible for all the other kernel threads. Note I am not showing daemons like klogd which do not execute in ring 0 (as far as I know).
Number of kernel threads != number of CPU cores. (ref title of question)
Kernel threads consist of a set of registers, a stack, and
a few corresponding kernel data structures.
…
The purported advantage of kernel threads over processes
is faster creation and context switching compared
with processes.
…
Kernel threads are considered “lightweight,” and one
would expect the number of threads to only be limited by
address space and processor time
…
In particular, operating system kernels tend to see kernel
threads as a special kind of process rather than a unique entity.
For example, in the Solaris kernel threads are called
“light weight processes” (LWP’s). Linux actually creates
kernel threads using a special variation of fork called
“clone,” and until recently gave each thread a separate process
ID. Because of this heritage, in practice kernel threads
tend to be closer in memory and time cost to processes than
user-level threads,
(Multiple Flows of Control in Migratable Parallel Programs 2006)
I can't definitively answer the "kernel threads" question for Linux. For Windows, I can tell you that the "kernel threads" are simply threads created from some other kernel mode routine, running procedures that never enter user mode. When the scheduler picks a thread for execution it resumes its previous state (user or kernel, whatever that was); the CPU doesn't need to "tell the difference". The thread executes in kernel mode because that's what it was doing the last time it was executing.
In Windows these typically are created with the so-called "System" process as their parent, but they can actually be created in any process. So, in Unix they can have a parent ID of zero? i.e. belonging to no process? This actually doesn't matter unless the thread tries to use process-level resources.
As for the addresses assigned by the compiler... There are a couple of possible ways to think about this. One part of it is that the compiler really doesn't pick addresses for much of anything; almost everything a compiler produces (in a modern environment) is in terms of offsets. A given local variable is at some offset from wherever the stack pointer will be when the routine is instantiated. (Note that stacks themselves are at dynamically assigned addresses, just like heap allocations are.) A routine entry point is at some offset from the start of the code section it's in. Etc.
The second part of the answer is that addresses, such as they are, are assigned by the linker, not the compiler. Which really just defers the question - how can it do this? By which I guess you mean, how does it know what addresses will be available at runtime? The answer is "practically all of them."
Remember that every process starts out as an almost completely blank slate, with a new instantiation of user mode address space. e.g. every process has its own instance of 0x10000. So aside from having to avoid a few things that are at well-known (to the linker, anyway) locations within each process on the platform, the linker is free to put things where it wants them within the process address space. It doesn't have to know or care where anything else already is.
The third part is that nearly everything (except those OS-defined things that are at well-known addresses) can be moved to different addresses at run time, due to Address Space Layout Randomization, which exists on both Windows and Linux (Linux released it first, in fact). So it doesn't actually matter where the linker put things.
Best Answer
Kernel-level threads require a context switch, which involves changing a large set of processor registers that define the current memory map and permissions. It also evicts some or all of the processor cache.
User-level threads just require a small amount of bookkeeping within one kernel thread or process.
However, the difference isn't big if your threads are predominantly doing I/O operations, as those have to go through the kernel in any case. It's most important if you're trying to implement some kind of simulation with a very large number of independant processes. In that case you need to pay careful attention to what thread synchronisation mechanisms you use, as some of them also go up to the kernel and trigger a context switch.
http://www.cs.rochester.edu/u/cli/research/switch.pdf "In general, the indirect cost of context switch ranges from several microseconds to more than one thousand microseconds for our workload."
Edit: user-level threads maintain a stack per-thread, and may or may not save the general-purpose registers depending on the architecture and the clobber rules of its calling convention. It can be as simple as dumping the registers to the stack, jumping to a new address, and popping a few registers, which may be in your cache if that thread was run recently.
Kernel-level context switches also change the memory map by writing to the TLB, and changing the security level (privilege level or "ring") of the processor. See "Performance Considerations"