Linux – How does the ELF loader determine the initial stack size

elflinux-kernelmemorystackx86

I'm studying the ELF specification (http://www.skyfree.org/linux/references/ELF_Format.pdf), and one point that is not clear to me about the program loading process is how the stack is initialized, and what the initial page size is. Here's the test (on Ubuntu x86-64):

$ cat test.s
.text
  .global _start
_start:
  mov $0x3c,%eax
  mov $0,%edi
  syscall
$ as test.s -o test.o && ld test.o
$ gdb a.out -q
Reading symbols from a.out...(no debugging symbols found)...done.
(gdb) b _start
Breakpoint 1 at 0x400078
(gdb) run
Starting program: ~/a.out 

Breakpoint 1, 0x0000000000400078 in _start ()
(gdb) print $sp
$1 = (void *) 0x7fffffffdf00
(gdb) info proc map
process 20062
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000        0x0 ~/a.out
      0x7ffff7ffa000     0x7ffff7ffd000     0x3000        0x0 [vvar]
      0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0 [vdso]
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0 [vsyscall]

The ELF specification has very little to say about how or why this stack page exists in the first place, but I can find references that say that the stack should be initialized with SP pointing to argc, with argv, envp and the auxiliary vector just above that, and I have confirmed this. But how much space is available below SP? On my system there are 0x1FF00 bytes mapped below SP, but presumably this is counting down from the top of the stack at 0x7ffffffff000, and there are 0x21000 bytes in the full mapping. What influences this number?

I am aware that the page just below the stack is a "guard page" that automatically becomes writable and "grows down the stack" if I write to it (presumably so that naive stack handling "just works"), but if I allocate a huge stack frame then I could overshoot the guard page and segfault, so I want to determine how much space is already properly allocated to me right at process start.

EDIT: Some more data makes me even more unsure what's going on. The test is the following:

.text
  .global _start
_start:
  subq $0x7fe000,%rsp
  movq $1,(%rsp)
  mov $0x3c,%eax
  mov $0,%edi
  syscall

I played with different values of the constant 0x7fe000 here to see what happens, and for this value it is nondeterministic whether I get a segfault or not. According to GDB, the subq instruction on its own will expand the size of the mmap, which is mysterious to me (how does linux know what's in my register?), but this program will usually crash GDB on exit for some reason. It can't be ASLR causing the nondeterminism because I'm not using a GOT or any PLT section; the executable is always loaded at the same locations in virtual memory every time. So is this some randomness of the PID or physical memory bleeding through? All in all I'm very confused as to how much stack is actually legally available for random access, and how much is requested on changing RSP or on writing to areas "just out of range" of legal memory.

Best Answer

I don't believe this question is really to do with ELF. As far as I know, ELF defines a way to "flat pack" a program image into files and then re-assemble it ready for first execution. The definition of what the stack is and how it's implemented sits somewhere between CPU specific and OS specific if the OS behaviour hasn't been elevated to POSIX. Though no-doubt the ELF specification makes some demands about what it needs on the stack.

Minimum stack Allocation

From your question:

I am aware that the page just below the stack is a "guard page" that automatically becomes writable and "grows down the stack" if I write to it (presumably so that naive stack handling "just works"), but if I allocate a huge stack frame then I could overshoot the guard page and segfault, so I want to determine how much space is already properly allocated to me right at process start.

I'm struggling to find an authoritative reference for this. But I have found a large enough number of non-authoritative references to suggest this is incorrect.

From what I've read, the guard page is used to catch access outside the maximum stack allocation, and not for "normal" stack growth. The actual memory allocation (mapping pages to memory addresses) is done on demand. Ie: when un-mapped addresses in memory are accessed which are between stack-base and stack-base - max-stack-size + 1, an exception might be triggered by the CPU, but the Kernel will handle the exception by mapping a page of memory, not cascading a segmentation fault.

So accessing the stack inside the maximum allocation shouldn't cause a segmentation fault. As you've discovered

Maximum stack Allocation

Investigating documentation ought to follow lines of Linux documentation on thread creation and image loading (fork(2), clone(2), execve(2)). The documentation of execve mentions something interesting:

Limits on size of arguments and environment

...snip...

On kernel 2.6.23 and later, most architectures support a size limit derived from the soft RLIMIT_STACK resource limit (see getrlimit(2))

...snip...

This confirms that the limit requires the architecture to support it and also references where it's limited (getrlimit(2)).

RLIMIT_STACK

This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).

Since Linux 2.6.23, this limit also determines the amount of space used for the process's command-line arguments and envi‐ronment variables; for details, see execve(2).

Growing the stack by changing the RSP register

I don't know x86 assembler. But I'll draw your attention to the "Stack Fault Exception" which can be triggered by x86 CPUs when the SS register is changed. Please do correct me if I'm wrong, but I believe on x86-64 SS:SP has just become "RSP". So if I understand correctly a Stack Fault Exception can be triggered by decremented RSP (subq $0x7fe000,%rsp).

See page 222 here: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce.html

Related Question