Linux – First process in a new Linux user namespace needs to call setuid()

linuxnamespaceuidusers

I'm learning about Linux user namespaces and I'm observing a strange behavior which isn't completely clear to me.

I've created a range of UIDs in initial user namespaces to which I can map UIDs in child user namespace via newuidmap command. These are my settings:

$ grep '^woky:' /etc/subuid
woky:200000:10000
$ id -u
1000

Then I've tried to create a new user namespace and map its UID range [0-10000) to [200000-210000) in the parent user namespace:

  • First terminal:

    $ PS1='% ' unshare -U bash
    % echo $$
    1337
    % id
    uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
    
  • Second terminal:

    $ ps -p 1337 -o uid
      UID
     1000
    $ newuidmap 1337 0 200000 10000
    $ ps -p 1337 -o uid
      UID
     1000
    
  • First terminal:

    % id
    uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
    

So UIDs inside and outside the new user namespace weren't changed even though the newuidmap completed successfully.

Then I found the following article http://www.itinken.com/blog/2016/Sep/exploring-unprivileged-containers/ which opened my eyes a bit. I've tried the previous scenario but with the following test-unshare.py script, which I took from the article and slightly modified, instead of the unshare command:

#!/usr/bin/python3
import os
from cffi import FFI
CLONE_NEWUSER = 0x10000000
ffi = FFI()
ffi.cdef('int unshare(int flags);')
libc = ffi.dlopen(None)

libc.unshare(CLONE_NEWUSER)
print("user id = %d, process id = %d" % (os.getuid(), os.getpid()))
input("Press Enter to continue...")
# The uid must be set to 0 to avoid loosing capabilities when creating the shell.
os.setuid(0)
os.execlp('/bin/bash', 'bash')
  • First terminal:

    $ python3 ./test-unshare.py
    user id = 65534, process id = 1337
    Press Enter to continue...
    
  • Second terminal:

    $ ps -p 1337 -o uid
      UID
     1000
    $ newuidmap 1337 0 200000 10000
    $ ps -p 1337 -o uid
      UID
     1000
    
  • First terminal:

    <Enter>
    bash: /home/woky/.bashrc: Permission denied
    bash-4.4# id
    uid=0(root) gid=65534(nobody) groups=65534(nobody)
    
  • Second terminal:

    $ ps -p 1337 -o uid
      UID
    200000
    

Now it looks like what I've expected from the beginning. Now my theory about why the UIDs in the first example weren't changed is the following:

The unshare called execve(2) to run /bin/bash without first calling setuid(2). Now the shell lost all its capabilities (as mentioned in user_namespaces(7)) and cannot change its UID from 65534. In the second case, the process changed its UID to 0, because it had capabilities to do so, and Linux mapped it to 200000 outside the new user namespace (according to /proc/1337/uid_map which newuidmap wrote). Which means that the first process in a new user namespace has to call setuid(START_UID) or otherwise it'd be stuck in 65534 after execve(2).

Is it correct?

The aricle says about my first example (which is equivalent to the Python code in its first example) the following:

If you just try this out, you'll probably find that it doesn't really quite work, this is because the uid map has the be set before the shell is executed.

But I cannot conclude this from the information in man pages nor do man pages explicitly state that setuid(2) needs to be called in the first process in a new user namespace.

However, in this scenario, the process in the new user namespace didn't have to call setuid(2) and yet its UID changed:

  • First terminal:

    $ PS1='% ' unshare -U bash
    % echo $$
    1337
    % id
    uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
    
  • Second terminal:

    $ ps -p 1337 -o uid
      UID
     1000
    $ echo '500000 1000 1' >/proc/1337/uid_map
    $ ps -p 1337 -o uid
      UID
     1000
    
  • First terminal:

    % id
    uid=500000 gid=65534(nobody) groups=65534(nobody)
    

Please, explain all situations in depth.

My journey started when I tried to understand what the /etc/subuid file is for. It's used by Docker and LXC but only few documents explain it. Sorry for the verbosity. It took me really long time to comprehend it and I still don't understand it fully, so I'm collecting here all I know.

BONUS: Explain /etc/subuid, its relation to user namespaces, why is it required for Docker and LXC, and why is it generic interface on Linux distributions. The man page is brief and articles on Internet mostly document how to make something in LXC/Docker work. (Actual explanation is in newuidmap(1)).

Best Answer

Look at https://unix.stackexchange.com/a/110245/301641

Look at this answer, you may see some UID as nobody since they are not mapped.

Situation 1: "(65534)nobody inside, 1000(woky) outside" as init value, after newuidmap still doesn't get a map(only [outside200000,outside210000) are mapped, but need outside1000 to be mapped, out of range). So nothing changed.

Situation 2: "(65534)nobody inside, 1000(woky) outside" as init value, after newuidmap still doesn't get a map(only [outside200000,outside210000) are mapped, but need outside1000 to be mapped, out of range). But you setuid(inside0) right after getting the map(notice that you can never setuid before writing to uid_map), which is in the map, so UID changed from overflow value to normal mapped value(outside200000, inside0).

Situation 3: "(65534)nobody inside, 1000(woky) outside" as init value, after newuidmap get a map(outside1000 get mapped), so UID changed from overflow value to normal mapped value(outside1000, inside500000).

Related Question