Linux – First process in a new Linux user namespace needs to call setuid()

linuxnamespaceuidusers

I'm learning about Linux user namespaces and I'm observing a strange behavior which isn't completely clear to me.

I've created a range of UIDs in initial user namespaces to which I can map UIDs in child user namespace via newuidmap command. These are my settings:

$ grep '^woky:' /etc/subuid
woky:200000:10000
$ id -u
1000

Then I've tried to create a new user namespace and map its UID range [0-10000) to [200000-210000) in the parent user namespace:

First terminal:

$ PS1='% ' unshare -U bash
% echo $$
1337
% id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)

Second terminal:

$ ps -p 1337 -o uid
  UID
 1000
$ newuidmap 1337 0 200000 10000
$ ps -p 1337 -o uid
  UID
 1000

First terminal:

% id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)

So UIDs inside and outside the new user namespace weren't changed even though the newuidmap completed successfully.

Then I found the following article http://www.itinken.com/blog/2016/Sep/exploring-unprivileged-containers/ which opened my eyes a bit. I've tried the previous scenario but with the following test-unshare.py script, which I took from the article and slightly modified, instead of the unshare command:

#!/usr/bin/python3
import os
from cffi import FFI
CLONE_NEWUSER = 0x10000000
ffi = FFI()
ffi.cdef('int unshare(int flags);')
libc = ffi.dlopen(None)

libc.unshare(CLONE_NEWUSER)
print("user id = %d, process id = %d" % (os.getuid(), os.getpid()))
input("Press Enter to continue...")
# The uid must be set to 0 to avoid loosing capabilities when creating the shell.
os.setuid(0)
os.execlp('/bin/bash', 'bash')

First terminal:

$ python3 ./test-unshare.py
user id = 65534, process id = 1337
Press Enter to continue...

Second terminal:

$ ps -p 1337 -o uid
  UID
 1000
$ newuidmap 1337 0 200000 10000
$ ps -p 1337 -o uid
  UID
 1000

First terminal:

<Enter>
bash: /home/woky/.bashrc: Permission denied
bash-4.4# id
uid=0(root) gid=65534(nobody) groups=65534(nobody)

Second terminal:
```
$ ps -p 1337 -o uid
  UID
200000
```

Now it looks like what I've expected from the beginning. Now my theory about why the UIDs in the first example weren't changed is the following:

The unshare called execve(2) to run /bin/bash without first calling setuid(2). Now the shell lost all its capabilities (as mentioned in user_namespaces(7)) and cannot change its UID from 65534. In the second case, the process changed its UID to 0, because it had capabilities to do so, and Linux mapped it to 200000 outside the new user namespace (according to /proc/1337/uid_map which newuidmap wrote). Which means that the first process in a new user namespace has to call setuid(START_UID) or otherwise it'd be stuck in 65534 after execve(2).

Is it correct?

The aricle says about my first example (which is equivalent to the Python code in its first example) the following:

If you just try this out, you'll probably find that it doesn't really quite work, this is because the uid map has the be set before the shell is executed.

But I cannot conclude this from the information in man pages nor do man pages explicitly state that setuid(2) needs to be called in the first process in a new user namespace.

However, in this scenario, the process in the new user namespace didn't have to call setuid(2) and yet its UID changed:

First terminal:

$ PS1='% ' unshare -U bash
% echo $$
1337
% id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)

Second terminal:

$ ps -p 1337 -o uid
  UID
 1000
$ echo '500000 1000 1' >/proc/1337/uid_map
$ ps -p 1337 -o uid
  UID
 1000

First terminal:

% id
uid=500000 gid=65534(nobody) groups=65534(nobody)

Please, explain all situations in depth.

My journey started when I tried to understand what the /etc/subuid file is for. It's used by Docker and LXC but only few documents explain it. Sorry for the verbosity. It took me really long time to comprehend it and I still don't understand it fully, so I'm collecting here all I know.

BONUS: Explain /etc/subuid, its relation to user namespaces, why is it required for Docker and LXC, and why is it generic interface on Linux distributions. The man page is brief and articles on Internet mostly document how to make something in LXC/Docker work. (Actual explanation is in newuidmap(1)).

Best Answer

Look at https://unix.stackexchange.com/a/110245/301641

Look at this answer, you may see some UID as nobody since they are not mapped.

Situation 1： "(65534)nobody inside, 1000(woky) outside" as init value, after newuidmap still doesn't get a map(only [outside200000,outside210000) are mapped, but need outside1000 to be mapped, out of range). So nothing changed.

Situation 2: "(65534)nobody inside, 1000(woky) outside" as init value, after newuidmap still doesn't get a map(only [outside200000,outside210000) are mapped, but need outside1000 to be mapped, out of range). But you setuid(inside0) right after getting the map(notice that you can never setuid before writing to uid_map), which is in the map, so UID changed from overflow value to normal mapped value(outside200000, inside0).

Situation 3: "(65534)nobody inside, 1000(woky) outside" as init value, after newuidmap get a map(outside1000 get mapped), so UID changed from overflow value to normal mapped value(outside1000, inside500000).

Related Solutions

Linux – LXC: Any security difference between root and end-user owned unprivileged containers

In other words: may root owned unprivileged containers be "less unprivileged" than ones owned by standard accounts?

I don't think so. What matters is what's in /proc/$PID/uid_map of processes in user namespace of the container, not what's in /etc/subuid. Suppose you execute the following from the initial user namespace (that is, not from the container) for $PID of a process running in the container:

$ cat /proc/$PID/uid_map
0 200000 1000

This means that UID range [0-1000) of the process $PID will be mapped to UID range [200000-201000) outside of its user namespace (of the container). UIDs outside of the [200000-201000) range will be mapped to 65534 ($(cat /proc/sys/kernel/overflowuid)) in the container. This can happen for instance if you don't create a new PID namespace. In that case, the process in the container would see processes outside, but their UID would be 65534.

So with proper UID mapping, even if the container is started by root, its processes will have unprivileged UIDs outside of it.

Subordinate UIDs in /etc/subuid are not in any way linked to a single UID outside. The purpose of this file is to allow unprivileged users to start containers which use more than one UID (which is the case for most Linux operating systems). By default, you can only map your UID if you're unprivileged user. That is, if your UID is 1000 and $PID refers to a process in the container, you can only do

echo "$N 1000 1" >/proc/$PID/uid_map

for any $N as unprivileged user. Everything else is not permitted. If you could map longer range, i.e.

echo "$N 1000 50" >/proc/$PID/uid_map

you would gain access to UIDs [1000-1050) outside of the container through the container. And of course, if you could change start of outer UID range, you'd have easy way to get root. So /etc/subuid defines outer ranges which you are allowed to use. This file is used by newuidmap which is setuid root.

$ cat /etc/subuid
woky:200000:50
$ echo '0 200000 50' >/proc/$PID/uid_map
-bash: echo: write error: Operation not permitted
$ newuidmap $PID 0 200000 50
$ # success

The details are much more complicated and I'm probably not the proper person to explain it but I guess it's better to have no answer. :-) You might want to check man pages user_namespaces(7) and newuidmap(1), and my own research First process in a new Linux user namespace needs to call setuid()? . Unfortunately, I'm not entirely sure how LXC uses this file.

Best Answer

Related Solutions

Linux – LXC: Any security difference between root and end-user owned unprivileged containers

Related Question