Linux – Bind Mounts of Device Nodes Break with EACCES in Root of tmpfs

bind-mountlinuxmountnamespace

A common scenario for setting up a container/sandbox is wanting to create a minimal set of device nodes in a new tmpfs (rather than exposing the host /dev), and the only (unprivileged) way I know to do this is by bind-mounting the desired ones into it. The commands I'm using (inside unshare -mc --keep-caps) are:

mkdir /tmp/x
mount -t tmpfs none /tmp/x
touch /tmp/x/null
mount -o bind /dev/null /tmp/x/null

with the intend of moving the mount on top of /dev. However, even before doing the move, running echo > /tmp/x/null produces a "Permission denied" error (EACCES).

Yet if I additionally perform:

mkdir /tmp/x/y
touch /tmp/x/y/null
mount -o bind /dev/null /tmp/x/y/null
echo > /tmp/x/y/null

the write succeeds as it should. I've played around with this quite a bit, but can't find a root cause or reason this should be happening. It's possible to work around it by putting the bind-mounted nodes in a subdirectory and symlinks to them in the top-level of the filesystem that will become the new /dev, but it seems like this shouldn't be necessary.

What's going on? Is there a reasonable explanation for this? Or is it some access control logic gone wrong?

Best Answer

Well, this seems to be a very interesting effect, which is a consequence of three mechanisms combined together.

The first (trivial) point is that when you redirect something to the file, the shell opens the target file with the O_CREAT option to be sure that the file will be created if it does not yet exist.

The second thing to consider is the fact that /tmp/x is a tmpfs mountpoint, while /tmp/x/y is an ordinary directory. Given that you mount tmpfs with no options, the mountpoint's permissions automagically change so that it becomes world-writable and has a sticky bit (1777, which is a usual set of permissions for /tmp, so this feels like a sane default), while the permissions for /tmp/x/y are probably 0755 (depends on your umask).

Finally, the third part of the puzzle is the way you set up the user namespace: you instruct unshare(1) to map UID/GID of your host user to the same UID/GID in the new namespace. This is the only mapping in new namespace, so trying to translate any other UID between the parent/child namespaces will result in so-called overflow UID, which by default is 65534 — a nobody user (see user_namespaces(7), section Unmapped user and group IDs). This makes /dev/null (and its bind-mounts) be owned by nobody inside the child user namespace (as there is no mapping for host's root user in the child user namespace):

$ ls -l /dev/null
crw-rw-rw- 1 nobody nobody 1, 3 Nov 25 21:54 /dev/null

Combining all the facts together we come to the following: echo > /tmp/x/null tries to open an existing file with O_CREAT option, while this file resides inside the world-writable sticky directory and is owned by nobody, who is not the owner of the directory containing it.

Now, read openat(2) carefully, word by word:

EACCES

Where O_CREAT is specified, the protected_fifos or protected_regular sysctl is enabled, the file already exists and is a FIFO or regular file, the owner of the file is neither the current user nor the owner of the containing directory, and the containing directory is both world- or group-writable and sticky. For details, see the descriptions of /proc/sys/fs/protected_fifos and /proc/sys/fs/protected_regular in proc(5).

Isn't this brilliant? This seems almost like our case... Except the fact that the man page tells only about ordinary files and FIFOs and tells nothing about device nodes.

Well, let's take a look at the code which actually implements this. We can see that, essentially, it first checks for exceptional cases which must succeed (the first if), and then it just denies the access for any other case if the sticky directory is world-writable (the second if, first condition):

static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
        struct inode * const inode)
{
  if ((!sysctl_protected_fifos && S_ISFIFO(inode->i_mode)) ||
      (!sysctl_protected_regular && S_ISREG(inode->i_mode)) ||
      likely(!(dir_mode & S_ISVTX)) ||
      uid_eq(inode->i_uid, dir_uid) ||
      uid_eq(current_fsuid(), inode->i_uid))
    return 0;

  if (likely(dir_mode & 0002) ||
      (dir_mode & 0020 &&
       ((sysctl_protected_fifos >= 2 && S_ISFIFO(inode->i_mode)) ||
        (sysctl_protected_regular >= 2 && S_ISREG(inode->i_mode))))) {
    const char *operation = S_ISFIFO(inode->i_mode) ?
          "sticky_create_fifo" :
          "sticky_create_regular";
    audit_log_path_denied(AUDIT_ANOM_CREAT, operation);
    return -EACCES;
  }
  return 0;
}

So, if the target file is a char device (not a regular file or a FIFO), the kernel still denies opening it with O_CREAT when this file is in the world-writable sticky directory.

To prove that I found the reason correctly, we may check that the problem disappears in any of the following cases:

mount tmpfs with -o mode=777 — this will not make the mountpoint have a sticky bit;
open /tmp/x/null as O_WRONLY, but without O_CREAT option (to test this, write a program calling open("/tmp/x/null", O_WRONLY | O_CREAT) and open("/tmp/x/null", O_WRONLY), then compile and run it under strace -e trace=openat to see the returned values for each call).

I'm not sure whether this behavior should be considered a kernel bug or not, but the documentation for openat(2) clearly does not cover all the cases when this syscall actually fails with EACCES.

UPDATE 2016-07-20:

The following are true for 4.5 kernels, but not true for 4.3 kernels (This is wrong. See update #2 below):

The kernel has two flags that control read-only:

The MS_READONLY: Indicating whether the mount is read-only
The MNT_READONLY: Indicating whether the "user" wants it read-only

On a 4.5 kernel, doing a mount -o bind,ro will actually do the trick. For example, this:

# mkdir /tmp/test
# mkdir /tmp/test/a /tmp/test/b
# mount -t tmpfs none /tmp/test/a
# mkdir /tmp/test/a/d
# mount -o bind,ro /tmp/test/a/d /tmp/test/b

will create a read-only bind mount of /tmp/test/a/d to /tmp/test/b, which will be visible in /proc/mounts as:

none /tmp/test/a tmpfs rw,relatime 0 0
none /tmp/test/b tmpfs ro,relatime 0 0

A more detailed view is visible in /proc/self/mountinfo, which takes into consideration the user view (namespace). The relevant lines will be these:

363 74 0:49 / /tmp/test/a rw,relatime shared:273 - tmpfs none rw
368 74 0:49 /d /tmp/test/b ro,relatime shared:273 - tmpfs none rw

Where on the second line, you can see that it says both ro (MNT_READONLY) and rw (!MS_READONLY).

The end result is this:

# echo a > /tmp/test/a/d/f
# echo a > /tmp/test/b/f
-su: /tmp/test/b/f: Read-only file system

UPDATE 2016-07-20 #2:

A bit more digging into this shows that the behavior in fact depends on the version of libmount which is part of util-linux. Support for this was added with this commit and was released with version 2.27:

commit 9ac77b8a78452eab0612523d27fee52159f5016a
Author: Karel Zak 
Date:   Mon Aug 17 11:54:26 2015 +0200

    libmount: add support for "bind,ro"

    Now it's necessary t use two mount(8) calls to create a read-only
    mount:

      mount /foo /bar -o bind
      mount /bar -o remount,ro,bind

    This patch allows to specify "bind,ro" and the remount is done
    automatically by libmount by additional mount(2) syscall. It's not
    atomic of course.

    Signed-off-by: Karel Zak

which also provides the workaround. The behavior can be seen using strace on an older and a newer mount:

Old:

mount("/tmp/test/a/d", "/tmp/test/b", 0x222e240, MS_MGC_VAL|MS_RDONLY|MS_BIND, NULL) = 0 <0.000681>

New:

mount("/tmp/test/a/d", "/tmp/test/b", 0x1a8ee90, MS_MGC_VAL|MS_RDONLY|MS_BIND, NULL) = 0 <0.011492>
mount("none", "/tmp/test/b", NULL, MS_RDONLY|MS_REMOUNT|MS_BIND, NULL) = 0 <0.006281>

Conclusion:

To achieve the desired result one needs to run two commands (as @Thomas already said):

mount SRC DST -o bind
mount DST -o remount,ro,bind

Newer versions of mount (util-linux >=2.27) do this automatically when one runs

mount SRC DST -o bind,ro

Debian – bind mounts by systemd don’t magically work with systemd-tmpfiles

bind is unreliable when defined in fstab on a system with systemd. Systemd parses the fstab and tries to work out what order to mount and bind things in. From my own experience it gets this wrong 100% of the time. Best option is to move all you binds out of fstab and make you own xxx.mount system files for systemd. That was you gain controo over the order etc.

Best Answer

Related Solutions

Linux – Why doesn’t mount respect the read only option for bind mounts

UPDATE 2016-07-20:

UPDATE 2016-07-20 #2:

Conclusion:

Debian – bind mounts by systemd don’t magically work with systemd-tmpfiles

Related Question