Bind mount is just... well... a bind mount. I.e. it's not a new mount. It just "links"/"exposes"/"considers" a subdirectory as a new mount point. As such it cannot alter the mount parameters. That's why you're getting complaints:
# mount /mnt/1/lala /mnt/2 -o bind,ro
mount: warning: /mnt/2 seems to be mounted read-write.
But as you said a normal bind mount works:
# mount /mnt/1/lala /mnt/2 -o bind
And then a ro remount also works:
# mount /mnt/1/lala /mnt/2 -o bind,remount,ro
However what happens is that you're changing the whole mount and not just this bind mount. If you take a look at /proc/mounts you'll see that both bind mount and the original mount change to read-only:
/dev/loop0 /mnt/1 ext2 ro,relatime,errors=continue,user_xattr,acl 0 0
/dev/loop0 /mnt/2 ext2 ro,relatime,errors=continue,user_xattr,acl 0 0
So what you're doing is like changing the initial mount to a read-only mount and then doing a bind mount which will of course be read-only.
UPDATE 2016-07-20:
The following are true for 4.5 kernels, but not true for 4.3 kernels (This is wrong. See update #2 below):
The kernel has two flags that control read-only:
- The
MS_READONLY
: Indicating whether the mount is read-only
- The
MNT_READONLY
: Indicating whether the "user" wants it read-only
On a 4.5 kernel, doing a mount -o bind,ro
will actually do the trick. For example, this:
# mkdir /tmp/test
# mkdir /tmp/test/a /tmp/test/b
# mount -t tmpfs none /tmp/test/a
# mkdir /tmp/test/a/d
# mount -o bind,ro /tmp/test/a/d /tmp/test/b
will create a read-only bind mount of /tmp/test/a/d
to /tmp/test/b
, which will be visible in /proc/mounts
as:
none /tmp/test/a tmpfs rw,relatime 0 0
none /tmp/test/b tmpfs ro,relatime 0 0
A more detailed view is visible in /proc/self/mountinfo
, which takes into consideration the user view (namespace). The relevant lines will be these:
363 74 0:49 / /tmp/test/a rw,relatime shared:273 - tmpfs none rw
368 74 0:49 /d /tmp/test/b ro,relatime shared:273 - tmpfs none rw
Where on the second line, you can see that it says both ro
(MNT_READONLY
) and rw
(!MS_READONLY
).
The end result is this:
# echo a > /tmp/test/a/d/f
# echo a > /tmp/test/b/f
-su: /tmp/test/b/f: Read-only file system
UPDATE 2016-07-20 #2:
A bit more digging into this shows that the behavior in fact depends on the version of libmount which is part of util-linux. Support for this was added with this commit and was released with version 2.27:
commit 9ac77b8a78452eab0612523d27fee52159f5016a
Author: Karel Zak
Date: Mon Aug 17 11:54:26 2015 +0200
libmount: add support for "bind,ro"
Now it's necessary t use two mount(8) calls to create a read-only
mount:
mount /foo /bar -o bind
mount /bar -o remount,ro,bind
This patch allows to specify "bind,ro" and the remount is done
automatically by libmount by additional mount(2) syscall. It's not
atomic of course.
Signed-off-by: Karel Zak
which also provides the workaround. The behavior can be seen using strace on an older and a newer mount:
Old:
mount("/tmp/test/a/d", "/tmp/test/b", 0x222e240, MS_MGC_VAL|MS_RDONLY|MS_BIND, NULL) = 0 <0.000681>
New:
mount("/tmp/test/a/d", "/tmp/test/b", 0x1a8ee90, MS_MGC_VAL|MS_RDONLY|MS_BIND, NULL) = 0 <0.011492>
mount("none", "/tmp/test/b", NULL, MS_RDONLY|MS_REMOUNT|MS_BIND, NULL) = 0 <0.006281>
Conclusion:
To achieve the desired result one needs to run two commands (as @Thomas already said):
mount SRC DST -o bind
mount DST -o remount,ro,bind
Newer versions of mount (util-linux >=2.27) do this automatically when one runs
mount SRC DST -o bind,ro
Best Answer
Well, this seems to be a very interesting effect, which is a consequence of three mechanisms combined together.
The first (trivial) point is that when you redirect something to the file, the shell opens the target file with the
O_CREAT
option to be sure that the file will be created if it does not yet exist.The second thing to consider is the fact that
/tmp/x
is atmpfs
mountpoint, while/tmp/x/y
is an ordinary directory. Given that you mounttmpfs
with no options, the mountpoint's permissions automagically change so that it becomes world-writable and has a sticky bit (1777
, which is a usual set of permissions for/tmp
, so this feels like a sane default), while the permissions for/tmp/x/y
are probably0755
(depends on yourumask
).Finally, the third part of the puzzle is the way you set up the user namespace: you instruct
unshare(1)
to map UID/GID of your host user to the same UID/GID in the new namespace. This is the only mapping in new namespace, so trying to translate any other UID between the parent/child namespaces will result in so-called overflow UID, which by default is65534
— anobody
user (seeuser_namespaces(7)
, sectionUnmapped user and group IDs
). This makes/dev/null
(and its bind-mounts) be owned bynobody
inside the child user namespace (as there is no mapping for host'sroot
user in the child user namespace):Combining all the facts together we come to the following:
echo > /tmp/x/null
tries to open an existing file withO_CREAT
option, while this file resides inside the world-writable sticky directory and is owned bynobody
, who is not the owner of the directory containing it.Now, read
openat(2)
carefully, word by word:Isn't this brilliant? This seems almost like our case... Except the fact that the man page tells only about ordinary files and FIFOs and tells nothing about device nodes.
Well, let's take a look at the code which actually implements this. We can see that, essentially, it first checks for exceptional cases which must succeed (the first
if
), and then it just denies the access for any other case if the sticky directory is world-writable (the secondif
, first condition):So, if the target file is a char device (not a regular file or a FIFO), the kernel still denies opening it with
O_CREAT
when this file is in the world-writable sticky directory.To prove that I found the reason correctly, we may check that the problem disappears in any of the following cases:
tmpfs
with-o mode=777
— this will not make the mountpoint have a sticky bit;/tmp/x/null
asO_WRONLY
, but withoutO_CREAT
option (to test this, write a program callingopen("/tmp/x/null", O_WRONLY | O_CREAT)
andopen("/tmp/x/null", O_WRONLY)
, then compile and run it understrace -e trace=openat
to see the returned values for each call).I'm not sure whether this behavior should be considered a kernel bug or not, but the documentation for
openat(2)
clearly does not cover all the cases when this syscall actually fails withEACCES
.