Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

fs: allow to mount beneath top mount

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:

fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");

After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:

fchdir(fd_oldroot);
umount2(".", MNT_DETACH);

Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.

The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:

mount --make-shared /
unshare --mount --propagation=slave

Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):

mount -t tmpfs tmpfs /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs

Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:

mount --bind /opt /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1

The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs

But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.

Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):

mount --bind /opt /opt

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1

If another mount is mounted on top of the /opt mount at the /opt
dentry:

mount --bind /tmp /opt

The following clunky mount tree will result:

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1

The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.

When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.

(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:

mount --bind /opt /opt
mount --bind /tmp /opt

will cause the same mount tree as:

mount --bind /opt /opt
mount --beneath /tmp /opt

because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt

This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

+350 -50
+304 -48
fs/namespace.c
··· 926 926 hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); 927 927 } 928 928 929 + /** 930 + * mnt_set_mountpoint_beneath - mount a mount beneath another one 931 + * 932 + * @new_parent: the source mount 933 + * @top_mnt: the mount beneath which @new_parent is mounted 934 + * @new_mp: the new mountpoint of @top_mnt on @new_parent 935 + * 936 + * Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and 937 + * parent @top_mnt->mnt_parent and mount it on top of @new_parent at 938 + * @new_mp. And mount @new_parent on the old parent and old 939 + * mountpoint of @top_mnt. 940 + * 941 + * Context: This function expects namespace_lock() and lock_mount_hash() 942 + * to have been acquired in that order. 943 + */ 944 + static void mnt_set_mountpoint_beneath(struct mount *new_parent, 945 + struct mount *top_mnt, 946 + struct mountpoint *new_mp) 947 + { 948 + struct mount *old_top_parent = top_mnt->mnt_parent; 949 + struct mountpoint *old_top_mp = top_mnt->mnt_mp; 950 + 951 + mnt_set_mountpoint(old_top_parent, old_top_mp, new_parent); 952 + mnt_change_mountpoint(new_parent, new_mp, top_mnt); 953 + } 954 + 955 + 929 956 static void __attach_mnt(struct mount *mnt, struct mount *parent) 930 957 { 931 958 hlist_add_head_rcu(&mnt->mnt_hash, ··· 960 933 list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); 961 934 } 962 935 963 - /* 964 - * vfsmount lock must be held for write 936 + /** 937 + * attach_mnt - mount a mount, attach to @mount_hashtable and parent's 938 + * list of child mounts 939 + * @parent: the parent 940 + * @mnt: the new mount 941 + * @mp: the new mountpoint 942 + * @beneath: whether to mount @mnt beneath or on top of @parent 943 + * 944 + * If @beneath is false, mount @mnt at @mp on @parent. Then attach @mnt 945 + * to @parent's child mount list and to @mount_hashtable. 946 + * 947 + * If @beneath is true, remove @mnt from its current parent and 948 + * mountpoint and mount it on @mp on @parent, and mount @parent on the 949 + * old parent and old mountpoint of @mnt. Finally, attach @parent to 950 + * @mnt_hashtable and @parent->mnt_parent->mnt_mounts. 951 + * 952 + * Note, when __attach_mnt() is called @mnt->mnt_parent already points 953 + * to the correct parent. 954 + * 955 + * Context: This function expects namespace_lock() and lock_mount_hash() 956 + * to have been acquired in that order. 965 957 */ 966 - static void attach_mnt(struct mount *mnt, 967 - struct mount *parent, 968 - struct mountpoint *mp) 958 + static void attach_mnt(struct mount *mnt, struct mount *parent, 959 + struct mountpoint *mp, bool beneath) 969 960 { 970 - mnt_set_mountpoint(parent, mp, mnt); 971 - __attach_mnt(mnt, parent); 961 + if (beneath) 962 + mnt_set_mountpoint_beneath(mnt, parent, mp); 963 + else 964 + mnt_set_mountpoint(parent, mp, mnt); 965 + /* 966 + * Note, @mnt->mnt_parent has to be used. If @mnt was mounted 967 + * beneath @parent then @mnt will need to be attached to 968 + * @parent's old parent, not @parent. IOW, @mnt->mnt_parent 969 + * isn't the same mount as @parent. 970 + */ 971 + __attach_mnt(mnt, mnt->mnt_parent); 972 972 } 973 973 974 974 void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) ··· 1007 953 hlist_del_init(&mnt->mnt_mp_list); 1008 954 hlist_del_init_rcu(&mnt->mnt_hash); 1009 955 1010 - attach_mnt(mnt, parent, mp); 956 + attach_mnt(mnt, parent, mp, false); 1011 957 1012 958 put_mountpoint(old_mp); 1013 959 mnt_add_count(old_parent, -1); ··· 2008 1954 goto out; 2009 1955 lock_mount_hash(); 2010 1956 list_add_tail(&q->mnt_list, &res->mnt_list); 2011 - attach_mnt(q, parent, p->mnt_mp); 1957 + attach_mnt(q, parent, p->mnt_mp, false); 2012 1958 unlock_mount_hash(); 2013 1959 } 2014 1960 } ··· 2217 2163 return 0; 2218 2164 } 2219 2165 2220 - /* 2221 - * @source_mnt : mount tree to be attached 2222 - * @nd : place the mount tree @source_mnt is attached 2223 - * @parent_nd : if non-null, detach the source_mnt from its parent and 2224 - * store the parent mount and mountpoint dentry. 2225 - * (done when source_mnt is moved) 2166 + enum mnt_tree_flags_t { 2167 + MNT_TREE_MOVE = BIT(0), 2168 + MNT_TREE_BENEATH = BIT(1), 2169 + }; 2170 + 2171 + /** 2172 + * attach_recursive_mnt - attach a source mount tree 2173 + * @source_mnt: mount tree to be attached 2174 + * @top_mnt: mount that @source_mnt will be mounted on or mounted beneath 2175 + * @dest_mp: the mountpoint @source_mnt will be mounted at 2176 + * @flags: modify how @source_mnt is supposed to be attached 2226 2177 * 2227 2178 * NOTE: in the table below explains the semantics when a source mount 2228 2179 * of a given type is attached to a destination mount of a given type. ··· 2284 2225 * applied to each mount in the tree. 2285 2226 * Must be called without spinlocks held, since this function can sleep 2286 2227 * in allocations. 2228 + * 2229 + * Context: The function expects namespace_lock() to be held. 2230 + * Return: If @source_mnt was successfully attached 0 is returned. 2231 + * Otherwise a negative error code is returned. 2287 2232 */ 2288 2233 static int attach_recursive_mnt(struct mount *source_mnt, 2289 - struct mount *dest_mnt, 2290 - struct mountpoint *dest_mp, 2291 - bool moving) 2234 + struct mount *top_mnt, 2235 + struct mountpoint *dest_mp, 2236 + enum mnt_tree_flags_t flags) 2292 2237 { 2293 2238 struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; 2294 2239 HLIST_HEAD(tree_list); 2295 - struct mnt_namespace *ns = dest_mnt->mnt_ns; 2240 + struct mnt_namespace *ns = top_mnt->mnt_ns; 2296 2241 struct mountpoint *smp; 2297 - struct mount *child, *p; 2242 + struct mount *child, *dest_mnt, *p; 2298 2243 struct hlist_node *n; 2299 - int err; 2244 + int err = 0; 2245 + bool moving = flags & MNT_TREE_MOVE, beneath = flags & MNT_TREE_BENEATH; 2300 2246 2301 - /* Preallocate a mountpoint in case the new mounts need 2302 - * to be tucked under other mounts. 2247 + /* 2248 + * Preallocate a mountpoint in case the new mounts need to be 2249 + * mounted beneath mounts on the same mountpoint. 2303 2250 */ 2304 2251 smp = get_mountpoint(source_mnt->mnt.mnt_root); 2305 2252 if (IS_ERR(smp)) ··· 2318 2253 goto out; 2319 2254 } 2320 2255 2256 + if (beneath) 2257 + dest_mnt = top_mnt->mnt_parent; 2258 + else 2259 + dest_mnt = top_mnt; 2260 + 2321 2261 if (IS_MNT_SHARED(dest_mnt)) { 2322 2262 err = invent_group_ids(source_mnt, true); 2323 2263 if (err) 2324 2264 goto out; 2325 2265 err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list); 2326 - lock_mount_hash(); 2327 - if (err) 2328 - goto out_cleanup_ids; 2266 + } 2267 + lock_mount_hash(); 2268 + if (err) 2269 + goto out_cleanup_ids; 2270 + 2271 + if (IS_MNT_SHARED(dest_mnt)) { 2329 2272 for (p = source_mnt; p; p = next_mnt(p, source_mnt)) 2330 2273 set_mnt_shared(p); 2331 - } else { 2332 - lock_mount_hash(); 2333 2274 } 2275 + 2334 2276 if (moving) { 2277 + if (beneath) 2278 + dest_mp = smp; 2335 2279 unhash_mnt(source_mnt); 2336 - attach_mnt(source_mnt, dest_mnt, dest_mp); 2280 + attach_mnt(source_mnt, top_mnt, dest_mp, beneath); 2337 2281 touch_mnt_namespace(source_mnt->mnt_ns); 2338 2282 } else { 2339 2283 if (source_mnt->mnt_ns) { 2340 2284 /* move from anon - the caller will destroy */ 2341 2285 list_del_init(&source_mnt->mnt_ns->list); 2342 2286 } 2343 - mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); 2287 + if (beneath) 2288 + mnt_set_mountpoint_beneath(source_mnt, top_mnt, smp); 2289 + else 2290 + mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); 2344 2291 commit_tree(source_mnt); 2345 2292 } 2346 2293 ··· 2392 2315 return err; 2393 2316 } 2394 2317 2395 - static struct mountpoint *lock_mount(struct path *path) 2318 + /** 2319 + * do_lock_mount - lock mount and mountpoint 2320 + * @path: target path 2321 + * @beneath: whether the intention is to mount beneath @path 2322 + * 2323 + * Follow the mount stack on @path until the top mount @mnt is found. If 2324 + * the initial @path->{mnt,dentry} is a mountpoint lookup the first 2325 + * mount stacked on top of it. Then simply follow @{mnt,mnt->mnt_root} 2326 + * until nothing is stacked on top of it anymore. 2327 + * 2328 + * Acquire the inode_lock() on the top mount's ->mnt_root to protect 2329 + * against concurrent removal of the new mountpoint from another mount 2330 + * namespace. 2331 + * 2332 + * If @beneath is requested, acquire inode_lock() on @mnt's mountpoint 2333 + * @mp on @mnt->mnt_parent must be acquired. This protects against a 2334 + * concurrent unlink of @mp->mnt_dentry from another mount namespace 2335 + * where @mnt doesn't have a child mount mounted @mp. A concurrent 2336 + * removal of @mnt->mnt_root doesn't matter as nothing will be mounted 2337 + * on top of it for @beneath. 2338 + * 2339 + * In addition, @beneath needs to make sure that @mnt hasn't been 2340 + * unmounted or moved from its current mountpoint in between dropping 2341 + * @mount_lock and acquiring @namespace_sem. For the !@beneath case @mnt 2342 + * being unmounted would be detected later by e.g., calling 2343 + * check_mnt(mnt) in the function it's called from. For the @beneath 2344 + * case however, it's useful to detect it directly in do_lock_mount(). 2345 + * If @mnt hasn't been unmounted then @mnt->mnt_mountpoint still points 2346 + * to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will 2347 + * point to @mnt->mnt_root and @mnt->mnt_mp will be NULL. 2348 + * 2349 + * Return: Either the target mountpoint on the top mount or the top 2350 + * mount's mountpoint. 2351 + */ 2352 + static struct mountpoint *do_lock_mount(struct path *path, bool beneath) 2396 2353 { 2397 - struct vfsmount *mnt; 2354 + struct vfsmount *mnt = path->mnt; 2398 2355 struct dentry *dentry; 2399 - struct mountpoint *mp; 2356 + struct mountpoint *mp = ERR_PTR(-ENOENT); 2400 2357 2401 2358 for (;;) { 2402 - dentry = path->dentry; 2359 + struct mount *m; 2360 + 2361 + if (beneath) { 2362 + m = real_mount(mnt); 2363 + read_seqlock_excl(&mount_lock); 2364 + dentry = dget(m->mnt_mountpoint); 2365 + read_sequnlock_excl(&mount_lock); 2366 + } else { 2367 + dentry = path->dentry; 2368 + } 2369 + 2403 2370 inode_lock(dentry->d_inode); 2404 2371 if (unlikely(cant_mount(dentry))) { 2405 2372 inode_unlock(dentry->d_inode); 2406 - return ERR_PTR(-ENOENT); 2373 + goto out; 2407 2374 } 2408 2375 2409 2376 namespace_lock(); 2377 + 2378 + if (beneath && (!is_mounted(mnt) || m->mnt_mountpoint != dentry)) { 2379 + namespace_unlock(); 2380 + inode_unlock(dentry->d_inode); 2381 + goto out; 2382 + } 2410 2383 2411 2384 mnt = lookup_mnt(path); 2412 2385 if (likely(!mnt)) ··· 2464 2337 2465 2338 namespace_unlock(); 2466 2339 inode_unlock(dentry->d_inode); 2340 + if (beneath) 2341 + dput(dentry); 2467 2342 path_put(path); 2468 2343 path->mnt = mnt; 2469 2344 path->dentry = dget(mnt->mnt_root); ··· 2477 2348 inode_unlock(dentry->d_inode); 2478 2349 } 2479 2350 2351 + out: 2352 + if (beneath) 2353 + dput(dentry); 2354 + 2480 2355 return mp; 2356 + } 2357 + 2358 + static inline struct mountpoint *lock_mount(struct path *path) 2359 + { 2360 + return do_lock_mount(path, false); 2481 2361 } 2482 2362 2483 2363 static void unlock_mount(struct mountpoint *where) ··· 2510 2372 d_is_dir(mnt->mnt.mnt_root)) 2511 2373 return -ENOTDIR; 2512 2374 2513 - return attach_recursive_mnt(mnt, p, mp, false); 2375 + return attach_recursive_mnt(mnt, p, mp, 0); 2514 2376 } 2515 2377 2516 2378 /* ··· 2992 2854 return err; 2993 2855 } 2994 2856 2995 - static int do_move_mount(struct path *old_path, struct path *new_path) 2857 + /** 2858 + * path_overmounted - check if path is overmounted 2859 + * @path: path to check 2860 + * 2861 + * Check if path is overmounted, i.e., if there's a mount on top of 2862 + * @path->mnt with @path->dentry as mountpoint. 2863 + * 2864 + * Context: This function expects namespace_lock() to be held. 2865 + * Return: If path is overmounted true is returned, false if not. 2866 + */ 2867 + static inline bool path_overmounted(const struct path *path) 2868 + { 2869 + rcu_read_lock(); 2870 + if (unlikely(__lookup_mnt(path->mnt, path->dentry))) { 2871 + rcu_read_unlock(); 2872 + return true; 2873 + } 2874 + rcu_read_unlock(); 2875 + return false; 2876 + } 2877 + 2878 + /** 2879 + * can_move_mount_beneath - check that we can mount beneath the top mount 2880 + * @from: mount to mount beneath 2881 + * @to: mount under which to mount 2882 + * 2883 + * - Make sure that @to->dentry is actually the root of a mount under 2884 + * which we can mount another mount. 2885 + * - Make sure that nothing can be mounted beneath the caller's current 2886 + * root or the rootfs of the namespace. 2887 + * - Make sure that the caller can unmount the topmost mount ensuring 2888 + * that the caller could reveal the underlying mountpoint. 2889 + * - Ensure that nothing has been mounted on top of @from before we 2890 + * grabbed @namespace_sem to avoid creating pointless shadow mounts. 2891 + * - Prevent mounting beneath a mount if the propagation relationship 2892 + * between the source mount, parent mount, and top mount would lead to 2893 + * nonsensical mount trees. 2894 + * 2895 + * Context: This function expects namespace_lock() to be held. 2896 + * Return: On success 0, and on error a negative error code is returned. 2897 + */ 2898 + static int can_move_mount_beneath(const struct path *from, 2899 + const struct path *to, 2900 + const struct mountpoint *mp) 2901 + { 2902 + struct mount *mnt_from = real_mount(from->mnt), 2903 + *mnt_to = real_mount(to->mnt), 2904 + *parent_mnt_to = mnt_to->mnt_parent; 2905 + 2906 + if (!mnt_has_parent(mnt_to)) 2907 + return -EINVAL; 2908 + 2909 + if (!path_mounted(to)) 2910 + return -EINVAL; 2911 + 2912 + if (IS_MNT_LOCKED(mnt_to)) 2913 + return -EINVAL; 2914 + 2915 + /* Avoid creating shadow mounts during mount propagation. */ 2916 + if (path_overmounted(from)) 2917 + return -EINVAL; 2918 + 2919 + /* 2920 + * Mounting beneath the rootfs only makes sense when the 2921 + * semantics of pivot_root(".", ".") are used. 2922 + */ 2923 + if (&mnt_to->mnt == current->fs->root.mnt) 2924 + return -EINVAL; 2925 + if (parent_mnt_to == current->nsproxy->mnt_ns->root) 2926 + return -EINVAL; 2927 + 2928 + for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent) 2929 + if (p == mnt_to) 2930 + return -EINVAL; 2931 + 2932 + /* 2933 + * If the parent mount propagates to the child mount this would 2934 + * mean mounting @mnt_from on @mnt_to->mnt_parent and then 2935 + * propagating a copy @c of @mnt_from on top of @mnt_to. This 2936 + * defeats the whole purpose of mounting beneath another mount. 2937 + */ 2938 + if (propagation_would_overmount(parent_mnt_to, mnt_to, mp)) 2939 + return -EINVAL; 2940 + 2941 + /* 2942 + * If @mnt_to->mnt_parent propagates to @mnt_from this would 2943 + * mean propagating a copy @c of @mnt_from on top of @mnt_from. 2944 + * Afterwards @mnt_from would be mounted on top of 2945 + * @mnt_to->mnt_parent and @mnt_to would be unmounted from 2946 + * @mnt->mnt_parent and remounted on @mnt_from. But since @c is 2947 + * already mounted on @mnt_from, @mnt_to would ultimately be 2948 + * remounted on top of @c. Afterwards, @mnt_from would be 2949 + * covered by a copy @c of @mnt_from and @c would be covered by 2950 + * @mnt_from itself. This defeats the whole purpose of mounting 2951 + * @mnt_from beneath @mnt_to. 2952 + */ 2953 + if (propagation_would_overmount(parent_mnt_to, mnt_from, mp)) 2954 + return -EINVAL; 2955 + 2956 + return 0; 2957 + } 2958 + 2959 + static int do_move_mount(struct path *old_path, struct path *new_path, 2960 + bool beneath) 2996 2961 { 2997 2962 struct mnt_namespace *ns; 2998 2963 struct mount *p; ··· 3104 2863 struct mountpoint *mp, *old_mp; 3105 2864 int err; 3106 2865 bool attached; 2866 + enum mnt_tree_flags_t flags = 0; 3107 2867 3108 - mp = lock_mount(new_path); 2868 + mp = do_lock_mount(new_path, beneath); 3109 2869 if (IS_ERR(mp)) 3110 2870 return PTR_ERR(mp); 3111 2871 ··· 3114 2872 p = real_mount(new_path->mnt); 3115 2873 parent = old->mnt_parent; 3116 2874 attached = mnt_has_parent(old); 2875 + if (attached) 2876 + flags |= MNT_TREE_MOVE; 3117 2877 old_mp = old->mnt_mp; 3118 2878 ns = old->mnt_ns; 3119 2879 ··· 3146 2902 */ 3147 2903 if (attached && IS_MNT_SHARED(parent)) 3148 2904 goto out; 2905 + 2906 + if (beneath) { 2907 + err = can_move_mount_beneath(old_path, new_path, mp); 2908 + if (err) 2909 + goto out; 2910 + 2911 + err = -EINVAL; 2912 + p = p->mnt_parent; 2913 + flags |= MNT_TREE_BENEATH; 2914 + } 2915 + 3149 2916 /* 3150 2917 * Don't move a mount tree containing unbindable mounts to a destination 3151 2918 * mount which is shared. ··· 3170 2915 if (p == old) 3171 2916 goto out; 3172 2917 3173 - err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, 3174 - attached); 2918 + err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, flags); 3175 2919 if (err) 3176 2920 goto out; 3177 2921 ··· 3202 2948 if (err) 3203 2949 return err; 3204 2950 3205 - err = do_move_mount(&old_path, path); 2951 + err = do_move_mount(&old_path, path, false); 3206 2952 path_put(&old_path); 3207 2953 return err; 3208 2954 } ··· 3368 3114 err = -ENOENT; 3369 3115 goto discard_locked; 3370 3116 } 3371 - rcu_read_lock(); 3372 - if (unlikely(__lookup_mnt(path->mnt, dentry))) { 3373 - rcu_read_unlock(); 3117 + if (path_overmounted(path)) { 3374 3118 err = 0; 3375 3119 goto discard_locked; 3376 3120 } 3377 - rcu_read_unlock(); 3378 3121 mp = get_mountpoint(dentry); 3379 3122 if (IS_ERR(mp)) { 3380 3123 err = PTR_ERR(mp); ··· 4063 3812 if (flags & ~MOVE_MOUNT__MASK) 4064 3813 return -EINVAL; 4065 3814 3815 + if ((flags & (MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP)) == 3816 + (MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP)) 3817 + return -EINVAL; 3818 + 4066 3819 /* If someone gives a pathname, they aren't permitted to move 4067 3820 * from an fd that requires unmount as we can't get at the flag 4068 3821 * to clear it afterwards. ··· 4096 3841 if (flags & MOVE_MOUNT_SET_GROUP) 4097 3842 ret = do_set_group(&from_path, &to_path); 4098 3843 else 4099 - ret = do_move_mount(&from_path, &to_path); 3844 + ret = do_move_mount(&from_path, &to_path, 3845 + (flags & MOVE_MOUNT_BENEATH)); 4100 3846 4101 3847 out_to: 4102 3848 path_put(&to_path); ··· 4230 3974 root_mnt->mnt.mnt_flags &= ~MNT_LOCKED; 4231 3975 } 4232 3976 /* mount old root on put_old */ 4233 - attach_mnt(root_mnt, old_mnt, old_mp); 3977 + attach_mnt(root_mnt, old_mnt, old_mp, false); 4234 3978 /* mount new_root on / */ 4235 - attach_mnt(new_mnt, root_parent, root_mp); 3979 + attach_mnt(new_mnt, root_parent, root_mp, false); 4236 3980 mnt_add_count(root_parent, -1); 4237 3981 touch_mnt_namespace(current->nsproxy->mnt_ns); 4238 3982 /* A moved mount should not expire automatically */
+41 -1
fs/pnode.c
··· 216 216 static struct mount *last_dest, *first_source, *last_source, *dest_master; 217 217 static struct hlist_head *list; 218 218 219 - static inline bool peers(struct mount *m1, struct mount *m2) 219 + static inline bool peers(const struct mount *m1, const struct mount *m2) 220 220 { 221 221 return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id; 222 222 } ··· 352 352 static inline int do_refcount_check(struct mount *mnt, int count) 353 353 { 354 354 return mnt_get_count(mnt) > count; 355 + } 356 + 357 + /** 358 + * propagation_would_overmount - check whether propagation from @from 359 + * would overmount @to 360 + * @from: shared mount 361 + * @to: mount to check 362 + * @mp: future mountpoint of @to on @from 363 + * 364 + * If @from propagates mounts to @to, @from and @to must either be peers 365 + * or one of the masters in the hierarchy of masters of @to must be a 366 + * peer of @from. 367 + * 368 + * If the root of the @to mount is equal to the future mountpoint @mp of 369 + * the @to mount on @from then @to will be overmounted by whatever is 370 + * propagated to it. 371 + * 372 + * Context: This function expects namespace_lock() to be held and that 373 + * @mp is stable. 374 + * Return: If @from overmounts @to, true is returned, false if not. 375 + */ 376 + bool propagation_would_overmount(const struct mount *from, 377 + const struct mount *to, 378 + const struct mountpoint *mp) 379 + { 380 + if (!IS_MNT_SHARED(from)) 381 + return false; 382 + 383 + if (IS_MNT_NEW(to)) 384 + return false; 385 + 386 + if (to->mnt.mnt_root != mp->m_dentry) 387 + return false; 388 + 389 + for (const struct mount *m = to; m; m = m->mnt_master) { 390 + if (peers(from, m)) 391 + return true; 392 + } 393 + 394 + return false; 355 395 } 356 396 357 397 /*
+3
fs/pnode.h
··· 53 53 bool is_path_reachable(struct mount *, struct dentry *, 54 54 const struct path *root); 55 55 int count_mounts(struct mnt_namespace *ns, struct mount *mnt); 56 + bool propagation_would_overmount(const struct mount *from, 57 + const struct mount *to, 58 + const struct mountpoint *mp); 56 59 #endif /* _LINUX_PNODE_H */
+2 -1
include/uapi/linux/mount.h
··· 74 74 #define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */ 75 75 #define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */ 76 76 #define MOVE_MOUNT_SET_GROUP 0x00000100 /* Set sharing group instead */ 77 - #define MOVE_MOUNT__MASK 0x00000177 77 + #define MOVE_MOUNT_BENEATH 0x00000200 /* Mount beneath top mount */ 78 + #define MOVE_MOUNT__MASK 0x00000377 78 79 79 80 /* 80 81 * fsopen() flags.