Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull setgid inheritance updates from Christian Brauner:
"This contains the work to make setgid inheritance consistent between
modifying a file and when changing ownership or mode as this has been
a repeated source of very subtle bugs. The gist is that we perform the
same permission checks in the write path as we do in the ownership and
mode changing paths after this series where we're currently doing
different things.

We've already made setgid inheritance a lot more consistent and
reliable in the last releases by moving setgid stripping from the
individual filesystems up into the vfs. This aims to make the logic
even more consistent and easier to understand and also to fix
long-standing overlayfs setgid inheritance bugs. Miklos was nice
enough to just let me carry the trivial overlayfs patches from Amir
too.

Below is a more detailed explanation how the current difference in
setgid handling lead to very subtle bugs exemplified via overlayfs
which is a victim of the current rules. I hope this explains why I
think taking the regression risk here is worth it.

A long while ago I found a few setgid inheritance bugs in overlayfs in
the write path in certain conditions. Amir recently picked this back
up in [1] and I jumped on board to fix this more generally.

On the surface all that overlayfs would need to fix setgid inheritance
would be to call file_remove_privs() or file_modified() but actually
that isn't enough because the setgid inheritance api is wildly
inconsistent in that area.

Before this pr setgid stripping in file_remove_privs()'s old
should_remove_suid() helper was inconsistent with other parts of the
vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is
S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups
and the caller isn't privileged over the inode although we require
this already in setattr_prepare() and setattr_copy() and so all
filesystem implement this requirement implicitly because they have to
use setattr_{prepare,copy}() anyway.

But the inconsistency shows up in setgid stripping bugs for overlayfs
in xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping
works correctly when performing various write-like operations as an
unprivileged user (fallocate, reflink, write, etc.):

echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k

The test basically creates a file with 6666 permissions. While the
file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP
set.

On a regular filesystem like xfs what will happen is:

sys_fallocate()
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> setattr_copy()

In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.

But we also see that ATTR_KILL_SGID won't be set because while the
file is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.

So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE.

Now notify_change() sees that ATTR_KILL_SUID is set and does:

ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);

which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

Now we call into the filesystem's ->setattr() inode operation which
will end up calling setattr_copy(). Since ATTR_MODE is set we will
hit:

if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
if (!vfsgid_in_group_p(vfsgid) &&
!capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
mode &= ~S_ISGID;
inode->i_mode = mode;
}

and since the caller in the test is neither capable nor in the group
of the inode the S_ISGID bit is stripped.

But assume the file isn't suid then ATTR_KILL_SUID won't be raised
which has the consequence that neither the setgid nor the suid bits
are stripped even though it should be stripped because the inode isn't
in the caller's groups and the caller isn't privileged over the inode.

If overlayfs is in the mix things become a bit more complicated and
the bug shows up more clearly.

When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to
file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be
raised but because the check in notify_change() is questioning the
ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped
the S_ISGID bit isn't removed even though it should be stripped:

sys_fallocate()
-> vfs_fallocate()
-> ovl_fallocate()
-> file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> ovl_setattr()
/* TAKE ON MOUNTER'S CREDS */
-> ovl_do_notify_change()
-> notify_change()
/* GIVE UP MOUNTER'S CREDS */
/* TAKE ON MOUNTER'S CREDS */
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = attr_force | kill;
-> notify_change()

The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have
notify_change() not pointlessly requiring S_IXGRP again. It doesn't
make any sense in the first place because the caller must calculate
the flags via should_remove_suid() anyway which would raise
ATTR_KILL_SGID

Note that some xfstests will now fail as these patches will cause the
setgid bit to be lost in certain conditions for unprivileged users
modifying a setgid file when they would've been kept otherwise. I
think this risk is worth taking and I explained and mentioned this
multiple times on the list [2].

Enforcing the rules consistently across write operations and
chmod/chown will lead to losing the setgid bit in cases were it
might've been retained before.

While I've mentioned this a few times but it's worth repeating just to
make sure that this is understood. For the sake of maintainability,
consistency, and security this is a risk worth taking.

If we really see regressions for workloads the fix is to have special
setgid handling in the write path again with different semantics from
chmod/chown and possibly additional duct tape for overlayfs. I'll
update the relevant xfstests with if you should decide to merge this
second setgid cleanup.

Before that people should be aware that there might be failures for
fstests where unprivileged users modify a setgid file"

Link: https://lore.kernel.org/linux-fsdevel/20221003123040.900827-1-amir73il@gmail.com [1]
Link: https://lore.kernel.org/linux-fsdevel/20221122142010.zchf2jz2oymx55qi@wittgenstein [2]

* tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: use consistent setgid checks in is_sxid()
ovl: remove privs in ovl_fallocate()
ovl: remove privs in ovl_copyfile()
attr: use consistent sgid stripping checks
attr: add setattr_should_drop_sgid()
fs: move should_remove_suid()
attr: add in_group_or_capable()

+140 -56
+1 -1
Documentation/trace/ftrace.rst
··· 2940 2940 bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement 2941 2941 bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action 2942 2942 bash-1994 [000] .... 4342.324899: do_truncate <-do_last 2943 - bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate 2943 + bash-1994 [000] .... 4342.324899: setattr_should_drop_suidgid <-do_truncate 2944 2944 bash-1994 [000] .... 4342.324899: notify_change <-do_truncate 2945 2945 bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change 2946 2946 bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
+68 -6
fs/attr.c
··· 18 18 #include <linux/evm.h> 19 19 #include <linux/ima.h> 20 20 21 + #include "internal.h" 22 + 23 + /** 24 + * setattr_should_drop_sgid - determine whether the setgid bit needs to be 25 + * removed 26 + * @mnt_userns: user namespace of the mount @inode was found from 27 + * @inode: inode to check 28 + * 29 + * This function determines whether the setgid bit needs to be removed. 30 + * We retain backwards compatibility and require setgid bit to be removed 31 + * unconditionally if S_IXGRP is set. Otherwise we have the exact same 32 + * requirements as setattr_prepare() and setattr_copy(). 33 + * 34 + * Return: ATTR_KILL_SGID if setgid bit needs to be removed, 0 otherwise. 35 + */ 36 + int setattr_should_drop_sgid(struct user_namespace *mnt_userns, 37 + const struct inode *inode) 38 + { 39 + umode_t mode = inode->i_mode; 40 + 41 + if (!(mode & S_ISGID)) 42 + return 0; 43 + if (mode & S_IXGRP) 44 + return ATTR_KILL_SGID; 45 + if (!in_group_or_capable(mnt_userns, inode, 46 + i_gid_into_vfsgid(mnt_userns, inode))) 47 + return ATTR_KILL_SGID; 48 + return 0; 49 + } 50 + 51 + /** 52 + * setattr_should_drop_suidgid - determine whether the set{g,u}id bit needs to 53 + * be dropped 54 + * @mnt_userns: user namespace of the mount @inode was found from 55 + * @inode: inode to check 56 + * 57 + * This function determines whether the set{g,u}id bits need to be removed. 58 + * If the setuid bit needs to be removed ATTR_KILL_SUID is returned. If the 59 + * setgid bit needs to be removed ATTR_KILL_SGID is returned. If both 60 + * set{g,u}id bits need to be removed the corresponding mask of both flags is 61 + * returned. 62 + * 63 + * Return: A mask of ATTR_KILL_S{G,U}ID indicating which - if any - setid bits 64 + * to remove, 0 otherwise. 65 + */ 66 + int setattr_should_drop_suidgid(struct user_namespace *mnt_userns, 67 + struct inode *inode) 68 + { 69 + umode_t mode = inode->i_mode; 70 + int kill = 0; 71 + 72 + /* suid always must be killed */ 73 + if (unlikely(mode & S_ISUID)) 74 + kill = ATTR_KILL_SUID; 75 + 76 + kill |= setattr_should_drop_sgid(mnt_userns, inode); 77 + 78 + if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) 79 + return kill; 80 + 81 + return 0; 82 + } 83 + EXPORT_SYMBOL(setattr_should_drop_suidgid); 84 + 21 85 /** 22 86 * chown_ok - verify permissions to chown inode 23 87 * @mnt_userns: user namespace of the mount @inode was found from ··· 204 140 vfsgid = i_gid_into_vfsgid(mnt_userns, inode); 205 141 206 142 /* Also check the setgid bit! */ 207 - if (!vfsgid_in_group_p(vfsgid) && 208 - !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) 143 + if (!in_group_or_capable(mnt_userns, inode, vfsgid)) 209 144 attr->ia_mode &= ~S_ISGID; 210 145 } 211 146 ··· 314 251 inode->i_ctime = attr->ia_ctime; 315 252 if (ia_valid & ATTR_MODE) { 316 253 umode_t mode = attr->ia_mode; 317 - vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); 318 - if (!vfsgid_in_group_p(vfsgid) && 319 - !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) 254 + if (!in_group_or_capable(mnt_userns, inode, 255 + i_gid_into_vfsgid(mnt_userns, inode))) 320 256 mode &= ~S_ISGID; 321 257 inode->i_mode = mode; 322 258 } ··· 437 375 } 438 376 } 439 377 if (ia_valid & ATTR_KILL_SGID) { 440 - if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { 378 + if (mode & S_ISGID) { 441 379 if (!(ia_valid & ATTR_MODE)) { 442 380 ia_valid = attr->ia_valid |= ATTR_MODE; 443 381 attr->ia_mode = inode->i_mode;
+1 -1
fs/fuse/file.c
··· 1313 1313 return err; 1314 1314 1315 1315 if (fc->handle_killpriv_v2 && 1316 - should_remove_suid(file_dentry(file))) { 1316 + setattr_should_drop_suidgid(&init_user_ns, file_inode(file))) { 1317 1317 goto writethrough; 1318 1318 } 1319 1319
+28 -36
fs/inode.c
··· 1949 1949 EXPORT_SYMBOL(touch_atime); 1950 1950 1951 1951 /* 1952 - * The logic we want is 1953 - * 1954 - * if suid or (sgid and xgrp) 1955 - * remove privs 1956 - */ 1957 - int should_remove_suid(struct dentry *dentry) 1958 - { 1959 - umode_t mode = d_inode(dentry)->i_mode; 1960 - int kill = 0; 1961 - 1962 - /* suid always must be killed */ 1963 - if (unlikely(mode & S_ISUID)) 1964 - kill = ATTR_KILL_SUID; 1965 - 1966 - /* 1967 - * sgid without any exec bits is just a mandatory locking mark; leave 1968 - * it alone. If some exec bits are set, it's a real sgid; kill it. 1969 - */ 1970 - if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) 1971 - kill |= ATTR_KILL_SGID; 1972 - 1973 - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) 1974 - return kill; 1975 - 1976 - return 0; 1977 - } 1978 - EXPORT_SYMBOL(should_remove_suid); 1979 - 1980 - /* 1981 1952 * Return mask of changes for notify_change() that need to be done as a 1982 1953 * response to write or truncate. Return 0 if nothing has to be changed. 1983 1954 * Negative value on error (change should be denied). 1984 1955 */ 1985 - int dentry_needs_remove_privs(struct dentry *dentry) 1956 + int dentry_needs_remove_privs(struct user_namespace *mnt_userns, 1957 + struct dentry *dentry) 1986 1958 { 1987 1959 struct inode *inode = d_inode(dentry); 1988 1960 int mask = 0; ··· 1963 1991 if (IS_NOSEC(inode)) 1964 1992 return 0; 1965 1993 1966 - mask = should_remove_suid(dentry); 1994 + mask = setattr_should_drop_suidgid(mnt_userns, inode); 1967 1995 ret = security_inode_need_killpriv(dentry); 1968 1996 if (ret < 0) 1969 1997 return ret; ··· 1995 2023 if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode)) 1996 2024 return 0; 1997 2025 1998 - kill = dentry_needs_remove_privs(dentry); 2026 + kill = dentry_needs_remove_privs(file_mnt_user_ns(file), dentry); 1999 2027 if (kill < 0) 2000 2028 return kill; 2001 2029 ··· 2457 2485 EXPORT_SYMBOL(current_time); 2458 2486 2459 2487 /** 2488 + * in_group_or_capable - check whether caller is CAP_FSETID privileged 2489 + * @mnt_userns: user namespace of the mount @inode was found from 2490 + * @inode: inode to check 2491 + * @vfsgid: the new/current vfsgid of @inode 2492 + * 2493 + * Check wether @vfsgid is in the caller's group list or if the caller is 2494 + * privileged with CAP_FSETID over @inode. This can be used to determine 2495 + * whether the setgid bit can be kept or must be dropped. 2496 + * 2497 + * Return: true if the caller is sufficiently privileged, false if not. 2498 + */ 2499 + bool in_group_or_capable(struct user_namespace *mnt_userns, 2500 + const struct inode *inode, vfsgid_t vfsgid) 2501 + { 2502 + if (vfsgid_in_group_p(vfsgid)) 2503 + return true; 2504 + if (capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) 2505 + return true; 2506 + return false; 2507 + } 2508 + 2509 + /** 2460 2510 * mode_strip_sgid - handle the sgid bit for non-directories 2461 2511 * @mnt_userns: User namespace of the mount the inode was created from 2462 2512 * @dir: parent directory inode ··· 2499 2505 return mode; 2500 2506 if (S_ISDIR(mode) || !dir || !(dir->i_mode & S_ISGID)) 2501 2507 return mode; 2502 - if (in_group_p(i_gid_into_mnt(mnt_userns, dir))) 2508 + if (in_group_or_capable(mnt_userns, dir, 2509 + i_gid_into_vfsgid(mnt_userns, dir))) 2503 2510 return mode; 2504 - if (capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID)) 2505 - return mode; 2506 - 2507 2511 return mode & ~S_ISGID; 2508 2512 } 2509 2513 EXPORT_SYMBOL(mode_strip_sgid);
+9 -1
fs/internal.h
··· 150 150 * inode.c 151 151 */ 152 152 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); 153 - extern int dentry_needs_remove_privs(struct dentry *dentry); 153 + int dentry_needs_remove_privs(struct user_namespace *, struct dentry *dentry); 154 + bool in_group_or_capable(struct user_namespace *mnt_userns, 155 + const struct inode *inode, vfsgid_t vfsgid); 154 156 155 157 /* 156 158 * fs-writeback.c ··· 257 255 #endif 258 256 259 257 ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *pos); 258 + 259 + /* 260 + * fs/attr.c 261 + */ 262 + int setattr_should_drop_sgid(struct user_namespace *mnt_userns, 263 + const struct inode *inode);
+2 -2
fs/ocfs2/file.c
··· 1991 1991 } 1992 1992 } 1993 1993 1994 - if (file && should_remove_suid(file->f_path.dentry)) { 1994 + if (file && setattr_should_drop_suidgid(&init_user_ns, file_inode(file))) { 1995 1995 ret = __ocfs2_write_remove_suid(inode, di_bh); 1996 1996 if (ret) { 1997 1997 mlog_errno(ret); ··· 2279 2279 * inode. There's also the dinode i_size state which 2280 2280 * can be lost via setattr during extending writes (we 2281 2281 * set inode->i_size at the end of a write. */ 2282 - if (should_remove_suid(dentry)) { 2282 + if (setattr_should_drop_suidgid(&init_user_ns, inode)) { 2283 2283 if (meta_level == 0) { 2284 2284 ocfs2_inode_unlock_for_extent_tree(inode, 2285 2285 &di_bh,
+4 -4
fs/open.c
··· 54 54 } 55 55 56 56 /* Remove suid, sgid, and file capabilities on truncate too */ 57 - ret = dentry_needs_remove_privs(dentry); 57 + ret = dentry_needs_remove_privs(mnt_userns, dentry); 58 58 if (ret < 0) 59 59 return ret; 60 60 if (ret) ··· 723 723 return -EINVAL; 724 724 if ((group != (gid_t)-1) && !setattr_vfsgid(&newattrs, gid)) 725 725 return -EINVAL; 726 - if (!S_ISDIR(inode->i_mode)) 727 - newattrs.ia_valid |= 728 - ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV; 729 726 inode_lock(inode); 727 + if (!S_ISDIR(inode->i_mode)) 728 + newattrs.ia_valid |= ATTR_KILL_SUID | ATTR_KILL_PRIV | 729 + setattr_should_drop_sgid(mnt_userns, inode); 730 730 /* Continue to send actual fs values, not the mount values. */ 731 731 error = security_path_chown( 732 732 path,
+25 -3
fs/overlayfs/file.c
··· 517 517 const struct cred *old_cred; 518 518 int ret; 519 519 520 + inode_lock(inode); 521 + /* Update mode */ 522 + ovl_copyattr(inode); 523 + ret = file_remove_privs(file); 524 + if (ret) 525 + goto out_unlock; 526 + 520 527 ret = ovl_real_fdget(file, &real); 521 528 if (ret) 522 - return ret; 529 + goto out_unlock; 523 530 524 531 old_cred = ovl_override_creds(file_inode(file)->i_sb); 525 532 ret = vfs_fallocate(real.file, mode, offset, len); ··· 536 529 ovl_copyattr(inode); 537 530 538 531 fdput(real); 532 + 533 + out_unlock: 534 + inode_unlock(inode); 539 535 540 536 return ret; 541 537 } ··· 577 567 const struct cred *old_cred; 578 568 loff_t ret; 579 569 570 + inode_lock(inode_out); 571 + if (op != OVL_DEDUPE) { 572 + /* Update mode */ 573 + ovl_copyattr(inode_out); 574 + ret = file_remove_privs(file_out); 575 + if (ret) 576 + goto out_unlock; 577 + } 578 + 580 579 ret = ovl_real_fdget(file_out, &real_out); 581 580 if (ret) 582 - return ret; 581 + goto out_unlock; 583 582 584 583 ret = ovl_real_fdget(file_in, &real_in); 585 584 if (ret) { 586 585 fdput(real_out); 587 - return ret; 586 + goto out_unlock; 588 587 } 589 588 590 589 old_cred = ovl_override_creds(file_inode(file_out)->i_sb); ··· 621 602 622 603 fdput(real_in); 623 604 fdput(real_out); 605 + 606 + out_unlock: 607 + inode_unlock(inode_out); 624 608 625 609 return ret; 626 610 }
+2 -2
include/linux/fs.h
··· 3133 3133 extern struct inode *new_inode_pseudo(struct super_block *sb); 3134 3134 extern struct inode *new_inode(struct super_block *sb); 3135 3135 extern void free_inode_nonrcu(struct inode *inode); 3136 - extern int should_remove_suid(struct dentry *); 3136 + extern int setattr_should_drop_suidgid(struct user_namespace *, struct inode *); 3137 3137 extern int file_remove_privs(struct file *); 3138 3138 3139 3139 /* ··· 3564 3564 3565 3565 static inline bool is_sxid(umode_t mode) 3566 3566 { 3567 - return (mode & S_ISUID) || ((mode & S_ISGID) && (mode & S_IXGRP)); 3567 + return mode & (S_ISUID | S_ISGID); 3568 3568 } 3569 3569 3570 3570 static inline int check_sticky(struct user_namespace *mnt_userns,