Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull user namespace updates from Eric Biederman:
"Long ago and far away when user namespaces where young it was realized
that allowing fresh mounts of proc and sysfs with only user namespace
permissions could violate the basic rule that only root gets to decide
if proc or sysfs should be mounted at all.

Some hacks were put in place to reduce the worst of the damage could
be done, and the common sense rule was adopted that fresh mounts of
proc and sysfs should allow no more than bind mounts of proc and
sysfs. Unfortunately that rule has not been fully enforced.

There are two kinds of gaps in that enforcement. Only filesystems
mounted on empty directories of proc and sysfs should be ignored but
the test for empty directories was insufficient. So in my tree
directories on proc, sysctl and sysfs that will always be empty are
created specially. Every other technique is imperfect as an ordinary
directory can have entries added even after a readdir returns and
shows that the directory is empty. Special creation of directories
for mount points makes the code in the kernel a smidge clearer about
it's purpose. I asked container developers from the various container
projects to help test this and no holes were found in the set of mount
points on proc and sysfs that are created specially.

This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of
proc and sysfs. I expected this to be the boring part of the work but
unfortunately unprivileged userspace winds up mounting fresh copies of
proc and sysfs with noexec and nosuid clear when root set those flags
on the previous mount of proc and sysfs. So for now only the atime,
read-only and nodev attributes which userspace happens to keep
consistent are enforced. Dealing with the noexec and nosuid
attributes remains for another time.

This set of changes also addresses an issue with how open file
descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of
/proc/<pid>/fd has been triggering a WARN_ON that has not been
meaningful since it was added (as all of the code in the kernel was
converted) and is not now actively wrong.

There is also a short list of issues that have not been fixed yet that
I will mention briefly.

It is possible to rename a directory from below to above a bind mount.
At which point any directory pointers below the renamed directory can
be walked up to the root directory of the filesystem. With user
namespaces enabled a bind mount of the bind mount can be created
allowing the user to pick a directory whose children they can rename
to outside of the bind mount. This is challenging to fix and doubly
so because all obvious solutions must touch code that is in the
performance part of pathname resolution.

As mentioned above there is also a question of how to ensure that
developers by accident or with purpose do not introduce exectuable
files on sysfs and proc and in doing so introduce security regressions
in the current userspace that will not be immediately obvious and as
such are likely to require breaking userspace in painful ways once
they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Remove incorrect debugging WARN in prepend_path
mnt: Update fs_fully_visible to test for permanently empty directories
sysfs: Create mountpoints with sysfs_create_mount_point
sysfs: Add support for permanently empty directories to serve as mount points.
kernfs: Add support for always empty directories.
proc: Allow creating permanently empty directories that serve as mount points
sysctl: Allow creating permanently empty directories that serve as mountpoints.
fs: Add helper functions for permanently empty directories.
vfs: Ignore unlocked mounts in fs_fully_visible
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
mnt: Refactor the logic for mounting sysfs and proc in a user namespace

+340 -101
+4 -8
arch/s390/hypfs/inode.c
··· 456 456 .show_options = hypfs_show_options, 457 457 }; 458 458 459 - static struct kobject *s390_kobj; 460 - 461 459 static int __init hypfs_init(void) 462 460 { 463 461 int rc; ··· 479 481 rc = -ENODATA; 480 482 goto fail_hypfs_sprp_exit; 481 483 } 482 - s390_kobj = kobject_create_and_add("s390", hypervisor_kobj); 483 - if (!s390_kobj) { 484 - rc = -ENOMEM; 484 + rc = sysfs_create_mount_point(hypervisor_kobj, "s390"); 485 + if (rc) 485 486 goto fail_hypfs_diag0c_exit; 486 - } 487 487 rc = register_filesystem(&hypfs_type); 488 488 if (rc) 489 489 goto fail_filesystem; 490 490 return 0; 491 491 492 492 fail_filesystem: 493 - kobject_put(s390_kobj); 493 + sysfs_remove_mount_point(hypervisor_kobj, "s390"); 494 494 fail_hypfs_diag0c_exit: 495 495 hypfs_diag0c_exit(); 496 496 fail_hypfs_sprp_exit: ··· 506 510 static void __exit hypfs_exit(void) 507 511 { 508 512 unregister_filesystem(&hypfs_type); 509 - kobject_put(s390_kobj); 513 + sysfs_remove_mount_point(hypervisor_kobj, "s390"); 510 514 hypfs_diag0c_exit(); 511 515 hypfs_sprp_exit(); 512 516 hypfs_vm_exit();
+2 -4
drivers/firmware/efi/efi.c
··· 66 66 early_param("efi", parse_efi_cmdline); 67 67 68 68 struct kobject *efi_kobj; 69 - static struct kobject *efivars_kobj; 70 69 71 70 /* 72 71 * Let's not leave out systab information that snuck into ··· 217 218 goto err_remove_group; 218 219 219 220 /* and the standard mountpoint for efivarfs */ 220 - efivars_kobj = kobject_create_and_add("efivars", efi_kobj); 221 - if (!efivars_kobj) { 221 + error = sysfs_create_mount_point(efi_kobj, "efivars"); 222 + if (error) { 222 223 pr_err("efivars: Subsystem registration failed.\n"); 223 - error = -ENOMEM; 224 224 goto err_remove_group; 225 225 } 226 226
+4 -6
fs/configfs/mount.c
··· 129 129 } 130 130 131 131 132 - static struct kobject *config_kobj; 133 - 134 132 static int __init configfs_init(void) 135 133 { 136 134 int err = -ENOMEM; ··· 139 141 if (!configfs_dir_cachep) 140 142 goto out; 141 143 142 - config_kobj = kobject_create_and_add("config", kernel_kobj); 143 - if (!config_kobj) 144 + err = sysfs_create_mount_point(kernel_kobj, "config"); 145 + if (err) 144 146 goto out2; 145 147 146 148 err = register_filesystem(&configfs_fs_type); ··· 150 152 return 0; 151 153 out3: 152 154 pr_err("Unable to register filesystem!\n"); 153 - kobject_put(config_kobj); 155 + sysfs_remove_mount_point(kernel_kobj, "config"); 154 156 out2: 155 157 kmem_cache_destroy(configfs_dir_cachep); 156 158 configfs_dir_cachep = NULL; ··· 161 163 static void __exit configfs_exit(void) 162 164 { 163 165 unregister_filesystem(&configfs_fs_type); 164 - kobject_put(config_kobj); 166 + sysfs_remove_mount_point(kernel_kobj, "config"); 165 167 kmem_cache_destroy(configfs_dir_cachep); 166 168 configfs_dir_cachep = NULL; 167 169 }
-11
fs/dcache.c
··· 2927 2927 vfsmnt = &mnt->mnt; 2928 2928 continue; 2929 2929 } 2930 - /* 2931 - * Filesystems needing to implement special "root names" 2932 - * should do so with ->d_dname() 2933 - */ 2934 - if (IS_ROOT(dentry) && 2935 - (dentry->d_name.len != 1 || 2936 - dentry->d_name.name[0] != '/')) { 2937 - WARN(1, "Root dentry has weird name <%.*s>\n", 2938 - (int) dentry->d_name.len, 2939 - dentry->d_name.name); 2940 - } 2941 2930 if (!error) 2942 2931 error = is_mounted(vfsmnt) ? 1 : 2; 2943 2932 break;
+4 -7
fs/debugfs/inode.c
··· 716 716 } 717 717 EXPORT_SYMBOL_GPL(debugfs_initialized); 718 718 719 - 720 - static struct kobject *debug_kobj; 721 - 722 719 static int __init debugfs_init(void) 723 720 { 724 721 int retval; 725 722 726 - debug_kobj = kobject_create_and_add("debug", kernel_kobj); 727 - if (!debug_kobj) 728 - return -EINVAL; 723 + retval = sysfs_create_mount_point(kernel_kobj, "debug"); 724 + if (retval) 725 + return retval; 729 726 730 727 retval = register_filesystem(&debug_fs_type); 731 728 if (retval) 732 - kobject_put(debug_kobj); 729 + sysfs_remove_mount_point(kernel_kobj, "debug"); 733 730 else 734 731 debugfs_registered = true; 735 732
+3 -6
fs/fuse/inode.c
··· 1294 1294 } 1295 1295 1296 1296 static struct kobject *fuse_kobj; 1297 - static struct kobject *connections_kobj; 1298 1297 1299 1298 static int fuse_sysfs_init(void) 1300 1299 { ··· 1305 1306 goto out_err; 1306 1307 } 1307 1308 1308 - connections_kobj = kobject_create_and_add("connections", fuse_kobj); 1309 - if (!connections_kobj) { 1310 - err = -ENOMEM; 1309 + err = sysfs_create_mount_point(fuse_kobj, "connections"); 1310 + if (err) 1311 1311 goto out_fuse_unregister; 1312 - } 1313 1312 1314 1313 return 0; 1315 1314 ··· 1319 1322 1320 1323 static void fuse_sysfs_cleanup(void) 1321 1324 { 1322 - kobject_put(connections_kobj); 1325 + sysfs_remove_mount_point(fuse_kobj, "connections"); 1323 1326 kobject_put(fuse_kobj); 1324 1327 } 1325 1328
+37 -1
fs/kernfs/dir.c
··· 592 592 goto out_unlock; 593 593 594 594 ret = -ENOENT; 595 + if (parent->flags & KERNFS_EMPTY_DIR) 596 + goto out_unlock; 597 + 595 598 if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent)) 596 599 goto out_unlock; 597 600 ··· 776 773 kn->dir.root = parent->dir.root; 777 774 kn->ns = ns; 778 775 kn->priv = priv; 776 + 777 + /* link in */ 778 + rc = kernfs_add_one(kn); 779 + if (!rc) 780 + return kn; 781 + 782 + kernfs_put(kn); 783 + return ERR_PTR(rc); 784 + } 785 + 786 + /** 787 + * kernfs_create_empty_dir - create an always empty directory 788 + * @parent: parent in which to create a new directory 789 + * @name: name of the new directory 790 + * 791 + * Returns the created node on success, ERR_PTR() value on failure. 792 + */ 793 + struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, 794 + const char *name) 795 + { 796 + struct kernfs_node *kn; 797 + int rc; 798 + 799 + /* allocate */ 800 + kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, KERNFS_DIR); 801 + if (!kn) 802 + return ERR_PTR(-ENOMEM); 803 + 804 + kn->flags |= KERNFS_EMPTY_DIR; 805 + kn->dir.root = parent->dir.root; 806 + kn->ns = NULL; 807 + kn->priv = NULL; 779 808 780 809 /* link in */ 781 810 rc = kernfs_add_one(kn); ··· 1289 1254 mutex_lock(&kernfs_mutex); 1290 1255 1291 1256 error = -ENOENT; 1292 - if (!kernfs_active(kn) || !kernfs_active(new_parent)) 1257 + if (!kernfs_active(kn) || !kernfs_active(new_parent) || 1258 + (new_parent->flags & KERNFS_EMPTY_DIR)) 1293 1259 goto out; 1294 1260 1295 1261 error = 0;
+2
fs/kernfs/inode.c
··· 296 296 case KERNFS_DIR: 297 297 inode->i_op = &kernfs_dir_iops; 298 298 inode->i_fop = &kernfs_dir_fops; 299 + if (kn->flags & KERNFS_EMPTY_DIR) 300 + make_empty_dir_inode(inode); 299 301 break; 300 302 case KERNFS_FILE: 301 303 inode->i_size = kn->attr.size;
+95
fs/libfs.c
··· 1108 1108 .readlink = generic_readlink 1109 1109 }; 1110 1110 EXPORT_SYMBOL(simple_symlink_inode_operations); 1111 + 1112 + /* 1113 + * Operations for a permanently empty directory. 1114 + */ 1115 + static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) 1116 + { 1117 + return ERR_PTR(-ENOENT); 1118 + } 1119 + 1120 + static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry, 1121 + struct kstat *stat) 1122 + { 1123 + struct inode *inode = d_inode(dentry); 1124 + generic_fillattr(inode, stat); 1125 + return 0; 1126 + } 1127 + 1128 + static int empty_dir_setattr(struct dentry *dentry, struct iattr *attr) 1129 + { 1130 + return -EPERM; 1131 + } 1132 + 1133 + static int empty_dir_setxattr(struct dentry *dentry, const char *name, 1134 + const void *value, size_t size, int flags) 1135 + { 1136 + return -EOPNOTSUPP; 1137 + } 1138 + 1139 + static ssize_t empty_dir_getxattr(struct dentry *dentry, const char *name, 1140 + void *value, size_t size) 1141 + { 1142 + return -EOPNOTSUPP; 1143 + } 1144 + 1145 + static int empty_dir_removexattr(struct dentry *dentry, const char *name) 1146 + { 1147 + return -EOPNOTSUPP; 1148 + } 1149 + 1150 + static ssize_t empty_dir_listxattr(struct dentry *dentry, char *list, size_t size) 1151 + { 1152 + return -EOPNOTSUPP; 1153 + } 1154 + 1155 + static const struct inode_operations empty_dir_inode_operations = { 1156 + .lookup = empty_dir_lookup, 1157 + .permission = generic_permission, 1158 + .setattr = empty_dir_setattr, 1159 + .getattr = empty_dir_getattr, 1160 + .setxattr = empty_dir_setxattr, 1161 + .getxattr = empty_dir_getxattr, 1162 + .removexattr = empty_dir_removexattr, 1163 + .listxattr = empty_dir_listxattr, 1164 + }; 1165 + 1166 + static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence) 1167 + { 1168 + /* An empty directory has two entries . and .. at offsets 0 and 1 */ 1169 + return generic_file_llseek_size(file, offset, whence, 2, 2); 1170 + } 1171 + 1172 + static int empty_dir_readdir(struct file *file, struct dir_context *ctx) 1173 + { 1174 + dir_emit_dots(file, ctx); 1175 + return 0; 1176 + } 1177 + 1178 + static const struct file_operations empty_dir_operations = { 1179 + .llseek = empty_dir_llseek, 1180 + .read = generic_read_dir, 1181 + .iterate = empty_dir_readdir, 1182 + .fsync = noop_fsync, 1183 + }; 1184 + 1185 + 1186 + void make_empty_dir_inode(struct inode *inode) 1187 + { 1188 + set_nlink(inode, 2); 1189 + inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO; 1190 + inode->i_uid = GLOBAL_ROOT_UID; 1191 + inode->i_gid = GLOBAL_ROOT_GID; 1192 + inode->i_rdev = 0; 1193 + inode->i_size = 2; 1194 + inode->i_blkbits = PAGE_SHIFT; 1195 + inode->i_blocks = 0; 1196 + 1197 + inode->i_op = &empty_dir_inode_operations; 1198 + inode->i_fop = &empty_dir_operations; 1199 + } 1200 + 1201 + bool is_empty_dir_inode(struct inode *inode) 1202 + { 1203 + return (inode->i_fop == &empty_dir_operations) && 1204 + (inode->i_op == &empty_dir_inode_operations); 1205 + }
+33 -6
fs/namespace.c
··· 2343 2343 return err; 2344 2344 } 2345 2345 2346 + static bool fs_fully_visible(struct file_system_type *fs_type, int *new_mnt_flags); 2347 + 2346 2348 /* 2347 2349 * create a new mount for userspace and request it to be added into the 2348 2350 * namespace's tree ··· 2375 2373 if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) { 2376 2374 flags |= MS_NODEV; 2377 2375 mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV; 2376 + } 2377 + if (type->fs_flags & FS_USERNS_VISIBLE) { 2378 + if (!fs_fully_visible(type, &mnt_flags)) 2379 + return -EPERM; 2378 2380 } 2379 2381 } 2380 2382 ··· 3181 3175 return chrooted; 3182 3176 } 3183 3177 3184 - bool fs_fully_visible(struct file_system_type *type) 3178 + static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) 3185 3179 { 3186 3180 struct mnt_namespace *ns = current->nsproxy->mnt_ns; 3181 + int new_flags = *new_mnt_flags; 3187 3182 struct mount *mnt; 3188 3183 bool visible = false; 3189 3184 ··· 3203 3196 if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root) 3204 3197 continue; 3205 3198 3206 - /* This mount is not fully visible if there are any child mounts 3207 - * that cover anything except for empty directories. 3199 + /* Verify the mount flags are equal to or more permissive 3200 + * than the proposed new mount. 3201 + */ 3202 + if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) && 3203 + !(new_flags & MNT_READONLY)) 3204 + continue; 3205 + if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && 3206 + !(new_flags & MNT_NODEV)) 3207 + continue; 3208 + if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && 3209 + ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) 3210 + continue; 3211 + 3212 + /* This mount is not fully visible if there are any 3213 + * locked child mounts that cover anything except for 3214 + * empty directories. 3208 3215 */ 3209 3216 list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { 3210 3217 struct inode *inode = child->mnt_mountpoint->d_inode; 3211 - if (!S_ISDIR(inode->i_mode)) 3212 - goto next; 3213 - if (inode->i_nlink > 2) 3218 + /* Only worry about locked mounts */ 3219 + if (!(mnt->mnt.mnt_flags & MNT_LOCKED)) 3220 + continue; 3221 + /* Is the directory permanetly empty? */ 3222 + if (!is_empty_dir_inode(inode)) 3214 3223 goto next; 3215 3224 } 3225 + /* Preserve the locked attributes */ 3226 + *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ 3227 + MNT_LOCK_NODEV | \ 3228 + MNT_LOCK_ATIME); 3216 3229 visible = true; 3217 3230 goto found; 3218 3231 next: ;
+23
fs/proc/generic.c
··· 373 373 WARN(1, "create '/proc/%s' by hand\n", qstr.name); 374 374 return NULL; 375 375 } 376 + if (is_empty_pde(*parent)) { 377 + WARN(1, "attempt to add to permanently empty directory"); 378 + return NULL; 379 + } 376 380 377 381 ent = kzalloc(sizeof(struct proc_dir_entry) + qstr.len + 1, GFP_KERNEL); 378 382 if (!ent) ··· 458 454 return proc_mkdir_data(name, 0, parent, NULL); 459 455 } 460 456 EXPORT_SYMBOL(proc_mkdir); 457 + 458 + struct proc_dir_entry *proc_create_mount_point(const char *name) 459 + { 460 + umode_t mode = S_IFDIR | S_IRUGO | S_IXUGO; 461 + struct proc_dir_entry *ent, *parent = NULL; 462 + 463 + ent = __proc_create(&parent, name, mode, 2); 464 + if (ent) { 465 + ent->data = NULL; 466 + ent->proc_fops = NULL; 467 + ent->proc_iops = NULL; 468 + if (proc_register(parent, ent) < 0) { 469 + kfree(ent); 470 + parent->nlink--; 471 + ent = NULL; 472 + } 473 + } 474 + return ent; 475 + } 461 476 462 477 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, 463 478 struct proc_dir_entry *parent,
+4
fs/proc/inode.c
··· 422 422 inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; 423 423 PROC_I(inode)->pde = de; 424 424 425 + if (is_empty_pde(de)) { 426 + make_empty_dir_inode(inode); 427 + return inode; 428 + } 425 429 if (de->mode) { 426 430 inode->i_mode = de->mode; 427 431 inode->i_uid = de->uid;
+6
fs/proc/internal.h
··· 191 191 } 192 192 extern void pde_put(struct proc_dir_entry *); 193 193 194 + static inline bool is_empty_pde(const struct proc_dir_entry *pde) 195 + { 196 + return S_ISDIR(pde->mode) && !pde->proc_iops; 197 + } 198 + struct proc_dir_entry *proc_create_mount_point(const char *name); 199 + 194 200 /* 195 201 * inode.c 196 202 */
+37
fs/proc/proc_sysctl.c
··· 19 19 static const struct file_operations proc_sys_dir_file_operations; 20 20 static const struct inode_operations proc_sys_dir_operations; 21 21 22 + /* Support for permanently empty directories */ 23 + 24 + struct ctl_table sysctl_mount_point[] = { 25 + { } 26 + }; 27 + 28 + static bool is_empty_dir(struct ctl_table_header *head) 29 + { 30 + return head->ctl_table[0].child == sysctl_mount_point; 31 + } 32 + 33 + static void set_empty_dir(struct ctl_dir *dir) 34 + { 35 + dir->header.ctl_table[0].child = sysctl_mount_point; 36 + } 37 + 38 + static void clear_empty_dir(struct ctl_dir *dir) 39 + 40 + { 41 + dir->header.ctl_table[0].child = NULL; 42 + } 43 + 22 44 void proc_sys_poll_notify(struct ctl_table_poll *poll) 23 45 { 24 46 if (!poll) ··· 209 187 struct ctl_table *entry; 210 188 int err; 211 189 190 + /* Is this a permanently empty directory? */ 191 + if (is_empty_dir(&dir->header)) 192 + return -EROFS; 193 + 194 + /* Am I creating a permanently empty directory? */ 195 + if (header->ctl_table == sysctl_mount_point) { 196 + if (!RB_EMPTY_ROOT(&dir->root)) 197 + return -EINVAL; 198 + set_empty_dir(dir); 199 + } 200 + 212 201 dir->header.nreg++; 213 202 header->parent = dir; 214 203 err = insert_links(header); ··· 235 202 erase_header(header); 236 203 put_links(header); 237 204 fail_links: 205 + if (header->ctl_table == sysctl_mount_point) 206 + clear_empty_dir(dir); 238 207 header->parent = NULL; 239 208 drop_sysctl_table(&dir->header); 240 209 return err; ··· 454 419 inode->i_mode |= S_IFDIR; 455 420 inode->i_op = &proc_sys_dir_operations; 456 421 inode->i_fop = &proc_sys_dir_file_operations; 422 + if (is_empty_dir(head)) 423 + make_empty_dir_inode(inode); 457 424 } 458 425 out: 459 426 return inode;
+3 -6
fs/proc/root.c
··· 112 112 ns = task_active_pid_ns(current); 113 113 options = data; 114 114 115 - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) 116 - return ERR_PTR(-EPERM); 117 - 118 115 /* Does the mounter have privilege over the pid namespace? */ 119 116 if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) 120 117 return ERR_PTR(-EPERM); ··· 156 159 .name = "proc", 157 160 .mount = proc_mount, 158 161 .kill_sb = proc_kill_sb, 159 - .fs_flags = FS_USERNS_MOUNT, 162 + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, 160 163 }; 161 164 162 165 void __init proc_root_init(void) ··· 179 182 #endif 180 183 proc_mkdir("fs", NULL); 181 184 proc_mkdir("driver", NULL); 182 - proc_mkdir("fs/nfsd", NULL); /* somewhere for the nfsd filesystem to be mounted */ 185 + proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */ 183 186 #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) 184 187 /* just give it a mountpoint */ 185 - proc_mkdir("openprom", NULL); 188 + proc_create_mount_point("openprom"); 186 189 #endif 187 190 proc_tty_init(); 188 191 proc_mkdir("bus", NULL);
+4 -8
fs/pstore/inode.c
··· 461 461 .kill_sb = pstore_kill_sb, 462 462 }; 463 463 464 - static struct kobject *pstore_kobj; 465 - 466 464 static int __init init_pstore_fs(void) 467 465 { 468 - int err = 0; 466 + int err; 469 467 470 468 /* Create a convenient mount point for people to access pstore */ 471 - pstore_kobj = kobject_create_and_add("pstore", fs_kobj); 472 - if (!pstore_kobj) { 473 - err = -ENOMEM; 469 + err = sysfs_create_mount_point(fs_kobj, "pstore"); 470 + if (err) 474 471 goto out; 475 - } 476 472 477 473 err = register_filesystem(&pstore_fs_type); 478 474 if (err < 0) 479 - kobject_put(pstore_kobj); 475 + sysfs_remove_mount_point(fs_kobj, "pstore"); 480 476 481 477 out: 482 478 return err;
+34
fs/sysfs/dir.c
··· 121 121 122 122 return kernfs_rename_ns(kn, new_parent, kn->name, new_ns); 123 123 } 124 + 125 + /** 126 + * sysfs_create_mount_point - create an always empty directory 127 + * @parent_kobj: kobject that will contain this always empty directory 128 + * @name: The name of the always empty directory to add 129 + */ 130 + int sysfs_create_mount_point(struct kobject *parent_kobj, const char *name) 131 + { 132 + struct kernfs_node *kn, *parent = parent_kobj->sd; 133 + 134 + kn = kernfs_create_empty_dir(parent, name); 135 + if (IS_ERR(kn)) { 136 + if (PTR_ERR(kn) == -EEXIST) 137 + sysfs_warn_dup(parent, name); 138 + return PTR_ERR(kn); 139 + } 140 + 141 + return 0; 142 + } 143 + EXPORT_SYMBOL_GPL(sysfs_create_mount_point); 144 + 145 + /** 146 + * sysfs_remove_mount_point - remove an always empty directory. 147 + * @parent_kobj: kobject that will contain this always empty directory 148 + * @name: The name of the always empty directory to remove 149 + * 150 + */ 151 + void sysfs_remove_mount_point(struct kobject *parent_kobj, const char *name) 152 + { 153 + struct kernfs_node *parent = parent_kobj->sd; 154 + 155 + kernfs_remove_by_name_ns(parent, name, NULL); 156 + } 157 + EXPORT_SYMBOL_GPL(sysfs_remove_mount_point);
+1 -4
fs/sysfs/mount.c
··· 31 31 bool new_sb; 32 32 33 33 if (!(flags & MS_KERNMOUNT)) { 34 - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) 35 - return ERR_PTR(-EPERM); 36 - 37 34 if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) 38 35 return ERR_PTR(-EPERM); 39 36 } ··· 55 58 .name = "sysfs", 56 59 .mount = sysfs_mount, 57 60 .kill_sb = sysfs_kill_sb, 58 - .fs_flags = FS_USERNS_MOUNT, 61 + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, 59 62 }; 60 63 61 64 int __init sysfs_init(void)
+2 -4
fs/tracefs/inode.c
··· 631 631 return tracefs_registered; 632 632 } 633 633 634 - static struct kobject *trace_kobj; 635 - 636 634 static int __init tracefs_init(void) 637 635 { 638 636 int retval; 639 637 640 - trace_kobj = kobject_create_and_add("tracing", kernel_kobj); 641 - if (!trace_kobj) 638 + retval = sysfs_create_mount_point(kernel_kobj, "tracing"); 639 + if (retval) 642 640 return -EINVAL; 643 641 644 642 retval = register_filesystem(&trace_fs_type);
+3 -1
include/linux/fs.h
··· 1917 1917 #define FS_HAS_SUBTYPE 4 1918 1918 #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ 1919 1919 #define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */ 1920 + #define FS_USERNS_VISIBLE 32 /* FS must already be visible */ 1920 1921 #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ 1921 1922 struct dentry *(*mount) (struct file_system_type *, int, 1922 1923 const char *, void *); ··· 2005 2004 extern int freeze_super(struct super_block *super); 2006 2005 extern int thaw_super(struct super_block *super); 2007 2006 extern bool our_mnt(struct vfsmount *mnt); 2008 - extern bool fs_fully_visible(struct file_system_type *); 2009 2007 2010 2008 extern int current_umask(void); 2011 2009 ··· 2816 2816 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); 2817 2817 extern const struct file_operations simple_dir_operations; 2818 2818 extern const struct inode_operations simple_dir_inode_operations; 2819 + extern void make_empty_dir_inode(struct inode *inode); 2820 + extern bool is_empty_dir_inode(struct inode *inode); 2819 2821 struct tree_descr { char *name; const struct file_operations *ops; int mode; }; 2820 2822 struct dentry *d_alloc_name(struct dentry *, const char *); 2821 2823 extern int simple_fill_super(struct super_block *, unsigned long, struct tree_descr *);
+3
include/linux/kernfs.h
··· 45 45 KERNFS_LOCKDEP = 0x0100, 46 46 KERNFS_SUICIDAL = 0x0400, 47 47 KERNFS_SUICIDED = 0x0800, 48 + KERNFS_EMPTY_DIR = 0x1000, 48 49 }; 49 50 50 51 /* @flags for kernfs_create_root() */ ··· 287 286 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, 288 287 const char *name, umode_t mode, 289 288 void *priv, const void *ns); 289 + struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, 290 + const char *name); 290 291 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, 291 292 const char *name, 292 293 umode_t mode, loff_t size,
+3
include/linux/sysctl.h
··· 188 188 void unregister_sysctl_table(struct ctl_table_header * table); 189 189 190 190 extern int sysctl_init(void); 191 + 192 + extern struct ctl_table sysctl_mount_point[]; 193 + 191 194 #else /* CONFIG_SYSCTL */ 192 195 static inline struct ctl_table_header *register_sysctl_table(struct ctl_table * table) 193 196 {
+15
include/linux/sysfs.h
··· 210 210 int __must_check sysfs_move_dir_ns(struct kobject *kobj, 211 211 struct kobject *new_parent_kobj, 212 212 const void *new_ns); 213 + int __must_check sysfs_create_mount_point(struct kobject *parent_kobj, 214 + const char *name); 215 + void sysfs_remove_mount_point(struct kobject *parent_kobj, 216 + const char *name); 213 217 214 218 int __must_check sysfs_create_file_ns(struct kobject *kobj, 215 219 const struct attribute *attr, ··· 300 296 const void *new_ns) 301 297 { 302 298 return 0; 299 + } 300 + 301 + static inline int sysfs_create_mount_point(struct kobject *parent_kobj, 302 + const char *name) 303 + { 304 + return 0; 305 + } 306 + 307 + static inline void sysfs_remove_mount_point(struct kobject *parent_kobj, 308 + const char *name) 309 + { 303 310 } 304 311 305 312 static inline int sysfs_create_file_ns(struct kobject *kobj,
+4 -6
kernel/cgroup.c
··· 1939 1939 .kill_sb = cgroup_kill_sb, 1940 1940 }; 1941 1941 1942 - static struct kobject *cgroup_kobj; 1943 - 1944 1942 /** 1945 1943 * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy 1946 1944 * @task: target task ··· 5068 5070 ss->bind(init_css_set.subsys[ssid]); 5069 5071 } 5070 5072 5071 - cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj); 5072 - if (!cgroup_kobj) 5073 - return -ENOMEM; 5073 + err = sysfs_create_mount_point(fs_kobj, "cgroup"); 5074 + if (err) 5075 + return err; 5074 5076 5075 5077 err = register_filesystem(&cgroup_fs_type); 5076 5078 if (err < 0) { 5077 - kobject_put(cgroup_kobj); 5079 + sysfs_remove_mount_point(fs_kobj, "cgroup"); 5078 5080 return err; 5079 5081 } 5080 5082
+1 -7
kernel/sysctl.c
··· 1538 1538 { } 1539 1539 }; 1540 1540 1541 - #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE) 1542 - static struct ctl_table binfmt_misc_table[] = { 1543 - { } 1544 - }; 1545 - #endif 1546 - 1547 1541 static struct ctl_table fs_table[] = { 1548 1542 { 1549 1543 .procname = "inode-nr", ··· 1691 1697 { 1692 1698 .procname = "binfmt_misc", 1693 1699 .mode = 0555, 1694 - .child = binfmt_misc_table, 1700 + .child = sysctl_mount_point, 1695 1701 }, 1696 1702 #endif 1697 1703 {
+4 -6
security/inode.c
··· 215 215 } 216 216 EXPORT_SYMBOL_GPL(securityfs_remove); 217 217 218 - static struct kobject *security_kobj; 219 - 220 218 static int __init securityfs_init(void) 221 219 { 222 220 int retval; 223 221 224 - security_kobj = kobject_create_and_add("security", kernel_kobj); 225 - if (!security_kobj) 226 - return -EINVAL; 222 + retval = sysfs_create_mount_point(kernel_kobj, "security"); 223 + if (retval) 224 + return retval; 227 225 228 226 retval = register_filesystem(&fs_type); 229 227 if (retval) 230 - kobject_put(security_kobj); 228 + sysfs_remove_mount_point(kernel_kobj, "security"); 231 229 return retval; 232 230 } 233 231
+5 -6
security/selinux/selinuxfs.c
··· 1853 1853 }; 1854 1854 1855 1855 struct vfsmount *selinuxfs_mount; 1856 - static struct kobject *selinuxfs_kobj; 1857 1856 1858 1857 static int __init init_sel_fs(void) 1859 1858 { ··· 1861 1862 if (!selinux_enabled) 1862 1863 return 0; 1863 1864 1864 - selinuxfs_kobj = kobject_create_and_add("selinux", fs_kobj); 1865 - if (!selinuxfs_kobj) 1866 - return -ENOMEM; 1865 + err = sysfs_create_mount_point(fs_kobj, "selinux"); 1866 + if (err) 1867 + return err; 1867 1868 1868 1869 err = register_filesystem(&sel_fs_type); 1869 1870 if (err) { 1870 - kobject_put(selinuxfs_kobj); 1871 + sysfs_remove_mount_point(fs_kobj, "selinux"); 1871 1872 return err; 1872 1873 } 1873 1874 ··· 1886 1887 #ifdef CONFIG_SECURITY_SELINUX_DISABLE 1887 1888 void exit_sel_fs(void) 1888 1889 { 1889 - kobject_put(selinuxfs_kobj); 1890 + sysfs_remove_mount_point(fs_kobj, "selinux"); 1890 1891 kern_unmount(selinuxfs_mount); 1891 1892 unregister_filesystem(&sel_fs_type); 1892 1893 }
+4 -4
security/smack/smackfs.c
··· 2314 2314 .llseek = generic_file_llseek, 2315 2315 }; 2316 2316 2317 - static struct kset *smackfs_kset; 2318 2317 /** 2319 2318 * smk_init_sysfs - initialize /sys/fs/smackfs 2320 2319 * 2321 2320 */ 2322 2321 static int smk_init_sysfs(void) 2323 2322 { 2324 - smackfs_kset = kset_create_and_add("smackfs", NULL, fs_kobj); 2325 - if (!smackfs_kset) 2326 - return -ENOMEM; 2323 + int err; 2324 + err = sysfs_create_mount_point(fs_kobj, "smackfs"); 2325 + if (err) 2326 + return err; 2327 2327 return 0; 2328 2328 } 2329 2329