Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].

As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.

thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat2(dirb, "b/c/../../etc/shadow",
{ .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

The primary reason for deferring to userspace with -EAGAIN is that an
in-kernel retry loop (or doing a path_is_under() check after re-taking
the relevant seqlocks) can become unreasonably expensive on machines
with lots of VFS activity (nfsd can cause lots of rename_lock updates).
Thus it should be up to userspace how many times they wish to retry the
lookup -- the selftests for this attack indicate that there is a ~35%
chance of the lookup succeeding on the first try even with an attacker
thrashing rename_lock.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check for
magic-links after they are resolved. However this seems unlikely to
be a feature that people *really* need -- it can be added later if
it turns out a lot of people want it.

[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

authored by

Aleksa Sarai and committed by
Al Viro
ab87f9a5 8db52c7e

+27 -16
+27 -16
fs/namei.c
··· 491 491 struct path root; 492 492 struct inode *inode; /* path.dentry.d_inode */ 493 493 unsigned int flags; 494 - unsigned seq, m_seq; 494 + unsigned seq, m_seq, r_seq; 495 495 int last_type; 496 496 unsigned depth; 497 497 int total_link_count; ··· 1793 1793 if (type == LAST_DOTDOT) { 1794 1794 int error = 0; 1795 1795 1796 - /* 1797 - * Scoped-lookup flags resolving ".." is not currently safe -- 1798 - * races can cause our parent to have moved outside of the root 1799 - * and us to skip over it. 1800 - */ 1801 - if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) 1802 - return -EXDEV; 1803 1796 if (!nd->root.mnt) { 1804 1797 error = set_root(nd); 1805 1798 if (error) 1806 1799 return error; 1807 1800 } 1808 - if (nd->flags & LOOKUP_RCU) { 1809 - return follow_dotdot_rcu(nd); 1810 - } else 1811 - return follow_dotdot(nd); 1801 + if (nd->flags & LOOKUP_RCU) 1802 + error = follow_dotdot_rcu(nd); 1803 + else 1804 + error = follow_dotdot(nd); 1805 + if (error) 1806 + return error; 1807 + 1808 + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) { 1809 + /* 1810 + * If there was a racing rename or mount along our 1811 + * path, then we can't be sure that ".." hasn't jumped 1812 + * above nd->root (and so userspace should retry or use 1813 + * some fallback). 1814 + */ 1815 + smp_rmb(); 1816 + if (unlikely(__read_seqcount_retry(&mount_lock.seqcount, nd->m_seq))) 1817 + return -EAGAIN; 1818 + if (unlikely(__read_seqcount_retry(&rename_lock.seqcount, nd->r_seq))) 1819 + return -EAGAIN; 1820 + } 1812 1821 } 1813 1822 return 0; 1814 1823 } ··· 2282 2273 nd->last_type = LAST_ROOT; /* if there are only slashes... */ 2283 2274 nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; 2284 2275 nd->depth = 0; 2276 + 2277 + nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount); 2278 + nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount); 2279 + smp_rmb(); 2280 + 2285 2281 if (flags & LOOKUP_ROOT) { 2286 2282 struct dentry *root = nd->root.dentry; 2287 2283 struct inode *inode = root->d_inode; ··· 2295 2281 nd->path = nd->root; 2296 2282 nd->inode = inode; 2297 2283 if (flags & LOOKUP_RCU) { 2298 - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); 2284 + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); 2299 2285 nd->root_seq = nd->seq; 2300 - nd->m_seq = read_seqbegin(&mount_lock); 2301 2286 } else { 2302 2287 path_get(&nd->path); 2303 2288 } ··· 2306 2293 nd->root.mnt = NULL; 2307 2294 nd->path.mnt = NULL; 2308 2295 nd->path.dentry = NULL; 2309 - 2310 - nd->m_seq = read_seqbegin(&mount_lock); 2311 2296 2312 2297 /* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */ 2313 2298 if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {