Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

exit: support non-blocking pidfds

Passing a non-blocking pidfd to waitid() currently has no effect, i.e. is not
supported. There are users which would like to use waitid() on pidfds that are
O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
waitid().
The expected behavior is to have waitid() return -EAGAIN for non-blocking
pidfds and to block for blocking pidfds without needing to perform any
additional checks for flags set on the pidfd before passing it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child process is
ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
loops that handle EAGAIN specially.

It also makes the API more consistent and uniform. In essence, waitid() is
treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
contrast non-blocking pidfds and non-blocking sockets are a setting on an open
file description affecting all threads in the calling process as well as other
processes that hold file descriptors referring to the same open file
description. Both behaviors, per call and per open file description, have
genuine use-cases.

The implementation should be straightforward:
- If a non-blocking pidfd is passed and WNOHANG is not raised we simply raise
the WNOHANG flag internally. When do_wait() returns indicating that there are
eligible child processes but none have exited yet we set EAGAIN. If no child
process exists we continue returning ECHILD.
- If a non-blocking pidfd is passed and WNOHANG is raised waitid() will
continue returning 0, i.e. it will not set EAGAIN. This ensure backwards
compatibility with applications passing WNOHANG explicitly with pidfds.

A concrete use-case that was brought on-list was Josh's async pidfd library.
Ever since the introduction of pidfds and more advanced async io various
programming languages such as Rust have grown support for async event
libraries. These libraries are created to help build epoll-based event loops
around file descriptors. A common pattern is to automatically make all file
descriptors they manage to O_NONBLOCK.

For such libraries the EAGAIN error code is treated specially. When a function
is called that returns EAGAIN the function isn't called again until the event
loop indicates the the file descriptor is ready. Supporting EAGAIN when
waiting on pidfds makes such libraries just work with little effort.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Link: https://github.com/joshtriplett/async-pidfd
Link: https://lore.kernel.org/r/20200902102130.147672-3-christian.brauner@ubuntu.com

+12 -3
+12 -3
kernel/exit.c
··· 1474 1474 return retval; 1475 1475 } 1476 1476 1477 - static struct pid *pidfd_get_pid(unsigned int fd) 1477 + static struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags) 1478 1478 { 1479 1479 struct fd f; 1480 1480 struct pid *pid; ··· 1484 1484 return ERR_PTR(-EBADF); 1485 1485 1486 1486 pid = pidfd_pid(f.file); 1487 - if (!IS_ERR(pid)) 1487 + if (!IS_ERR(pid)) { 1488 1488 get_pid(pid); 1489 + *flags = f.file->f_flags; 1490 + } 1489 1491 1490 1492 fdput(f); 1491 1493 return pid; ··· 1500 1498 struct pid *pid = NULL; 1501 1499 enum pid_type type; 1502 1500 long ret; 1501 + unsigned int f_flags = 0; 1503 1502 1504 1503 if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED| 1505 1504 __WNOTHREAD|__WCLONE|__WALL)) ··· 1534 1531 if (upid < 0) 1535 1532 return -EINVAL; 1536 1533 1537 - pid = pidfd_get_pid(upid); 1534 + pid = pidfd_get_pid(upid, &f_flags); 1538 1535 if (IS_ERR(pid)) 1539 1536 return PTR_ERR(pid); 1537 + 1540 1538 break; 1541 1539 default: 1542 1540 return -EINVAL; ··· 1548 1544 wo.wo_flags = options; 1549 1545 wo.wo_info = infop; 1550 1546 wo.wo_rusage = ru; 1547 + if (f_flags & O_NONBLOCK) 1548 + wo.wo_flags |= WNOHANG; 1549 + 1551 1550 ret = do_wait(&wo); 1551 + if (!ret && !(options & WNOHANG) && (f_flags & O_NONBLOCK)) 1552 + ret = -EAGAIN; 1552 1553 1553 1554 put_pid(pid); 1554 1555 return ret;