eventpoll: Set epoll timeout if it's in the future

Avoid an edge case where epoll_wait arms a timer and calls schedule()
even if the timer will expire immediately.

For example: if the user has specified an epoll busy poll usecs which is
equal or larger than the epoll_wait/epoll_pwait2 timeout, it is
unnecessary to call schedule_hrtimeout_range; the busy poll usecs have
consumed the entire timeout duration so it is unnecessary to induce
scheduling latency by calling schedule() (via schedule_hrtimeout_range).

This can be measured using a simple bpftrace script:

tracepoint:sched:sched_switch
/ args->prev_pid == $1 /
{
print(kstack());
print(ustack());
}

Before this patch is applied:

Testing an epoll_wait app with busy poll usecs set to 1000, and
epoll_wait timeout set to 1ms using the script above shows:

__traceiter_sched_switch+69
__schedule+1495
schedule+32
schedule_hrtimeout_range+159
do_epoll_wait+1424
__x64_sys_epoll_wait+97
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118

epoll_wait+82

Which is unexpected; the busy poll usecs should have consumed the
entire timeout and there should be no reason to arm a timer.

After this patch is applied: the same test scenario does not generate a
call to schedule() in the above edge case. If the busy poll usecs are
reduced (for example usecs: 100, epoll_wait timeout 1ms) the timer is
armed as expected.

Fixes: bf3b9f6372c4 ("epoll: Add busy poll support to epoll with socket fds.")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/20250416185826.26375-1-jdamato@fastly.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>

authored by Joe Damato and committed by Christian Brauner 0a65bc27 a681b7c1

+9 -1
+9 -1
fs/eventpoll.c
··· 1996 1996 return res; 1997 1997 } 1998 1998 1999 + static int ep_schedule_timeout(ktime_t *to) 2000 + { 2001 + if (to) 2002 + return ktime_after(*to, ktime_get()); 2003 + else 2004 + return 1; 2005 + } 2006 + 1999 2007 /** 2000 2008 * ep_poll - Retrieves ready events, and delivers them to the caller-supplied 2001 2009 * event buffer. ··· 2111 2103 2112 2104 write_unlock_irq(&ep->lock); 2113 2105 2114 - if (!eavail) 2106 + if (!eavail && ep_schedule_timeout(to)) 2115 2107 timed_out = !schedule_hrtimeout_range(to, slack, 2116 2108 HRTIMER_MODE_ABS); 2117 2109 __set_current_state(TASK_RUNNING);