sched/rt: Fix double enqueue caused by rt_effective_prio

Double enqueues in rt runqueues (list) have been reported while running
a simple test that spawns a number of threads doing a short sleep/run
pattern while being concurrently setscheduled between rt and fair class.

WARNING: CPU: 3 PID: 2825 at kernel/sched/rt.c:1294 enqueue_task_rt+0x355/0x360
CPU: 3 PID: 2825 Comm: setsched__13
RIP: 0010:enqueue_task_rt+0x355/0x360
Call Trace:
__sched_setscheduler+0x581/0x9d0
_sched_setscheduler+0x63/0xa0
do_sched_setscheduler+0xa0/0x150
__x64_sys_sched_setscheduler+0x1a/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae

list_add double add: new=ffff9867cb629b40, prev=ffff9867cb629b40,
next=ffff98679fc67ca0.
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 3 PID: 2825 Comm: setsched__13
RIP: 0010:__list_add_valid+0x41/0x50
Call Trace:
enqueue_task_rt+0x291/0x360
__sched_setscheduler+0x581/0x9d0
_sched_setscheduler+0x63/0xa0
do_sched_setscheduler+0xa0/0x150
__x64_sys_sched_setscheduler+0x1a/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae

__sched_setscheduler() uses rt_effective_prio() to handle proper queuing
of priority boosted tasks that are setscheduled while being boosted.
rt_effective_prio() is however called twice per each
__sched_setscheduler() call: first directly by __sched_setscheduler()
before dequeuing the task and then by __setscheduler() to actually do
the priority change. If the priority of the pi_top_task is concurrently
being changed however, it might happen that the two calls return
different results. If, for example, the first call returned the same rt
priority the task was running at and the second one a fair priority, the
task won't be removed by the rt list (on_list still set) and then
enqueued in the fair runqueue. When eventually setscheduled back to rt
it will be seen as enqueued already and the WARNING/BUG be issued.

Fix this by calling rt_effective_prio() only once and then reusing the
return value. While at it refactor code as well for clarity. Concurrent
priority inheritance handling is still safe and will eventually converge
to a new state by following the inheritance chain(s).

Fixes: 0782e63bc6fe ("sched: Handle priority boosted tasks proper in setscheduler()")
[squashed Peterz changes; added changelog]
Reported-by: Mark Simmons <msimmons@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210803104501.38333-1-juri.lelli@redhat.com

Changed files
+35 -55
kernel
sched
+35 -55
kernel/sched/core.c
··· 1981 1981 dequeue_task(rq, p, flags); 1982 1982 } 1983 1983 1984 - /* 1985 - * __normal_prio - return the priority that is based on the static prio 1986 - */ 1987 - static inline int __normal_prio(struct task_struct *p) 1984 + static inline int __normal_prio(int policy, int rt_prio, int nice) 1988 1985 { 1989 - return p->static_prio; 1986 + int prio; 1987 + 1988 + if (dl_policy(policy)) 1989 + prio = MAX_DL_PRIO - 1; 1990 + else if (rt_policy(policy)) 1991 + prio = MAX_RT_PRIO - 1 - rt_prio; 1992 + else 1993 + prio = NICE_TO_PRIO(nice); 1994 + 1995 + return prio; 1990 1996 } 1991 1997 1992 1998 /* ··· 2004 1998 */ 2005 1999 static inline int normal_prio(struct task_struct *p) 2006 2000 { 2007 - int prio; 2008 - 2009 - if (task_has_dl_policy(p)) 2010 - prio = MAX_DL_PRIO-1; 2011 - else if (task_has_rt_policy(p)) 2012 - prio = MAX_RT_PRIO-1 - p->rt_priority; 2013 - else 2014 - prio = __normal_prio(p); 2015 - return prio; 2001 + return __normal_prio(p->policy, p->rt_priority, PRIO_TO_NICE(p->static_prio)); 2016 2002 } 2017 2003 2018 2004 /* ··· 4097 4099 } else if (PRIO_TO_NICE(p->static_prio) < 0) 4098 4100 p->static_prio = NICE_TO_PRIO(0); 4099 4101 4100 - p->prio = p->normal_prio = __normal_prio(p); 4102 + p->prio = p->normal_prio = p->static_prio; 4101 4103 set_load_weight(p, false); 4102 4104 4103 4105 /* ··· 6339 6341 } 6340 6342 EXPORT_SYMBOL(default_wake_function); 6341 6343 6344 + static void __setscheduler_prio(struct task_struct *p, int prio) 6345 + { 6346 + if (dl_prio(prio)) 6347 + p->sched_class = &dl_sched_class; 6348 + else if (rt_prio(prio)) 6349 + p->sched_class = &rt_sched_class; 6350 + else 6351 + p->sched_class = &fair_sched_class; 6352 + 6353 + p->prio = prio; 6354 + } 6355 + 6342 6356 #ifdef CONFIG_RT_MUTEXES 6343 6357 6344 6358 static inline int __rt_effective_prio(struct task_struct *pi_task, int prio) ··· 6466 6456 } else { 6467 6457 p->dl.pi_se = &p->dl; 6468 6458 } 6469 - p->sched_class = &dl_sched_class; 6470 6459 } else if (rt_prio(prio)) { 6471 6460 if (dl_prio(oldprio)) 6472 6461 p->dl.pi_se = &p->dl; 6473 6462 if (oldprio < prio) 6474 6463 queue_flag |= ENQUEUE_HEAD; 6475 - p->sched_class = &rt_sched_class; 6476 6464 } else { 6477 6465 if (dl_prio(oldprio)) 6478 6466 p->dl.pi_se = &p->dl; 6479 6467 if (rt_prio(oldprio)) 6480 6468 p->rt.timeout = 0; 6481 - p->sched_class = &fair_sched_class; 6482 6469 } 6483 6470 6484 - p->prio = prio; 6471 + __setscheduler_prio(p, prio); 6485 6472 6486 6473 if (queued) 6487 6474 enqueue_task(rq, p, queue_flag); ··· 6831 6824 set_load_weight(p, true); 6832 6825 } 6833 6826 6834 - /* Actually do priority change: must hold pi & rq lock. */ 6835 - static void __setscheduler(struct rq *rq, struct task_struct *p, 6836 - const struct sched_attr *attr, bool keep_boost) 6837 - { 6838 - /* 6839 - * If params can't change scheduling class changes aren't allowed 6840 - * either. 6841 - */ 6842 - if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS) 6843 - return; 6844 - 6845 - __setscheduler_params(p, attr); 6846 - 6847 - /* 6848 - * Keep a potential priority boosting if called from 6849 - * sched_setscheduler(). 6850 - */ 6851 - p->prio = normal_prio(p); 6852 - if (keep_boost) 6853 - p->prio = rt_effective_prio(p, p->prio); 6854 - 6855 - if (dl_prio(p->prio)) 6856 - p->sched_class = &dl_sched_class; 6857 - else if (rt_prio(p->prio)) 6858 - p->sched_class = &rt_sched_class; 6859 - else 6860 - p->sched_class = &fair_sched_class; 6861 - } 6862 - 6863 6827 /* 6864 6828 * Check the target process has a UID that matches the current process's: 6865 6829 */ ··· 6851 6873 const struct sched_attr *attr, 6852 6874 bool user, bool pi) 6853 6875 { 6854 - int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 : 6855 - MAX_RT_PRIO - 1 - attr->sched_priority; 6856 - int retval, oldprio, oldpolicy = -1, queued, running; 6857 - int new_effective_prio, policy = attr->sched_policy; 6876 + int oldpolicy = -1, policy = attr->sched_policy; 6877 + int retval, oldprio, newprio, queued, running; 6858 6878 const struct sched_class *prev_class; 6859 6879 struct callback_head *head; 6860 6880 struct rq_flags rf; ··· 7050 7074 p->sched_reset_on_fork = reset_on_fork; 7051 7075 oldprio = p->prio; 7052 7076 7077 + newprio = __normal_prio(policy, attr->sched_priority, attr->sched_nice); 7053 7078 if (pi) { 7054 7079 /* 7055 7080 * Take priority boosted tasks into account. If the new ··· 7059 7082 * the runqueue. This will be done when the task deboost 7060 7083 * itself. 7061 7084 */ 7062 - new_effective_prio = rt_effective_prio(p, newprio); 7063 - if (new_effective_prio == oldprio) 7085 + newprio = rt_effective_prio(p, newprio); 7086 + if (newprio == oldprio) 7064 7087 queue_flags &= ~DEQUEUE_MOVE; 7065 7088 } 7066 7089 ··· 7073 7096 7074 7097 prev_class = p->sched_class; 7075 7098 7076 - __setscheduler(rq, p, attr, pi); 7099 + if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) { 7100 + __setscheduler_params(p, attr); 7101 + __setscheduler_prio(p, newprio); 7102 + } 7077 7103 __setscheduler_uclamp(p, attr); 7078 7104 7079 7105 if (queued) {