sched/mmcid: Protect transition on weakly ordered systems

Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:

watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
NIP: mm_get_cid+0xe8/0x188
LR: mm_get_cid+0x108/0x188
mm_cid_switch_to+0x3c4/0x52c
__schedule+0x47c/0x700
schedule_idle+0x3c/0x64
do_idle+0x160/0x1b0
cpu_startup_entry+0x48/0x50
start_secondary+0x284/0x288
start_secondary_prolog+0x10/0x14

watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
NIP: plpar_hcall_norets_notrace+0x18/0x2c
LR: queued_spin_lock_slowpath+0xd88/0x15d0
_raw_spin_lock+0x80/0xa0
raw_spin_rq_lock_nested+0x3c/0xf8
mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
sched_mm_cid_exit+0x108/0x22c
do_exit+0xf4/0x5d0
make_task_dead+0x0/0x178
system_call_exception+0x128/0x390
system_call_vectored_common+0x15c/0x2ec

The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.

After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.

This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.

The transition mechanism does:

WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
WRITE_ONCE(mm->mm_cid.percpu, new_mode);

fixup()

WRITE_ONCE(mm->mm_cid.transit, 0);

mm_cid_schedin() does:

if (!READ_ONCE(mm->mm_cid.percpu))
...
cid |= READ_ONCE(mm->mm_cid.transit);

so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

This could obviously be solved by using:
smp_store_release(&mm->mm_cid.percpu, true);
and
smp_load_acquire(&mm->mm_cid.percpu);

but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.

That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.

That makes the update of both states atomic and a concurrent read observes
always consistent state.

The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org

authored by Thomas Gleixner and committed by Peter Zijlstra 47ee94ef 4327fb13

+58 -35
+2 -4
include/linux/rseq_types.h
··· 121 /** 122 * struct mm_mm_cid - Storage for per MM CID data 123 * @pcpu: Per CPU storage for CIDs associated to a CPU 124 - * @percpu: Set, when CIDs are in per CPU mode 125 - * @transit: Set to MM_CID_TRANSIT during a mode change transition phase 126 * @max_cids: The exclusive maximum CID value for allocation and convergence 127 * @irq_work: irq_work to handle the affinity mode change case 128 * @work: Regular work to handle the affinity mode change case ··· 138 struct mm_mm_cid { 139 /* Hotpath read mostly members */ 140 struct mm_cid_pcpu __percpu *pcpu; 141 - unsigned int percpu; 142 - unsigned int transit; 143 unsigned int max_cids; 144 145 /* Rarely used. Moves @lock and @mutex into the second cacheline */
··· 121 /** 122 * struct mm_mm_cid - Storage for per MM CID data 123 * @pcpu: Per CPU storage for CIDs associated to a CPU 124 + * @mode: Indicates per CPU and transition mode 125 * @max_cids: The exclusive maximum CID value for allocation and convergence 126 * @irq_work: irq_work to handle the affinity mode change case 127 * @work: Regular work to handle the affinity mode change case ··· 139 struct mm_mm_cid { 140 /* Hotpath read mostly members */ 141 struct mm_cid_pcpu __percpu *pcpu; 142 + unsigned int mode; 143 unsigned int max_cids; 144 145 /* Rarely used. Moves @lock and @mutex into the second cacheline */
+44 -22
kernel/sched/core.c
··· 10297 * 10298 * Mode switching: 10299 * 10300 * All transitions of ownership mode happen in two phases: 10301 * 10302 - * 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the CIDs 10303 - * and denotes that the CID is only temporarily owned by a task. When 10304 - * the task schedules out it drops the CID back into the pool if this 10305 - * bit is set. 10306 * 10307 * 2) The initiating context walks the per CPU space or the tasks to fixup 10308 - * or drop the CIDs and after completion it clears mm:mm_cid.transit. 10309 - * After that point the CIDs are strictly task or CPU owned again. 10310 * 10311 * This two phase transition is required to prevent CID space exhaustion 10312 * during the transition as a direct transfer of ownership would fail: ··· 10420 static bool mm_update_max_cids(struct mm_struct *mm) 10421 { 10422 struct mm_mm_cid *mc = &mm->mm_cid; 10423 10424 lockdep_assert_held(&mm->mm_cid.lock); 10425 ··· 10429 __mm_update_max_cids(mc); 10430 10431 /* Check whether owner mode must be changed */ 10432 - if (!mc->percpu) { 10433 /* Enable per CPU mode when the number of users is above max_cids */ 10434 if (mc->users > mc->max_cids) 10435 mc->pcpu_thrs = mm_cid_calc_pcpu_thrs(mc); ··· 10440 } 10441 10442 /* Mode change required? */ 10443 - if (!!mc->percpu == !!mc->pcpu_thrs) 10444 return false; 10445 10446 - /* Set the transition flag to bridge the transfer */ 10447 - WRITE_ONCE(mc->transit, MM_CID_TRANSIT); 10448 - WRITE_ONCE(mc->percpu, !!mc->pcpu_thrs); 10449 return true; 10450 } 10451 ··· 10475 10476 WRITE_ONCE(mc->nr_cpus_allowed, weight); 10477 __mm_update_max_cids(mc); 10478 - if (!mc->percpu) 10479 return; 10480 10481 /* Adjust the threshold to the wider set */ ··· 10491 /* Queue the irq work, which schedules the real work */ 10492 mc->update_deferred = true; 10493 irq_work_queue(&mc->irq_work); 10494 } 10495 10496 static inline void mm_cid_transit_to_task(struct task_struct *t, struct mm_cid_pcpu *pcp) ··· 10546 } 10547 } 10548 } 10549 - /* Clear the transition bit */ 10550 - WRITE_ONCE(mm->mm_cid.transit, 0); 10551 } 10552 10553 static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp) ··· 10618 struct mm_struct *mm = current->mm; 10619 10620 mm_cid_do_fixup_tasks_to_cpus(mm); 10621 - /* Clear the transition bit */ 10622 - WRITE_ONCE(mm->mm_cid.transit, 0); 10623 } 10624 10625 static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct *mm) ··· 10649 } 10650 10651 if (!sched_mm_cid_add_user(t, mm)) { 10652 - if (!mm->mm_cid.percpu) 10653 t->mm_cid.cid = mm_get_cid(mm); 10654 return; 10655 } 10656 10657 /* Handle the mode change and transfer current's CID */ 10658 - percpu = !!mm->mm_cid.percpu; 10659 if (!percpu) 10660 mm_cid_transit_to_task(current, pcp); 10661 else ··· 10694 * affinity change increased the number of allowed CPUs and the 10695 * deferred fixup did not run yet. 10696 */ 10697 - if (WARN_ON_ONCE(mm->mm_cid.percpu)) 10698 return false; 10699 /* 10700 * A failed fork(2) cleanup never gets here, so @current must have ··· 10785 if (!mm_update_max_cids(mm)) 10786 return; 10787 /* Affinity changes can only switch back to task mode */ 10788 - if (WARN_ON_ONCE(mm->mm_cid.percpu)) 10789 return; 10790 } 10791 mm_cid_fixup_cpus_to_tasks(mm); ··· 10806 void mm_init_cid(struct mm_struct *mm, struct task_struct *p) 10807 { 10808 mm->mm_cid.max_cids = 0; 10809 - mm->mm_cid.percpu = 0; 10810 - mm->mm_cid.transit = 0; 10811 mm->mm_cid.nr_cpus_allowed = p->nr_cpus_allowed; 10812 mm->mm_cid.users = 0; 10813 mm->mm_cid.pcpu_thrs = 0;
··· 10297 * 10298 * Mode switching: 10299 * 10300 + * The ownership mode is per process and stored in mm:mm_cid::mode with the 10301 + * following possible states: 10302 + * 10303 + * 0: Per task ownership 10304 + * 0 | MM_CID_TRANSIT: Transition from per CPU to per task 10305 + * MM_CID_ONCPU: Per CPU ownership 10306 + * MM_CID_ONCPU | MM_CID_TRANSIT: Transition from per task to per CPU 10307 + * 10308 * All transitions of ownership mode happen in two phases: 10309 * 10310 + * 1) mm:mm_cid::mode has the MM_CID_TRANSIT bit set. This is OR'ed on the 10311 + * CIDs and denotes that the CID is only temporarily owned by a 10312 + * task. When the task schedules out it drops the CID back into the 10313 + * pool if this bit is set. 10314 * 10315 * 2) The initiating context walks the per CPU space or the tasks to fixup 10316 + * or drop the CIDs and after completion it clears MM_CID_TRANSIT in 10317 + * mm:mm_cid::mode. After that point the CIDs are strictly task or CPU 10318 + * owned again. 10319 * 10320 * This two phase transition is required to prevent CID space exhaustion 10321 * during the transition as a direct transfer of ownership would fail: ··· 10411 static bool mm_update_max_cids(struct mm_struct *mm) 10412 { 10413 struct mm_mm_cid *mc = &mm->mm_cid; 10414 + bool percpu = cid_on_cpu(mc->mode); 10415 10416 lockdep_assert_held(&mm->mm_cid.lock); 10417 ··· 10419 __mm_update_max_cids(mc); 10420 10421 /* Check whether owner mode must be changed */ 10422 + if (!percpu) { 10423 /* Enable per CPU mode when the number of users is above max_cids */ 10424 if (mc->users > mc->max_cids) 10425 mc->pcpu_thrs = mm_cid_calc_pcpu_thrs(mc); ··· 10430 } 10431 10432 /* Mode change required? */ 10433 + if (percpu == !!mc->pcpu_thrs) 10434 return false; 10435 10436 + /* Flip the mode and set the transition flag to bridge the transfer */ 10437 + WRITE_ONCE(mc->mode, mc->mode ^ (MM_CID_TRANSIT | MM_CID_ONCPU)); 10438 + /* 10439 + * Order the store against the subsequent fixups so that 10440 + * acquire(rq::lock) cannot be reordered by the CPU before the 10441 + * store. 10442 + */ 10443 + smp_mb(); 10444 return true; 10445 } 10446 ··· 10460 10461 WRITE_ONCE(mc->nr_cpus_allowed, weight); 10462 __mm_update_max_cids(mc); 10463 + if (!cid_on_cpu(mc->mode)) 10464 return; 10465 10466 /* Adjust the threshold to the wider set */ ··· 10476 /* Queue the irq work, which schedules the real work */ 10477 mc->update_deferred = true; 10478 irq_work_queue(&mc->irq_work); 10479 + } 10480 + 10481 + static inline void mm_cid_complete_transit(struct mm_struct *mm, unsigned int mode) 10482 + { 10483 + /* 10484 + * Ensure that the store removing the TRANSIT bit cannot be 10485 + * reordered by the CPU before the fixups have been completed. 10486 + */ 10487 + smp_mb(); 10488 + WRITE_ONCE(mm->mm_cid.mode, mode); 10489 } 10490 10491 static inline void mm_cid_transit_to_task(struct task_struct *t, struct mm_cid_pcpu *pcp) ··· 10521 } 10522 } 10523 } 10524 + mm_cid_complete_transit(mm, 0); 10525 } 10526 10527 static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp) ··· 10594 struct mm_struct *mm = current->mm; 10595 10596 mm_cid_do_fixup_tasks_to_cpus(mm); 10597 + mm_cid_complete_transit(mm, MM_CID_ONCPU); 10598 } 10599 10600 static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct *mm) ··· 10626 } 10627 10628 if (!sched_mm_cid_add_user(t, mm)) { 10629 + if (!cid_on_cpu(mm->mm_cid.mode)) 10630 t->mm_cid.cid = mm_get_cid(mm); 10631 return; 10632 } 10633 10634 /* Handle the mode change and transfer current's CID */ 10635 + percpu = cid_on_cpu(mm->mm_cid.mode); 10636 if (!percpu) 10637 mm_cid_transit_to_task(current, pcp); 10638 else ··· 10671 * affinity change increased the number of allowed CPUs and the 10672 * deferred fixup did not run yet. 10673 */ 10674 + if (WARN_ON_ONCE(cid_on_cpu(mm->mm_cid.mode))) 10675 return false; 10676 /* 10677 * A failed fork(2) cleanup never gets here, so @current must have ··· 10762 if (!mm_update_max_cids(mm)) 10763 return; 10764 /* Affinity changes can only switch back to task mode */ 10765 + if (WARN_ON_ONCE(cid_on_cpu(mm->mm_cid.mode))) 10766 return; 10767 } 10768 mm_cid_fixup_cpus_to_tasks(mm); ··· 10783 void mm_init_cid(struct mm_struct *mm, struct task_struct *p) 10784 { 10785 mm->mm_cid.max_cids = 0; 10786 + mm->mm_cid.mode = 0; 10787 mm->mm_cid.nr_cpus_allowed = p->nr_cpus_allowed; 10788 mm->mm_cid.users = 0; 10789 mm->mm_cid.pcpu_thrs = 0;
+12 -9
kernel/sched/sched.h
··· 3816 __this_cpu_write(mm->mm_cid.pcpu->cid, cid); 3817 } 3818 3819 - static __always_inline void mm_cid_from_cpu(struct task_struct *t, unsigned int cpu_cid) 3820 { 3821 unsigned int max_cids, tcid = t->mm_cid.cid; 3822 struct mm_struct *mm = t->mm; ··· 3843 if (!cid_on_cpu(cpu_cid)) 3844 cpu_cid = cid_to_cpu_cid(mm_get_cid(mm)); 3845 3846 - /* Set the transition mode flag if required */ 3847 - if (READ_ONCE(mm->mm_cid.transit)) 3848 cpu_cid = cpu_cid_to_cid(cpu_cid) | MM_CID_TRANSIT; 3849 } 3850 mm_cid_update_pcpu_cid(mm, cpu_cid); 3851 mm_cid_update_task_cid(t, cpu_cid); 3852 } 3853 3854 - static __always_inline void mm_cid_from_task(struct task_struct *t, unsigned int cpu_cid) 3855 { 3856 unsigned int max_cids, tcid = t->mm_cid.cid; 3857 struct mm_struct *mm = t->mm; ··· 3878 if (!cid_on_task(tcid)) 3879 tcid = mm_get_cid(mm); 3880 /* Set the transition mode flag if required */ 3881 - tcid |= READ_ONCE(mm->mm_cid.transit); 3882 } 3883 mm_cid_update_pcpu_cid(mm, tcid); 3884 mm_cid_update_task_cid(t, tcid); ··· 3887 static __always_inline void mm_cid_schedin(struct task_struct *next) 3888 { 3889 struct mm_struct *mm = next->mm; 3890 - unsigned int cpu_cid; 3891 3892 if (!next->mm_cid.active) 3893 return; 3894 3895 cpu_cid = __this_cpu_read(mm->mm_cid.pcpu->cid); 3896 - if (likely(!READ_ONCE(mm->mm_cid.percpu))) 3897 - mm_cid_from_task(next, cpu_cid); 3898 else 3899 - mm_cid_from_cpu(next, cpu_cid); 3900 } 3901 3902 static __always_inline void mm_cid_schedout(struct task_struct *prev)
··· 3816 __this_cpu_write(mm->mm_cid.pcpu->cid, cid); 3817 } 3818 3819 + static __always_inline void mm_cid_from_cpu(struct task_struct *t, unsigned int cpu_cid, 3820 + unsigned int mode) 3821 { 3822 unsigned int max_cids, tcid = t->mm_cid.cid; 3823 struct mm_struct *mm = t->mm; ··· 3842 if (!cid_on_cpu(cpu_cid)) 3843 cpu_cid = cid_to_cpu_cid(mm_get_cid(mm)); 3844 3845 + /* Handle the transition mode flag if required */ 3846 + if (mode & MM_CID_TRANSIT) 3847 cpu_cid = cpu_cid_to_cid(cpu_cid) | MM_CID_TRANSIT; 3848 } 3849 mm_cid_update_pcpu_cid(mm, cpu_cid); 3850 mm_cid_update_task_cid(t, cpu_cid); 3851 } 3852 3853 + static __always_inline void mm_cid_from_task(struct task_struct *t, unsigned int cpu_cid, 3854 + unsigned int mode) 3855 { 3856 unsigned int max_cids, tcid = t->mm_cid.cid; 3857 struct mm_struct *mm = t->mm; ··· 3876 if (!cid_on_task(tcid)) 3877 tcid = mm_get_cid(mm); 3878 /* Set the transition mode flag if required */ 3879 + tcid |= mode & MM_CID_TRANSIT; 3880 } 3881 mm_cid_update_pcpu_cid(mm, tcid); 3882 mm_cid_update_task_cid(t, tcid); ··· 3885 static __always_inline void mm_cid_schedin(struct task_struct *next) 3886 { 3887 struct mm_struct *mm = next->mm; 3888 + unsigned int cpu_cid, mode; 3889 3890 if (!next->mm_cid.active) 3891 return; 3892 3893 cpu_cid = __this_cpu_read(mm->mm_cid.pcpu->cid); 3894 + mode = READ_ONCE(mm->mm_cid.mode); 3895 + if (likely(!cid_on_cpu(mode))) 3896 + mm_cid_from_task(next, cpu_cid, mode); 3897 else 3898 + mm_cid_from_cpu(next, cpu_cid, mode); 3899 } 3900 3901 static __always_inline void mm_cid_schedout(struct task_struct *prev)