sched/mmcid: Protect transition on weakly ordered systems

Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:

watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
NIP: mm_get_cid+0xe8/0x188
LR: mm_get_cid+0x108/0x188
mm_cid_switch_to+0x3c4/0x52c
__schedule+0x47c/0x700
schedule_idle+0x3c/0x64
do_idle+0x160/0x1b0
cpu_startup_entry+0x48/0x50
start_secondary+0x284/0x288
start_secondary_prolog+0x10/0x14

watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
NIP: plpar_hcall_norets_notrace+0x18/0x2c
LR: queued_spin_lock_slowpath+0xd88/0x15d0
_raw_spin_lock+0x80/0xa0
raw_spin_rq_lock_nested+0x3c/0xf8
mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
sched_mm_cid_exit+0x108/0x22c
do_exit+0xf4/0x5d0
make_task_dead+0x0/0x178
system_call_exception+0x128/0x390
system_call_vectored_common+0x15c/0x2ec

The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.

After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.

This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.

The transition mechanism does:

WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
WRITE_ONCE(mm->mm_cid.percpu, new_mode);

fixup()

WRITE_ONCE(mm->mm_cid.transit, 0);

mm_cid_schedin() does:

if (!READ_ONCE(mm->mm_cid.percpu))
...
cid |= READ_ONCE(mm->mm_cid.transit);

so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

This could obviously be solved by using:
smp_store_release(&mm->mm_cid.percpu, true);
and
smp_load_acquire(&mm->mm_cid.percpu);

but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.

That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.

That makes the update of both states atomic and a concurrent read observes
always consistent state.

The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org

authored by Thomas Gleixner and committed by Peter Zijlstra 47ee94ef 4327fb13

+58 -35
+2 -4
include/linux/rseq_types.h
··· 121 121 /** 122 122 * struct mm_mm_cid - Storage for per MM CID data 123 123 * @pcpu: Per CPU storage for CIDs associated to a CPU 124 - * @percpu: Set, when CIDs are in per CPU mode 125 - * @transit: Set to MM_CID_TRANSIT during a mode change transition phase 124 + * @mode: Indicates per CPU and transition mode 126 125 * @max_cids: The exclusive maximum CID value for allocation and convergence 127 126 * @irq_work: irq_work to handle the affinity mode change case 128 127 * @work: Regular work to handle the affinity mode change case ··· 138 139 struct mm_mm_cid { 139 140 /* Hotpath read mostly members */ 140 141 struct mm_cid_pcpu __percpu *pcpu; 141 - unsigned int percpu; 142 - unsigned int transit; 142 + unsigned int mode; 143 143 unsigned int max_cids; 144 144 145 145 /* Rarely used. Moves @lock and @mutex into the second cacheline */
+44 -22
kernel/sched/core.c
··· 10297 10297 * 10298 10298 * Mode switching: 10299 10299 * 10300 + * The ownership mode is per process and stored in mm:mm_cid::mode with the 10301 + * following possible states: 10302 + * 10303 + * 0: Per task ownership 10304 + * 0 | MM_CID_TRANSIT: Transition from per CPU to per task 10305 + * MM_CID_ONCPU: Per CPU ownership 10306 + * MM_CID_ONCPU | MM_CID_TRANSIT: Transition from per task to per CPU 10307 + * 10300 10308 * All transitions of ownership mode happen in two phases: 10301 10309 * 10302 - * 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the CIDs 10303 - * and denotes that the CID is only temporarily owned by a task. When 10304 - * the task schedules out it drops the CID back into the pool if this 10305 - * bit is set. 10310 + * 1) mm:mm_cid::mode has the MM_CID_TRANSIT bit set. This is OR'ed on the 10311 + * CIDs and denotes that the CID is only temporarily owned by a 10312 + * task. When the task schedules out it drops the CID back into the 10313 + * pool if this bit is set. 10306 10314 * 10307 10315 * 2) The initiating context walks the per CPU space or the tasks to fixup 10308 - * or drop the CIDs and after completion it clears mm:mm_cid.transit. 10309 - * After that point the CIDs are strictly task or CPU owned again. 10316 + * or drop the CIDs and after completion it clears MM_CID_TRANSIT in 10317 + * mm:mm_cid::mode. After that point the CIDs are strictly task or CPU 10318 + * owned again. 10310 10319 * 10311 10320 * This two phase transition is required to prevent CID space exhaustion 10312 10321 * during the transition as a direct transfer of ownership would fail: ··· 10420 10411 static bool mm_update_max_cids(struct mm_struct *mm) 10421 10412 { 10422 10413 struct mm_mm_cid *mc = &mm->mm_cid; 10414 + bool percpu = cid_on_cpu(mc->mode); 10423 10415 10424 10416 lockdep_assert_held(&mm->mm_cid.lock); 10425 10417 ··· 10429 10419 __mm_update_max_cids(mc); 10430 10420 10431 10421 /* Check whether owner mode must be changed */ 10432 - if (!mc->percpu) { 10422 + if (!percpu) { 10433 10423 /* Enable per CPU mode when the number of users is above max_cids */ 10434 10424 if (mc->users > mc->max_cids) 10435 10425 mc->pcpu_thrs = mm_cid_calc_pcpu_thrs(mc); ··· 10440 10430 } 10441 10431 10442 10432 /* Mode change required? */ 10443 - if (!!mc->percpu == !!mc->pcpu_thrs) 10433 + if (percpu == !!mc->pcpu_thrs) 10444 10434 return false; 10445 10435 10446 - /* Set the transition flag to bridge the transfer */ 10447 - WRITE_ONCE(mc->transit, MM_CID_TRANSIT); 10448 - WRITE_ONCE(mc->percpu, !!mc->pcpu_thrs); 10436 + /* Flip the mode and set the transition flag to bridge the transfer */ 10437 + WRITE_ONCE(mc->mode, mc->mode ^ (MM_CID_TRANSIT | MM_CID_ONCPU)); 10438 + /* 10439 + * Order the store against the subsequent fixups so that 10440 + * acquire(rq::lock) cannot be reordered by the CPU before the 10441 + * store. 10442 + */ 10443 + smp_mb(); 10449 10444 return true; 10450 10445 } 10451 10446 ··· 10475 10460 10476 10461 WRITE_ONCE(mc->nr_cpus_allowed, weight); 10477 10462 __mm_update_max_cids(mc); 10478 - if (!mc->percpu) 10463 + if (!cid_on_cpu(mc->mode)) 10479 10464 return; 10480 10465 10481 10466 /* Adjust the threshold to the wider set */ ··· 10491 10476 /* Queue the irq work, which schedules the real work */ 10492 10477 mc->update_deferred = true; 10493 10478 irq_work_queue(&mc->irq_work); 10479 + } 10480 + 10481 + static inline void mm_cid_complete_transit(struct mm_struct *mm, unsigned int mode) 10482 + { 10483 + /* 10484 + * Ensure that the store removing the TRANSIT bit cannot be 10485 + * reordered by the CPU before the fixups have been completed. 10486 + */ 10487 + smp_mb(); 10488 + WRITE_ONCE(mm->mm_cid.mode, mode); 10494 10489 } 10495 10490 10496 10491 static inline void mm_cid_transit_to_task(struct task_struct *t, struct mm_cid_pcpu *pcp) ··· 10546 10521 } 10547 10522 } 10548 10523 } 10549 - /* Clear the transition bit */ 10550 - WRITE_ONCE(mm->mm_cid.transit, 0); 10524 + mm_cid_complete_transit(mm, 0); 10551 10525 } 10552 10526 10553 10527 static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp) ··· 10618 10594 struct mm_struct *mm = current->mm; 10619 10595 10620 10596 mm_cid_do_fixup_tasks_to_cpus(mm); 10621 - /* Clear the transition bit */ 10622 - WRITE_ONCE(mm->mm_cid.transit, 0); 10597 + mm_cid_complete_transit(mm, MM_CID_ONCPU); 10623 10598 } 10624 10599 10625 10600 static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct *mm) ··· 10649 10626 } 10650 10627 10651 10628 if (!sched_mm_cid_add_user(t, mm)) { 10652 - if (!mm->mm_cid.percpu) 10629 + if (!cid_on_cpu(mm->mm_cid.mode)) 10653 10630 t->mm_cid.cid = mm_get_cid(mm); 10654 10631 return; 10655 10632 } 10656 10633 10657 10634 /* Handle the mode change and transfer current's CID */ 10658 - percpu = !!mm->mm_cid.percpu; 10635 + percpu = cid_on_cpu(mm->mm_cid.mode); 10659 10636 if (!percpu) 10660 10637 mm_cid_transit_to_task(current, pcp); 10661 10638 else ··· 10694 10671 * affinity change increased the number of allowed CPUs and the 10695 10672 * deferred fixup did not run yet. 10696 10673 */ 10697 - if (WARN_ON_ONCE(mm->mm_cid.percpu)) 10674 + if (WARN_ON_ONCE(cid_on_cpu(mm->mm_cid.mode))) 10698 10675 return false; 10699 10676 /* 10700 10677 * A failed fork(2) cleanup never gets here, so @current must have ··· 10785 10762 if (!mm_update_max_cids(mm)) 10786 10763 return; 10787 10764 /* Affinity changes can only switch back to task mode */ 10788 - if (WARN_ON_ONCE(mm->mm_cid.percpu)) 10765 + if (WARN_ON_ONCE(cid_on_cpu(mm->mm_cid.mode))) 10789 10766 return; 10790 10767 } 10791 10768 mm_cid_fixup_cpus_to_tasks(mm); ··· 10806 10783 void mm_init_cid(struct mm_struct *mm, struct task_struct *p) 10807 10784 { 10808 10785 mm->mm_cid.max_cids = 0; 10809 - mm->mm_cid.percpu = 0; 10810 - mm->mm_cid.transit = 0; 10786 + mm->mm_cid.mode = 0; 10811 10787 mm->mm_cid.nr_cpus_allowed = p->nr_cpus_allowed; 10812 10788 mm->mm_cid.users = 0; 10813 10789 mm->mm_cid.pcpu_thrs = 0;
+12 -9
kernel/sched/sched.h
··· 3816 3816 __this_cpu_write(mm->mm_cid.pcpu->cid, cid); 3817 3817 } 3818 3818 3819 - static __always_inline void mm_cid_from_cpu(struct task_struct *t, unsigned int cpu_cid) 3819 + static __always_inline void mm_cid_from_cpu(struct task_struct *t, unsigned int cpu_cid, 3820 + unsigned int mode) 3820 3821 { 3821 3822 unsigned int max_cids, tcid = t->mm_cid.cid; 3822 3823 struct mm_struct *mm = t->mm; ··· 3843 3842 if (!cid_on_cpu(cpu_cid)) 3844 3843 cpu_cid = cid_to_cpu_cid(mm_get_cid(mm)); 3845 3844 3846 - /* Set the transition mode flag if required */ 3847 - if (READ_ONCE(mm->mm_cid.transit)) 3845 + /* Handle the transition mode flag if required */ 3846 + if (mode & MM_CID_TRANSIT) 3848 3847 cpu_cid = cpu_cid_to_cid(cpu_cid) | MM_CID_TRANSIT; 3849 3848 } 3850 3849 mm_cid_update_pcpu_cid(mm, cpu_cid); 3851 3850 mm_cid_update_task_cid(t, cpu_cid); 3852 3851 } 3853 3852 3854 - static __always_inline void mm_cid_from_task(struct task_struct *t, unsigned int cpu_cid) 3853 + static __always_inline void mm_cid_from_task(struct task_struct *t, unsigned int cpu_cid, 3854 + unsigned int mode) 3855 3855 { 3856 3856 unsigned int max_cids, tcid = t->mm_cid.cid; 3857 3857 struct mm_struct *mm = t->mm; ··· 3878 3876 if (!cid_on_task(tcid)) 3879 3877 tcid = mm_get_cid(mm); 3880 3878 /* Set the transition mode flag if required */ 3881 - tcid |= READ_ONCE(mm->mm_cid.transit); 3879 + tcid |= mode & MM_CID_TRANSIT; 3882 3880 } 3883 3881 mm_cid_update_pcpu_cid(mm, tcid); 3884 3882 mm_cid_update_task_cid(t, tcid); ··· 3887 3885 static __always_inline void mm_cid_schedin(struct task_struct *next) 3888 3886 { 3889 3887 struct mm_struct *mm = next->mm; 3890 - unsigned int cpu_cid; 3888 + unsigned int cpu_cid, mode; 3891 3889 3892 3890 if (!next->mm_cid.active) 3893 3891 return; 3894 3892 3895 3893 cpu_cid = __this_cpu_read(mm->mm_cid.pcpu->cid); 3896 - if (likely(!READ_ONCE(mm->mm_cid.percpu))) 3897 - mm_cid_from_task(next, cpu_cid); 3894 + mode = READ_ONCE(mm->mm_cid.mode); 3895 + if (likely(!cid_on_cpu(mode))) 3896 + mm_cid_from_task(next, cpu_cid, mode); 3898 3897 else 3899 - mm_cid_from_cpu(next, cpu_cid); 3898 + mm_cid_from_cpu(next, cpu_cid, mode); 3900 3899 } 3901 3900 3902 3901 static __always_inline void mm_cid_schedout(struct task_struct *prev)