Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched: Clean up active_mm reference counting

The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

user -> user

When switching between two user tasks, all we need to consider
is switch_mm().

user -> kernel

When switching from a user task to a kernel task (which
doesn't have an associated mm) we retain the last mm in our
active_mm. Increment a reference count on active_mm.

kernel -> kernel

When switching between kernel threads, all we need to do is
pass along the active_mm reference.

kernel -> user

When switching between a kernel and user task, we must switch
from the last active_mm to the next mm, hoping of course that
these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: luto@kernel.org

+29 -18
+29 -18
kernel/sched/core.c
··· 3214 3214 context_switch(struct rq *rq, struct task_struct *prev, 3215 3215 struct task_struct *next, struct rq_flags *rf) 3216 3216 { 3217 - struct mm_struct *mm, *oldmm; 3218 - 3219 3217 prepare_task_switch(rq, prev, next); 3220 3218 3221 - mm = next->mm; 3222 - oldmm = prev->active_mm; 3223 3219 /* 3224 3220 * For paravirt, this is coupled with an exit in switch_to to 3225 3221 * combine the page table reload and the switch backend into ··· 3224 3228 arch_start_context_switch(prev); 3225 3229 3226 3230 /* 3227 - * If mm is non-NULL, we pass through switch_mm(). If mm is 3228 - * NULL, we will pass through mmdrop() in finish_task_switch(). 3229 - * Both of these contain the full memory barrier required by 3230 - * membarrier after storing to rq->curr, before returning to 3231 - * user-space. 3231 + * kernel -> kernel lazy + transfer active 3232 + * user -> kernel lazy + mmgrab() active 3233 + * 3234 + * kernel -> user switch + mmdrop() active 3235 + * user -> user switch 3232 3236 */ 3233 - if (!mm) { 3234 - next->active_mm = oldmm; 3235 - mmgrab(oldmm); 3236 - enter_lazy_tlb(oldmm, next); 3237 - } else 3238 - switch_mm_irqs_off(oldmm, mm, next); 3237 + if (!next->mm) { // to kernel 3238 + enter_lazy_tlb(prev->active_mm, next); 3239 3239 3240 - if (!prev->mm) { 3241 - prev->active_mm = NULL; 3242 - rq->prev_mm = oldmm; 3240 + next->active_mm = prev->active_mm; 3241 + if (prev->mm) // from user 3242 + mmgrab(prev->active_mm); 3243 + else 3244 + prev->active_mm = NULL; 3245 + } else { // to user 3246 + /* 3247 + * sys_membarrier() requires an smp_mb() between setting 3248 + * rq->curr and returning to userspace. 3249 + * 3250 + * The below provides this either through switch_mm(), or in 3251 + * case 'prev->active_mm == next->mm' through 3252 + * finish_task_switch()'s mmdrop(). 3253 + */ 3254 + 3255 + switch_mm_irqs_off(prev->active_mm, next->mm, next); 3256 + 3257 + if (!prev->mm) { // from kernel 3258 + /* will mmdrop() in finish_task_switch(). */ 3259 + rq->prev_mm = prev->active_mm; 3260 + prev->active_mm = NULL; 3261 + } 3243 3262 } 3244 3263 3245 3264 rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);