sched/membarrier: Fix p->mm->membarrier_state racy load

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Mathieu Desnoyers and committed by

Ingo Molnar 6 years ago 227a4aad 2840cf02

+183 -54

6 changed files

expand all

exec.c

include

linux

mm_types.h

sched

mm.h

kernel

sched

core.c

membarrier.c

sched.h

+1 -1

fs/exec.c

··· 1033 1033 } 1034 1034 task_lock(tsk); 1035 1035 active_mm = tsk->active_mm; 1036 + membarrier_exec_mmap(mm); 1036 1037 tsk->mm = mm; 1037 1038 tsk->active_mm = mm; 1038 1039 activate_mm(active_mm, mm); ··· 1826 1825 /* execve succeeded */ 1827 1826 current->fs->in_exec = 0; 1828 1827 current->in_execve = 0; 1829 - membarrier_execve(current); 1830 1828 rseq_execve(current); 1831 1829 acct_update_integrals(current); 1832 1830 task_numa_free(current, false);

+11 -3

include/linux/mm_types.h

··· 383 383 unsigned long highest_vm_end; /* highest vma end address */ 384 384 pgd_t * pgd; 385 385 386 + #ifdef CONFIG_MEMBARRIER 387 + /** 388 + * @membarrier_state: Flags controlling membarrier behavior. 389 + * 390 + * This field is close to @pgd to hopefully fit in the same 391 + * cache-line, which needs to be touched by switch_mm(). 392 + */ 393 + atomic_t membarrier_state; 394 + #endif 395 + 386 396 /** 387 397 * @mm_users: The number of users including userspace. 388 398 * ··· 462 452 unsigned long flags; /* Must use atomic bitops to access */ 463 453 464 454 struct core_state *core_state; /* coredumping support */ 465 - #ifdef CONFIG_MEMBARRIER 466 - atomic_t membarrier_state; 467 - #endif 455 + 468 456 #ifdef CONFIG_AIO 469 457 spinlock_t ioctx_lock; 470 458 struct kioctx_table __rcu *ioctx_table;

+3 -5

include/linux/sched/mm.h

··· 370 370 sync_core_before_usermode(); 371 371 } 372 372 373 - static inline void membarrier_execve(struct task_struct *t) 374 - { 375 - atomic_set(&t->mm->membarrier_state, 0); 376 - } 373 + extern void membarrier_exec_mmap(struct mm_struct *mm); 374 + 377 375 #else 378 376 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS 379 377 static inline void membarrier_arch_switch_mm(struct mm_struct *prev, ··· 380 382 { 381 383 } 382 384 #endif 383 - static inline void membarrier_execve(struct task_struct *t) 385 + static inline void membarrier_exec_mmap(struct mm_struct *mm) 384 386 { 385 387 } 386 388 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)

+2 -2

kernel/sched/core.c

··· 3358 3358 else 3359 3359 prev->active_mm = NULL; 3360 3360 } else { // to user 3361 + membarrier_switch_mm(rq, prev->active_mm, next->mm); 3361 3362 /* 3362 3363 * sys_membarrier() requires an smp_mb() between setting 3363 - * rq->curr and returning to userspace. 3364 + * rq->curr / membarrier_switch_mm() and returning to userspace. 3364 3365 * 3365 3366 * The below provides this either through switch_mm(), or in 3366 3367 * case 'prev->active_mm == next->mm' through 3367 3368 * finish_task_switch()'s mmdrop(). 3368 3369 */ 3369 - 3370 3370 switch_mm_irqs_off(prev->active_mm, next->mm, next); 3371 3371 3372 3372 if (!prev->mm) { // from kernel

+132 -43

kernel/sched/membarrier.c

··· 30 30 smp_mb(); /* IPIs should be serializing but paranoid. */ 31 31 } 32 32 33 + static void ipi_sync_rq_state(void *info) 34 + { 35 + struct mm_struct *mm = (struct mm_struct *) info; 36 + 37 + if (current->mm != mm) 38 + return; 39 + this_cpu_write(runqueues.membarrier_state, 40 + atomic_read(&mm->membarrier_state)); 41 + /* 42 + * Issue a memory barrier after setting 43 + * MEMBARRIER_STATE_GLOBAL_EXPEDITED in the current runqueue to 44 + * guarantee that no memory access following registration is reordered 45 + * before registration. 46 + */ 47 + smp_mb(); 48 + } 49 + 50 + void membarrier_exec_mmap(struct mm_struct *mm) 51 + { 52 + /* 53 + * Issue a memory barrier before clearing membarrier_state to 54 + * guarantee that no memory access prior to exec is reordered after 55 + * clearing this state. 56 + */ 57 + smp_mb(); 58 + atomic_set(&mm->membarrier_state, 0); 59 + /* 60 + * Keep the runqueue membarrier_state in sync with this mm 61 + * membarrier_state. 62 + */ 63 + this_cpu_write(runqueues.membarrier_state, 0); 64 + } 65 + 33 66 static int membarrier_global_expedited(void) 34 67 { 35 68 int cpu; ··· 89 56 } 90 57 91 58 cpus_read_lock(); 59 + rcu_read_lock(); 92 60 for_each_online_cpu(cpu) { 93 61 struct task_struct *p; 94 62 ··· 104 70 if (cpu == raw_smp_processor_id()) 105 71 continue; 106 72 107 - rcu_read_lock(); 73 + if (!(READ_ONCE(cpu_rq(cpu)->membarrier_state) & 74 + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) 75 + continue; 76 + 77 + /* 78 + * Skip the CPU if it runs a kernel thread. The scheduler 79 + * leaves the prior task mm in place as an optimization when 80 + * scheduling a kthread. 81 + */ 108 82 p = rcu_dereference(cpu_rq(cpu)->curr); 109 - if (p && p->mm && (atomic_read(&p->mm->membarrier_state) & 110 - MEMBARRIER_STATE_GLOBAL_EXPEDITED)) { 111 - if (!fallback) 112 - __cpumask_set_cpu(cpu, tmpmask); 113 - else 114 - smp_call_function_single(cpu, ipi_mb, NULL, 1); 115 - } 116 - rcu_read_unlock(); 83 + if (p->flags & PF_KTHREAD) 84 + continue; 85 + 86 + if (!fallback) 87 + __cpumask_set_cpu(cpu, tmpmask); 88 + else 89 + smp_call_function_single(cpu, ipi_mb, NULL, 1); 117 90 } 91 + rcu_read_unlock(); 118 92 if (!fallback) { 119 93 preempt_disable(); 120 94 smp_call_function_many(tmpmask, ipi_mb, NULL, 1); ··· 178 136 } 179 137 180 138 cpus_read_lock(); 139 + rcu_read_lock(); 181 140 for_each_online_cpu(cpu) { 182 141 struct task_struct *p; 183 142 ··· 200 157 else 201 158 smp_call_function_single(cpu, ipi_mb, NULL, 1); 202 159 } 203 - rcu_read_unlock(); 204 160 } 161 + rcu_read_unlock(); 205 162 if (!fallback) { 206 163 preempt_disable(); 207 164 smp_call_function_many(tmpmask, ipi_mb, NULL, 1); ··· 220 177 return 0; 221 178 } 222 179 180 + static int sync_runqueues_membarrier_state(struct mm_struct *mm) 181 + { 182 + int membarrier_state = atomic_read(&mm->membarrier_state); 183 + cpumask_var_t tmpmask; 184 + int cpu; 185 + 186 + if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) { 187 + this_cpu_write(runqueues.membarrier_state, membarrier_state); 188 + 189 + /* 190 + * For single mm user, we can simply issue a memory barrier 191 + * after setting MEMBARRIER_STATE_GLOBAL_EXPEDITED in the 192 + * mm and in the current runqueue to guarantee that no memory 193 + * access following registration is reordered before 194 + * registration. 195 + */ 196 + smp_mb(); 197 + return 0; 198 + } 199 + 200 + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) 201 + return -ENOMEM; 202 + 203 + /* 204 + * For mm with multiple users, we need to ensure all future 205 + * scheduler executions will observe @mm's new membarrier 206 + * state. 207 + */ 208 + synchronize_rcu(); 209 + 210 + /* 211 + * For each cpu runqueue, if the task's mm match @mm, ensure that all 212 + * @mm's membarrier state set bits are also set in in the runqueue's 213 + * membarrier state. This ensures that a runqueue scheduling 214 + * between threads which are users of @mm has its membarrier state 215 + * updated. 216 + */ 217 + cpus_read_lock(); 218 + rcu_read_lock(); 219 + for_each_online_cpu(cpu) { 220 + struct rq *rq = cpu_rq(cpu); 221 + struct task_struct *p; 222 + 223 + p = rcu_dereference(&rq->curr); 224 + if (p && p->mm == mm) 225 + __cpumask_set_cpu(cpu, tmpmask); 226 + } 227 + rcu_read_unlock(); 228 + 229 + preempt_disable(); 230 + smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); 231 + preempt_enable(); 232 + 233 + free_cpumask_var(tmpmask); 234 + cpus_read_unlock(); 235 + 236 + return 0; 237 + } 238 + 223 239 static int membarrier_register_global_expedited(void) 224 240 { 225 241 struct task_struct *p = current; 226 242 struct mm_struct *mm = p->mm; 243 + int ret; 227 244 228 245 if (atomic_read(&mm->membarrier_state) & 229 246 MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY) 230 247 return 0; 231 248 atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state); 232 - if (atomic_read(&mm->mm_users) == 1) { 233 - /* 234 - * For single mm user, single threaded process, we can 235 - * simply issue a memory barrier after setting 236 - * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that 237 - * no memory access following registration is reordered 238 - * before registration. 239 - */ 240 - smp_mb(); 241 - } else { 242 - /* 243 - * For multi-mm user threads, we need to ensure all 244 - * future scheduler executions will observe the new 245 - * thread flag state for this mm. 246 - */ 247 - synchronize_rcu(); 248 - } 249 + ret = sync_runqueues_membarrier_state(mm); 250 + if (ret) 251 + return ret; 249 252 atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY, 250 253 &mm->membarrier_state); 251 254 ··· 302 213 { 303 214 struct task_struct *p = current; 304 215 struct mm_struct *mm = p->mm; 305 - int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY; 216 + int ready_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY, 217 + set_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED, 218 + ret; 306 219 307 220 if (flags & MEMBARRIER_FLAG_SYNC_CORE) { 308 221 if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) 309 222 return -EINVAL; 310 - state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY; 223 + ready_state = 224 + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY; 311 225 } 312 226 313 227 /* ··· 318 226 * groups, which use the same mm. (CLONE_VM but not 319 227 * CLONE_THREAD). 320 228 */ 321 - if ((atomic_read(&mm->membarrier_state) & state) == state) 229 + if ((atomic_read(&mm->membarrier_state) & ready_state) == ready_state) 322 230 return 0; 323 - atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state); 324 231 if (flags & MEMBARRIER_FLAG_SYNC_CORE) 325 - atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE, 326 - &mm->membarrier_state); 327 - if (atomic_read(&mm->mm_users) != 1) { 328 - /* 329 - * Ensure all future scheduler executions will observe the 330 - * new thread flag state for this process. 331 - */ 332 - synchronize_rcu(); 333 - } 334 - atomic_or(state, &mm->membarrier_state); 232 + set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE; 233 + atomic_or(set_state, &mm->membarrier_state); 234 + ret = sync_runqueues_membarrier_state(mm); 235 + if (ret) 236 + return ret; 237 + atomic_or(ready_state, &mm->membarrier_state); 335 238 336 239 return 0; 337 240 } ··· 340 253 * command specified does not exist, not available on the running 341 254 * kernel, or if the command argument is invalid, this system call 342 255 * returns -EINVAL. For a given command, with flags argument set to 0, 343 - * this system call is guaranteed to always return the same value until 344 - * reboot. 256 + * if this system call returns -ENOSYS or -EINVAL, it is guaranteed to 257 + * always return the same value until reboot. In addition, it can return 258 + * -ENOMEM if there is not enough memory available to perform the system 259 + * call. 345 260 * 346 261 * All memory accesses performed in program order from each targeted thread 347 262 * is guaranteed to be ordered with respect to sys_membarrier(). If we use

+34

kernel/sched/sched.h

··· 911 911 912 912 atomic_t nr_iowait; 913 913 914 + #ifdef CONFIG_MEMBARRIER 915 + int membarrier_state; 916 + #endif 917 + 914 918 #ifdef CONFIG_SMP 915 919 struct root_domain *rd; 916 920 struct sched_domain __rcu *sd; ··· 2442 2438 static inline bool sched_energy_enabled(void) { return false; } 2443 2439 2444 2440 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ 2441 + 2442 + #ifdef CONFIG_MEMBARRIER 2443 + /* 2444 + * The scheduler provides memory barriers required by membarrier between: 2445 + * - prior user-space memory accesses and store to rq->membarrier_state, 2446 + * - store to rq->membarrier_state and following user-space memory accesses. 2447 + * In the same way it provides those guarantees around store to rq->curr. 2448 + */ 2449 + static inline void membarrier_switch_mm(struct rq *rq, 2450 + struct mm_struct *prev_mm, 2451 + struct mm_struct *next_mm) 2452 + { 2453 + int membarrier_state; 2454 + 2455 + if (prev_mm == next_mm) 2456 + return; 2457 + 2458 + membarrier_state = atomic_read(&next_mm->membarrier_state); 2459 + if (READ_ONCE(rq->membarrier_state) == membarrier_state) 2460 + return; 2461 + 2462 + WRITE_ONCE(rq->membarrier_state, membarrier_state); 2463 + } 2464 + #else 2465 + static inline void membarrier_switch_mm(struct rq *rq, 2466 + struct mm_struct *prev_mm, 2467 + struct mm_struct *next_mm) 2468 + { 2469 + } 2470 + #endif