Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO)

Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Kamil Iskra <iskra@mcs.anl.gov>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org> [3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Naoya Horiguchi and committed by
Linus Torvalds
3ba08129 74614de1

+48 -13
+5
Documentation/vm/hwpoison.txt
··· 84 84 PR_MCE_KILL_EARLY: Early kill 85 85 PR_MCE_KILL_LATE: Late kill 86 86 PR_MCE_KILL_DEFAULT: Use system global default 87 + Note that if you want to have a dedicated thread which handles 88 + the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should 89 + call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, 90 + the SIGBUS is sent to the main thread. 91 + 87 92 PR_MCE_KILL_GET 88 93 return current mode 89 94
+43 -13
mm/memory-failure.c
··· 380 380 } 381 381 } 382 382 383 - static int task_early_kill(struct task_struct *tsk, int force_early) 383 + /* 384 + * Find a dedicated thread which is supposed to handle SIGBUS(BUS_MCEERR_AO) 385 + * on behalf of the thread group. Return task_struct of the (first found) 386 + * dedicated thread if found, and return NULL otherwise. 387 + * 388 + * We already hold read_lock(&tasklist_lock) in the caller, so we don't 389 + * have to call rcu_read_lock/unlock() in this function. 390 + */ 391 + static struct task_struct *find_early_kill_thread(struct task_struct *tsk) 384 392 { 393 + struct task_struct *t; 394 + 395 + for_each_thread(tsk, t) 396 + if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) 397 + return t; 398 + return NULL; 399 + } 400 + 401 + /* 402 + * Determine whether a given process is "early kill" process which expects 403 + * to be signaled when some page under the process is hwpoisoned. 404 + * Return task_struct of the dedicated thread (main thread unless explicitly 405 + * specified) if the process is "early kill," and otherwise returns NULL. 406 + */ 407 + static struct task_struct *task_early_kill(struct task_struct *tsk, 408 + int force_early) 409 + { 410 + struct task_struct *t; 385 411 if (!tsk->mm) 386 - return 0; 412 + return NULL; 387 413 if (force_early) 388 - return 1; 389 - if (tsk->flags & PF_MCE_PROCESS) 390 - return !!(tsk->flags & PF_MCE_EARLY); 391 - return sysctl_memory_failure_early_kill; 414 + return tsk; 415 + t = find_early_kill_thread(tsk); 416 + if (t) 417 + return t; 418 + if (sysctl_memory_failure_early_kill) 419 + return tsk; 420 + return NULL; 392 421 } 393 422 394 423 /* ··· 439 410 read_lock(&tasklist_lock); 440 411 for_each_process (tsk) { 441 412 struct anon_vma_chain *vmac; 413 + struct task_struct *t = task_early_kill(tsk, force_early); 442 414 443 - if (!task_early_kill(tsk, force_early)) 415 + if (!t) 444 416 continue; 445 417 anon_vma_interval_tree_foreach(vmac, &av->rb_root, 446 418 pgoff, pgoff) { 447 419 vma = vmac->vma; 448 420 if (!page_mapped_in_vma(page, vma)) 449 421 continue; 450 - if (vma->vm_mm == tsk->mm) 451 - add_to_kill(tsk, page, vma, to_kill, tkc); 422 + if (vma->vm_mm == t->mm) 423 + add_to_kill(t, page, vma, to_kill, tkc); 452 424 } 453 425 } 454 426 read_unlock(&tasklist_lock); ··· 470 440 read_lock(&tasklist_lock); 471 441 for_each_process(tsk) { 472 442 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 443 + struct task_struct *t = task_early_kill(tsk, force_early); 473 444 474 - if (!task_early_kill(tsk, force_early)) 445 + if (!t) 475 446 continue; 476 - 477 447 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, 478 448 pgoff) { 479 449 /* ··· 483 453 * Assume applications who requested early kill want 484 454 * to be informed of all such data corruptions. 485 455 */ 486 - if (vma->vm_mm == tsk->mm) 487 - add_to_kill(tsk, page, vma, to_kill, tkc); 456 + if (vma->vm_mm == t->mm) 457 + add_to_kill(t, page, vma, to_kill, tkc); 488 458 } 489 459 } 490 460 read_unlock(&tasklist_lock);