Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

watchdog: implement error handling for failure to set up hardware perf events

If watchdog_nmi_enable() fails to set up the hardware perf event of one
CPU, the entire hard lockup detector is deemed unreliable. Hence, disable
the hard lockup detector and shut down the hardware perf events on all
CPUs.

[dzickus@redhat.com: update comments to explain some code]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Ulrich Obergfell and committed by
Linus Torvalds
bcfba4f4 83a80a39

+30
+30
kernel/watchdog.c
··· 502 502 __this_cpu_write(soft_lockup_hrtimer_cnt, 503 503 __this_cpu_read(hrtimer_interrupts)); 504 504 __touch_watchdog(); 505 + 506 + /* 507 + * watchdog_nmi_enable() clears the NMI_WATCHDOG_ENABLED bit in the 508 + * failure path. Check for failures that can occur asynchronously - 509 + * for example, when CPUs are on-lined - and shut down the hardware 510 + * perf event on each CPU accordingly. 511 + * 512 + * The only non-obvious place this bit can be cleared is through 513 + * watchdog_nmi_enable(), so a pr_info() is placed there. Placing a 514 + * pr_info here would be too noisy as it would result in a message 515 + * every few seconds if the hardlockup was disabled but the softlockup 516 + * enabled. 517 + */ 518 + if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED)) 519 + watchdog_nmi_disable(cpu); 505 520 } 506 521 507 522 #ifdef CONFIG_HARDLOCKUP_DETECTOR ··· 567 552 goto out_save; 568 553 } 569 554 555 + /* 556 + * Disable the hard lockup detector if _any_ CPU fails to set up 557 + * set up the hardware perf event. The watchdog() function checks 558 + * the NMI_WATCHDOG_ENABLED bit periodically. 559 + * 560 + * The barriers are for syncing up watchdog_enabled across all the 561 + * cpus, as clear_bit() does not use barriers. 562 + */ 563 + smp_mb__before_atomic(); 564 + clear_bit(NMI_WATCHDOG_ENABLED_BIT, &watchdog_enabled); 565 + smp_mb__after_atomic(); 566 + 570 567 /* skip displaying the same error again */ 571 568 if (cpu > 0 && (PTR_ERR(event) == cpu0_err)) 572 569 return PTR_ERR(event); ··· 592 565 else 593 566 pr_err("disabled (cpu%i): unable to create perf event: %ld\n", 594 567 cpu, PTR_ERR(event)); 568 + 569 + pr_info("Shutting down hard lockup detector on all cpus\n"); 570 + 595 571 return PTR_ERR(event); 596 572 597 573 /* success path */