softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression

The touch_nmi_watchdog() routine on x86 ultimately calls
touch_softlockup_watchdog(). The problem is that to touch the
softlockup watchdog, the cpu_clock code has to be called which could
involve multiple cpu locks and can lead to a hard hang if one of the
locks is held by a processor that is not going to return anytime soon
(such as could be the case with kgdb or perhaps even with some other
kind of exception).

This patch causes the public version of the
touch_softlockup_watchdog() to defer the cpu clock access to a later
point.

The test case for this problem is to use the following kernel config
options:

CONFIG_KGDB_TESTS=y
CONFIG_KGDB_TESTS_ON_BOOT=y
CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"

It should be noted that kgdb test suite and these options were not
available until 2.6.26-rc2, so it was necessary to patch the kgdb
test suite during the bisection.

I would consider this patch a regression fix because the problem first
appeared in commit 27ec4407790d075c325e1f4da0a19c56953cce23 when some
logic was added to try to periodically sync the clocks. It was
possible to work around this particular problem by simply not
performing the sync anytime the system was in a critical context.
This was ok until commit 3e51f33fcc7f55e6df25d15b55ed10c8b4da84cd,
which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
multi-cpu locks to sync the clocks. It became clear that accessing
this code from an nmi was the source of the lockups. Avoiding the
access to the low level clock code from an code inside the NMI
processing also fixed the problem with the 27ec44... commit.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

authored by Jason Wessel and committed by Ingo Molnar 9c106c11 afd38009

+10 -5
+10 -5
kernel/softlockup.c
··· 49 return cpu_clock(this_cpu) >> 30LL; /* 2^30 ~= 10^9 */ 50 } 51 52 - void touch_softlockup_watchdog(void) 53 { 54 int this_cpu = raw_smp_processor_id(); 55 56 __raw_get_cpu_var(touch_timestamp) = get_timestamp(this_cpu); 57 } 58 EXPORT_SYMBOL(touch_softlockup_watchdog); 59 ··· 85 unsigned long now; 86 87 if (touch_timestamp == 0) { 88 - touch_softlockup_watchdog(); 89 return; 90 } 91 ··· 100 101 /* do not print during early bootup: */ 102 if (unlikely(system_state != SYSTEM_RUNNING)) { 103 - touch_softlockup_watchdog(); 104 return; 105 } 106 ··· 219 sched_setscheduler(current, SCHED_FIFO, &param); 220 221 /* initialize timestamp */ 222 - touch_softlockup_watchdog(); 223 224 set_current_state(TASK_INTERRUPTIBLE); 225 /* ··· 228 * debug-printout triggers in softlockup_tick(). 229 */ 230 while (!kthread_should_stop()) { 231 - touch_softlockup_watchdog(); 232 schedule(); 233 234 if (kthread_should_stop())
··· 49 return cpu_clock(this_cpu) >> 30LL; /* 2^30 ~= 10^9 */ 50 } 51 52 + static void __touch_softlockup_watchdog(void) 53 { 54 int this_cpu = raw_smp_processor_id(); 55 56 __raw_get_cpu_var(touch_timestamp) = get_timestamp(this_cpu); 57 + } 58 + 59 + void touch_softlockup_watchdog(void) 60 + { 61 + __raw_get_cpu_var(touch_timestamp) = 0; 62 } 63 EXPORT_SYMBOL(touch_softlockup_watchdog); 64 ··· 80 unsigned long now; 81 82 if (touch_timestamp == 0) { 83 + __touch_softlockup_watchdog(); 84 return; 85 } 86 ··· 95 96 /* do not print during early bootup: */ 97 if (unlikely(system_state != SYSTEM_RUNNING)) { 98 + __touch_softlockup_watchdog(); 99 return; 100 } 101 ··· 214 sched_setscheduler(current, SCHED_FIFO, &param); 215 216 /* initialize timestamp */ 217 + __touch_softlockup_watchdog(); 218 219 set_current_state(TASK_INTERRUPTIBLE); 220 /* ··· 223 * debug-printout triggers in softlockup_tick(). 224 */ 225 while (!kthread_should_stop()) { 226 + __touch_softlockup_watchdog(); 227 schedule(); 228 229 if (kthread_should_stop())