x86/smp: Put CPUs into INIT on shutdown if possible

Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de

+53 -7
+2
arch/x86/include/asm/smp.h
··· 139 void native_send_call_func_single_ipi(int cpu); 140 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle); 141 142 void smp_store_boot_cpu_info(void); 143 void smp_store_cpu_info(int id); 144
··· 139 void native_send_call_func_single_ipi(int cpu); 140 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle); 141 142 + bool smp_park_other_cpus_in_init(void); 143 + 144 void smp_store_boot_cpu_info(void); 145 void smp_store_cpu_info(int id); 146
+32 -7
arch/x86/kernel/smp.c
··· 131 } 132 133 /* 134 - * this function calls the 'stop' function on all other CPUs in the system. 135 */ 136 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot) 137 { ··· 172 * 2) Wait for all other CPUs to report that they reached the 173 * HLT loop in stop_this_cpu() 174 * 175 - * 3) If #2 timed out send an NMI to the CPUs which did not 176 - * yet report 177 * 178 - * 4) Wait for all other CPUs to report that they reached the 179 * HLT loop in stop_this_cpu() 180 * 181 - * #3 can obviously race against a CPU reaching the HLT loop late. 182 * That CPU will have reported already and the "have all CPUs 183 * reached HLT" condition will be true despite the fact that the 184 * other CPU is still handling the NMI. Again, there is no ··· 198 /* 199 * Don't wait longer than a second for IPI completion. The 200 * wait request is not checked here because that would 201 - * prevent an NMI shutdown attempt in case that not all 202 * CPUs reach shutdown state. 203 */ 204 timeout = USEC_PER_SEC; ··· 206 udelay(1); 207 } 208 209 - /* if the REBOOT_VECTOR didn't work, try with the NMI */ 210 if (!cpumask_empty(&cpus_stop_mask)) { 211 /* 212 * If NMI IPI is enabled, try to register the stop handler ··· 249 udelay(1); 250 } 251 252 local_irq_save(flags); 253 disable_local_APIC(); 254 mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
··· 131 } 132 133 /* 134 + * Disable virtualization, APIC etc. and park the CPU in a HLT loop 135 */ 136 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot) 137 { ··· 172 * 2) Wait for all other CPUs to report that they reached the 173 * HLT loop in stop_this_cpu() 174 * 175 + * 3) If the system uses INIT/STARTUP for CPU bringup, then 176 + * send all present CPUs an INIT vector, which brings them 177 + * completely out of the way. 178 * 179 + * 4) If #3 is not possible and #2 timed out send an NMI to the 180 + * CPUs which did not yet report 181 + * 182 + * 5) Wait for all other CPUs to report that they reached the 183 * HLT loop in stop_this_cpu() 184 * 185 + * #4 can obviously race against a CPU reaching the HLT loop late. 186 * That CPU will have reported already and the "have all CPUs 187 * reached HLT" condition will be true despite the fact that the 188 * other CPU is still handling the NMI. Again, there is no ··· 194 /* 195 * Don't wait longer than a second for IPI completion. The 196 * wait request is not checked here because that would 197 + * prevent an NMI/INIT shutdown in case that not all 198 * CPUs reach shutdown state. 199 */ 200 timeout = USEC_PER_SEC; ··· 202 udelay(1); 203 } 204 205 + /* 206 + * Park all other CPUs in INIT including "offline" CPUs, if 207 + * possible. That's a safe place where they can't resume execution 208 + * of HLT and then execute the HLT loop from overwritten text or 209 + * page tables. 210 + * 211 + * The only downside is a broadcast MCE, but up to the point where 212 + * the kexec() kernel brought all APs online again an MCE will just 213 + * make HLT resume and handle the MCE. The machine crashes and burns 214 + * due to overwritten text, page tables and data. So there is a 215 + * choice between fire and frying pan. The result is pretty much 216 + * the same. Chose frying pan until x86 provides a sane mechanism 217 + * to park a CPU. 218 + */ 219 + if (smp_park_other_cpus_in_init()) 220 + goto done; 221 + 222 + /* 223 + * If park with INIT was not possible and the REBOOT_VECTOR didn't 224 + * take all secondary CPUs offline, try with the NMI. 225 + */ 226 if (!cpumask_empty(&cpus_stop_mask)) { 227 /* 228 * If NMI IPI is enabled, try to register the stop handler ··· 225 udelay(1); 226 } 227 228 + done: 229 local_irq_save(flags); 230 disable_local_APIC(); 231 mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+19
arch/x86/kernel/smpboot.c
··· 1465 cache_aps_init(); 1466 } 1467 1468 /* 1469 * Early setup to make printk work. 1470 */
··· 1465 cache_aps_init(); 1466 } 1467 1468 + bool smp_park_other_cpus_in_init(void) 1469 + { 1470 + unsigned int cpu, this_cpu = smp_processor_id(); 1471 + unsigned int apicid; 1472 + 1473 + if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu) 1474 + return false; 1475 + 1476 + for_each_present_cpu(cpu) { 1477 + if (cpu == this_cpu) 1478 + continue; 1479 + apicid = apic->cpu_present_to_apicid(cpu); 1480 + if (apicid == BAD_APICID) 1481 + continue; 1482 + send_init_sequence(apicid); 1483 + } 1484 + return true; 1485 + } 1486 + 1487 /* 1488 * Early setup to make printk work. 1489 */