···11+An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel22+free to update it with notes about any area that is not clear.33+44+---55+66+MCA/INIT are completely asynchronous. They can occur at any time, when77+the OS is in any state. Including when one of the cpus is already88+holding a spinlock. Trying to get any lock from MCA/INIT state is99+asking for deadlock. Also the state of structures that are protected1010+by locks is indeterminate, including linked lists.1111+1212+---1313+1414+The complicated ia64 MCA process. All of this is mandated by Intel's1515+specification for ia64 SAL, error recovery and and unwind, it is not as1616+if we have a choice here.1717+1818+* MCA occurs on one cpu, usually due to a double bit memory error.1919+ This is the monarch cpu.2020+2121+* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)2222+ to all the other cpus, the slaves.2323+2424+* Slave cpus that receive the MCA interrupt call down into SAL, they2525+ end up spinning disabled while the MCA is being serviced.2626+2727+* If any slave cpu was already spinning disabled when the MCA occurred2828+ then it cannot service the MCA interrupt. SAL waits ~20 seconds then2929+ sends an unmaskable INIT event to the slave cpus that have not3030+ already rendezvoused.3131+3232+* Because MCA/INIT can be delivered at any time, including when the cpu3333+ is down in PAL in physical mode, the registers at the time of the3434+ event are _completely_ undefined. In particular the MCA/INIT3535+ handlers cannot rely on the thread pointer, PAL physical mode can3636+ (and does) modify TP. It is allowed to do that as long as it resets3737+ TP on return. However MCA/INIT events expose us to these PAL3838+ internal TP changes. Hence curr_task().3939+4040+* If an MCA/INIT event occurs while the kernel was running (not user4141+ space) and the kernel has called PAL then the MCA/INIT handler cannot4242+ assume that the kernel stack is in a fit state to be used. Mainly4343+ because PAL may or may not maintain the stack pointer internally.4444+ Because the MCA/INIT handlers cannot trust the kernel stack, they4545+ have to use their own, per-cpu stacks. The MCA/INIT stacks are4646+ preformatted with just enough task state to let the relevant handlers4747+ do their job.4848+4949+* Unlike most other architectures, the ia64 struct task is embedded in5050+ the kernel stack[1]. So switching to a new kernel stack means that5151+ we switch to a new task as well. Because various bits of the kernel5252+ assume that current points into the struct task, switching to a new5353+ stack also means a new value for current.5454+5555+* Once all slaves have rendezvoused and are spinning disabled, the5656+ monarch is entered. The monarch now tries to diagnose the problem5757+ and decide if it can recover or not.5858+5959+* Part of the monarch's job is to look at the state of all the other6060+ tasks. The only way to do that on ia64 is to call the unwinder,6161+ as mandated by Intel.6262+6363+* The starting point for the unwind depends on whether a task is6464+ running or not. That is, whether it is on a cpu or is blocked. The6565+ monarch has to determine whether or not a task is on a cpu before it6666+ knows how to start unwinding it. The tasks that received an MCA or6767+ INIT event are no longer running, they have been converted to blocked6868+ tasks. But (and its a big but), the cpus that received the MCA6969+ rendezvous interrupt are still running on their normal kernel stacks!7070+7171+* To distinguish between these two cases, the monarch must know which7272+ tasks are on a cpu and which are not. Hence each slave cpu that7373+ switches to an MCA/INIT stack, registers its new stack using7474+ set_curr_task(), so the monarch can tell that the _original_ task is7575+ no longer running on that cpu. That gives us a decent chance of7676+ getting a valid backtrace of the _original_ task.7777+7878+* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a7979+ nested error, we want diagnostics on the MCA/INIT handler that8080+ failed, not on the task that was originally running. Again this8181+ requires set_curr_task() so the MCA/INIT handlers can register their8282+ own stack as running on that cpu. Then a recursive error gets a8383+ trace of the failing handler's "task".8484+8585+[1] My (Keith Owens) original design called for ia64 to separate its8686+ struct task and the kernel stacks. Then the MCA/INIT data would be8787+ chained stacks like i386 interrupt stacks. But that required8888+ radical surgery on the rest of ia64, plus extra hard wired TLB8989+ entries with its associated performance degradation. David9090+ Mosberger vetoed that approach. Which meant that separate kernel9191+ stacks meant separate "tasks" for the MCA/INIT handlers.9292+9393+---9494+9595+INIT is less complicated than MCA. Pressing the nmi button or using9696+the equivalent command on the management console sends INIT to all9797+cpus. SAL picks one one of the cpus as the monarch and the rest are9898+slaves. All the OS INIT handlers are entered at approximately the same9999+time. The OS monarch prints the state of all tasks and returns, after100100+which the slaves return and the system resumes.101101+102102+At least that is what is supposed to happen. Alas there are broken103103+versions of SAL out there. Some drive all the cpus as monarchs. Some104104+drive them all as slaves. Some drive one cpu as monarch, wait for that105105+cpu to return from the OS then drive the rest as slaves. Some versions106106+of SAL cannot even cope with returning from the OS, they spin inside107107+SAL on resume. The OS INIT code has workarounds for some of these108108+broken SAL symptoms, but some simply cannot be fixed from the OS side.109109+110110+---111111+112112+The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer113113+violations. Unfortunately MCA/INIT start off as massive layer114114+violations (can occur at _any_ time) and they build from there.115115+116116+At least ia64 makes an attempt at recovering from hardware errors, but117117+it is a difficult problem because of the asynchronous nature of these118118+errors. When processing an unmaskable interrupt we sometimes need119119+special code to cope with our inability to take any locks.120120+121121+---122122+123123+How is ia64 MCA/INIT different from x86 NMI?124124+125125+* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to126126+ all cpus.127127+128128+* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2129129+ per cpu.130130+131131+* x86 has a separate struct task which points to one of multiple kernel132132+ stacks. ia64 has the struct task embedded in the single kernel133133+ stack, so switching stack means switching task.134134+135135+* x86 does not call the BIOS so the NMI handler does not have to worry136136+ about any registers having changed. MCA/INIT can occur while the cpu137137+ is in PAL in physical mode, with undefined registers and an undefined138138+ kernel stack.139139+140140+* i386 backtrace is not very sensitive to whether a process is running141141+ or not. ia64 unwind is very, very sensitive to whether a process is142142+ running or not.143143+144144+---145145+146146+What happens when MCA/INIT is delivered what a cpu is running user147147+space code?148148+149149+The user mode registers are stored in the RSE area of the MCA/INIT on150150+entry to the OS and are restored from there on return to SAL, so user151151+mode registers are preserved across a recoverable MCA/INIT. Since the152152+OS has no idea what unwind data is available for the user space stack,153153+MCA/INIT never tries to backtrace user space. Which means that the OS154154+does not bother making the user space process look like a blocked task,155155+i.e. the OS does not copy pt_regs and switch_stack to the user space156156+stack. Also the OS has no idea how big the user space RSE and memory157157+stacks are, which makes it too risky to copy the saved state to a user158158+mode stack.159159+160160+---161161+162162+How do we get a backtrace on the tasks that were running when MCA/INIT163163+was delivered?164164+165165+mca.c:::ia64_mca_modify_original_stack(). That identifies and166166+verifies the original kernel stack, copies the dirty registers from167167+the MCA/INIT stack's RSE to the original stack's RSE, copies the168168+skeleton struct pt_regs and switch_stack to the original stack, fills169169+in the skeleton structures from the PAL minstate area and updates the170170+original stack's thread.ksp. That makes the original stack look171171+exactly like any other blocked task, i.e. it now appears to be172172+sleeping. To get a backtrace, just start with thread.ksp for the173173+original task and unwind like any other sleeping task.174174+175175+---176176+177177+How do we identify the tasks that were running when MCA/INIT was178178+delivered?179179+180180+If the previous task has been verified and converted to a blocked181181+state, then sos->prev_task on the MCA/INIT stack is updated to point to182182+the previous task. You can look at that field in dumps or debuggers.183183+To help distinguish between the handler and the original tasks,184184+handlers have _TIF_MCA_INIT set in thread_info.flags.185185+186186+The sos data is always in the MCA/INIT handler stack, at offset187187+MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it188188+as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct189189+ia64_sal_os_state), with 16 byte alignment for all structures.190190+191191+Also the comm field of the MCA/INIT task is modified to include the pid192192+of the original task, for humans to use. For example, a comm field of193193+'MCA 12159' means that pid 12159 was running when the MCA was194194+delivered.
···491491 ;;492492 lfetch.fault [r16], 128493493 br.ret.sptk.many rp494494-END(prefetch_switch_stack)494494+END(prefetch_stack)495495496496GLOBAL_ENTRY(execve)497497 mov r15=__NR_execve // put syscall number in place
+69-45
arch/ia64/kernel/mca_drv.c
···8484 struct page *p;85858686 /* whether physical address is valid or not */8787- if ( !ia64_phys_addr_valid(paddr) ) 8787+ if (!ia64_phys_addr_valid(paddr))8888 return ISOLATE_NG;89899090 /* convert physical address to physical page number */9191 p = pfn_to_page(paddr>>PAGE_SHIFT);92929393 /* check whether a page number have been already registered or not */9494- for( i = 0; i < num_page_isolate; i++ )9595- if( page_isolate[i] == p )9494+ for (i = 0; i < num_page_isolate; i++)9595+ if (page_isolate[i] == p)9696 return ISOLATE_OK; /* already listed */97979898 /* limitation check */9999- if( num_page_isolate == MAX_PAGE_ISOLATE ) 9999+ if (num_page_isolate == MAX_PAGE_ISOLATE)100100 return ISOLATE_NG;101101102102 /* kick pages having attribute 'SLAB' or 'Reserved' */103103- if( PageSlab(p) || PageReserved(p) ) 103103+ if (PageSlab(p) || PageReserved(p))104104 return ISOLATE_NG;105105106106 /* add attribute 'Reserved' and register the page */···139139 * @peidx: pointer to index of processor error section140140 */141141142142-static void 142142+static void143143mca_make_peidx(sal_log_processor_info_t *slpi, peidx_table_t *peidx)144144{145145- /* 145145+ /*146146 * calculate the start address of147147 * "struct cpuid_info" and "sal_processor_static_info_t".148148 */···164164}165165166166/**167167- * mca_make_slidx - Make index of SAL error record 167167+ * mca_make_slidx - Make index of SAL error record168168 * @buffer: pointer to SAL error record169169 * @slidx: pointer to index of SAL error record170170 *···172172 * 1 if record has platform error / 0 if not173173 */174174#define LOG_INDEX_ADD_SECT_PTR(sect, ptr) \175175- { slidx_list_t *hl = &slidx_pool.buffer[slidx_pool.cur_idx]; \176176- hl->hdr = ptr; \177177- list_add(&hl->list, &(sect)); \178178- slidx_pool.cur_idx = (slidx_pool.cur_idx + 1)%slidx_pool.max_idx; }175175+ {slidx_list_t *hl = &slidx_pool.buffer[slidx_pool.cur_idx]; \176176+ hl->hdr = ptr; \177177+ list_add(&hl->list, &(sect)); \178178+ slidx_pool.cur_idx = (slidx_pool.cur_idx + 1)%slidx_pool.max_idx; }179179180180-static int 180180+static int181181mca_make_slidx(void *buffer, slidx_table_t *slidx)182182{183183 int platform_err = 0;···214214 sp = (sal_log_section_hdr_t *)((char*)buffer + ercd_pos);215215 if (!efi_guidcmp(sp->guid, SAL_PROC_DEV_ERR_SECT_GUID)) {216216 LOG_INDEX_ADD_SECT_PTR(slidx->proc_err, sp);217217- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_MEM_DEV_ERR_SECT_GUID)) {217217+ } else if (!efi_guidcmp(sp->guid,218218+ SAL_PLAT_MEM_DEV_ERR_SECT_GUID)) {218219 platform_err = 1;219220 LOG_INDEX_ADD_SECT_PTR(slidx->mem_dev_err, sp);220220- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_SEL_DEV_ERR_SECT_GUID)) {221221+ } else if (!efi_guidcmp(sp->guid,222222+ SAL_PLAT_SEL_DEV_ERR_SECT_GUID)) {221223 platform_err = 1;222224 LOG_INDEX_ADD_SECT_PTR(slidx->sel_dev_err, sp);223223- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_PCI_BUS_ERR_SECT_GUID)) {225225+ } else if (!efi_guidcmp(sp->guid,226226+ SAL_PLAT_PCI_BUS_ERR_SECT_GUID)) {224227 platform_err = 1;225228 LOG_INDEX_ADD_SECT_PTR(slidx->pci_bus_err, sp);226226- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_SMBIOS_DEV_ERR_SECT_GUID)) {229229+ } else if (!efi_guidcmp(sp->guid,230230+ SAL_PLAT_SMBIOS_DEV_ERR_SECT_GUID)) {227231 platform_err = 1;228232 LOG_INDEX_ADD_SECT_PTR(slidx->smbios_dev_err, sp);229229- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_PCI_COMP_ERR_SECT_GUID)) {233233+ } else if (!efi_guidcmp(sp->guid,234234+ SAL_PLAT_PCI_COMP_ERR_SECT_GUID)) {230235 platform_err = 1;231236 LOG_INDEX_ADD_SECT_PTR(slidx->pci_comp_err, sp);232232- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_SPECIFIC_ERR_SECT_GUID)) {237237+ } else if (!efi_guidcmp(sp->guid,238238+ SAL_PLAT_SPECIFIC_ERR_SECT_GUID)) {233239 platform_err = 1;234240 LOG_INDEX_ADD_SECT_PTR(slidx->plat_specific_err, sp);235235- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_HOST_CTLR_ERR_SECT_GUID)) {241241+ } else if (!efi_guidcmp(sp->guid,242242+ SAL_PLAT_HOST_CTLR_ERR_SECT_GUID)) {236243 platform_err = 1;237244 LOG_INDEX_ADD_SECT_PTR(slidx->host_ctlr_err, sp);238238- } else if (!efi_guidcmp(sp->guid, SAL_PLAT_BUS_ERR_SECT_GUID)) {245245+ } else if (!efi_guidcmp(sp->guid,246246+ SAL_PLAT_BUS_ERR_SECT_GUID)) {239247 platform_err = 1;240248 LOG_INDEX_ADD_SECT_PTR(slidx->plat_bus_err, sp);241249 } else {···261253 * Return value:262254 * 0 on Success / -ENOMEM on Failure263255 */264264-static int 256256+static int265257init_record_index_pools(void)266258{267259 int i;268260 int rec_max_size; /* Maximum size of SAL error records */269261 int sect_min_size; /* Minimum size of SAL error sections */270262 /* minimum size table of each section */271271- static int sal_log_sect_min_sizes[] = { 272272- sizeof(sal_log_processor_info_t) + sizeof(sal_processor_static_info_t),263263+ static int sal_log_sect_min_sizes[] = {264264+ sizeof(sal_log_processor_info_t)265265+ + sizeof(sal_processor_static_info_t),273266 sizeof(sal_log_mem_dev_err_info_t),274267 sizeof(sal_log_sel_dev_err_info_t),275268 sizeof(sal_log_pci_bus_err_info_t),···303294304295 /* - 3 - */305296 slidx_pool.max_idx = (rec_max_size/sect_min_size) * 2 + 1;306306- slidx_pool.buffer = (slidx_list_t *) kmalloc(slidx_pool.max_idx * sizeof(slidx_list_t), GFP_KERNEL);297297+ slidx_pool.buffer = (slidx_list_t *)298298+ kmalloc(slidx_pool.max_idx * sizeof(slidx_list_t), GFP_KERNEL);307299308300 return slidx_pool.buffer ? 0 : -ENOMEM;309301}···318308 * is_mca_global - Check whether this MCA is global or not319309 * @peidx: pointer of index of processor error section320310 * @pbci: pointer to pal_bus_check_info_t311311+ * @sos: pointer to hand off struct between SAL and OS321312 *322313 * Return value:323314 * MCA_IS_LOCAL / MCA_IS_GLOBAL···328317is_mca_global(peidx_table_t *peidx, pal_bus_check_info_t *pbci,329318 struct ia64_sal_os_state *sos)330319{331331- pal_processor_state_info_t *psp = (pal_processor_state_info_t*)peidx_psp(peidx);320320+ pal_processor_state_info_t *psp =321321+ (pal_processor_state_info_t*)peidx_psp(peidx);332322333333- /* 323323+ /*334324 * PAL can request a rendezvous, if the MCA has a global scope.335335- * If "rz_always" flag is set, SAL requests MCA rendezvous 325325+ * If "rz_always" flag is set, SAL requests MCA rendezvous336326 * in spite of global MCA.337327 * Therefore it is local MCA when rendezvous has not been requested.338328 * Failed to rendezvous, the system must be down.···393381 * @slidx: pointer of index of SAL error record394382 * @peidx: pointer of index of processor error section395383 * @pbci: pointer of pal_bus_check_info384384+ * @sos: pointer to hand off struct between SAL and OS396385 *397386 * Return value:398387 * 1 on Success / 0 on Failure399388 */400389401390static int402402-recover_from_read_error(slidx_table_t *slidx, peidx_table_t *peidx, pal_bus_check_info_t *pbci,391391+recover_from_read_error(slidx_table_t *slidx,392392+ peidx_table_t *peidx, pal_bus_check_info_t *pbci,403393 struct ia64_sal_os_state *sos)404394{405395 sal_log_mod_error_info_t *smei;···467453 * @slidx: pointer of index of SAL error record468454 * @peidx: pointer of index of processor error section469455 * @pbci: pointer of pal_bus_check_info456456+ * @sos: pointer to hand off struct between SAL and OS470457 *471458 * Return value:472459 * 1 on Success / 0 on Failure473460 */474461475462static int476476-recover_from_platform_error(slidx_table_t *slidx, peidx_table_t *peidx, pal_bus_check_info_t *pbci,463463+recover_from_platform_error(slidx_table_t *slidx, peidx_table_t *peidx,464464+ pal_bus_check_info_t *pbci,477465 struct ia64_sal_os_state *sos)478466{479467 int status = 0;480480- pal_processor_state_info_t *psp = (pal_processor_state_info_t*)peidx_psp(peidx);468468+ pal_processor_state_info_t *psp =469469+ (pal_processor_state_info_t*)peidx_psp(peidx);481470482471 if (psp->bc && pbci->eb && pbci->bsi == 0) {483472 switch(pbci->type) {484473 case 1: /* partial read */485474 case 3: /* full line(cpu) read */486475 case 9: /* I/O space read */487487- status = recover_from_read_error(slidx, peidx, pbci, sos);476476+ status = recover_from_read_error(slidx, peidx, pbci,477477+ sos);488478 break;489479 case 0: /* unknown */490480 case 2: /* partial write */···499481 case 8: /* write coalescing transactions */500482 case 10: /* I/O space write */501483 case 11: /* inter-processor interrupt message(IPI) */502502- case 12: /* interrupt acknowledge or external task priority cycle */484484+ case 12: /* interrupt acknowledge or485485+ external task priority cycle */503486 default:504487 break;505488 }···515496 * @slidx: pointer of index of SAL error record516497 * @peidx: pointer of index of processor error section517498 * @pbci: pointer of pal_bus_check_info499499+ * @sos: pointer to hand off struct between SAL and OS518500 *519501 * Return value:520502 * 1 on Success / 0 on Failure···529509 */530510531511static int532532-recover_from_processor_error(int platform, slidx_table_t *slidx, peidx_table_t *peidx, pal_bus_check_info_t *pbci,512512+recover_from_processor_error(int platform, slidx_table_t *slidx,513513+ peidx_table_t *peidx, pal_bus_check_info_t *pbci,533514 struct ia64_sal_os_state *sos)534515{535535- pal_processor_state_info_t *psp = (pal_processor_state_info_t*)peidx_psp(peidx);516516+ pal_processor_state_info_t *psp =517517+ (pal_processor_state_info_t*)peidx_psp(peidx);536518537537- /* 519519+ /*538520 * We cannot recover errors with other than bus_check.539521 */540540- if (psp->cc || psp->rc || psp->uc) 522522+ if (psp->cc || psp->rc || psp->uc)541523 return 0;542524543525 /*···568546 * (e.g. a load from poisoned memory)569547 * This means "there are some platform errors".570548 */571571- if (platform) 549549+ if (platform)572550 return recover_from_platform_error(slidx, peidx, pbci, sos);573573- /* 574574- * On account of strange SAL error record, we cannot recover. 551551+ /*552552+ * On account of strange SAL error record, we cannot recover.575553 */576554 return 0;577555}···579557/**580558 * mca_try_to_recover - Try to recover from MCA581559 * @rec: pointer to a SAL error record560560+ * @sos: pointer to hand off struct between SAL and OS582561 *583562 * Return value:584563 * 1 on Success / 0 on Failure585564 */586565587566static int588588-mca_try_to_recover(void *rec, 589589- struct ia64_sal_os_state *sos)567567+mca_try_to_recover(void *rec, struct ia64_sal_os_state *sos)590568{591569 int platform_err;592570 int n_proc_err;···610588 }611589612590 /* Make index of processor error section */613613- mca_make_peidx((sal_log_processor_info_t*)slidx_first_entry(&slidx.proc_err)->hdr, &peidx);591591+ mca_make_peidx((sal_log_processor_info_t*)592592+ slidx_first_entry(&slidx.proc_err)->hdr, &peidx);614593615594 /* Extract Processor BUS_CHECK[0] */616595 *((u64*)&pbci) = peidx_check_info(&peidx, bus_check, 0);···621598 return 0;622599623600 /* Try to recover a processor error */624624- return recover_from_processor_error(platform_err, &slidx, &peidx, &pbci, sos);601601+ return recover_from_processor_error(platform_err, &slidx, &peidx,602602+ &pbci, sos);625603}626604627605/*···635611 return -ENOMEM;636612637613 /* register external mca handlers */638638- if (ia64_reg_MCA_extension(mca_try_to_recover)){ 614614+ if (ia64_reg_MCA_extension(mca_try_to_recover)) { 639615 printk(KERN_ERR "ia64_reg_MCA_extension failed.\n");640616 kfree(slidx_pool.buffer);641617 return -EFAULT;