Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Documentation/acpi/einj: Correct and streamline text

Streamline and simplify formulations, improve formatting and extend the
injection example in the error injection write up for users which we
carry in Documentation/.

Add a paragraph about checking for EINJ support and expand the ACPI5.0
memory errors section, as requested by Tony.

Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1422553845-30717-1-git-send-email-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>

+118 -70
+118 -70
Documentation/acpi/apei/einj.txt
··· 1 1 APEI Error INJection 2 2 ~~~~~~~~~~~~~~~~~~~~ 3 3 4 - EINJ provides a hardware error injection mechanism 5 - It is very useful for debugging and testing of other APEI and RAS features. 4 + EINJ provides a hardware error injection mechanism. It is very useful 5 + for debugging and testing APEI and RAS features in general. 6 6 7 - To use EINJ, make sure the following are enabled in your kernel 7 + You need to check whether your BIOS supports EINJ first. For that, look 8 + for early boot messages similar to this one: 9 + 10 + ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001) 11 + 12 + which shows that the BIOS is exposing an EINJ table - it is the 13 + mechanism through which the injection is done. 14 + 15 + Alternatively, look in /sys/firmware/acpi/tables for an "EINJ" file, 16 + which is a different representation of the same thing. 17 + 18 + It doesn't necessarily mean that EINJ is not supported if those above 19 + don't exist: before you give up, go into BIOS setup to see if the BIOS 20 + has an option to enable error injection. Look for something called WHEA 21 + or similar. Often, you need to enable an ACPI5 support option prior, in 22 + order to see the APEI,EINJ,... functionality supported and exposed by 23 + the BIOS menu. 24 + 25 + To use EINJ, make sure the following are options enabled in your kernel 8 26 configuration: 9 27 10 28 CONFIG_DEBUG_FS 11 29 CONFIG_ACPI_APEI 12 30 CONFIG_ACPI_APEI_EINJ 13 31 14 - The user interface of EINJ is debug file system, under the 15 - directory apei/einj. The following files are provided. 32 + The EINJ user interface is in <debugfs mount point>/apei/einj. 33 + 34 + The following files belong to it: 16 35 17 36 - available_error_type 18 - Reading this file returns the error injection capability of the 19 - platform, that is, which error types are supported. The error type 20 - definition is as follow, the left field is the error type value, the 21 - right field is error description. 22 37 23 - 0x00000001 Processor Correctable 24 - 0x00000002 Processor Uncorrectable non-fatal 25 - 0x00000004 Processor Uncorrectable fatal 26 - 0x00000008 Memory Correctable 27 - 0x00000010 Memory Uncorrectable non-fatal 28 - 0x00000020 Memory Uncorrectable fatal 29 - 0x00000040 PCI Express Correctable 30 - 0x00000080 PCI Express Uncorrectable fatal 31 - 0x00000100 PCI Express Uncorrectable non-fatal 32 - 0x00000200 Platform Correctable 33 - 0x00000400 Platform Uncorrectable non-fatal 34 - 0x00000800 Platform Uncorrectable fatal 38 + This file shows which error types are supported: 35 39 36 - The format of file contents are as above, except there are only the 37 - available error type lines. 40 + Error Type Value Error Description 41 + ================ ================= 42 + 0x00000001 Processor Correctable 43 + 0x00000002 Processor Uncorrectable non-fatal 44 + 0x00000004 Processor Uncorrectable fatal 45 + 0x00000008 Memory Correctable 46 + 0x00000010 Memory Uncorrectable non-fatal 47 + 0x00000020 Memory Uncorrectable fatal 48 + 0x00000040 PCI Express Correctable 49 + 0x00000080 PCI Express Uncorrectable fatal 50 + 0x00000100 PCI Express Uncorrectable non-fatal 51 + 0x00000200 Platform Correctable 52 + 0x00000400 Platform Uncorrectable non-fatal 53 + 0x00000800 Platform Uncorrectable fatal 54 + 55 + The format of the file contents are as above, except present are only 56 + the available error types. 38 57 39 58 - error_type 40 - This file is used to set the error type value. The error type value 41 - is defined in "available_error_type" description. 59 + 60 + Set the value of the error type being injected. Possible error types 61 + are defined in the file available_error_type above. 42 62 43 63 - error_inject 44 - Write any integer to this file to trigger the error 45 - injection. Before this, please specify all necessary error 46 - parameters. 64 + 65 + Write any integer to this file to trigger the error injection. Make 66 + sure you have specified all necessary error parameters, i.e. this 67 + write should be the last step when injecting errors. 47 68 48 69 - flags 49 - Present for kernel version 3.13 and above. Used to specify which 50 - of param{1..4} are valid and should be used by BIOS during injection. 51 - Value is a bitmask as specified in ACPI5.0 spec for the 70 + 71 + Present for kernel versions 3.13 and above. Used to specify which 72 + of param{1..4} are valid and should be used by the firmware during 73 + injection. Value is a bitmask as specified in ACPI5.0 spec for the 52 74 SET_ERROR_TYPE_WITH_ADDRESS data structure: 53 - Bit 0 - Processor APIC field valid (see param3 below) 54 - Bit 1 - Memory address and mask valid (param1 and param2) 55 - Bit 2 - PCIe (seg,bus,dev,fn) valid (param4 below) 56 - If set to zero, legacy behaviour is used where the type of injection 57 - specifies just one bit set, and param1 is multiplexed. 75 + 76 + Bit 0 - Processor APIC field valid (see param3 below). 77 + Bit 1 - Memory address and mask valid (param1 and param2). 78 + Bit 2 - PCIe (seg,bus,dev,fn) valid (see param4 below). 79 + 80 + If set to zero, legacy behavior is mimicked where the type of 81 + injection specifies just one bit set, and param1 is multiplexed. 58 82 59 83 - param1 60 - This file is used to set the first error parameter value. Effect of 61 - parameter depends on error_type specified. For example, if error 62 - type is memory related type, the param1 should be a valid physical 63 - memory address. [Unless "flag" is set - see above] 84 + 85 + This file is used to set the first error parameter value. Its effect 86 + depends on the error type specified in error_type. For example, if 87 + error type is memory related type, the param1 should be a valid 88 + physical memory address. [Unless "flag" is set - see above] 64 89 65 90 - param2 66 - This file is used to set the second error parameter value. Effect of 67 - parameter depends on error_type specified. For example, if error 68 - type is memory related type, the param2 should be a physical memory 69 - address mask. Linux requires page or narrower granularity, say, 70 - 0xfffffffffffff000. 91 + 92 + Same use as param1 above. For example, if error type is of memory 93 + related type, then param2 should be a physical memory address mask. 94 + Linux requires page or narrower granularity, say, 0xfffffffffffff000. 71 95 72 96 - param3 73 - Used when the 0x1 bit is set in "flag" to specify the APIC id 97 + 98 + Used when the 0x1 bit is set in "flags" to specify the APIC id 74 99 75 100 - param4 76 - Used when the 0x4 bit is set in "flag" to specify target PCIe device 101 + Used when the 0x4 bit is set in "flags" to specify target PCIe device 77 102 78 103 - notrigger 79 - The EINJ mechanism is a two step process. First inject the error, then 80 - perform some actions to trigger it. Setting "notrigger" to 1 skips the 81 - trigger phase, which *may* allow the user to cause the error in some other 82 - context by a simple access to the cpu, memory location, or device that is 83 - the target of the error injection. Whether this actually works depends 84 - on what operations the BIOS actually includes in the trigger phase. 85 104 86 - BIOS versions based in the ACPI 4.0 specification have limited options 87 - to control where the errors are injected. Your BIOS may support an 88 - extension (enabled with the param_extension=1 module parameter, or 89 - boot command line einj.param_extension=1). This allows the address 90 - and mask for memory injections to be specified by the param1 and 91 - param2 files in apei/einj. 105 + The error injection mechanism is a two-step process. First inject the 106 + error, then perform some actions to trigger it. Setting "notrigger" 107 + to 1 skips the trigger phase, which *may* allow the user to cause the 108 + error in some other context by a simple access to the CPU, memory 109 + location, or device that is the target of the error injection. Whether 110 + this actually works depends on what operations the BIOS actually 111 + includes in the trigger phase. 92 112 93 - BIOS versions using the ACPI 5.0 specification have more control over 94 - the target of the injection. For processor related errors (type 0x1, 95 - 0x2 and 0x4) the APICID of the target should be provided using the 96 - param1 file in apei/einj. For memory errors (type 0x8, 0x10 and 0x20) 97 - the address is set using param1 with a mask in param2 (0x0 is equivalent 98 - to all ones). For PCI express errors (type 0x40, 0x80 and 0x100) the 99 - segment, bus, device and function are specified using param1: 113 + BIOS versions based on the ACPI 4.0 specification have limited options 114 + in controlling where the errors are injected. Your BIOS may support an 115 + extension (enabled with the param_extension=1 module parameter, or boot 116 + command line einj.param_extension=1). This allows the address and mask 117 + for memory injections to be specified by the param1 and param2 files in 118 + apei/einj. 119 + 120 + BIOS versions based on the ACPI 5.0 specification have more control over 121 + the target of the injection. For processor-related errors (type 0x1, 0x2 122 + and 0x4), you can set flags to 0x3 (param3 for bit 0, and param1 and 123 + param2 for bit 1) so that you have more information added to the error 124 + signature being injected. The actual data passed is this: 125 + 126 + memory_address = param1; 127 + memory_address_range = param2; 128 + apicid = param3; 129 + pcie_sbdf = param4; 130 + 131 + For memory errors (type 0x8, 0x10 and 0x20) the address is set using 132 + param1 with a mask in param2 (0x0 is equivalent to all ones). For PCI 133 + express errors (type 0x40, 0x80 and 0x100) the segment, bus, device and 134 + function are specified using param1: 100 135 101 136 31 24 23 16 15 11 10 8 7 0 102 137 +-------------------------------------------------+ 103 138 | segment | bus | device | function | reserved | 104 139 +-------------------------------------------------+ 105 140 106 - An ACPI 5.0 BIOS may also allow vendor specific errors to be injected. 141 + Anyway, you get the idea, if there's doubt just take a look at the code 142 + in drivers/acpi/apei/einj.c. 143 + 144 + An ACPI 5.0 BIOS may also allow vendor-specific errors to be injected. 107 145 In this case a file named vendor will contain identifying information 108 146 from the BIOS that hopefully will allow an application wishing to use 109 - the vendor specific extension to tell that they are running on a BIOS 147 + the vendor-specific extension to tell that they are running on a BIOS 110 148 that supports it. All vendor extensions have the 0x80000000 bit set in 111 149 error_type. A file vendor_flags controls the interpretation of param1 112 150 and param2 (1 = PROCESSOR, 2 = MEMORY, 4 = PCI). See your BIOS vendor 113 151 documentation for details (and expect changes to this API if vendors 114 152 creativity in using this feature expands beyond our expectations). 115 153 116 - Example: 154 + 155 + An error injection example: 156 + 117 157 # cd /sys/kernel/debug/apei/einj 118 158 # cat available_error_type # See which errors can be injected 119 159 0x00000002 Processor Uncorrectable non-fatal 120 160 0x00000008 Memory Correctable 121 161 0x00000010 Memory Uncorrectable non-fatal 122 162 # echo 0x12345000 > param1 # Set memory address for injection 123 - # echo 0xfffffffffffff000 > param2 # Mask - anywhere in this page 163 + # echo $((-1 << 12)) > param2 # Mask 0xfffffffffffff000 - anywhere in this page 124 164 # echo 0x8 > error_type # Choose correctable memory error 125 165 # echo 1 > error_inject # Inject now 126 166 167 + You should see something like this in dmesg: 168 + 169 + [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR 170 + [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 171 + [22715.834759] EDAC sbridge MC3: TSC 0 172 + [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86 173 + [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0 174 + [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0) 127 175 128 176 For more information about EINJ, please refer to ACPI specification 129 177 version 4.0, section 17.5 and ACPI 5.0, section 18.6.