Merge tag 'edac/v4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac

-2

Documentation/00-INDEX

··· 152 152 - directory with info about Linux driver model. 153 153 early-userspace/ 154 154 - info about initramfs, klibc, and userspace early during boot. 155 - edac.txt 156 - - information on EDAC - Error Detection And Correction 157 155 efi-stub.txt 158 156 - How to use the EFI boot stub to bypass GRUB or elilo on EFI systems. 159 157 eisa.txt

+1

Documentation/admin-guide/index.rst

··· 59 59 binfmt-misc 60 60 mono 61 61 java 62 + ras 62 63 63 64 .. only:: subproject and html 64 65

+1190

Documentation/admin-guide/ras.rst

··· 1 + .. include:: <isonum.txt> 2 + 3 + ============================================ 4 + Reliability, Availability and Serviceability 5 + ============================================ 6 + 7 + RAS concepts 8 + ************ 9 + 10 + Reliability, Availability and Serviceability (RAS) is a concept used on 11 + servers meant to measure their robusteness. 12 + 13 + Reliability 14 + is the probability that a system will produce correct outputs. 15 + 16 + * Generally measured as Mean Time Between Failures (MTBF) 17 + * Enhanced by features that help to avoid, detect and repair hardware faults 18 + 19 + Availability 20 + is the probability that a system is operational at a given time 21 + 22 + * Generally measured as a percentage of downtime per a period of time 23 + * Often uses mechanisms to detect and correct hardware faults in 24 + runtime; 25 + 26 + Serviceability (or maintainability) 27 + is the simplicity and speed with which a system can be repaired or 28 + maintained 29 + 30 + * Generally measured on Mean Time Between Repair (MTBR) 31 + 32 + Improving RAS 33 + ------------- 34 + 35 + In order to reduce systems downtime, a system should be capable of detecting 36 + hardware errors, and, when possible correcting them in runtime. It should 37 + also provide mechanisms to detect hardware degradation, in order to warn 38 + the system administrator to take the action of replacing a component before 39 + it causes data loss or system downtime. 40 + 41 + Among the monitoring measures, the most usual ones include: 42 + 43 + * CPU – detect errors at instruction execution and at L1/L2/L3 caches; 44 + * Memory – add error correction logic (ECC) to detect and correct errors; 45 + * I/O – add CRC checksums for tranfered data; 46 + * Storage – RAID, journal file systems, checksums, 47 + Self-Monitoring, Analysis and Reporting Technology (SMART). 48 + 49 + By monitoring the number of occurrences of error detections, it is possible 50 + to identify if the probability of hardware errors is increasing, and, on such 51 + case, do a preventive maintainance to replace a degrated component while 52 + those errors are correctable. 53 + 54 + Types of errors 55 + --------------- 56 + 57 + Most mechanisms used on modern systems use use technologies like Hamming 58 + Codes that allow error correction when the number of errors on a bit packet 59 + is below a threshold. If the number of errors is above, those mechanisms 60 + can indicate with a high degree of confidence that an error happened, but 61 + they can't correct. 62 + 63 + Also, sometimes an error occur on a component that it is not used. For 64 + example, a part of the memory that it is not currently allocated. 65 + 66 + That defines some categories of errors: 67 + 68 + * **Correctable Error (CE)** - the error detection mechanism detected and 69 + corrected the error. Such errors are usually not fatal, although some 70 + Kernel mechanisms allow the system administrator to consider them as fatal. 71 + 72 + * **Uncorrected Error (UE)** - the amount of errors happened above the error 73 + correction threshold, and the system was unable to auto-correct. 74 + 75 + * **Fatal Error** - when an UE error happens on a critical component of the 76 + system (for example, a piece of the Kernel got corrupted by an UE), the 77 + only reliable way to avoid data corruption is to hang or reboot the machine. 78 + 79 + * **Non-fatal Error** - when an UE error happens on an unused component, 80 + like a CPU in power down state or an unused memory bank, the system may 81 + still run, eventually replacing the affected hardware by a hot spare, 82 + if available. 83 + 84 + Also, when an error happens on an userspace process, it is also possible to 85 + kill such process and let userspace restart it. 86 + 87 + The mechanism for handling non-fatal errors is usually complex and may 88 + require the help of some userspace application, in order to apply the 89 + policy desired by the system administrator. 90 + 91 + Identifying a bad hardware component 92 + ------------------------------------ 93 + 94 + Just detecting a hardware flaw is usually not enough, as the system needs 95 + to pinpoint to the minimal replaceable unit (MRU) that should be exchanged 96 + to make the hardware reliable again. 97 + 98 + So, it requires not only error logging facilities, but also mechanisms that 99 + will translate the error message to the silkscreen or component label for 100 + the MRU. 101 + 102 + Typically, it is very complex for memory, as modern CPUs interlace memory 103 + from different memory modules, in order to provide a better performance. The 104 + DMI BIOS usually have a list of memory module labels, with can be obtained 105 + using the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 106 + 107 + Memory Device 108 + Total Width: 64 bits 109 + Data Width: 64 bits 110 + Size: 16384 MB 111 + Form Factor: SODIMM 112 + Set: None 113 + Locator: ChannelA-DIMM0 114 + Bank Locator: BANK 0 115 + Type: DDR4 116 + Type Detail: Synchronous 117 + Speed: 2133 MHz 118 + Rank: 2 119 + Configured Clock Speed: 2133 MHz 120 + 121 + On the above example, a DDR4 SO-DIMM memory module is located at the 122 + system's memory labeled as "BANK 0", as given by the *bank locator* field. 123 + Please notice that, on such system, the *total width* is equal to the 124 + *data witdh*. It means that such memory module doesn't have error 125 + detection/correction mechanisms. 126 + 127 + Unfortunately, not all systems use the same field to specify the memory 128 + bank. On this example, from an older server, ``dmidecode`` shows:: 129 + 130 + Memory Device 131 + Array Handle: 0x1000 132 + Error Information Handle: Not Provided 133 + Total Width: 72 bits 134 + Data Width: 64 bits 135 + Size: 8192 MB 136 + Form Factor: DIMM 137 + Set: 1 138 + Locator: DIMM_A1 139 + Bank Locator: Not Specified 140 + Type: DDR3 141 + Type Detail: Synchronous Registered (Buffered) 142 + Speed: 1600 MHz 143 + Rank: 2 144 + Configured Clock Speed: 1600 MHz 145 + 146 + There, the DDR3 RDIMM memory module is located at the system's memory labeled 147 + as "DIMM_A1", as given by the *locator* field. Please notice that this 148 + memory module has 64 bits of *data witdh* and 72 bits of *total width*. So, 149 + it has 8 extra bits to be used by error detection and correction mechanisms. 150 + Such kind of memory is called Error-correcting code memory (ECC memory). 151 + 152 + To make things even worse, it is not uncommon that systems with different 153 + labels on their system's board to use exactly the same BIOS, meaning that 154 + the labels provided by the BIOS won't match the real ones. 155 + 156 + ECC memory 157 + ---------- 158 + 159 + As mentioned on the previous section, ECC memory has extra bits to be 160 + used for error correction. So, on 64 bit systems, a memory module 161 + has 64 bits of *data width*, and 74 bits of *total width*. So, there are 162 + 8 bits extra bits to be used for the error detection and correction 163 + mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. 164 + 165 + So, when the cpu requests the memory controller to write a word with 166 + *data width*, the memory controller calculates the *syndrome* in real time, 167 + using Hamming code, or some other error correction code, like SECDED+, 168 + producing a code with *total width* size. Such code is then written 169 + on the memory modules. 170 + 171 + At read, the *total width* bits code is converted back, using the same 172 + ECC code used on write, producing a word with *data width* and a *syndrome*. 173 + The word with *data width* is sent to the CPU, even when errors happen. 174 + 175 + The memory controller also looks at the *syndrome* in order to check if 176 + there was an error, and if the ECC code was able to fix such error. 177 + If the error was corrected, a Corrected Error (CE) happened. If not, an 178 + Uncorrected Error (UE) happened. 179 + 180 + The information about the CE/UE errors is stored on some special registers 181 + at the memory controller and can be accessed by reading such registers, 182 + either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 183 + bit CPUs, such errors can also be retrieved via the Machine Check 184 + Architecture (MCA)\ [#f3]_. 185 + 186 + .. [#f1] Please notice that several memory controllers allow operation on a 187 + mode called "Lock-Step", where it groups two memory modules together, 188 + doing 128-bit reads/writes. That gives 16 bits for error correction, with 189 + significatively improves the error correction mechanism, at the expense 190 + that, when an error happens, there's no way to know what memory module is 191 + to blame. So, it has to blame both memory modules. 192 + 193 + .. [#f2] Some memory controllers also allow using memory in mirror mode. 194 + On such mode, the same data is written to two memory modules. At read, 195 + the system checks both memory modules, in order to check if both provide 196 + identical data. On such configuration, when an error happens, there's no 197 + way to know what memory module is to blame. So, it has to blame both 198 + memory modules (or 4 memory modules, if the system is also on Lock-step 199 + mode). 200 + 201 + .. [#f3] For more details about the Machine Check Architecture (MCA), 202 + please read Documentation/x86/x86_64/machinecheck at the Kernel tree. 203 + 204 + EDAC - Error Detection And Correction 205 + ************************************* 206 + 207 + .. note:: 208 + 209 + "bluesmoke" was the name for this device driver subsystem when it 210 + was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 211 + That site is mostly archaic now and can be used only for historical 212 + purposes. 213 + 214 + When the subsystem was pushed upstream for the first time, on 215 + Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. 216 + 217 + Purpose 218 + ------- 219 + 220 + The ``edac`` kernel module's goal is to detect and report hardware errors 221 + that occur within the computer system running under linux. 222 + 223 + Memory 224 + ------ 225 + 226 + Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 227 + primary errors being harvested. These types of errors are harvested by 228 + the ``edac_mc`` device. 229 + 230 + Detecting CE events, then harvesting those events and reporting them, 231 + **can** but must not necessarily be a predictor of future UE events. With 232 + CE events only, the system can and will continue to operate as no data 233 + has been damaged yet. 234 + 235 + However, preventive maintenance and proactive part replacement of memory 236 + modules exhibiting CEs can reduce the likelihood of the dreaded UE events 237 + and system panics. 238 + 239 + Other hardware elements 240 + ----------------------- 241 + 242 + A new feature for EDAC, the ``edac_device`` class of device, was added in 243 + the 2.6.23 version of the kernel. 244 + 245 + This new device type allows for non-memory type of ECC hardware detectors 246 + to have their states harvested and presented to userspace via the sysfs 247 + interface. 248 + 249 + Some architectures have ECC detectors for L1, L2 and L3 caches, 250 + along with DMA engines, fabric switches, main data path switches, 251 + interconnections, and various other hardware data paths. If the hardware 252 + reports it, then a edac_device device probably can be constructed to 253 + harvest and present that to userspace. 254 + 255 + 256 + PCI bus scanning 257 + ---------------- 258 + 259 + In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 260 + in order to determine if errors are occurring during data transfers. 261 + 262 + The presence of PCI Parity errors must be examined with a grain of salt. 263 + There are several add-in adapters that do **not** follow the PCI specification 264 + with regards to Parity generation and reporting. The specification says 265 + the vendor should tie the parity status bits to 0 if they do not intend 266 + to generate parity. Some vendors do not do this, and thus the parity bit 267 + can "float" giving false positives. 268 + 269 + There is a PCI device attribute located in sysfs that is checked by 270 + the EDAC PCI scanning code. If that attribute is set, PCI parity/error 271 + scanning is skipped for that device. The attribute is:: 272 + 273 + broken_parity_status 274 + 275 + and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 276 + PCI devices. 277 + 278 + 279 + Versioning 280 + ---------- 281 + 282 + EDAC is composed of a "core" module (``edac_core.ko``) and several Memory 283 + Controller (MC) driver modules. On a given system, the CORE is loaded 284 + and one MC driver will be loaded. Both the CORE and the MC driver (or 285 + ``edac_device`` driver) have individual versions that reflect current 286 + release level of their respective modules. 287 + 288 + Thus, to "report" on what version a system is running, one must report 289 + both the CORE's and the MC driver's versions. 290 + 291 + 292 + Loading 293 + ------- 294 + 295 + If ``edac`` was statically linked with the kernel then no loading 296 + is necessary. If ``edac`` was built as modules then simply modprobe 297 + the ``edac`` pieces that you need. You should be able to modprobe 298 + hardware-specific modules and have the dependencies load the necessary 299 + core modules. 300 + 301 + Example:: 302 + 303 + $ modprobe amd76x_edac 304 + 305 + loads both the ``amd76x_edac.ko`` memory controller module and the 306 + ``edac_mc.ko`` core module. 307 + 308 + 309 + Sysfs interface 310 + --------------- 311 + 312 + EDAC presents a ``sysfs`` interface for control and reporting purposes. It 313 + lives in the /sys/devices/system/edac directory. 314 + 315 + Within this directory there currently reside 2 components: 316 + 317 + ======= ============================== 318 + mc memory controller(s) system 319 + pci PCI control and status system 320 + ======= ============================== 321 + 322 + 323 + 324 + Memory Controller (mc) Model 325 + ---------------------------- 326 + 327 + Each ``mc`` device controls a set of memory modules [#f4]_. These modules 328 + are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 329 + There can be multiple csrows and multiple channels. 330 + 331 + .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 332 + used to refer to a memory module, although there are other memory 333 + packaging alternatives, like SO-DIMM, SIMM, etc. Along this document, 334 + and inside the EDAC system, the term "dimm" is used for all memory 335 + modules, even when they use a different kind of packaging. 336 + 337 + Memory controllers allow for several csrows, with 8 csrows being a 338 + typical value. Yet, the actual number of csrows depends on the layout of 339 + a given motherboard, memory controller and memory module characteristics. 340 + 341 + Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 342 + data transfers to/from the CPU from/to memory. Some newer chipsets allow 343 + for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 344 + controllers. The following example will assume 2 channels: 345 + 346 + +------------+-----------------------+ 347 + | Chip | Channels | 348 + | Select +-----------+-----------+ 349 + | rows | ``ch0`` | ``ch1`` | 350 + +============+===========+===========+ 351 + | ``csrow0`` | DIMM_A0 | DIMM_B0 | 352 + +------------+ | | 353 + | ``csrow1`` | | | 354 + +------------+-----------+-----------+ 355 + | ``csrow2`` | DIMM_A1 | DIMM_B1 | 356 + +------------+ | | 357 + | ``csrow3`` | | | 358 + +------------+-----------+-----------+ 359 + 360 + In the above example, there are 4 physical slots on the motherboard 361 + for memory DIMMs: 362 + 363 + +---------+---------+ 364 + | DIMM_A0 | DIMM_B0 | 365 + +---------+---------+ 366 + | DIMM_A1 | DIMM_B1 | 367 + +---------+---------+ 368 + 369 + Labels for these slots are usually silk-screened on the motherboard. 370 + Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 371 + channel 1. Notice that there are two csrows possible on a physical DIMM. 372 + These csrows are allocated their csrow assignment based on the slot into 373 + which the memory DIMM is placed. Thus, when 1 DIMM is placed in each 374 + Channel, the csrows cross both DIMMs. 375 + 376 + Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 377 + Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 378 + will have just one csrow (csrow0). csrow1 will be empty. On the other 379 + hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 380 + and csrow1 will be populated. The pattern repeats itself for csrow2 and 381 + csrow3. 382 + 383 + The representation of the above is reflected in the directory 384 + tree in EDAC's sysfs interface. Starting in directory 385 + ``/sys/devices/system/edac/mc``, each memory controller will be 386 + represented by its own ``mcX`` directory, where ``X`` is the 387 + index of the MC:: 388 + 389 + ..../edac/mc/ 390 + | 391 + |->mc0 392 + |->mc1 393 + |->mc2 394 + .... 395 + 396 + Under each ``mcX`` directory each ``csrowX`` is again represented by a 397 + ``csrowX``, where ``X`` is the csrow index:: 398 + 399 + .../mc/mc0/ 400 + | 401 + |->csrow0 402 + |->csrow2 403 + |->csrow3 404 + .... 405 + 406 + Notice that there is no csrow1, which indicates that csrow0 is composed 407 + of a single ranked DIMMs. This should also apply in both Channels, in 408 + order to have dual-channel mode be operational. Since both csrow2 and 409 + csrow3 are populated, this indicates a dual ranked set of DIMMs for 410 + channels 0 and 1. 411 + 412 + Within each of the ``mcX`` and ``csrowX`` directories are several EDAC 413 + control and attribute files. 414 + 415 + ``mcX`` directories 416 + ------------------- 417 + 418 + In ``mcX`` directories are EDAC control and attribute files for 419 + this ``X`` instance of the memory controllers. 420 + 421 + For a description of the sysfs API, please see: 422 + 423 + Documentation/ABI/testing/sysfs-devices-edac 424 + 425 + 426 + ``dimmX`` or ``rankX`` directories 427 + ---------------------------------- 428 + 429 + The recommended way to use the EDAC subsystem is to look at the information 430 + provided by the ``dimmX`` or ``rankX`` directories [#f5]_. 431 + 432 + A typical EDAC system has the following structure under 433 + ``/sys/devices/system/edac/``\ [#f6]_:: 434 + 435 + /sys/devices/system/edac/ 436 + ├── mc 437 + │ ├── mc0 438 + │ │ ├── ce_count 439 + │ │ ├── ce_noinfo_count 440 + │ │ ├── dimm0 441 + │ │ │ ├── dimm_dev_type 442 + │ │ │ ├── dimm_edac_mode 443 + │ │ │ ├── dimm_label 444 + │ │ │ ├── dimm_location 445 + │ │ │ ├── dimm_mem_type 446 + │ │ │ ├── size 447 + │ │ │ └── uevent 448 + │ │ ├── max_location 449 + │ │ ├── mc_name 450 + │ │ ├── reset_counters 451 + │ │ ├── seconds_since_reset 452 + │ │ ├── size_mb 453 + │ │ ├── ue_count 454 + │ │ ├── ue_noinfo_count 455 + │ │ └── uevent 456 + │ ├── mc1 457 + │ │ ├── ce_count 458 + │ │ ├── ce_noinfo_count 459 + │ │ ├── dimm0 460 + │ │ │ ├── dimm_dev_type 461 + │ │ │ ├── dimm_edac_mode 462 + │ │ │ ├── dimm_label 463 + │ │ │ ├── dimm_location 464 + │ │ │ ├── dimm_mem_type 465 + │ │ │ ├── size 466 + │ │ │ └── uevent 467 + │ │ ├── max_location 468 + │ │ ├── mc_name 469 + │ │ ├── reset_counters 470 + │ │ ├── seconds_since_reset 471 + │ │ ├── size_mb 472 + │ │ ├── ue_count 473 + │ │ ├── ue_noinfo_count 474 + │ │ └── uevent 475 + │ └── uevent 476 + └── uevent 477 + 478 + In the ``dimmX`` directories are EDAC control and attribute files for 479 + this ``X`` memory module: 480 + 481 + - ``size`` - Total memory managed by this csrow attribute file 482 + 483 + This attribute file displays, in count of megabytes, the memory 484 + that this csrow contains. 485 + 486 + - ``dimm_dev_type`` - Device type attribute file 487 + 488 + This attribute file will display what type of DRAM device is 489 + being utilized on this DIMM. 490 + Examples: 491 + 492 + - x1 493 + - x2 494 + - x4 495 + - x8 496 + 497 + - ``dimm_edac_mode`` - EDAC Mode of operation attribute file 498 + 499 + This attribute file will display what type of Error detection 500 + and correction is being utilized. 501 + 502 + - ``dimm_label`` - memory module label control file 503 + 504 + This control file allows this DIMM to have a label assigned 505 + to it. With this label in the module, when errors occur 506 + the output can provide the DIMM label in the system log. 507 + This becomes vital for panic events to isolate the 508 + cause of the UE event. 509 + 510 + DIMM Labels must be assigned after booting, with information 511 + that correctly identifies the physical slot with its 512 + silk screen label. This information is currently very 513 + motherboard specific and determination of this information 514 + must occur in userland at this time. 515 + 516 + - ``dimm_location`` - location of the memory module 517 + 518 + The location can have up to 3 levels, and describe how the 519 + memory controller identifies the location of a memory module. 520 + Depending on the type of memory and memory controller, it 521 + can be: 522 + 523 + - *csrow* and *channel* - used when the memory controller 524 + doesn't identify a single DIMM - e. g. in ``rankX`` dir; 525 + - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 526 + controllers; 527 + - *channel*, *slot* - used on Nehalem and newer Intel drivers. 528 + 529 + - ``dimm_mem_type`` - Memory Type attribute file 530 + 531 + This attribute file will display what type of memory is currently 532 + on this csrow. Normally, either buffered or unbuffered memory. 533 + Examples: 534 + 535 + - Registered-DDR 536 + - Unbuffered-DDR 537 + 538 + .. [#f5] On some systems, the memory controller doesn't have any logic 539 + to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 540 + On modern Intel memory controllers, the memory controller identifies the 541 + memory modules directly. On such systems, the directory is called ``dimmX``. 542 + 543 + .. [#f6] There are also some ``power`` directories and ``subsystem`` 544 + symlinks inside the sysfs mapping that are automatically created by 545 + the sysfs subsystem. Currently, they serve no purpose. 546 + 547 + ``csrowX`` directories 548 + ---------------------- 549 + 550 + When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 551 + directories. As this API doesn't work properly for Rambus, FB-DIMMs and 552 + modern Intel Memory Controllers, this is being deprecated in favor of 553 + ``dimmX`` directories. 554 + 555 + In the ``csrowX`` directories are EDAC control and attribute files for 556 + this ``X`` instance of csrow: 557 + 558 + 559 + - ``ue_count`` - Total Uncorrectable Errors count attribute file 560 + 561 + This attribute file displays the total count of uncorrectable 562 + errors that have occurred on this csrow. If panic_on_ue is set 563 + this counter will not have a chance to increment, since EDAC 564 + will panic the system. 565 + 566 + 567 + - ``ce_count`` - Total Correctable Errors count attribute file 568 + 569 + This attribute file displays the total count of correctable 570 + errors that have occurred on this csrow. This count is very 571 + important to examine. CEs provide early indications that a 572 + DIMM is beginning to fail. This count field should be 573 + monitored for non-zero values and report such information 574 + to the system administrator. 575 + 576 + 577 + - ``size_mb`` - Total memory managed by this csrow attribute file 578 + 579 + This attribute file displays, in count of megabytes, the memory 580 + that this csrow contains. 581 + 582 + 583 + - ``mem_type`` - Memory Type attribute file 584 + 585 + This attribute file will display what type of memory is currently 586 + on this csrow. Normally, either buffered or unbuffered memory. 587 + Examples: 588 + 589 + - Registered-DDR 590 + - Unbuffered-DDR 591 + 592 + 593 + - ``edac_mode`` - EDAC Mode of operation attribute file 594 + 595 + This attribute file will display what type of Error detection 596 + and correction is being utilized. 597 + 598 + 599 + - ``dev_type`` - Device type attribute file 600 + 601 + This attribute file will display what type of DRAM device is 602 + being utilized on this DIMM. 603 + Examples: 604 + 605 + - x1 606 + - x2 607 + - x4 608 + - x8 609 + 610 + 611 + - ``ch0_ce_count`` - Channel 0 CE Count attribute file 612 + 613 + This attribute file will display the count of CEs on this 614 + DIMM located in channel 0. 615 + 616 + 617 + - ``ch0_ue_count`` - Channel 0 UE Count attribute file 618 + 619 + This attribute file will display the count of UEs on this 620 + DIMM located in channel 0. 621 + 622 + 623 + - ``ch0_dimm_label`` - Channel 0 DIMM Label control file 624 + 625 + 626 + This control file allows this DIMM to have a label assigned 627 + to it. With this label in the module, when errors occur 628 + the output can provide the DIMM label in the system log. 629 + This becomes vital for panic events to isolate the 630 + cause of the UE event. 631 + 632 + DIMM Labels must be assigned after booting, with information 633 + that correctly identifies the physical slot with its 634 + silk screen label. This information is currently very 635 + motherboard specific and determination of this information 636 + must occur in userland at this time. 637 + 638 + 639 + - ``ch1_ce_count`` - Channel 1 CE Count attribute file 640 + 641 + 642 + This attribute file will display the count of CEs on this 643 + DIMM located in channel 1. 644 + 645 + 646 + - ``ch1_ue_count`` - Channel 1 UE Count attribute file 647 + 648 + 649 + This attribute file will display the count of UEs on this 650 + DIMM located in channel 0. 651 + 652 + 653 + - ``ch1_dimm_label`` - Channel 1 DIMM Label control file 654 + 655 + This control file allows this DIMM to have a label assigned 656 + to it. With this label in the module, when errors occur 657 + the output can provide the DIMM label in the system log. 658 + This becomes vital for panic events to isolate the 659 + cause of the UE event. 660 + 661 + DIMM Labels must be assigned after booting, with information 662 + that correctly identifies the physical slot with its 663 + silk screen label. This information is currently very 664 + motherboard specific and determination of this information 665 + must occur in userland at this time. 666 + 667 + 668 + System Logging 669 + -------------- 670 + 671 + If logging for UEs and CEs is enabled, then system logs will contain 672 + information indicating that errors have been detected:: 673 + 674 + EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 675 + EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 676 + 677 + 678 + The structure of the message is: 679 + 680 + +---------------------------------------+-------------+ 681 + | Content + Example | 682 + +=======================================+=============+ 683 + | The memory controller | MC0 | 684 + +---------------------------------------+-------------+ 685 + | Error type | CE | 686 + +---------------------------------------+-------------+ 687 + | Memory page | 0x283 | 688 + +---------------------------------------+-------------+ 689 + | Offset in the page | 0xce0 | 690 + +---------------------------------------+-------------+ 691 + | The byte granularity | grain 8 | 692 + | or resolution of the error | | 693 + +---------------------------------------+-------------+ 694 + | The error syndrome | 0xb741 | 695 + +---------------------------------------+-------------+ 696 + | Memory row | row 0 + 697 + +---------------------------------------+-------------+ 698 + | Memory channel | channel 1 | 699 + +---------------------------------------+-------------+ 700 + | DIMM label, if set prior | DIMM B1 | 701 + +---------------------------------------+-------------+ 702 + | And then an optional, driver-specific | | 703 + | message that may have additional | | 704 + | information. | | 705 + +---------------------------------------+-------------+ 706 + 707 + Both UEs and CEs with no info will lack all but memory controller, error 708 + type, a notice of "no info" and then an optional, driver-specific error 709 + message. 710 + 711 + 712 + PCI Bus Parity Detection 713 + ------------------------ 714 + 715 + On Header Type 00 devices, the primary status is looked at for any 716 + parity error regardless of whether parity is enabled on the device or 717 + not. (The spec indicates parity is generated in some cases). On Header 718 + Type 01 bridges, the secondary status register is also looked at to see 719 + if parity occurred on the bus on the other side of the bridge. 720 + 721 + 722 + Sysfs configuration 723 + ------------------- 724 + 725 + Under ``/sys/devices/system/edac/pci`` are control and attribute files as 726 + follows: 727 + 728 + 729 + - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 730 + 731 + This control file enables or disables the PCI Bus Parity scanning 732 + operation. Writing a 1 to this file enables the scanning. Writing 733 + a 0 to this file disables the scanning. 734 + 735 + Enable:: 736 + 737 + echo "1" >/sys/devices/system/edac/pci/check_pci_parity 738 + 739 + Disable:: 740 + 741 + echo "0" >/sys/devices/system/edac/pci/check_pci_parity 742 + 743 + 744 + - ``pci_parity_count`` - Parity Count 745 + 746 + This attribute file will display the number of parity errors that 747 + have been detected. 748 + 749 + 750 + Module parameters 751 + ----------------- 752 + 753 + - ``edac_mc_panic_on_ue`` - Panic on UE control file 754 + 755 + An uncorrectable error will cause a machine panic. This is usually 756 + desirable. It is a bad idea to continue when an uncorrectable error 757 + occurs - it is indeterminate what was uncorrected and the operating 758 + system context might be so mangled that continuing will lead to further 759 + corruption. If the kernel has MCE configured, then EDAC will never 760 + notice the UE. 761 + 762 + LOAD TIME:: 763 + 764 + module/kernel parameter: edac_mc_panic_on_ue=[0|1] 765 + 766 + RUN TIME:: 767 + 768 + echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 769 + 770 + 771 + - ``edac_mc_log_ue`` - Log UE control file 772 + 773 + 774 + Generate kernel messages describing uncorrectable errors. These errors 775 + are reported through the system message log system. UE statistics 776 + will be accumulated even when UE logging is disabled. 777 + 778 + LOAD TIME:: 779 + 780 + module/kernel parameter: edac_mc_log_ue=[0|1] 781 + 782 + RUN TIME:: 783 + 784 + echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 785 + 786 + 787 + - ``edac_mc_log_ce`` - Log CE control file 788 + 789 + 790 + Generate kernel messages describing correctable errors. These 791 + errors are reported through the system message log system. 792 + CE statistics will be accumulated even when CE logging is disabled. 793 + 794 + LOAD TIME:: 795 + 796 + module/kernel parameter: edac_mc_log_ce=[0|1] 797 + 798 + RUN TIME:: 799 + 800 + echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 801 + 802 + 803 + - ``edac_mc_poll_msec`` - Polling period control file 804 + 805 + 806 + The time period, in milliseconds, for polling for error information. 807 + Too small a value wastes resources. Too large a value might delay 808 + necessary handling of errors and might loose valuable information for 809 + locating the error. 1000 milliseconds (once each second) is the current 810 + default. Systems which require all the bandwidth they can get, may 811 + increase this. 812 + 813 + LOAD TIME:: 814 + 815 + module/kernel parameter: edac_mc_poll_msec=[0|1] 816 + 817 + RUN TIME:: 818 + 819 + echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 820 + 821 + 822 + - ``panic_on_pci_parity`` - Panic on PCI PARITY Error 823 + 824 + 825 + This control file enables or disables panicking when a parity 826 + error has been detected. 827 + 828 + 829 + module/kernel parameter:: 830 + 831 + edac_panic_on_pci_pe=[0|1] 832 + 833 + Enable:: 834 + 835 + echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 836 + 837 + Disable:: 838 + 839 + echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 840 + 841 + 842 + 843 + EDAC device type 844 + ---------------- 845 + 846 + In the header file, edac_pci.h, there is a series of edac_device structures 847 + and APIs for the EDAC_DEVICE. 848 + 849 + User space access to an edac_device is through the sysfs interface. 850 + 851 + At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 852 + will appear. 853 + 854 + There is a three level tree beneath the above ``edac`` directory. For example, 855 + the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 856 + website) installs itself as:: 857 + 858 + /sys/devices/system/edac/test-instance 859 + 860 + in this directory are various controls, a symlink and one or more ``instance`` 861 + directories. 862 + 863 + The standard default controls are: 864 + 865 + ============== ======================================================= 866 + log_ce boolean to log CE events 867 + log_ue boolean to log UE events 868 + panic_on_ue boolean to ``panic`` the system if an UE is encountered 869 + (default off, can be set true via startup script) 870 + poll_msec time period between POLL cycles for events 871 + ============== ======================================================= 872 + 873 + The test_device_edac device adds at least one of its own custom control: 874 + 875 + ============== ================================================== 876 + test_bits which in the current test driver does nothing but 877 + show how it is installed. A ported driver can 878 + add one or more such controls and/or attributes 879 + for specific uses. 880 + One out-of-tree driver uses controls here to allow 881 + for ERROR INJECTION operations to hardware 882 + injection registers 883 + ============== ================================================== 884 + 885 + The symlink points to the 'struct dev' that is registered for this edac_device. 886 + 887 + Instances 888 + --------- 889 + 890 + One or more instance directories are present. For the ``test_device_edac`` 891 + case: 892 + 893 + +----------------+ 894 + | test-instance0 | 895 + +----------------+ 896 + 897 + 898 + In this directory there are two default counter attributes, which are totals of 899 + counter in deeper subdirectories. 900 + 901 + ============== ==================================== 902 + ce_count total of CE events of subdirectories 903 + ue_count total of UE events of subdirectories 904 + ============== ==================================== 905 + 906 + Blocks 907 + ------ 908 + 909 + At the lowest directory level is the ``block`` directory. There can be 0, 1 910 + or more blocks specified in each instance: 911 + 912 + +-------------+ 913 + | test-block0 | 914 + +-------------+ 915 + 916 + In this directory the default attributes are: 917 + 918 + ============== ================================================ 919 + ce_count which is counter of CE events for this ``block`` 920 + of hardware being monitored 921 + ue_count which is counter of UE events for this ``block`` 922 + of hardware being monitored 923 + ============== ================================================ 924 + 925 + 926 + The ``test_device_edac`` device adds 4 attributes and 1 control: 927 + 928 + ================== ==================================================== 929 + test-block-bits-0 for every POLL cycle this counter 930 + is incremented 931 + test-block-bits-1 every 10 cycles, this counter is bumped once, 932 + and test-block-bits-0 is set to 0 933 + test-block-bits-2 every 100 cycles, this counter is bumped once, 934 + and test-block-bits-1 is set to 0 935 + test-block-bits-3 every 1000 cycles, this counter is bumped once, 936 + and test-block-bits-2 is set to 0 937 + ================== ==================================================== 938 + 939 + 940 + ================== ==================================================== 941 + reset-counters writing ANY thing to this control will 942 + reset all the above counters. 943 + ================== ==================================================== 944 + 945 + 946 + Use of the ``test_device_edac`` driver should enable any others to create their own 947 + unique drivers for their hardware systems. 948 + 949 + The ``test_device_edac`` sample driver is located at the 950 + http://bluesmoke.sourceforge.net project site for EDAC. 951 + 952 + 953 + Usage of EDAC APIs on Nehalem and newer Intel CPUs 954 + -------------------------------------------------- 955 + 956 + On older Intel architectures, the memory controller was part of the North 957 + Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 958 + newer Intel architectures integrated an enhanced version of the memory 959 + controller (MC) inside the CPUs. 960 + 961 + This chapter will cover the differences of the enhanced memory controllers 962 + found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 963 + ``sbx_edac`` drivers. 964 + 965 + .. note:: 966 + 967 + The Xeon E7 processor families use a separate chip for the memory 968 + controller, called Intel Scalable Memory Buffer. This section doesn't 969 + apply for such families. 970 + 971 + 1) There is one Memory Controller per Quick Patch Interconnect 972 + (QPI). At the driver, the term "socket" means one QPI. This is 973 + associated with a physical CPU socket. 974 + 975 + Each MC have 3 physical read channels, 3 physical write channels and 976 + 3 logic channels. The driver currently sees it as just 3 channels. 977 + Each channel can have up to 3 DIMMs. 978 + 979 + The minimum known unity is DIMMs. There are no information about csrows. 980 + As EDAC API maps the minimum unity is csrows, the driver sequentially 981 + maps channel/DIMM into different csrows. 982 + 983 + For example, supposing the following layout:: 984 + 985 + Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 986 + dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 987 + dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 988 + Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 989 + dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 990 + Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 991 + dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 992 + 993 + The driver will map it as:: 994 + 995 + csrow0: channel 0, dimm0 996 + csrow1: channel 0, dimm1 997 + csrow2: channel 1, dimm0 998 + csrow3: channel 2, dimm0 999 + 1000 + exports one DIMM per csrow. 1001 + 1002 + Each QPI is exported as a different memory controller. 1003 + 1004 + 2) The MC has the ability to inject errors to test drivers. The drivers 1005 + implement this functionality via some error injection nodes: 1006 + 1007 + For injecting a memory error, there are some sysfs nodes, under 1008 + ``/sys/devices/system/edac/mc/mc?/``: 1009 + 1010 + - ``inject_addrmatch/*``: 1011 + Controls the error injection mask register. It is possible to specify 1012 + several characteristics of the address to match an error code:: 1013 + 1014 + dimm = the affected dimm. Numbers are relative to a channel; 1015 + rank = the memory rank; 1016 + channel = the channel that will generate an error; 1017 + bank = the affected bank; 1018 + page = the page address; 1019 + column (or col) = the address column. 1020 + 1021 + each of the above values can be set to "any" to match any valid value. 1022 + 1023 + At driver init, all values are set to any. 1024 + 1025 + For example, to generate an error at rank 1 of dimm 2, for any channel, 1026 + any bank, any page, any column:: 1027 + 1028 + echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1029 + echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1030 + 1031 + To return to the default behaviour of matching any, you can do:: 1032 + 1033 + echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1034 + echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1035 + 1036 + - ``inject_eccmask``: 1037 + specifies what bits will have troubles, 1038 + 1039 + - ``inject_section``: 1040 + specifies what ECC cache section will get the error:: 1041 + 1042 + 3 for both 1043 + 2 for the highest 1044 + 1 for the lowest 1045 + 1046 + - ``inject_type``: 1047 + specifies the type of error, being a combination of the following bits:: 1048 + 1049 + bit 0 - repeat 1050 + bit 1 - ecc 1051 + bit 2 - parity 1052 + 1053 + - ``inject_enable``: 1054 + starts the error generation when something different than 0 is written. 1055 + 1056 + All inject vars can be read. root permission is needed for write. 1057 + 1058 + Datasheet states that the error will only be generated after a write on an 1059 + address that matches inject_addrmatch. It seems, however, that reading will 1060 + also produce an error. 1061 + 1062 + For example, the following code will generate an error for any write access 1063 + at socket 0, on any DIMM/address on channel 2:: 1064 + 1065 + echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 1066 + echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 1067 + echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 1068 + echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 1069 + echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 1070 + dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 1071 + 1072 + For socket 1, it is needed to replace "mc0" by "mc1" at the above 1073 + commands. 1074 + 1075 + The generated error message will look like:: 1076 + 1077 + EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 1078 + 1079 + 3) Corrected Error memory register counters 1080 + 1081 + Those newer MCs have some registers to count memory errors. The driver 1082 + uses those registers to report Corrected Errors on devices with Registered 1083 + DIMMs. 1084 + 1085 + However, those counters don't work with Unregistered DIMM. As the chipset 1086 + offers some counters that also work with UDIMMs (but with a worse level of 1087 + granularity than the default ones), the driver exposes those registers for 1088 + UDIMM memories. 1089 + 1090 + They can be read by looking at the contents of ``all_channel_counts/``:: 1091 + 1092 + $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 1093 + /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 1094 + 0 1095 + /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 1096 + 0 1097 + /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 1098 + 0 1099 + 1100 + What happens here is that errors on different csrows, but at the same 1101 + dimm number will increment the same counter. 1102 + So, in this memory mapping:: 1103 + 1104 + csrow0: channel 0, dimm0 1105 + csrow1: channel 0, dimm1 1106 + csrow2: channel 1, dimm0 1107 + csrow3: channel 2, dimm0 1108 + 1109 + The hardware will increment udimm0 for an error at the first dimm at either 1110 + csrow0, csrow2 or csrow3; 1111 + 1112 + The hardware will increment udimm1 for an error at the second dimm at either 1113 + csrow0, csrow2 or csrow3; 1114 + 1115 + The hardware will increment udimm2 for an error at the third dimm at either 1116 + csrow0, csrow2 or csrow3; 1117 + 1118 + 4) Standard error counters 1119 + 1120 + The standard error counters are generated when an mcelog error is received 1121 + by the driver. Since, with UDIMM, this is counted by software, it is 1122 + possible that some errors could be lost. With RDIMM's, they display the 1123 + contents of the registers 1124 + 1125 + Reference documents used on ``amd64_edac`` 1126 + ------------------------------------------ 1127 + 1128 + ``amd64_edac`` module is based on the following documents 1129 + (available from http://support.amd.com/en-us/search/tech-docs): 1130 + 1131 + 1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 1132 + Opteron Processors 1133 + :AMD publication #: 26094 1134 + :Revision: 3.26 1135 + :Link: http://support.amd.com/TechDocs/26094.PDF 1136 + 1137 + 2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 1138 + Processors 1139 + :AMD publication #: 32559 1140 + :Revision: 3.00 1141 + :Issue Date: May 2006 1142 + :Link: http://support.amd.com/TechDocs/32559.pdf 1143 + 1144 + 3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 1145 + Processors 1146 + :AMD publication #: 31116 1147 + :Revision: 3.00 1148 + :Issue Date: September 07, 2007 1149 + :Link: http://support.amd.com/TechDocs/31116.pdf 1150 + 1151 + 4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1152 + Models 30h-3Fh Processors 1153 + :AMD publication #: 49125 1154 + :Revision: 3.06 1155 + :Issue Date: 2/12/2015 (latest release) 1156 + :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 1157 + 1158 + 5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1159 + Models 60h-6Fh Processors 1160 + :AMD publication #: 50742 1161 + :Revision: 3.01 1162 + :Issue Date: 7/23/2015 (latest release) 1163 + :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 1164 + 1165 + 6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 1166 + Models 00h-0Fh Processors 1167 + :AMD publication #: 48751 1168 + :Revision: 3.03 1169 + :Issue Date: 2/23/2015 (latest release) 1170 + :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 1171 + 1172 + Credits 1173 + ======= 1174 + 1175 + * Written by Doug Thompson <dougthompson@xmission.com> 1176 + 1177 + - 7 Dec 2005 1178 + - 17 Jul 2007 Updated 1179 + 1180 + * |copy| Mauro Carvalho Chehab 1181 + 1182 + - 05 Aug 2009 Nehalem interface 1183 + - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 1184 + 1185 + * EDAC authors/maintainers: 1186 + 1187 + - Doug Thompson, Dave Jiang, Dave Peterson et al, 1188 + - Mauro Carvalho Chehab 1189 + - Borislav Petkov 1190 + - original author: Thayne Harbaugh

+178

Documentation/driver-api/edac.rst

··· 1 + Error Detection And Correction (EDAC) Devices 2 + ============================================= 3 + 4 + Main Concepts used at the EDAC subsystem 5 + ---------------------------------------- 6 + 7 + There are several things to be aware of that aren't at all obvious, like 8 + *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*, 9 + etc... 10 + 11 + These are some of the many terms that are thrown about that don't always 12 + mean what people think they mean (Inconceivable!). In the interest of 13 + creating a common ground for discussion, terms and their definitions 14 + will be established. 15 + 16 + * Memory devices 17 + 18 + The individual DRAM chips on a memory stick. These devices commonly 19 + output 4 and 8 bits each (x4, x8). Grouping several of these in parallel 20 + provides the number of bits that the memory controller expects: 21 + typically 72 bits, in order to provide 64 bits + 8 bits of ECC data. 22 + 23 + * Memory Stick 24 + 25 + A printed circuit board that aggregates multiple memory devices in 26 + parallel. In general, this is the Field Replaceable Unit (FRU) which 27 + gets replaced, in the case of excessive errors. Most often it is also 28 + called DIMM (Dual Inline Memory Module). 29 + 30 + * Memory Socket 31 + 32 + A physical connector on the motherboard that accepts a single memory 33 + stick. Also called as "slot" on several datasheets. 34 + 35 + * Channel 36 + 37 + A memory controller channel, responsible to communicate with a group of 38 + DIMMs. Each channel has its own independent control (command) and data 39 + bus, and can be used independently or grouped with other channels. 40 + 41 + * Branch 42 + 43 + It is typically the highest hierarchy on a Fully-Buffered DIMM memory 44 + controller. Typically, it contains two channels. Two channels at the 45 + same branch can be used in single mode or in lockstep mode. When 46 + lockstep is enabled, the cacheline is doubled, but it generally brings 47 + some performance penalty. Also, it is generally not possible to point to 48 + just one memory stick when an error occurs, as the error correction code 49 + is calculated using two DIMMs instead of one. Due to that, it is capable 50 + of correcting more errors than on single mode. 51 + 52 + * Single-channel 53 + 54 + The data accessed by the memory controller is contained into one dimm 55 + only. E. g. if the data is 64 bits-wide, the data flows to the CPU using 56 + one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3 57 + memories. FB-DIMM and RAMBUS use a different concept for channel, so 58 + this concept doesn't apply there. 59 + 60 + * Double-channel 61 + 62 + The data size accessed by the memory controller is interlaced into two 63 + dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72 64 + bits with ECC), the data flows to the CPU using a 128 bits parallel 65 + access. 66 + 67 + * Chip-select row 68 + 69 + This is the name of the DRAM signal used to select the DRAM ranks to be 70 + accessed. Common chip-select rows for single channel are 64 bits, for 71 + dual channel 128 bits. It may not be visible by the memory controller, 72 + as some DIMM types have a memory buffer that can hide direct access to 73 + it from the Memory Controller. 74 + 75 + * Single-Ranked stick 76 + 77 + A Single-ranked stick has 1 chip-select row of memory. Motherboards 78 + commonly drive two chip-select pins to a memory stick. A single-ranked 79 + stick, will occupy only one of those rows. The other will be unused. 80 + 81 + .. _doubleranked: 82 + 83 + * Double-Ranked stick 84 + 85 + A double-ranked stick has two chip-select rows which access different 86 + sets of memory devices. The two rows cannot be accessed concurrently. 87 + 88 + * Double-sided stick 89 + 90 + **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`. 91 + 92 + A double-sided stick has two chip-select rows which access different sets 93 + of memory devices. The two rows cannot be accessed concurrently. 94 + "Double-sided" is irrespective of the memory devices being mounted on 95 + both sides of the memory stick. 96 + 97 + * Socket set 98 + 99 + All of the memory sticks that are required for a single memory access or 100 + all of the memory sticks spanned by a chip-select row. A single socket 101 + set has two chip-select rows and if double-sided sticks are used these 102 + will occupy those chip-select rows. 103 + 104 + * Bank 105 + 106 + This term is avoided because it is unclear when needing to distinguish 107 + between chip-select rows and socket sets. 108 + 109 + 110 + Memory Controllers 111 + ------------------ 112 + 113 + Most of the EDAC core is focused on doing Memory Controller error detection. 114 + The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info`` 115 + to describe the memory controllers, with is an opaque struct for the EDAC 116 + drivers. Only the EDAC core is allowed to touch it. 117 + 118 + .. kernel-doc:: include/linux/edac.h 119 + 120 + .. kernel-doc:: drivers/edac/edac_mc.h 121 + 122 + PCI Controllers 123 + --------------- 124 + 125 + The EDAC subsystem provides a mechanism to handle PCI controllers by calling 126 + the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct 127 + :c:type:`edac_pci_ctl_info` to describe the PCI controllers. 128 + 129 + .. kernel-doc:: drivers/edac/edac_pci.h 130 + 131 + EDAC Blocks 132 + ----------- 133 + 134 + The EDAC subsystem also provides a generic mechanism to report errors on 135 + other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function. 136 + 137 + The structures :c:type:`edac_dev_sysfs_block_attribute`, 138 + :c:type:`edac_device_block`, :c:type:`edac_device_instance` and 139 + :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device' 140 + representation at sysfs. 141 + 142 + This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or 143 + PCI, like: 144 + 145 + - CPU caches (L1 and L2) 146 + - DMA engines 147 + - Core CPU switches 148 + - Fabric switch units 149 + - PCIe interface controllers 150 + - other EDAC/ECC type devices that can be monitored for 151 + errors, etc. 152 + 153 + It allows for a 2 level set of hierarchy. 154 + 155 + For example, a cache could be composed of L1, L2 and L3 levels of cache. 156 + Each CPU core would have its own L1 cache, while sharing L2 and maybe L3 157 + caches. On such case, those can be represented via the following sysfs 158 + nodes:: 159 + 160 + /sys/devices/system/edac/.. 161 + 162 + pci/ <existing pci directory (if available)> 163 + mc/ <existing memory device directory> 164 + cpu/cpu0/.. <L1 and L2 block directory> 165 + /L1-cache/ce_count 166 + /ue_count 167 + /L2-cache/ce_count 168 + /ue_count 169 + cpu/cpu1/.. <L1 and L2 block directory> 170 + /L1-cache/ce_count 171 + /ue_count 172 + /L2-cache/ce_count 173 + /ue_count 174 + ... 175 + 176 + the L1 and L2 directories would be "edac_device_block's" 177 + 178 + .. kernel-doc:: drivers/edac/edac_device.h

+1

Documentation/driver-api/index.rst

··· 26 26 spi 27 27 i2c 28 28 hsi 29 + edac 29 30 miscellaneous 30 31 vme 31 32 80211/index

-812

Documentation/edac.txt

··· 1 - EDAC - Error Detection And Correction 2 - ===================================== 3 - 4 - "bluesmoke" was the name for this device driver when it 5 - was "out-of-tree" and maintained at sourceforge.net - 6 - bluesmoke.sourceforge.net. That site is mostly archaic now and can be 7 - used only for historical purposes. 8 - 9 - When the subsystem was pushed into 2.6.16 for the first time, it was 10 - renamed to 'EDAC'. 11 - 12 - PURPOSE 13 - ------- 14 - 15 - The 'edac' kernel module's goal is to detect and report hardware errors 16 - that occur within the computer system running under linux. 17 - 18 - MEMORY 19 - ------ 20 - 21 - Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 22 - primary errors being harvested. These types of errors are harvested by 23 - the 'edac_mc' device. 24 - 25 - Detecting CE events, then harvesting those events and reporting them, 26 - *can* but must not necessarily be a predictor of future UE events. With 27 - CE events only, the system can and will continue to operate as no data 28 - has been damaged yet. 29 - 30 - However, preventive maintenance and proactive part replacement of memory 31 - DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events 32 - and system panics. 33 - 34 - OTHER HARDWARE ELEMENTS 35 - ----------------------- 36 - 37 - A new feature for EDAC, the edac_device class of device, was added in 38 - the 2.6.23 version of the kernel. 39 - 40 - This new device type allows for non-memory type of ECC hardware detectors 41 - to have their states harvested and presented to userspace via the sysfs 42 - interface. 43 - 44 - Some architectures have ECC detectors for L1, L2 and L3 caches, 45 - along with DMA engines, fabric switches, main data path switches, 46 - interconnections, and various other hardware data paths. If the hardware 47 - reports it, then a edac_device device probably can be constructed to 48 - harvest and present that to userspace. 49 - 50 - 51 - PCI BUS SCANNING 52 - ---------------- 53 - 54 - In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 55 - in order to determine if errors are occurring during data transfers. 56 - 57 - The presence of PCI Parity errors must be examined with a grain of salt. 58 - There are several add-in adapters that do *not* follow the PCI specification 59 - with regards to Parity generation and reporting. The specification says 60 - the vendor should tie the parity status bits to 0 if they do not intend 61 - to generate parity. Some vendors do not do this, and thus the parity bit 62 - can "float" giving false positives. 63 - 64 - There is a PCI device attribute located in sysfs that is checked by 65 - the EDAC PCI scanning code. If that attribute is set, PCI parity/error 66 - scanning is skipped for that device. The attribute is: 67 - 68 - broken_parity_status 69 - 70 - and is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for 71 - PCI devices. 72 - 73 - 74 - VERSIONING 75 - ---------- 76 - 77 - EDAC is composed of a "core" module (edac_core.ko) and several Memory 78 - Controller (MC) driver modules. On a given system, the CORE is loaded 79 - and one MC driver will be loaded. Both the CORE and the MC driver (or 80 - edac_device driver) have individual versions that reflect current 81 - release level of their respective modules. 82 - 83 - Thus, to "report" on what version a system is running, one must report 84 - both the CORE's and the MC driver's versions. 85 - 86 - 87 - LOADING 88 - ------- 89 - 90 - If 'edac' was statically linked with the kernel then no loading 91 - is necessary. If 'edac' was built as modules then simply modprobe 92 - the 'edac' pieces that you need. You should be able to modprobe 93 - hardware-specific modules and have the dependencies load the necessary 94 - core modules. 95 - 96 - Example: 97 - 98 - $> modprobe amd76x_edac 99 - 100 - loads both the amd76x_edac.ko memory controller module and the edac_mc.ko 101 - core module. 102 - 103 - 104 - SYSFS INTERFACE 105 - --------------- 106 - 107 - EDAC presents a 'sysfs' interface for control and reporting purposes. It 108 - lives in the /sys/devices/system/edac directory. 109 - 110 - Within this directory there currently reside 2 components: 111 - 112 - mc memory controller(s) system 113 - pci PCI control and status system 114 - 115 - 116 - 117 - Memory Controller (mc) Model 118 - ---------------------------- 119 - 120 - Each 'mc' device controls a set of DIMM memory modules. These modules 121 - are laid out in a Chip-Select Row (csrowX) and Channel table (chX). 122 - There can be multiple csrows and multiple channels. 123 - 124 - Memory controllers allow for several csrows, with 8 csrows being a 125 - typical value. Yet, the actual number of csrows depends on the layout of 126 - a given motherboard, memory controller and DIMM characteristics. 127 - 128 - Dual channels allows for 128 bit data transfers to/from the CPU from/to 129 - memory. Some newer chipsets allow for more than 2 channels, like Fully 130 - Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: 131 - 132 - 133 - Channel 0 Channel 1 134 - =================================== 135 - csrow0 | DIMM_A0 | DIMM_B0 | 136 - csrow1 | DIMM_A0 | DIMM_B0 | 137 - =================================== 138 - 139 - =================================== 140 - csrow2 | DIMM_A1 | DIMM_B1 | 141 - csrow3 | DIMM_A1 | DIMM_B1 | 142 - =================================== 143 - 144 - In the above example table there are 4 physical slots on the motherboard 145 - for memory DIMMs: 146 - 147 - DIMM_A0 148 - DIMM_B0 149 - DIMM_A1 150 - DIMM_B1 151 - 152 - Labels for these slots are usually silk-screened on the motherboard. 153 - Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are 154 - channel 1. Notice that there are two csrows possible on a physical DIMM. 155 - These csrows are allocated their csrow assignment based on the slot into 156 - which the memory DIMM is placed. Thus, when 1 DIMM is placed in each 157 - Channel, the csrows cross both DIMMs. 158 - 159 - Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 160 - Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 161 - will have 1 csrow, csrow0. csrow1 will be empty. On the other hand, 162 - when 2 dual ranked DIMMs are similarly placed, then both csrow0 and 163 - csrow1 will be populated. The pattern repeats itself for csrow2 and 164 - csrow3. 165 - 166 - The representation of the above is reflected in the directory 167 - tree in EDAC's sysfs interface. Starting in directory 168 - /sys/devices/system/edac/mc each memory controller will be represented 169 - by its own 'mcX' directory, where 'X' is the index of the MC. 170 - 171 - 172 - ..../edac/mc/ 173 - | 174 - |->mc0 175 - |->mc1 176 - |->mc2 177 - .... 178 - 179 - Under each 'mcX' directory each 'csrowX' is again represented by a 180 - 'csrowX', where 'X' is the csrow index: 181 - 182 - 183 - .../mc/mc0/ 184 - | 185 - |->csrow0 186 - |->csrow2 187 - |->csrow3 188 - .... 189 - 190 - Notice that there is no csrow1, which indicates that csrow0 is composed 191 - of a single ranked DIMMs. This should also apply in both Channels, in 192 - order to have dual-channel mode be operational. Since both csrow2 and 193 - csrow3 are populated, this indicates a dual ranked set of DIMMs for 194 - channels 0 and 1. 195 - 196 - 197 - Within each of the 'mcX' and 'csrowX' directories are several EDAC 198 - control and attribute files. 199 - 200 - 201 - 'mcX' directories 202 - ----------------- 203 - 204 - In 'mcX' directories are EDAC control and attribute files for 205 - this 'X' instance of the memory controllers. 206 - 207 - For a description of the sysfs API, please see: 208 - Documentation/ABI/testing/sysfs-devices-edac 209 - 210 - 211 - 212 - 'csrowX' directories 213 - -------------------- 214 - 215 - When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX 216 - directories. As this API doesn't work properly for Rambus, FB-DIMMs and 217 - modern Intel Memory Controllers, this is being deprecated in favor of 218 - dimmX directories. 219 - 220 - In the 'csrowX' directories are EDAC control and attribute files for 221 - this 'X' instance of csrow: 222 - 223 - 224 - Total Uncorrectable Errors count attribute file: 225 - 226 - 'ue_count' 227 - 228 - This attribute file displays the total count of uncorrectable 229 - errors that have occurred on this csrow. If panic_on_ue is set 230 - this counter will not have a chance to increment, since EDAC 231 - will panic the system. 232 - 233 - 234 - Total Correctable Errors count attribute file: 235 - 236 - 'ce_count' 237 - 238 - This attribute file displays the total count of correctable 239 - errors that have occurred on this csrow. This count is very 240 - important to examine. CEs provide early indications that a 241 - DIMM is beginning to fail. This count field should be 242 - monitored for non-zero values and report such information 243 - to the system administrator. 244 - 245 - 246 - Total memory managed by this csrow attribute file: 247 - 248 - 'size_mb' 249 - 250 - This attribute file displays, in count of megabytes, the memory 251 - that this csrow contains. 252 - 253 - 254 - Memory Type attribute file: 255 - 256 - 'mem_type' 257 - 258 - This attribute file will display what type of memory is currently 259 - on this csrow. Normally, either buffered or unbuffered memory. 260 - Examples: 261 - Registered-DDR 262 - Unbuffered-DDR 263 - 264 - 265 - EDAC Mode of operation attribute file: 266 - 267 - 'edac_mode' 268 - 269 - This attribute file will display what type of Error detection 270 - and correction is being utilized. 271 - 272 - 273 - Device type attribute file: 274 - 275 - 'dev_type' 276 - 277 - This attribute file will display what type of DRAM device is 278 - being utilized on this DIMM. 279 - Examples: 280 - x1 281 - x2 282 - x4 283 - x8 284 - 285 - 286 - Channel 0 CE Count attribute file: 287 - 288 - 'ch0_ce_count' 289 - 290 - This attribute file will display the count of CEs on this 291 - DIMM located in channel 0. 292 - 293 - 294 - Channel 0 UE Count attribute file: 295 - 296 - 'ch0_ue_count' 297 - 298 - This attribute file will display the count of UEs on this 299 - DIMM located in channel 0. 300 - 301 - 302 - Channel 0 DIMM Label control file: 303 - 304 - 'ch0_dimm_label' 305 - 306 - This control file allows this DIMM to have a label assigned 307 - to it. With this label in the module, when errors occur 308 - the output can provide the DIMM label in the system log. 309 - This becomes vital for panic events to isolate the 310 - cause of the UE event. 311 - 312 - DIMM Labels must be assigned after booting, with information 313 - that correctly identifies the physical slot with its 314 - silk screen label. This information is currently very 315 - motherboard specific and determination of this information 316 - must occur in userland at this time. 317 - 318 - 319 - Channel 1 CE Count attribute file: 320 - 321 - 'ch1_ce_count' 322 - 323 - This attribute file will display the count of CEs on this 324 - DIMM located in channel 1. 325 - 326 - 327 - Channel 1 UE Count attribute file: 328 - 329 - 'ch1_ue_count' 330 - 331 - This attribute file will display the count of UEs on this 332 - DIMM located in channel 0. 333 - 334 - 335 - Channel 1 DIMM Label control file: 336 - 337 - 'ch1_dimm_label' 338 - 339 - This control file allows this DIMM to have a label assigned 340 - to it. With this label in the module, when errors occur 341 - the output can provide the DIMM label in the system log. 342 - This becomes vital for panic events to isolate the 343 - cause of the UE event. 344 - 345 - DIMM Labels must be assigned after booting, with information 346 - that correctly identifies the physical slot with its 347 - silk screen label. This information is currently very 348 - motherboard specific and determination of this information 349 - must occur in userland at this time. 350 - 351 - 352 - 353 - SYSTEM LOGGING 354 - -------------- 355 - 356 - If logging for UEs and CEs is enabled, then system logs will contain 357 - information indicating that errors have been detected: 358 - 359 - EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, 360 - channel 1 "DIMM_B1": amd76x_edac 361 - 362 - EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, 363 - channel 1 "DIMM_B1": amd76x_edac 364 - 365 - 366 - The structure of the message is: 367 - the memory controller (MC0) 368 - Error type (CE) 369 - memory page (0x283) 370 - offset in the page (0xce0) 371 - the byte granularity (grain 8) 372 - or resolution of the error 373 - the error syndrome (0xb741) 374 - memory row (row 0) 375 - memory channel (channel 1) 376 - DIMM label, if set prior (DIMM B1 377 - and then an optional, driver-specific message that may 378 - have additional information. 379 - 380 - Both UEs and CEs with no info will lack all but memory controller, error 381 - type, a notice of "no info" and then an optional, driver-specific error 382 - message. 383 - 384 - 385 - PCI Bus Parity Detection 386 - ------------------------ 387 - 388 - On Header Type 00 devices, the primary status is looked at for any 389 - parity error regardless of whether parity is enabled on the device or 390 - not. (The spec indicates parity is generated in some cases). On Header 391 - Type 01 bridges, the secondary status register is also looked at to see 392 - if parity occurred on the bus on the other side of the bridge. 393 - 394 - 395 - SYSFS CONFIGURATION 396 - ------------------- 397 - 398 - Under /sys/devices/system/edac/pci are control and attribute files as follows: 399 - 400 - 401 - Enable/Disable PCI Parity checking control file: 402 - 403 - 'check_pci_parity' 404 - 405 - 406 - This control file enables or disables the PCI Bus Parity scanning 407 - operation. Writing a 1 to this file enables the scanning. Writing 408 - a 0 to this file disables the scanning. 409 - 410 - Enable: 411 - echo "1" >/sys/devices/system/edac/pci/check_pci_parity 412 - 413 - Disable: 414 - echo "0" >/sys/devices/system/edac/pci/check_pci_parity 415 - 416 - 417 - Parity Count: 418 - 419 - 'pci_parity_count' 420 - 421 - This attribute file will display the number of parity errors that 422 - have been detected. 423 - 424 - 425 - 426 - MODULE PARAMETERS 427 - ----------------- 428 - 429 - Panic on UE control file: 430 - 431 - 'edac_mc_panic_on_ue' 432 - 433 - An uncorrectable error will cause a machine panic. This is usually 434 - desirable. It is a bad idea to continue when an uncorrectable error 435 - occurs - it is indeterminate what was uncorrected and the operating 436 - system context might be so mangled that continuing will lead to further 437 - corruption. If the kernel has MCE configured, then EDAC will never 438 - notice the UE. 439 - 440 - LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1] 441 - 442 - RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 443 - 444 - 445 - Log UE control file: 446 - 447 - 'edac_mc_log_ue' 448 - 449 - Generate kernel messages describing uncorrectable errors. These errors 450 - are reported through the system message log system. UE statistics 451 - will be accumulated even when UE logging is disabled. 452 - 453 - LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1] 454 - 455 - RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 456 - 457 - 458 - Log CE control file: 459 - 460 - 'edac_mc_log_ce' 461 - 462 - Generate kernel messages describing correctable errors. These 463 - errors are reported through the system message log system. 464 - CE statistics will be accumulated even when CE logging is disabled. 465 - 466 - LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1] 467 - 468 - RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 469 - 470 - 471 - Polling period control file: 472 - 473 - 'edac_mc_poll_msec' 474 - 475 - The time period, in milliseconds, for polling for error information. 476 - Too small a value wastes resources. Too large a value might delay 477 - necessary handling of errors and might loose valuable information for 478 - locating the error. 1000 milliseconds (once each second) is the current 479 - default. Systems which require all the bandwidth they can get, may 480 - increase this. 481 - 482 - LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1] 483 - 484 - RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 485 - 486 - 487 - Panic on PCI PARITY Error: 488 - 489 - 'panic_on_pci_parity' 490 - 491 - 492 - This control file enables or disables panicking when a parity 493 - error has been detected. 494 - 495 - 496 - module/kernel parameter: edac_panic_on_pci_pe=[0|1] 497 - 498 - Enable: 499 - echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 500 - 501 - Disable: 502 - echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 503 - 504 - 505 - 506 - EDAC device type 507 - ---------------- 508 - 509 - In the header file, edac_core.h, there is a series of edac_device structures 510 - and APIs for the EDAC_DEVICE. 511 - 512 - User space access to an edac_device is through the sysfs interface. 513 - 514 - At the location /sys/devices/system/edac (sysfs) new edac_device devices will 515 - appear. 516 - 517 - There is a three level tree beneath the above 'edac' directory. For example, 518 - the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website) 519 - installs itself as: 520 - 521 - /sys/devices/systm/edac/test-instance 522 - 523 - in this directory are various controls, a symlink and one or more 'instance' 524 - directories. 525 - 526 - The standard default controls are: 527 - 528 - log_ce boolean to log CE events 529 - log_ue boolean to log UE events 530 - panic_on_ue boolean to 'panic' the system if an UE is encountered 531 - (default off, can be set true via startup script) 532 - poll_msec time period between POLL cycles for events 533 - 534 - The test_device_edac device adds at least one of its own custom control: 535 - 536 - test_bits which in the current test driver does nothing but 537 - show how it is installed. A ported driver can 538 - add one or more such controls and/or attributes 539 - for specific uses. 540 - One out-of-tree driver uses controls here to allow 541 - for ERROR INJECTION operations to hardware 542 - injection registers 543 - 544 - The symlink points to the 'struct dev' that is registered for this edac_device. 545 - 546 - INSTANCES 547 - --------- 548 - 549 - One or more instance directories are present. For the 'test_device_edac' case: 550 - 551 - test-instance0 552 - 553 - 554 - In this directory there are two default counter attributes, which are totals of 555 - counter in deeper subdirectories. 556 - 557 - ce_count total of CE events of subdirectories 558 - ue_count total of UE events of subdirectories 559 - 560 - BLOCKS 561 - ------ 562 - 563 - At the lowest directory level is the 'block' directory. There can be 0, 1 564 - or more blocks specified in each instance. 565 - 566 - test-block0 567 - 568 - 569 - In this directory the default attributes are: 570 - 571 - ce_count which is counter of CE events for this 'block' 572 - of hardware being monitored 573 - ue_count which is counter of UE events for this 'block' 574 - of hardware being monitored 575 - 576 - 577 - The 'test_device_edac' device adds 4 attributes and 1 control: 578 - 579 - test-block-bits-0 for every POLL cycle this counter 580 - is incremented 581 - test-block-bits-1 every 10 cycles, this counter is bumped once, 582 - and test-block-bits-0 is set to 0 583 - test-block-bits-2 every 100 cycles, this counter is bumped once, 584 - and test-block-bits-1 is set to 0 585 - test-block-bits-3 every 1000 cycles, this counter is bumped once, 586 - and test-block-bits-2 is set to 0 587 - 588 - 589 - reset-counters writing ANY thing to this control will 590 - reset all the above counters. 591 - 592 - 593 - Use of the 'test_device_edac' driver should enable any others to create their own 594 - unique drivers for their hardware systems. 595 - 596 - The 'test_device_edac' sample driver is located at the 597 - bluesmoke.sourceforge.net project site for EDAC. 598 - 599 - 600 - NEHALEM USAGE OF EDAC APIs 601 - -------------------------- 602 - 603 - This chapter documents some EXPERIMENTAL mappings for EDAC API to handle 604 - Nehalem EDAC driver. They will likely be changed on future versions 605 - of the driver. 606 - 607 - Due to the way Nehalem exports Memory Controller data, some adjustments 608 - were done at i7core_edac driver. This chapter will cover those differences 609 - 610 - 1) On Nehalem, there is one Memory Controller per Quick Patch Interconnect 611 - (QPI). At the driver, the term "socket" means one QPI. This is 612 - associated with a physical CPU socket. 613 - 614 - Each MC have 3 physical read channels, 3 physical write channels and 615 - 3 logic channels. The driver currently sees it as just 3 channels. 616 - Each channel can have up to 3 DIMMs. 617 - 618 - The minimum known unity is DIMMs. There are no information about csrows. 619 - As EDAC API maps the minimum unity is csrows, the driver sequentially 620 - maps channel/dimm into different csrows. 621 - 622 - For example, supposing the following layout: 623 - Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 624 - dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 625 - dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 626 - Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 627 - dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 628 - Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 629 - dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 630 - The driver will map it as: 631 - csrow0: channel 0, dimm0 632 - csrow1: channel 0, dimm1 633 - csrow2: channel 1, dimm0 634 - csrow3: channel 2, dimm0 635 - 636 - exports one 637 - DIMM per csrow. 638 - 639 - Each QPI is exported as a different memory controller. 640 - 641 - 2) Nehalem MC has the ability to generate errors. The driver implements this 642 - functionality via some error injection nodes: 643 - 644 - For injecting a memory error, there are some sysfs nodes, under 645 - /sys/devices/system/edac/mc/mc?/: 646 - 647 - inject_addrmatch/*: 648 - Controls the error injection mask register. It is possible to specify 649 - several characteristics of the address to match an error code: 650 - dimm = the affected dimm. Numbers are relative to a channel; 651 - rank = the memory rank; 652 - channel = the channel that will generate an error; 653 - bank = the affected bank; 654 - page = the page address; 655 - column (or col) = the address column. 656 - each of the above values can be set to "any" to match any valid value. 657 - 658 - At driver init, all values are set to any. 659 - 660 - For example, to generate an error at rank 1 of dimm 2, for any channel, 661 - any bank, any page, any column: 662 - echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 663 - echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 664 - 665 - To return to the default behaviour of matching any, you can do: 666 - echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 667 - echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 668 - 669 - inject_eccmask: 670 - specifies what bits will have troubles, 671 - 672 - inject_section: 673 - specifies what ECC cache section will get the error: 674 - 3 for both 675 - 2 for the highest 676 - 1 for the lowest 677 - 678 - inject_type: 679 - specifies the type of error, being a combination of the following bits: 680 - bit 0 - repeat 681 - bit 1 - ecc 682 - bit 2 - parity 683 - 684 - inject_enable starts the error generation when something different 685 - than 0 is written. 686 - 687 - All inject vars can be read. root permission is needed for write. 688 - 689 - Datasheet states that the error will only be generated after a write on an 690 - address that matches inject_addrmatch. It seems, however, that reading will 691 - also produce an error. 692 - 693 - For example, the following code will generate an error for any write access 694 - at socket 0, on any DIMM/address on channel 2: 695 - 696 - echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 697 - echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 698 - echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 699 - echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 700 - echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 701 - dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 702 - 703 - For socket 1, it is needed to replace "mc0" by "mc1" at the above 704 - commands. 705 - 706 - The generated error message will look like: 707 - 708 - EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 709 - 710 - 3) Nehalem specific Corrected Error memory counters 711 - 712 - Nehalem have some registers to count memory errors. The driver uses those 713 - registers to report Corrected Errors on devices with Registered Dimms. 714 - 715 - However, those counters don't work with Unregistered Dimms. As the chipset 716 - offers some counters that also work with UDIMMS (but with a worse level of 717 - granularity than the default ones), the driver exposes those registers for 718 - UDIMM memories. 719 - 720 - They can be read by looking at the contents of all_channel_counts/ 721 - 722 - $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 723 - /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 724 - 0 725 - /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 726 - 0 727 - /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 728 - 0 729 - 730 - What happens here is that errors on different csrows, but at the same 731 - dimm number will increment the same counter. 732 - So, in this memory mapping: 733 - csrow0: channel 0, dimm0 734 - csrow1: channel 0, dimm1 735 - csrow2: channel 1, dimm0 736 - csrow3: channel 2, dimm0 737 - The hardware will increment udimm0 for an error at the first dimm at either 738 - csrow0, csrow2 or csrow3; 739 - The hardware will increment udimm1 for an error at the second dimm at either 740 - csrow0, csrow2 or csrow3; 741 - The hardware will increment udimm2 for an error at the third dimm at either 742 - csrow0, csrow2 or csrow3; 743 - 744 - 4) Standard error counters 745 - 746 - The standard error counters are generated when an mcelog error is received 747 - by the driver. Since, with udimm, this is counted by software, it is 748 - possible that some errors could be lost. With rdimm's, they display the 749 - contents of the registers 750 - 751 - AMD64_EDAC REFERENCE DOCUMENTS USED 752 - ----------------------------------- 753 - amd64_edac module is based on the following documents 754 - (available from http://support.amd.com/en-us/search/tech-docs): 755 - 756 - 1. Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 757 - Opteron Processors 758 - AMD publication #: 26094 759 - Revision: 3.26 760 - Link: http://support.amd.com/TechDocs/26094.PDF 761 - 762 - 2. Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 763 - Processors 764 - AMD publication #: 32559 765 - Revision: 3.00 766 - Issue Date: May 2006 767 - Link: http://support.amd.com/TechDocs/32559.pdf 768 - 769 - 3. Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 770 - Processors 771 - AMD publication #: 31116 772 - Revision: 3.00 773 - Issue Date: September 07, 2007 774 - Link: http://support.amd.com/TechDocs/31116.pdf 775 - 776 - 4. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 777 - Models 30h-3Fh Processors 778 - AMD publication #: 49125 779 - Revision: 3.06 780 - Issue Date: 2/12/2015 (latest release) 781 - Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 782 - 783 - 5. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 784 - Models 60h-6Fh Processors 785 - AMD publication #: 50742 786 - Revision: 3.01 787 - Issue Date: 7/23/2015 (latest release) 788 - Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 789 - 790 - 6. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 791 - Models 00h-0Fh Processors 792 - AMD publication #: 48751 793 - Revision: 3.03 794 - Issue Date: 2/23/2015 (latest release) 795 - Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 796 - 797 - CREDITS: 798 - ======== 799 - 800 - Written by Doug Thompson <dougthompson@xmission.com> 801 - 7 Dec 2005 802 - 17 Jul 2007 Updated 803 - 804 - (c) Mauro Carvalho Chehab 805 - 05 Aug 2009 Nehalem interface 806 - 807 - EDAC authors/maintainers: 808 - 809 - Doug Thompson, Dave Jiang, Dave Peterson et al, 810 - Mauro Carvalho Chehab 811 - Borislav Petkov 812 - original author: Thayne Harbaugh

+2 -1

MAINTAINERS

··· 4588 4588 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git for-next 4589 4589 T: git git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git linux_next 4590 4590 S: Supported 4591 - F: Documentation/edac.txt 4591 + F: Documentation/admin-guide/ras.rst 4592 + F: Documentation/driver-api/edac.rst 4592 4593 F: drivers/edac/ 4593 4594 F: include/linux/edac.h 4594 4595

-1

drivers/edac/altera_edac.c

··· 35 35 #include <linux/uaccess.h> 36 36 37 37 #include "altera_edac.h" 38 - #include "edac_core.h" 39 38 #include "edac_module.h" 40 39 41 40 #define EDAC_MOD_STR "altera_edac"

+1 -1

drivers/edac/amd64_edac.h

··· 17 17 #include <linux/mmzone.h> 18 18 #include <linux/edac.h> 19 19 #include <asm/msr.h> 20 - #include "edac_core.h" 20 + #include "edac_module.h" 21 21 #include "mce_amd.h" 22 22 23 23 #define amd64_debug(fmt, arg...) \

+1 -1

drivers/edac/amd76x_edac.c

··· 17 17 #include <linux/pci.h> 18 18 #include <linux/pci_ids.h> 19 19 #include <linux/edac.h> 20 - #include "edac_core.h" 20 + #include "edac_module.h" 21 21 22 22 #define AMD76X_REVISION " Ver: 2.0.2" 23 23 #define EDAC_MOD_STR "amd76x_edac"

-1

drivers/edac/amd8111_edac.c

··· 29 29 #include <linux/pci_ids.h> 30 30 #include <asm/io.h> 31 31 32 - #include "edac_core.h" 33 32 #include "edac_module.h" 34 33 #include "amd8111_edac.h" 35 34

-1

drivers/edac/amd8131_edac.c

··· 29 29 #include <linux/edac.h> 30 30 #include <linux/pci_ids.h> 31 31 32 - #include "edac_core.h" 33 32 #include "edac_module.h" 34 33 #include "amd8131_edac.h" 35 34

+1 -1

drivers/edac/cell_edac.c

··· 19 19 #include <asm/machdep.h> 20 20 #include <asm/cell-regs.h> 21 21 22 - #include "edac_core.h" 22 + #include "edac_module.h" 23 23 24 24 struct cell_edac_priv 25 25 {

-1

drivers/edac/cpc925_edac.c

··· 27 27 #include <linux/platform_device.h> 28 28 #include <linux/gfp.h> 29 29 30 - #include "edac_core.h" 31 30 #include "edac_module.h" 32 31 33 32 #define CPC925_EDAC_REVISION " Ver: 1.0.0"

+1 -1

drivers/edac/e752x_edac.c

··· 24 24 #include <linux/pci.h> 25 25 #include <linux/pci_ids.h> 26 26 #include <linux/edac.h> 27 - #include "edac_core.h" 27 + #include "edac_module.h" 28 28 29 29 #define E752X_REVISION " Ver: 2.0.2" 30 30 #define EDAC_MOD_STR "e752x_edac"

+1 -1

drivers/edac/e7xxx_edac.c

··· 30 30 #include <linux/pci.h> 31 31 #include <linux/pci_ids.h> 32 32 #include <linux/edac.h> 33 - #include "edac_core.h" 33 + #include "edac_module.h" 34 34 35 35 #define E7XXX_REVISION " Ver: 2.0.2" 36 36 #define EDAC_MOD_STR "e7xxx_edac"

+60 -257

drivers/edac/edac_core.h drivers/edac/edac_device.h

··· 1 1 /* 2 - * Defines, structures, APIs for edac_core module 2 + * Defines, structures, APIs for edac_device 3 3 * 4 4 * (C) 2007 Linux Networx (http://lnxi.com) 5 5 * This file may be distributed under the terms of the ··· 15 15 * Refactored for multi-source files: 16 16 * Doug Thompson <norsk5@xmission.com> 17 17 * 18 + * Please look at Documentation/driver-api/edac.rst for more info about 19 + * EDAC core structs and functions. 18 20 */ 19 21 20 - #ifndef _EDAC_CORE_H_ 21 - #define _EDAC_CORE_H_ 22 + #ifndef _EDAC_DEVICE_H_ 23 + #define _EDAC_DEVICE_H_ 22 24 23 - #include <linux/kernel.h> 24 - #include <linux/types.h> 25 - #include <linux/module.h> 26 - #include <linux/spinlock.h> 27 - #include <linux/smp.h> 28 - #include <linux/pci.h> 29 - #include <linux/time.h> 30 - #include <linux/nmi.h> 31 - #include <linux/rcupdate.h> 32 25 #include <linux/completion.h> 33 - #include <linux/kobject.h> 34 - #include <linux/platform_device.h> 35 - #include <linux/workqueue.h> 26 + #include <linux/device.h> 36 27 #include <linux/edac.h> 28 + #include <linux/kobject.h> 29 + #include <linux/list.h> 30 + #include <linux/types.h> 31 + #include <linux/sysfs.h> 32 + #include <linux/workqueue.h> 37 33 38 - #define EDAC_DEVICE_NAME_LEN 31 39 - #define EDAC_ATTRIB_VALUE_LEN 15 40 - 41 - #if PAGE_SHIFT < 20 42 - #define PAGES_TO_MiB(pages) ((pages) >> (20 - PAGE_SHIFT)) 43 - #define MiB_TO_PAGES(mb) ((mb) << (20 - PAGE_SHIFT)) 44 - #else /* PAGE_SHIFT > 20 */ 45 - #define PAGES_TO_MiB(pages) ((pages) << (PAGE_SHIFT - 20)) 46 - #define MiB_TO_PAGES(mb) ((mb) >> (PAGE_SHIFT - 20)) 47 - #endif 48 - 49 - #define edac_printk(level, prefix, fmt, arg...) \ 50 - printk(level "EDAC " prefix ": " fmt, ##arg) 51 - 52 - #define edac_mc_printk(mci, level, fmt, arg...) \ 53 - printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg) 54 - 55 - #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \ 56 - printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg) 57 - 58 - #define edac_device_printk(ctl, level, fmt, arg...) \ 59 - printk(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg) 60 - 61 - #define edac_pci_printk(ctl, level, fmt, arg...) \ 62 - printk(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg) 63 - 64 - /* prefixes for edac_printk() and edac_mc_printk() */ 65 - #define EDAC_MC "MC" 66 - #define EDAC_PCI "PCI" 67 - #define EDAC_DEBUG "DEBUG" 68 - 69 - extern const char * const edac_mem_types[]; 70 - 71 - #ifdef CONFIG_EDAC_DEBUG 72 - extern int edac_debug_level; 73 - 74 - #define edac_dbg(level, fmt, ...) \ 75 - do { \ 76 - if (level <= edac_debug_level) \ 77 - edac_printk(KERN_DEBUG, EDAC_DEBUG, \ 78 - "%s: " fmt, __func__, ##__VA_ARGS__); \ 79 - } while (0) 80 - 81 - #else /* !CONFIG_EDAC_DEBUG */ 82 - 83 - #define edac_dbg(level, fmt, ...) \ 84 - do { \ 85 - if (0) \ 86 - edac_printk(KERN_DEBUG, EDAC_DEBUG, \ 87 - "%s: " fmt, __func__, ##__VA_ARGS__); \ 88 - } while (0) 89 - 90 - #endif /* !CONFIG_EDAC_DEBUG */ 91 - 92 - #define PCI_VEND_DEV(vend, dev) PCI_VENDOR_ID_ ## vend, \ 93 - PCI_DEVICE_ID_ ## vend ## _ ## dev 94 - 95 - #define edac_dev_name(dev) (dev)->dev_name 96 - 97 - #define to_mci(k) container_of(k, struct mem_ctl_info, dev) 98 34 99 35 /* 100 36 * The following are the structures to provide for a generic ··· 257 321 258 322 extern void edac_device_free_ctl_info(struct edac_device_ctl_info *ctl_info); 259 323 260 - #ifdef CONFIG_PCI 261 - 262 - struct edac_pci_counter { 263 - atomic_t pe_count; 264 - atomic_t npe_count; 265 - }; 266 - 267 - /* 268 - * Abstract edac_pci control info structure 324 + /** 325 + * edac_device_add_device: Insert the 'edac_dev' structure into the 326 + * edac_device global list and create sysfs entries associated with 327 + * edac_device structure. 269 328 * 270 - */ 271 - struct edac_pci_ctl_info { 272 - /* for global list of edac_pci_ctl_info structs */ 273 - struct list_head link; 274 - 275 - int pci_idx; 276 - 277 - struct bus_type *edac_subsys; /* pointer to subsystem */ 278 - 279 - /* the internal state of this controller instance */ 280 - int op_state; 281 - /* work struct for this instance */ 282 - struct delayed_work work; 283 - 284 - /* pointer to edac polling checking routine: 285 - * If NOT NULL: points to polling check routine 286 - * If NULL: Then assumes INTERRUPT operation, where 287 - * MC driver will receive events 288 - */ 289 - void (*edac_check) (struct edac_pci_ctl_info * edac_dev); 290 - 291 - struct device *dev; /* pointer to device structure */ 292 - 293 - const char *mod_name; /* module name */ 294 - const char *ctl_name; /* edac controller name */ 295 - const char *dev_name; /* pci/platform/etc... name */ 296 - 297 - void *pvt_info; /* pointer to 'private driver' info */ 298 - 299 - unsigned long start_time; /* edac_pci load start time (jiffies) */ 300 - 301 - struct completion complete; 302 - 303 - /* sysfs top name under 'edac' directory 304 - * and instance name: 305 - * cpu/cpu0/... 306 - * cpu/cpu1/... 307 - * cpu/cpu2/... 308 - * ... 309 - */ 310 - char name[EDAC_DEVICE_NAME_LEN + 1]; 311 - 312 - /* Event counters for the this whole EDAC Device */ 313 - struct edac_pci_counter counters; 314 - 315 - /* edac sysfs device control for the 'name' 316 - * device this structure controls 317 - */ 318 - struct kobject kobj; 319 - struct completion kobj_complete; 320 - }; 321 - 322 - #define to_edac_pci_ctl_work(w) \ 323 - container_of(w, struct edac_pci_ctl_info,work) 324 - 325 - /* write all or some bits in a byte-register*/ 326 - static inline void pci_write_bits8(struct pci_dev *pdev, int offset, u8 value, 327 - u8 mask) 328 - { 329 - if (mask != 0xff) { 330 - u8 buf; 331 - 332 - pci_read_config_byte(pdev, offset, &buf); 333 - value &= mask; 334 - buf &= ~mask; 335 - value |= buf; 336 - } 337 - 338 - pci_write_config_byte(pdev, offset, value); 339 - } 340 - 341 - /* write all or some bits in a word-register*/ 342 - static inline void pci_write_bits16(struct pci_dev *pdev, int offset, 343 - u16 value, u16 mask) 344 - { 345 - if (mask != 0xffff) { 346 - u16 buf; 347 - 348 - pci_read_config_word(pdev, offset, &buf); 349 - value &= mask; 350 - buf &= ~mask; 351 - value |= buf; 352 - } 353 - 354 - pci_write_config_word(pdev, offset, value); 355 - } 356 - 357 - /* 358 - * pci_write_bits32 329 + * @edac_dev: pointer to edac_device structure to be added to the list 330 + * 'edac_device' structure. 359 331 * 360 - * edac local routine to do pci_write_config_dword, but adds 361 - * a mask parameter. If mask is all ones, ignore the mask. 362 - * Otherwise utilize the mask to isolate specified bits 363 - * 364 - * write all or some bits in a dword-register 365 - */ 366 - static inline void pci_write_bits32(struct pci_dev *pdev, int offset, 367 - u32 value, u32 mask) 368 - { 369 - if (mask != 0xffffffff) { 370 - u32 buf; 371 - 372 - pci_read_config_dword(pdev, offset, &buf); 373 - value &= mask; 374 - buf &= ~mask; 375 - value |= buf; 376 - } 377 - 378 - pci_write_config_dword(pdev, offset, value); 379 - } 380 - 381 - #endif /* CONFIG_PCI */ 382 - 383 - struct mem_ctl_info *edac_mc_alloc(unsigned mc_num, 384 - unsigned n_layers, 385 - struct edac_mc_layer *layers, 386 - unsigned sz_pvt); 387 - extern int edac_mc_add_mc_with_groups(struct mem_ctl_info *mci, 388 - const struct attribute_group **groups); 389 - #define edac_mc_add_mc(mci) edac_mc_add_mc_with_groups(mci, NULL) 390 - extern void edac_mc_free(struct mem_ctl_info *mci); 391 - extern struct mem_ctl_info *edac_mc_find(int idx); 392 - extern struct mem_ctl_info *find_mci_by_dev(struct device *dev); 393 - extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev); 394 - extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci, 395 - unsigned long page); 396 - 397 - void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type, 398 - struct mem_ctl_info *mci, 399 - struct edac_raw_error_desc *e); 400 - 401 - void edac_mc_handle_error(const enum hw_event_mc_err_type type, 402 - struct mem_ctl_info *mci, 403 - const u16 error_count, 404 - const unsigned long page_frame_number, 405 - const unsigned long offset_in_page, 406 - const unsigned long syndrome, 407 - const int top_layer, 408 - const int mid_layer, 409 - const int low_layer, 410 - const char *msg, 411 - const char *other_detail); 412 - 413 - /* 414 - * edac_device APIs 332 + * Returns: 333 + * 0 on Success, or an error code on failure 415 334 */ 416 335 extern int edac_device_add_device(struct edac_device_ctl_info *edac_dev); 336 + 337 + /** 338 + * edac_device_del_device: 339 + * Remove sysfs entries for specified edac_device structure and 340 + * then remove edac_device structure from global list 341 + * 342 + * @dev: 343 + * Pointer to struct &device representing the edac device 344 + * structure to remove. 345 + * 346 + * Returns: 347 + * Pointer to removed edac_device structure, 348 + * or %NULL if device not found. 349 + */ 417 350 extern struct edac_device_ctl_info *edac_device_del_device(struct device *dev); 351 + 352 + /** 353 + * edac_device_handle_ue(): 354 + * perform a common output and handling of an 'edac_dev' UE event 355 + * 356 + * @edac_dev: pointer to struct &edac_device_ctl_info 357 + * @inst_nr: number of the instance where the UE error happened 358 + * @block_nr: number of the block where the UE error happened 359 + * @msg: message to be printed 360 + */ 418 361 extern void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, 419 362 int inst_nr, int block_nr, const char *msg); 363 + /** 364 + * edac_device_handle_ce(): 365 + * perform a common output and handling of an 'edac_dev' CE event 366 + * 367 + * @edac_dev: pointer to struct &edac_device_ctl_info 368 + * @inst_nr: number of the instance where the CE error happened 369 + * @block_nr: number of the block where the CE error happened 370 + * @msg: message to be printed 371 + */ 420 372 extern void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, 421 373 int inst_nr, int block_nr, const char *msg); 374 + 375 + /** 376 + * edac_device_alloc_index: Allocate a unique device index number 377 + * 378 + * Returns: 379 + * allocated index number 380 + */ 422 381 extern int edac_device_alloc_index(void); 423 382 extern const char *edac_layer_name[]; 424 383 425 - /* 426 - * edac_pci APIs 427 - */ 428 - extern struct edac_pci_ctl_info *edac_pci_alloc_ctl_info(unsigned int sz_pvt, 429 - const char *edac_pci_name); 430 - 431 - extern void edac_pci_free_ctl_info(struct edac_pci_ctl_info *pci); 432 - 433 - extern void edac_pci_reset_delay_period(struct edac_pci_ctl_info *pci, 434 - unsigned long value); 435 - 436 - extern int edac_pci_alloc_index(void); 437 - extern int edac_pci_add_device(struct edac_pci_ctl_info *pci, int edac_idx); 438 - extern struct edac_pci_ctl_info *edac_pci_del_device(struct device *dev); 439 - 440 - extern struct edac_pci_ctl_info *edac_pci_create_generic_ctl( 441 - struct device *dev, 442 - const char *mod_name); 443 - 444 - extern void edac_pci_release_generic_ctl(struct edac_pci_ctl_info *pci); 445 - extern int edac_pci_create_sysfs(struct edac_pci_ctl_info *pci); 446 - extern void edac_pci_remove_sysfs(struct edac_pci_ctl_info *pci); 447 - 448 - /* 449 - * edac misc APIs 450 - */ 451 - extern char *edac_op_state_to_string(int op_state); 452 - 453 - #endif /* _EDAC_CORE_H_ */ 384 + #endif

+12 -73

drivers/edac/edac_device.c

··· 12 12 * 19 Jan 2007 13 13 */ 14 14 15 - #include <linux/module.h> 16 - #include <linux/types.h> 17 - #include <linux/smp.h> 18 - #include <linux/init.h> 19 - #include <linux/sysctl.h> 20 - #include <linux/highmem.h> 21 - #include <linux/timer.h> 22 - #include <linux/slab.h> 23 - #include <linux/jiffies.h> 24 - #include <linux/spinlock.h> 25 - #include <linux/list.h> 26 - #include <linux/ctype.h> 27 - #include <linux/workqueue.h> 28 - #include <asm/uaccess.h> 29 15 #include <asm/page.h> 16 + #include <asm/uaccess.h> 17 + #include <linux/ctype.h> 18 + #include <linux/highmem.h> 19 + #include <linux/init.h> 20 + #include <linux/jiffies.h> 21 + #include <linux/module.h> 22 + #include <linux/slab.h> 23 + #include <linux/smp.h> 24 + #include <linux/spinlock.h> 25 + #include <linux/sysctl.h> 26 + #include <linux/timer.h> 30 27 31 - #include "edac_core.h" 28 + #include "edac_device.h" 32 29 #include "edac_module.h" 33 30 34 31 /* lock for the list: 'edac_device_list', manipulation of this list ··· 47 50 } 48 51 #endif /* CONFIG_EDAC_DEBUG */ 49 52 50 - 51 - /* 52 - * edac_device_alloc_ctl_info() 53 - * Allocate a new edac device control info structure 54 - * 55 - * The control structure is allocated in complete chunk 56 - * from the OS. It is in turn sub allocated to the 57 - * various objects that compose the structure 58 - * 59 - * The structure has a 'nr_instance' array within itself. 60 - * Each instance represents a major component 61 - * Example: L1 cache and L2 cache are 2 instance components 62 - * 63 - * Within each instance is an array of 'nr_blocks' blockoffsets 64 - */ 65 53 struct edac_device_ctl_info *edac_device_alloc_ctl_info( 66 54 unsigned sz_private, 67 55 char *edac_device_name, unsigned nr_instances, ··· 226 244 } 227 245 EXPORT_SYMBOL_GPL(edac_device_alloc_ctl_info); 228 246 229 - /* 230 - * edac_device_free_ctl_info() 231 - * frees the memory allocated by the edac_device_alloc_ctl_info() 232 - * function 233 - */ 234 247 void edac_device_free_ctl_info(struct edac_device_ctl_info *ctl_info) 235 248 { 236 249 edac_device_unregister_sysfs_main_kobj(ctl_info); ··· 437 460 edac_mod_work(&edac_dev->work, jiffs); 438 461 } 439 462 440 - /* 441 - * edac_device_alloc_index: Allocate a unique device index number 442 - * 443 - * Return: 444 - * allocated index number 445 - */ 446 463 int edac_device_alloc_index(void) 447 464 { 448 465 static atomic_t device_indexes = ATOMIC_INIT(0); ··· 445 474 } 446 475 EXPORT_SYMBOL_GPL(edac_device_alloc_index); 447 476 448 - /** 449 - * edac_device_add_device: Insert the 'edac_dev' structure into the 450 - * edac_device global list and create sysfs entries associated with 451 - * edac_device structure. 452 - * @edac_device: pointer to the edac_device structure to be added to the list 453 - * 'edac_device' structure. 454 - * 455 - * Return: 456 - * 0 Success 457 - * !0 Failure 458 - */ 459 477 int edac_device_add_device(struct edac_device_ctl_info *edac_dev) 460 478 { 461 479 edac_dbg(0, "\n"); ··· 501 541 } 502 542 EXPORT_SYMBOL_GPL(edac_device_add_device); 503 543 504 - /** 505 - * edac_device_del_device: 506 - * Remove sysfs entries for specified edac_device structure and 507 - * then remove edac_device structure from global list 508 - * 509 - * @dev: 510 - * Pointer to 'struct device' representing edac_device 511 - * structure to remove. 512 - * 513 - * Return: 514 - * Pointer to removed edac_device structure, 515 - * OR NULL if device not found. 516 - */ 517 544 struct edac_device_ctl_info *edac_device_del_device(struct device *dev) 518 545 { 519 546 struct edac_device_ctl_info *edac_dev; ··· 555 608 return edac_dev->panic_on_ue; 556 609 } 557 610 558 - /* 559 - * edac_device_handle_ce 560 - * perform a common output and handling of an 'edac_dev' CE event 561 - */ 562 611 void edac_device_handle_ce(struct edac_device_ctl_info *edac_dev, 563 612 int inst_nr, int block_nr, const char *msg) 564 613 { ··· 597 654 } 598 655 EXPORT_SYMBOL_GPL(edac_device_handle_ce); 599 656 600 - /* 601 - * edac_device_handle_ue 602 - * perform a common output and handling of an 'edac_dev' UE event 603 - */ 604 657 void edac_device_handle_ue(struct edac_device_ctl_info *edac_dev, 605 658 int inst_nr, int block_nr, const char *msg) 606 659 {

+2 -2

drivers/edac/edac_device_sysfs.c

··· 1 1 /* 2 2 * file for managing the edac_device subsystem of devices for EDAC 3 3 * 4 - * (C) 2007 SoftwareBitMaker 4 + * (C) 2007 SoftwareBitMaker 5 5 * 6 6 * This file may be distributed under the terms of the 7 7 * GNU General Public License. ··· 15 15 #include <linux/slab.h> 16 16 #include <linux/edac.h> 17 17 18 - #include "edac_core.h" 18 + #include "edac_device.h" 19 19 #include "edac_module.h" 20 20 21 21 #define EDAC_DEVICE_SYMLINK "device"

+1 -83

drivers/edac/edac_mc.c

··· 30 30 #include <linux/bitops.h> 31 31 #include <asm/uaccess.h> 32 32 #include <asm/page.h> 33 - #include "edac_core.h" 33 + #include "edac_mc.h" 34 34 #include "edac_module.h" 35 35 #include <ras/ras_event.h> 36 36 ··· 239 239 kfree(mci); 240 240 } 241 241 242 - /** 243 - * edac_mc_alloc: Allocate and partially fill a struct mem_ctl_info structure 244 - * @mc_num: Memory controller number 245 - * @n_layers: Number of MC hierarchy layers 246 - * layers: Describes each layer as seen by the Memory Controller 247 - * @size_pvt: size of private storage needed 248 - * 249 - * 250 - * Everything is kmalloc'ed as one big chunk - more efficient. 251 - * Only can be used if all structures have the same lifetime - otherwise 252 - * you have to allocate and initialize your own structures. 253 - * 254 - * Use edac_mc_free() to free mc structures allocated by this function. 255 - * 256 - * NOTE: drivers handle multi-rank memories in different ways: in some 257 - * drivers, one multi-rank memory stick is mapped as one entry, while, in 258 - * others, a single multi-rank memory stick would be mapped into several 259 - * entries. Currently, this function will allocate multiple struct dimm_info 260 - * on such scenarios, as grouping the multiple ranks require drivers change. 261 - * 262 - * Returns: 263 - * On failure: NULL 264 - * On success: struct mem_ctl_info pointer 265 - */ 266 242 struct mem_ctl_info *edac_mc_alloc(unsigned mc_num, 267 243 unsigned n_layers, 268 244 struct edac_mc_layer *layers, ··· 436 460 } 437 461 EXPORT_SYMBOL_GPL(edac_mc_alloc); 438 462 439 - /** 440 - * edac_mc_free 441 - * 'Free' a previously allocated 'mci' structure 442 - * @mci: pointer to a struct mem_ctl_info structure 443 - */ 444 463 void edac_mc_free(struct mem_ctl_info *mci) 445 464 { 446 465 edac_dbg(1, "\n"); ··· 617 646 return handlers; 618 647 } 619 648 620 - /** 621 - * edac_mc_find: Search for a mem_ctl_info structure whose index is 'idx'. 622 - * 623 - * If found, return a pointer to the structure. 624 - * Else return NULL. 625 - */ 626 649 struct mem_ctl_info *edac_mc_find(int idx) 627 650 { 628 651 struct mem_ctl_info *mci = NULL; ··· 641 676 } 642 677 EXPORT_SYMBOL(edac_mc_find); 643 678 644 - /** 645 - * edac_mc_add_mc_with_groups: Insert the 'mci' structure into the mci 646 - * global list and create sysfs entries associated with mci structure 647 - * @mci: pointer to the mci structure to be added to the list 648 - * @groups: optional attribute groups for the driver-specific sysfs entries 649 - * 650 - * Return: 651 - * 0 Success 652 - * !0 Failure 653 - */ 654 679 655 680 /* FIXME - should a warning be printed if no error detection? correction? */ 656 681 int edac_mc_add_mc_with_groups(struct mem_ctl_info *mci, ··· 731 776 } 732 777 EXPORT_SYMBOL_GPL(edac_mc_add_mc_with_groups); 733 778 734 - /** 735 - * edac_mc_del_mc: Remove sysfs entries for specified mci structure and 736 - * remove mci structure from global list 737 - * @pdev: Pointer to 'struct device' representing mci structure to remove. 738 - * 739 - * Return pointer to removed mci structure, or NULL if device not found. 740 - */ 741 779 struct mem_ctl_info *edac_mc_del_mc(struct device *dev) 742 780 { 743 781 struct mem_ctl_info *mci; ··· 994 1046 edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count); 995 1047 } 996 1048 997 - /** 998 - * edac_raw_mc_handle_error - reports a memory event to userspace without doing 999 - * anything to discover the error location 1000 - * 1001 - * @type: severity of the error (CE/UE/Fatal) 1002 - * @mci: a struct mem_ctl_info pointer 1003 - * @e: error description 1004 - * 1005 - * This raw function is used internally by edac_mc_handle_error(). It should 1006 - * only be called directly when the hardware error come directly from BIOS, 1007 - * like in the case of APEI GHES driver. 1008 - */ 1009 1049 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type, 1010 1050 struct mem_ctl_info *mci, 1011 1051 struct edac_raw_error_desc *e) ··· 1023 1087 } 1024 1088 EXPORT_SYMBOL_GPL(edac_raw_mc_handle_error); 1025 1089 1026 - /** 1027 - * edac_mc_handle_error - reports a memory event to userspace 1028 - * 1029 - * @type: severity of the error (CE/UE/Fatal) 1030 - * @mci: a struct mem_ctl_info pointer 1031 - * @error_count: Number of errors of the same type 1032 - * @page_frame_number: mem page where the error occurred 1033 - * @offset_in_page: offset of the error inside the page 1034 - * @syndrome: ECC syndrome 1035 - * @top_layer: Memory layer[0] position 1036 - * @mid_layer: Memory layer[1] position 1037 - * @low_layer: Memory layer[2] position 1038 - * @msg: Message meaningful to the end users that 1039 - * explains the event 1040 - * @other_detail: Technical details about the event that 1041 - * may help hardware manufacturers and 1042 - * EDAC developers to analyse the event 1043 - */ 1044 1090 void edac_mc_handle_error(const enum hw_event_mc_err_type type, 1045 1091 struct mem_ctl_info *mci, 1046 1092 const u16 error_count,

+245

drivers/edac/edac_mc.h

··· 1 + /* 2 + * Defines, structures, APIs for edac_mc module 3 + * 4 + * (C) 2007 Linux Networx (http://lnxi.com) 5 + * This file may be distributed under the terms of the 6 + * GNU General Public License. 7 + * 8 + * Written by Thayne Harbaugh 9 + * Based on work by Dan Hollis <goemon at anime dot net> and others. 10 + * http://www.anime.net/~goemon/linux-ecc/ 11 + * 12 + * NMI handling support added by 13 + * Dave Peterson <dsp@llnl.gov> <dave_peterson@pobox.com> 14 + * 15 + * Refactored for multi-source files: 16 + * Doug Thompson <norsk5@xmission.com> 17 + * 18 + * Please look at Documentation/driver-api/edac.rst for more info about 19 + * EDAC core structs and functions. 20 + */ 21 + 22 + #ifndef _EDAC_MC_H_ 23 + #define _EDAC_MC_H_ 24 + 25 + #include <linux/kernel.h> 26 + #include <linux/types.h> 27 + #include <linux/module.h> 28 + #include <linux/spinlock.h> 29 + #include <linux/smp.h> 30 + #include <linux/pci.h> 31 + #include <linux/time.h> 32 + #include <linux/nmi.h> 33 + #include <linux/rcupdate.h> 34 + #include <linux/completion.h> 35 + #include <linux/kobject.h> 36 + #include <linux/platform_device.h> 37 + #include <linux/workqueue.h> 38 + #include <linux/edac.h> 39 + 40 + #if PAGE_SHIFT < 20 41 + #define PAGES_TO_MiB(pages) ((pages) >> (20 - PAGE_SHIFT)) 42 + #define MiB_TO_PAGES(mb) ((mb) << (20 - PAGE_SHIFT)) 43 + #else /* PAGE_SHIFT > 20 */ 44 + #define PAGES_TO_MiB(pages) ((pages) << (PAGE_SHIFT - 20)) 45 + #define MiB_TO_PAGES(mb) ((mb) >> (PAGE_SHIFT - 20)) 46 + #endif 47 + 48 + #define edac_printk(level, prefix, fmt, arg...) \ 49 + printk(level "EDAC " prefix ": " fmt, ##arg) 50 + 51 + #define edac_mc_printk(mci, level, fmt, arg...) \ 52 + printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg) 53 + 54 + #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \ 55 + printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg) 56 + 57 + #define edac_device_printk(ctl, level, fmt, arg...) \ 58 + printk(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg) 59 + 60 + #define edac_pci_printk(ctl, level, fmt, arg...) \ 61 + printk(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg) 62 + 63 + /* prefixes for edac_printk() and edac_mc_printk() */ 64 + #define EDAC_MC "MC" 65 + #define EDAC_PCI "PCI" 66 + #define EDAC_DEBUG "DEBUG" 67 + 68 + extern const char * const edac_mem_types[]; 69 + 70 + #ifdef CONFIG_EDAC_DEBUG 71 + extern int edac_debug_level; 72 + 73 + #define edac_dbg(level, fmt, ...) \ 74 + do { \ 75 + if (level <= edac_debug_level) \ 76 + edac_printk(KERN_DEBUG, EDAC_DEBUG, \ 77 + "%s: " fmt, __func__, ##__VA_ARGS__); \ 78 + } while (0) 79 + 80 + #else /* !CONFIG_EDAC_DEBUG */ 81 + 82 + #define edac_dbg(level, fmt, ...) \ 83 + do { \ 84 + if (0) \ 85 + edac_printk(KERN_DEBUG, EDAC_DEBUG, \ 86 + "%s: " fmt, __func__, ##__VA_ARGS__); \ 87 + } while (0) 88 + 89 + #endif /* !CONFIG_EDAC_DEBUG */ 90 + 91 + #define PCI_VEND_DEV(vend, dev) PCI_VENDOR_ID_ ## vend, \ 92 + PCI_DEVICE_ID_ ## vend ## _ ## dev 93 + 94 + #define edac_dev_name(dev) (dev)->dev_name 95 + 96 + #define to_mci(k) container_of(k, struct mem_ctl_info, dev) 97 + 98 + /** 99 + * edac_mc_alloc() - Allocate and partially fill a struct &mem_ctl_info. 100 + * 101 + * @mc_num: Memory controller number 102 + * @n_layers: Number of MC hierarchy layers 103 + * @layers: Describes each layer as seen by the Memory Controller 104 + * @sz_pvt: size of private storage needed 105 + * 106 + * 107 + * Everything is kmalloc'ed as one big chunk - more efficient. 108 + * Only can be used if all structures have the same lifetime - otherwise 109 + * you have to allocate and initialize your own structures. 110 + * 111 + * Use edac_mc_free() to free mc structures allocated by this function. 112 + * 113 + * .. note:: 114 + * 115 + * drivers handle multi-rank memories in different ways: in some 116 + * drivers, one multi-rank memory stick is mapped as one entry, while, in 117 + * others, a single multi-rank memory stick would be mapped into several 118 + * entries. Currently, this function will allocate multiple struct dimm_info 119 + * on such scenarios, as grouping the multiple ranks require drivers change. 120 + * 121 + * Returns: 122 + * On success, return a pointer to struct mem_ctl_info pointer; 123 + * %NULL otherwise 124 + */ 125 + struct mem_ctl_info *edac_mc_alloc(unsigned mc_num, 126 + unsigned n_layers, 127 + struct edac_mc_layer *layers, 128 + unsigned sz_pvt); 129 + 130 + /** 131 + * edac_mc_add_mc_with_groups() - Insert the @mci structure into the mci 132 + * global list and create sysfs entries associated with @mci structure. 133 + * 134 + * @mci: pointer to the mci structure to be added to the list 135 + * @groups: optional attribute groups for the driver-specific sysfs entries 136 + * 137 + * Returns: 138 + * 0 on Success, or an error code on failure 139 + */ 140 + extern int edac_mc_add_mc_with_groups(struct mem_ctl_info *mci, 141 + const struct attribute_group **groups); 142 + #define edac_mc_add_mc(mci) edac_mc_add_mc_with_groups(mci, NULL) 143 + 144 + /** 145 + * edac_mc_free() - Frees a previously allocated @mci structure 146 + * 147 + * @mci: pointer to a struct mem_ctl_info structure 148 + */ 149 + extern void edac_mc_free(struct mem_ctl_info *mci); 150 + 151 + /** 152 + * edac_mc_find() - Search for a mem_ctl_info structure whose index is @idx. 153 + * 154 + * @idx: index to be seek 155 + * 156 + * If found, return a pointer to the structure. 157 + * Else return NULL. 158 + */ 159 + extern struct mem_ctl_info *edac_mc_find(int idx); 160 + 161 + /** 162 + * find_mci_by_dev() - Scan list of controllers looking for the one that 163 + * manages the @dev device. 164 + * 165 + * @dev: pointer to a struct device related with the MCI 166 + * 167 + * Returns: on success, returns a pointer to struct &mem_ctl_info; 168 + * %NULL otherwise. 169 + */ 170 + extern struct mem_ctl_info *find_mci_by_dev(struct device *dev); 171 + 172 + /** 173 + * edac_mc_del_mc() - Remove sysfs entries for mci structure associated with 174 + * @dev and remove mci structure from global list. 175 + * 176 + * @dev: Pointer to struct &device representing mci structure to remove. 177 + * 178 + * Returns: pointer to removed mci structure, or %NULL if device not found. 179 + */ 180 + extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev); 181 + 182 + /** 183 + * edac_mc_find_csrow_by_page() - Ancillary routine to identify what csrow 184 + * contains a memory page. 185 + * 186 + * @mci: pointer to a struct mem_ctl_info structure 187 + * @page: memory page to find 188 + * 189 + * Returns: on success, returns the csrow. -1 if not found. 190 + */ 191 + extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci, 192 + unsigned long page); 193 + 194 + /** 195 + * edac_raw_mc_handle_error() - Reports a memory event to userspace without 196 + * doing anything to discover the error location. 197 + * 198 + * @type: severity of the error (CE/UE/Fatal) 199 + * @mci: a struct mem_ctl_info pointer 200 + * @e: error description 201 + * 202 + * This raw function is used internally by edac_mc_handle_error(). It should 203 + * only be called directly when the hardware error come directly from BIOS, 204 + * like in the case of APEI GHES driver. 205 + */ 206 + void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type, 207 + struct mem_ctl_info *mci, 208 + struct edac_raw_error_desc *e); 209 + 210 + /** 211 + * edac_mc_handle_error() - Reports a memory event to userspace. 212 + * 213 + * @type: severity of the error (CE/UE/Fatal) 214 + * @mci: a struct mem_ctl_info pointer 215 + * @error_count: Number of errors of the same type 216 + * @page_frame_number: mem page where the error occurred 217 + * @offset_in_page: offset of the error inside the page 218 + * @syndrome: ECC syndrome 219 + * @top_layer: Memory layer[0] position 220 + * @mid_layer: Memory layer[1] position 221 + * @low_layer: Memory layer[2] position 222 + * @msg: Message meaningful to the end users that 223 + * explains the event 224 + * @other_detail: Technical details about the event that 225 + * may help hardware manufacturers and 226 + * EDAC developers to analyse the event 227 + */ 228 + void edac_mc_handle_error(const enum hw_event_mc_err_type type, 229 + struct mem_ctl_info *mci, 230 + const u16 error_count, 231 + const unsigned long page_frame_number, 232 + const unsigned long offset_in_page, 233 + const unsigned long syndrome, 234 + const int top_layer, 235 + const int mid_layer, 236 + const int low_layer, 237 + const char *msg, 238 + const char *other_detail); 239 + 240 + /* 241 + * edac misc APIs 242 + */ 243 + extern char *edac_op_state_to_string(int op_state); 244 + 245 + #endif /* _EDAC_MC_H_ */

+1 -1

drivers/edac/edac_mc_sysfs.c

··· 19 19 #include <linux/pm_runtime.h> 20 20 #include <linux/uaccess.h> 21 21 22 - #include "edac_core.h" 22 + #include "edac_mc.h" 23 23 #include "edac_module.h" 24 24 25 25 /* MC EDAC Controls, setable by module parameter, and sysfs */

+1 -1

drivers/edac/edac_module.c

··· 12 12 */ 13 13 #include <linux/edac.h> 14 14 15 - #include "edac_core.h" 15 + #include "edac_mc.h" 16 16 #include "edac_module.h" 17 17 18 18 #define EDAC_VERSION "Ver: 3.0.0"

+3 -1

drivers/edac/edac_module.h

··· 10 10 #ifndef __EDAC_MODULE_H__ 11 11 #define __EDAC_MODULE_H__ 12 12 13 - #include "edac_core.h" 13 + #include "edac_mc.h" 14 + #include "edac_pci.h" 15 + #include "edac_device.h" 14 16 15 17 /* 16 18 * INTERNAL EDAC MODULE:

+11 -79

drivers/edac/edac_pci.c

··· 9 9 * or implied. 10 10 * 11 11 */ 12 - #include <linux/module.h> 13 - #include <linux/types.h> 14 - #include <linux/smp.h> 15 - #include <linux/init.h> 16 - #include <linux/sysctl.h> 17 - #include <linux/highmem.h> 18 - #include <linux/timer.h> 19 - #include <linux/slab.h> 20 - #include <linux/spinlock.h> 21 - #include <linux/list.h> 22 - #include <linux/ctype.h> 23 - #include <linux/workqueue.h> 24 - #include <asm/uaccess.h> 25 12 #include <asm/page.h> 13 + #include <asm/uaccess.h> 14 + #include <linux/ctype.h> 15 + #include <linux/highmem.h> 16 + #include <linux/init.h> 17 + #include <linux/module.h> 18 + #include <linux/slab.h> 19 + #include <linux/smp.h> 20 + #include <linux/spinlock.h> 21 + #include <linux/sysctl.h> 22 + #include <linux/timer.h> 26 23 27 - #include "edac_core.h" 24 + #include "edac_pci.h" 28 25 #include "edac_module.h" 29 26 30 27 static DEFINE_MUTEX(edac_pci_ctls_mutex); 31 28 static LIST_HEAD(edac_pci_list); 32 29 static atomic_t pci_indexes = ATOMIC_INIT(0); 33 30 34 - /* 35 - * edac_pci_alloc_ctl_info 36 - * 37 - * The alloc() function for the 'edac_pci' control info 38 - * structure. The chip driver will allocate one of these for each 39 - * edac_pci it is going to control/register with the EDAC CORE. 40 - */ 41 31 struct edac_pci_ctl_info *edac_pci_alloc_ctl_info(unsigned int sz_pvt, 42 32 const char *edac_pci_name) 43 33 { ··· 58 68 } 59 69 EXPORT_SYMBOL_GPL(edac_pci_alloc_ctl_info); 60 70 61 - /* 62 - * edac_pci_free_ctl_info() 63 - * 64 - * Last action on the pci control structure. 65 - * 66 - * call the remove sysfs information, which will unregister 67 - * this control struct's kobj. When that kobj's ref count 68 - * goes to zero, its release function will be call and then 69 - * kfree() the memory. 70 - */ 71 71 void edac_pci_free_ctl_info(struct edac_pci_ctl_info *pci) 72 72 { 73 73 edac_dbg(1, "\n"); ··· 195 215 mutex_unlock(&edac_pci_ctls_mutex); 196 216 } 197 217 198 - /* 199 - * edac_pci_alloc_index: Allocate a unique PCI index number 200 - * 201 - * Return: 202 - * allocated index number 203 - * 204 - */ 205 218 int edac_pci_alloc_index(void) 206 219 { 207 220 return atomic_inc_return(&pci_indexes) - 1; 208 221 } 209 222 EXPORT_SYMBOL_GPL(edac_pci_alloc_index); 210 223 211 - /* 212 - * edac_pci_add_device: Insert the 'edac_dev' structure into the 213 - * edac_pci global list and create sysfs entries associated with 214 - * edac_pci structure. 215 - * @pci: pointer to the edac_device structure to be added to the list 216 - * @edac_idx: A unique numeric identifier to be assigned to the 217 - * 'edac_pci' structure. 218 - * 219 - * Return: 220 - * 0 Success 221 - * !0 Failure 222 - */ 223 224 int edac_pci_add_device(struct edac_pci_ctl_info *pci, int edac_idx) 224 225 { 225 226 edac_dbg(0, "\n"); ··· 246 285 } 247 286 EXPORT_SYMBOL_GPL(edac_pci_add_device); 248 287 249 - /* 250 - * edac_pci_del_device() 251 - * Remove sysfs entries for specified edac_pci structure and 252 - * then remove edac_pci structure from global list 253 - * 254 - * @dev: 255 - * Pointer to 'struct device' representing edac_pci structure 256 - * to remove 257 - * 258 - * Return: 259 - * Pointer to removed edac_pci structure, 260 - * or NULL if device not found 261 - */ 262 288 struct edac_pci_ctl_info *edac_pci_del_device(struct device *dev) 263 289 { 264 290 struct edac_pci_ctl_info *pci; ··· 299 351 int edac_idx; 300 352 }; 301 353 302 - /* 303 - * edac_pci_create_generic_ctl 304 - * 305 - * A generic constructor for a PCI parity polling device 306 - * Some systems have more than one domain of PCI busses. 307 - * For systems with one domain, then this API will 308 - * provide for a generic poller. 309 - * 310 - * This routine calls the edac_pci_alloc_ctl_info() for 311 - * the generic device, with default values 312 - */ 313 354 struct edac_pci_ctl_info *edac_pci_create_generic_ctl(struct device *dev, 314 355 const char *mod_name) 315 356 { ··· 331 394 } 332 395 EXPORT_SYMBOL_GPL(edac_pci_create_generic_ctl); 333 396 334 - /* 335 - * edac_pci_release_generic_ctl 336 - * 337 - * The release function of a generic EDAC PCI polling device 338 - */ 339 397 void edac_pci_release_generic_ctl(struct edac_pci_ctl_info *pci) 340 398 { 341 399 edac_dbg(0, "pci mod=%s\n", pci->mod_name);

+271

drivers/edac/edac_pci.h

··· 1 + /* 2 + * Defines, structures, APIs for edac_pci and edac_pci_sysfs 3 + * 4 + * (C) 2007 Linux Networx (http://lnxi.com) 5 + * This file may be distributed under the terms of the 6 + * GNU General Public License. 7 + * 8 + * Written by Thayne Harbaugh 9 + * Based on work by Dan Hollis <goemon at anime dot net> and others. 10 + * http://www.anime.net/~goemon/linux-ecc/ 11 + * 12 + * NMI handling support added by 13 + * Dave Peterson <dsp@llnl.gov> <dave_peterson@pobox.com> 14 + * 15 + * Refactored for multi-source files: 16 + * Doug Thompson <norsk5@xmission.com> 17 + * 18 + * Please look at Documentation/driver-api/edac.rst for more info about 19 + * EDAC core structs and functions. 20 + */ 21 + 22 + #ifndef _EDAC_PCI_H_ 23 + #define _EDAC_PCI_H_ 24 + 25 + #include <linux/completion.h> 26 + #include <linux/device.h> 27 + #include <linux/edac.h> 28 + #include <linux/kobject.h> 29 + #include <linux/list.h> 30 + #include <linux/pci.h> 31 + #include <linux/types.h> 32 + #include <linux/workqueue.h> 33 + 34 + #ifdef CONFIG_PCI 35 + 36 + struct edac_pci_counter { 37 + atomic_t pe_count; 38 + atomic_t npe_count; 39 + }; 40 + 41 + /* 42 + * Abstract edac_pci control info structure 43 + * 44 + */ 45 + struct edac_pci_ctl_info { 46 + /* for global list of edac_pci_ctl_info structs */ 47 + struct list_head link; 48 + 49 + int pci_idx; 50 + 51 + struct bus_type *edac_subsys; /* pointer to subsystem */ 52 + 53 + /* the internal state of this controller instance */ 54 + int op_state; 55 + /* work struct for this instance */ 56 + struct delayed_work work; 57 + 58 + /* pointer to edac polling checking routine: 59 + * If NOT NULL: points to polling check routine 60 + * If NULL: Then assumes INTERRUPT operation, where 61 + * MC driver will receive events 62 + */ 63 + void (*edac_check) (struct edac_pci_ctl_info * edac_dev); 64 + 65 + struct device *dev; /* pointer to device structure */ 66 + 67 + const char *mod_name; /* module name */ 68 + const char *ctl_name; /* edac controller name */ 69 + const char *dev_name; /* pci/platform/etc... name */ 70 + 71 + void *pvt_info; /* pointer to 'private driver' info */ 72 + 73 + unsigned long start_time; /* edac_pci load start time (jiffies) */ 74 + 75 + struct completion complete; 76 + 77 + /* sysfs top name under 'edac' directory 78 + * and instance name: 79 + * cpu/cpu0/... 80 + * cpu/cpu1/... 81 + * cpu/cpu2/... 82 + * ... 83 + */ 84 + char name[EDAC_DEVICE_NAME_LEN + 1]; 85 + 86 + /* Event counters for the this whole EDAC Device */ 87 + struct edac_pci_counter counters; 88 + 89 + /* edac sysfs device control for the 'name' 90 + * device this structure controls 91 + */ 92 + struct kobject kobj; 93 + }; 94 + 95 + #define to_edac_pci_ctl_work(w) \ 96 + container_of(w, struct edac_pci_ctl_info,work) 97 + 98 + /* write all or some bits in a byte-register*/ 99 + static inline void pci_write_bits8(struct pci_dev *pdev, int offset, u8 value, 100 + u8 mask) 101 + { 102 + if (mask != 0xff) { 103 + u8 buf; 104 + 105 + pci_read_config_byte(pdev, offset, &buf); 106 + value &= mask; 107 + buf &= ~mask; 108 + value |= buf; 109 + } 110 + 111 + pci_write_config_byte(pdev, offset, value); 112 + } 113 + 114 + /* write all or some bits in a word-register*/ 115 + static inline void pci_write_bits16(struct pci_dev *pdev, int offset, 116 + u16 value, u16 mask) 117 + { 118 + if (mask != 0xffff) { 119 + u16 buf; 120 + 121 + pci_read_config_word(pdev, offset, &buf); 122 + value &= mask; 123 + buf &= ~mask; 124 + value |= buf; 125 + } 126 + 127 + pci_write_config_word(pdev, offset, value); 128 + } 129 + 130 + /* 131 + * pci_write_bits32 132 + * 133 + * edac local routine to do pci_write_config_dword, but adds 134 + * a mask parameter. If mask is all ones, ignore the mask. 135 + * Otherwise utilize the mask to isolate specified bits 136 + * 137 + * write all or some bits in a dword-register 138 + */ 139 + static inline void pci_write_bits32(struct pci_dev *pdev, int offset, 140 + u32 value, u32 mask) 141 + { 142 + if (mask != 0xffffffff) { 143 + u32 buf; 144 + 145 + pci_read_config_dword(pdev, offset, &buf); 146 + value &= mask; 147 + buf &= ~mask; 148 + value |= buf; 149 + } 150 + 151 + pci_write_config_dword(pdev, offset, value); 152 + } 153 + 154 + #endif /* CONFIG_PCI */ 155 + 156 + /* 157 + * edac_pci APIs 158 + */ 159 + 160 + /** 161 + * edac_pci_alloc_ctl_info: 162 + * The alloc() function for the 'edac_pci' control info 163 + * structure. 164 + * 165 + * @sz_pvt: size of the private info at struct &edac_pci_ctl_info 166 + * @edac_pci_name: name of the PCI device 167 + * 168 + * The chip driver will allocate one of these for each 169 + * edac_pci it is going to control/register with the EDAC CORE. 170 + * 171 + * Returns: a pointer to struct &edac_pci_ctl_info on success; %NULL otherwise. 172 + */ 173 + extern struct edac_pci_ctl_info *edac_pci_alloc_ctl_info(unsigned int sz_pvt, 174 + const char *edac_pci_name); 175 + 176 + /** 177 + * edac_pci_free_ctl_info(): 178 + * Last action on the pci control structure. 179 + * 180 + * @pci: pointer to struct &edac_pci_ctl_info 181 + * 182 + * Calls the remove sysfs information, which will unregister 183 + * this control struct's kobj. When that kobj's ref count 184 + * goes to zero, its release function will be call and then 185 + * kfree() the memory. 186 + */ 187 + extern void edac_pci_free_ctl_info(struct edac_pci_ctl_info *pci); 188 + 189 + /** 190 + * edac_pci_alloc_index: Allocate a unique PCI index number 191 + * 192 + * Returns: 193 + * allocated index number 194 + * 195 + */ 196 + extern int edac_pci_alloc_index(void); 197 + 198 + /** 199 + * edac_pci_add_device(): Insert the 'edac_dev' structure into the 200 + * edac_pci global list and create sysfs entries associated with 201 + * edac_pci structure. 202 + * 203 + * @pci: pointer to the edac_device structure to be added to the list 204 + * @edac_idx: A unique numeric identifier to be assigned to the 205 + * 'edac_pci' structure. 206 + * 207 + * Returns: 208 + * 0 on Success, or an error code on failure 209 + */ 210 + extern int edac_pci_add_device(struct edac_pci_ctl_info *pci, int edac_idx); 211 + 212 + /** 213 + * edac_pci_del_device() 214 + * Remove sysfs entries for specified edac_pci structure and 215 + * then remove edac_pci structure from global list 216 + * 217 + * @dev: 218 + * Pointer to 'struct device' representing edac_pci structure 219 + * to remove 220 + * 221 + * Returns: 222 + * Pointer to removed edac_pci structure, 223 + * or %NULL if device not found 224 + */ 225 + extern struct edac_pci_ctl_info *edac_pci_del_device(struct device *dev); 226 + 227 + /** 228 + * edac_pci_create_generic_ctl() 229 + * A generic constructor for a PCI parity polling device 230 + * Some systems have more than one domain of PCI busses. 231 + * For systems with one domain, then this API will 232 + * provide for a generic poller. 233 + * 234 + * @dev: pointer to struct &device; 235 + * @mod_name: name of the PCI device 236 + * 237 + * This routine calls the edac_pci_alloc_ctl_info() for 238 + * the generic device, with default values 239 + * 240 + * Returns: Pointer to struct &edac_pci_ctl_info on success, %NULL on 241 + * failure. 242 + */ 243 + extern struct edac_pci_ctl_info *edac_pci_create_generic_ctl( 244 + struct device *dev, 245 + const char *mod_name); 246 + 247 + /** 248 + * edac_pci_release_generic_ctl 249 + * The release function of a generic EDAC PCI polling device 250 + * 251 + * @pci: pointer to struct &edac_pci_ctl_info 252 + */ 253 + extern void edac_pci_release_generic_ctl(struct edac_pci_ctl_info *pci); 254 + 255 + /** 256 + * edac_pci_create_sysfs 257 + * Create the controls/attributes for the specified EDAC PCI device 258 + * 259 + * @pci: pointer to struct &edac_pci_ctl_info 260 + */ 261 + extern int edac_pci_create_sysfs(struct edac_pci_ctl_info *pci); 262 + 263 + /** 264 + * edac_pci_remove_sysfs() 265 + * remove the controls and attributes for this EDAC PCI device 266 + * 267 + * @pci: pointer to struct &edac_pci_ctl_info 268 + */ 269 + extern void edac_pci_remove_sysfs(struct edac_pci_ctl_info *pci); 270 + 271 + #endif

+1 -12

drivers/edac/edac_pci_sysfs.c

··· 11 11 #include <linux/slab.h> 12 12 #include <linux/ctype.h> 13 13 14 - #include "edac_core.h" 14 + #include "edac_pci.h" 15 15 #include "edac_module.h" 16 16 17 17 #define EDAC_PCI_SYMLINK "device" ··· 418 418 } 419 419 } 420 420 421 - /* 422 - * 423 - * edac_pci_create_sysfs 424 - * 425 - * Create the controls/attributes for the specified EDAC PCI device 426 - */ 427 421 int edac_pci_create_sysfs(struct edac_pci_ctl_info *pci) 428 422 { 429 423 int err; ··· 453 459 return err; 454 460 } 455 461 456 - /* 457 - * edac_pci_remove_sysfs 458 - * 459 - * remove the controls and attributes for this EDAC PCI device 460 - */ 461 462 void edac_pci_remove_sysfs(struct edac_pci_ctl_info *pci) 462 463 { 463 464 edac_dbg(0, "index=%d\n", pci->pci_idx);

-1

drivers/edac/fsl_ddr_edac.c

··· 28 28 #include <linux/of_device.h> 29 29 #include <linux/of_address.h> 30 30 #include "edac_module.h" 31 - #include "edac_core.h" 32 31 #include "fsl_ddr_edac.h" 33 32 34 33 #define EDAC_MOD_STR "fsl_ddr_edac"

+1 -1

drivers/edac/ghes_edac.c

··· 14 14 #include <acpi/ghes.h> 15 15 #include <linux/edac.h> 16 16 #include <linux/dmi.h> 17 - #include "edac_core.h" 17 + #include "edac_module.h" 18 18 #include <ras/ras_event.h> 19 19 20 20 #define GHES_EDAC_REVISION " Ver: 1.0.0"

-1

drivers/edac/highbank_l2_edac.c

··· 21 21 #include <linux/platform_device.h> 22 22 #include <linux/of_platform.h> 23 23 24 - #include "edac_core.h" 25 24 #include "edac_module.h" 26 25 27 26 #define SR_CLR_SB_ECC_INTR 0x0

-1

drivers/edac/highbank_mc_edac.c

··· 22 22 #include <linux/of_platform.h> 23 23 #include <linux/uaccess.h> 24 24 25 - #include "edac_core.h" 26 25 #include "edac_module.h" 27 26 28 27 /* DDR Ctrlr Error Registers */

+1 -1

drivers/edac/i3000_edac.c

··· 14 14 #include <linux/pci.h> 15 15 #include <linux/pci_ids.h> 16 16 #include <linux/edac.h> 17 - #include "edac_core.h" 17 + #include "edac_module.h" 18 18 19 19 #define I3000_REVISION "1.1" 20 20

+1 -1

drivers/edac/i3200_edac.c

··· 13 13 #include <linux/pci_ids.h> 14 14 #include <linux/edac.h> 15 15 #include <linux/io.h> 16 - #include "edac_core.h" 16 + #include "edac_module.h" 17 17 18 18 #include <linux/io-64-nonatomic-lo-hi.h> 19 19

+1 -1

drivers/edac/i5000_edac.c

··· 22 22 #include <linux/edac.h> 23 23 #include <asm/mmzone.h> 24 24 25 - #include "edac_core.h" 25 + #include "edac_module.h" 26 26 27 27 /* 28 28 * Alter this version for the I5000 module when modifications are made

-1

drivers/edac/i5100_edac.c

··· 29 29 #include <linux/mmzone.h> 30 30 #include <linux/debugfs.h> 31 31 32 - #include "edac_core.h" 33 32 #include "edac_module.h" 34 33 35 34 /* register addresses */

+1 -1

drivers/edac/i5400_edac.c

··· 32 32 #include <linux/edac.h> 33 33 #include <linux/mmzone.h> 34 34 35 - #include "edac_core.h" 35 + #include "edac_module.h" 36 36 37 37 /* 38 38 * Alter this version for the I5400 module when modifications are made

+1 -1

drivers/edac/i7300_edac.c

··· 26 26 #include <linux/edac.h> 27 27 #include <linux/mmzone.h> 28 28 29 - #include "edac_core.h" 29 + #include "edac_module.h" 30 30 31 31 /* 32 32 * Alter this version for the I7300 module when modifications are made

+1 -1

drivers/edac/i7core_edac.c

··· 39 39 #include <asm/processor.h> 40 40 #include <asm/div64.h> 41 41 42 - #include "edac_core.h" 42 + #include "edac_module.h" 43 43 44 44 /* Static vars */ 45 45 static LIST_HEAD(i7core_edac_list);

+1 -1

drivers/edac/i82443bxgx_edac.c

··· 29 29 30 30 31 31 #include <linux/edac.h> 32 - #include "edac_core.h" 32 + #include "edac_module.h" 33 33 34 34 #define I82443_REVISION "0.1" 35 35

+1 -1

drivers/edac/i82860_edac.c

··· 14 14 #include <linux/pci.h> 15 15 #include <linux/pci_ids.h> 16 16 #include <linux/edac.h> 17 - #include "edac_core.h" 17 + #include "edac_module.h" 18 18 19 19 #define I82860_REVISION " Ver: 2.0.2" 20 20 #define EDAC_MOD_STR "i82860_edac"

+1 -1

drivers/edac/i82875p_edac.c

··· 18 18 #include <linux/pci.h> 19 19 #include <linux/pci_ids.h> 20 20 #include <linux/edac.h> 21 - #include "edac_core.h" 21 + #include "edac_module.h" 22 22 23 23 #define I82875P_REVISION " Ver: 2.0.2" 24 24 #define EDAC_MOD_STR "i82875p_edac"

+1 -1

drivers/edac/i82975x_edac.c

··· 14 14 #include <linux/pci.h> 15 15 #include <linux/pci_ids.h> 16 16 #include <linux/edac.h> 17 - #include "edac_core.h" 17 + #include "edac_module.h" 18 18 19 19 #define I82975X_REVISION " Ver: 1.0.0" 20 20 #define EDAC_MOD_STR "i82975x_edac"

+1 -1

drivers/edac/ie31200_edac.c

··· 41 41 #include <linux/edac.h> 42 42 43 43 #include <linux/io-64-nonatomic-lo-hi.h> 44 - #include "edac_core.h" 44 + #include "edac_module.h" 45 45 46 46 #define IE31200_REVISION "1.0" 47 47 #define EDAC_MOD_STR "ie31200_edac"

+1 -1

drivers/edac/layerscape_edac.c

··· 16 16 17 17 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 18 18 19 - #include "edac_core.h" 19 + #include "edac_module.h" 20 20 #include "fsl_ddr_edac.h" 21 21 22 22 static const struct of_device_id fsl_ddr_mc_err_of_match[] = {

-1

drivers/edac/mpc85xx_edac.c

··· 25 25 #include <linux/of_platform.h> 26 26 #include <linux/of_device.h> 27 27 #include "edac_module.h" 28 - #include "edac_core.h" 29 28 #include "mpc85xx_edac.h" 30 29 #include "fsl_ddr_edac.h" 31 30

-1

drivers/edac/mv64x60_edac.c

··· 17 17 #include <linux/edac.h> 18 18 #include <linux/gfp.h> 19 19 20 - #include "edac_core.h" 21 20 #include "edac_module.h" 22 21 #include "mv64x60_edac.h" 23 22

-1

drivers/edac/octeon_edac-l2c.c

··· 16 16 17 17 #include <asm/octeon/cvmx.h> 18 18 19 - #include "edac_core.h" 20 19 #include "edac_module.h" 21 20 22 21 #define EDAC_MOD_STR "octeon-l2c"

-1

drivers/edac/octeon_edac-lmc.c

··· 19 19 #include <asm/octeon/octeon.h> 20 20 #include <asm/octeon/cvmx-lmcx-defs.h> 21 21 22 - #include "edac_core.h" 23 22 #include "edac_module.h" 24 23 25 24 #define OCTEON_MAX_MC 4

-1

drivers/edac/octeon_edac-pc.c

··· 15 15 #include <linux/io.h> 16 16 #include <linux/edac.h> 17 17 18 - #include "edac_core.h" 19 18 #include "edac_module.h" 20 19 21 20 #include <asm/octeon/cvmx.h>

-1

drivers/edac/octeon_edac-pci.c

··· 18 18 #include <asm/octeon/cvmx-pci-defs.h> 19 19 #include <asm/octeon/octeon.h> 20 20 21 - #include "edac_core.h" 22 21 #include "edac_module.h" 23 22 24 23 static void octeon_pci_poll(struct edac_pci_ctl_info *pci)

+1 -1

drivers/edac/pasemi_edac.c

··· 26 26 #include <linux/pci.h> 27 27 #include <linux/pci_ids.h> 28 28 #include <linux/edac.h> 29 - #include "edac_core.h" 29 + #include "edac_module.h" 30 30 31 31 #define MODULE_NAME "pasemi_edac" 32 32

+1 -1

drivers/edac/ppc4xx_edac.c

··· 21 21 22 22 #include <asm/dcr.h> 23 23 24 - #include "edac_core.h" 24 + #include "edac_module.h" 25 25 #include "ppc4xx_edac.h" 26 26 27 27 /*

+1 -1

drivers/edac/r82600_edac.c

··· 20 20 #include <linux/pci.h> 21 21 #include <linux/pci_ids.h> 22 22 #include <linux/edac.h> 23 - #include "edac_core.h" 23 + #include "edac_module.h" 24 24 25 25 #define R82600_REVISION " Ver: 2.0.2" 26 26 #define EDAC_MOD_STR "r82600_edac"

+1 -1

drivers/edac/sb_edac.c

··· 27 27 #include <asm/processor.h> 28 28 #include <asm/mce.h> 29 29 30 - #include "edac_core.h" 30 + #include "edac_module.h" 31 31 32 32 /* Static vars */ 33 33 static LIST_HEAD(sbridge_edac_list);

+1 -1

drivers/edac/skx_edac.c

··· 29 29 #include <asm/processor.h> 30 30 #include <asm/mce.h> 31 31 32 - #include "edac_core.h" 32 + #include "edac_module.h" 33 33 34 34 #define SKX_REVISION " Ver: 1.0 " 35 35

+1 -1

drivers/edac/synopsys_edac.c

··· 23 23 #include <linux/module.h> 24 24 #include <linux/platform_device.h> 25 25 26 - #include "edac_core.h" 26 + #include "edac_module.h" 27 27 28 28 /* Number of cs_rows needed per memory controller */ 29 29 #define SYNPS_EDAC_NR_CSROWS 1

+1 -1

drivers/edac/tile_edac.c

··· 30 30 #include <hv/hypervisor.h> 31 31 #include <hv/drv_mshim_intf.h> 32 32 33 - #include "edac_core.h" 33 + #include "edac_module.h" 34 34 35 35 #define DRV_NAME "tile-edac" 36 36

+1 -1

drivers/edac/x38_edac.c

··· 16 16 #include <linux/edac.h> 17 17 18 18 #include <linux/io-64-nonatomic-lo-hi.h> 19 - #include "edac_core.h" 19 + #include "edac_module.h" 20 20 21 21 #define X38_REVISION "1.1" 22 22

-1

drivers/edac/xgene_edac.c

··· 28 28 #include <linux/of_address.h> 29 29 #include <linux/regmap.h> 30 30 31 - #include "edac_core.h" 32 31 #include "edac_module.h" 33 32 34 33 #define EDAC_MOD_STR "xgene_edac"

+31 -123

include/linux/edac.h

··· 18 18 #include <linux/workqueue.h> 19 19 #include <linux/debugfs.h> 20 20 21 + #define EDAC_DEVICE_NAME_LEN 31 22 + 21 23 struct device; 22 24 23 25 #define EDAC_OPSTATE_INVAL -1 ··· 130 128 * fatal (maybe it is on an unused memory area, 131 129 * or the memory controller could recover from 132 130 * it for example, by re-trying the operation). 131 + * @HW_EVENT_ERR_DEFERRED: Deferred Error - Indicates an uncorrectable 132 + * error whose handling is not urgent. This could 133 + * be due to hardware data poisoning where the 134 + * system can continue operation until the poisoned 135 + * data is consumed. Preemptive measures may also 136 + * be taken, e.g. offlining pages, etc. 133 137 * @HW_EVENT_ERR_FATAL: Fatal Error - Uncorrected error that could not 134 138 * be recovered. 139 + * @HW_EVENT_ERR_INFO: Informational - The CPER spec defines a forth 140 + * type of error: informational logs. 135 141 */ 136 142 enum hw_event_mc_err_type { 137 143 HW_EVENT_ERR_CORRECTED, ··· 170 160 * enum mem_type - memory types. For a more detailed reference, please see 171 161 * http://en.wikipedia.org/wiki/DRAM 172 162 * 173 - * @MEM_EMPTY Empty csrow 163 + * @MEM_EMPTY: Empty csrow 174 164 * @MEM_RESERVED: Reserved csrow type 175 165 * @MEM_UNKNOWN: Unknown csrow type 176 166 * @MEM_FPM: FPM - Fast Page Mode, used on systems up to 1995. ··· 294 284 295 285 /** 296 286 * enum scrub_type - scrubbing capabilities 297 - * @SCRUB_UNKNOWN Unknown if scrubber is available 287 + * @SCRUB_UNKNOWN: Unknown if scrubber is available 298 288 * @SCRUB_NONE: No scrubber 299 289 * @SCRUB_SW_PROG: SW progressive (sequential) scrubbing 300 290 * @SCRUB_SW_SRC: Software scrub only errors ··· 303 293 * @SCRUB_HW_PROG: HW progressive (sequential) scrubbing 304 294 * @SCRUB_HW_SRC: Hardware scrub only errors 305 295 * @SCRUB_HW_PROG_SRC: Progressive hardware scrub from an error 306 - * SCRUB_HW_TUNABLE: Hardware scrub frequency is tunable 296 + * @SCRUB_HW_TUNABLE: Hardware scrub frequency is tunable 307 297 */ 308 298 enum scrub_type { 309 299 SCRUB_UNKNOWN = 0, ··· 336 326 #define OP_RUNNING_POLL_INTR 0x203 337 327 #define OP_OFFLINE 0x300 338 328 339 - /* 340 - * Concepts used at the EDAC subsystem 341 - * 342 - * There are several things to be aware of that aren't at all obvious: 343 - * 344 - * SOCKETS, SOCKET SETS, BANKS, ROWS, CHIP-SELECT ROWS, CHANNELS, etc.. 345 - * 346 - * These are some of the many terms that are thrown about that don't always 347 - * mean what people think they mean (Inconceivable!). In the interest of 348 - * creating a common ground for discussion, terms and their definitions 349 - * will be established. 350 - * 351 - * Memory devices: The individual DRAM chips on a memory stick. These 352 - * devices commonly output 4 and 8 bits each (x4, x8). 353 - * Grouping several of these in parallel provides the 354 - * number of bits that the memory controller expects: 355 - * typically 72 bits, in order to provide 64 bits + 356 - * 8 bits of ECC data. 357 - * 358 - * Memory Stick: A printed circuit board that aggregates multiple 359 - * memory devices in parallel. In general, this is the 360 - * Field Replaceable Unit (FRU) which gets replaced, in 361 - * the case of excessive errors. Most often it is also 362 - * called DIMM (Dual Inline Memory Module). 363 - * 364 - * Memory Socket: A physical connector on the motherboard that accepts 365 - * a single memory stick. Also called as "slot" on several 366 - * datasheets. 367 - * 368 - * Channel: A memory controller channel, responsible to communicate 369 - * with a group of DIMMs. Each channel has its own 370 - * independent control (command) and data bus, and can 371 - * be used independently or grouped with other channels. 372 - * 373 - * Branch: It is typically the highest hierarchy on a 374 - * Fully-Buffered DIMM memory controller. 375 - * Typically, it contains two channels. 376 - * Two channels at the same branch can be used in single 377 - * mode or in lockstep mode. 378 - * When lockstep is enabled, the cacheline is doubled, 379 - * but it generally brings some performance penalty. 380 - * Also, it is generally not possible to point to just one 381 - * memory stick when an error occurs, as the error 382 - * correction code is calculated using two DIMMs instead 383 - * of one. Due to that, it is capable of correcting more 384 - * errors than on single mode. 385 - * 386 - * Single-channel: The data accessed by the memory controller is contained 387 - * into one dimm only. E. g. if the data is 64 bits-wide, 388 - * the data flows to the CPU using one 64 bits parallel 389 - * access. 390 - * Typically used with SDR, DDR, DDR2 and DDR3 memories. 391 - * FB-DIMM and RAMBUS use a different concept for channel, 392 - * so this concept doesn't apply there. 393 - * 394 - * Double-channel: The data size accessed by the memory controller is 395 - * interlaced into two dimms, accessed at the same time. 396 - * E. g. if the DIMM is 64 bits-wide (72 bits with ECC), 397 - * the data flows to the CPU using a 128 bits parallel 398 - * access. 399 - * 400 - * Chip-select row: This is the name of the DRAM signal used to select the 401 - * DRAM ranks to be accessed. Common chip-select rows for 402 - * single channel are 64 bits, for dual channel 128 bits. 403 - * It may not be visible by the memory controller, as some 404 - * DIMM types have a memory buffer that can hide direct 405 - * access to it from the Memory Controller. 406 - * 407 - * Single-Ranked stick: A Single-ranked stick has 1 chip-select row of memory. 408 - * Motherboards commonly drive two chip-select pins to 409 - * a memory stick. A single-ranked stick, will occupy 410 - * only one of those rows. The other will be unused. 411 - * 412 - * Double-Ranked stick: A double-ranked stick has two chip-select rows which 413 - * access different sets of memory devices. The two 414 - * rows cannot be accessed concurrently. 415 - * 416 - * Double-sided stick: DEPRECATED TERM, see Double-Ranked stick. 417 - * A double-sided stick has two chip-select rows which 418 - * access different sets of memory devices. The two 419 - * rows cannot be accessed concurrently. "Double-sided" 420 - * is irrespective of the memory devices being mounted 421 - * on both sides of the memory stick. 422 - * 423 - * Socket set: All of the memory sticks that are required for 424 - * a single memory access or all of the memory sticks 425 - * spanned by a chip-select row. A single socket set 426 - * has two chip-select rows and if double-sided sticks 427 - * are used these will occupy those chip-select rows. 428 - * 429 - * Bank: This term is avoided because it is unclear when 430 - * needing to distinguish between chip-select rows and 431 - * socket sets. 432 - * 433 - * Controller pages: 434 - * 435 - * Physical pages: 436 - * 437 - * Virtual pages: 438 - * 439 - * 440 - * STRUCTURE ORGANIZATION AND CHOICES 441 - * 442 - * 443 - * 444 - * PS - I enjoyed writing all that about as much as you enjoyed reading it. 445 - */ 446 - 447 329 /** 448 330 * enum edac_mc_layer - memory controller hierarchy layer 449 331 * ··· 360 458 361 459 /** 362 460 * struct edac_mc_layer - describes the memory controller hierarchy 363 - * @layer: layer type 461 + * @type: layer type 364 462 * @size: number of components per layer. For example, 365 463 * if the channel layer has two channels, size = 2 366 464 * @is_virt_csrow: This layer is part of the "csrow" when old API ··· 383 481 #define EDAC_MAX_LAYERS 3 384 482 385 483 /** 386 - * EDAC_DIMM_OFF - Macro responsible to get a pointer offset inside a pointer array 387 - * for the element given by [layer0,layer1,layer2] position 484 + * EDAC_DIMM_OFF - Macro responsible to get a pointer offset inside a pointer 485 + * array for the element given by [layer0,layer1,layer2] 486 + * position 388 487 * 389 488 * @layers: a struct edac_mc_layer array, describing how many elements 390 489 * were allocated for each layer 391 - * @n_layers: Number of layers at the @layers array 490 + * @nlayers: Number of layers at the @layers array 392 491 * @layer0: layer0 position 393 492 * @layer1: layer1 position. Unused if n_layers < 2 394 493 * @layer2: layer2 position. Unused if n_layers < 3 395 494 * 396 - * For 1 layer, this macro returns &var[layer0] - &var 495 + * For 1 layer, this macro returns "var[layer0] - var"; 496 + * 397 497 * For 2 layers, this macro is similar to allocate a bi-dimensional array 398 - * and to return "&var[layer0][layer1] - &var" 498 + * and to return "var[layer0][layer1] - var"; 499 + * 399 500 * For 3 layers, this macro is similar to allocate a tri-dimensional array 400 - * and to return "&var[layer0][layer1][layer2] - &var" 501 + * and to return "var[layer0][layer1][layer2] - var". 401 502 * 402 503 * A loop could be used here to make it more generic, but, as we only have 403 504 * 3 layers, this is a little faster. 505 + * 404 506 * By design, layers can never be 0 or more than 3. If that ever happens, 405 507 * a NULL is returned, causing an OOPS during the memory allocation routine, 406 508 * with would point to the developer that he's doing something wrong. ··· 431 525 * were allocated for each layer 432 526 * @var: name of the var where we want to get the pointer 433 527 * (like mci->dimms) 434 - * @n_layers: Number of layers at the @layers array 528 + * @nlayers: Number of layers at the @layers array 435 529 * @layer0: layer0 position 436 530 * @layer1: layer1 position. Unused if n_layers < 2 437 531 * @layer2: layer2 position. Unused if n_layers < 3 438 532 * 439 - * For 1 layer, this macro returns &var[layer0] 533 + * For 1 layer, this macro returns "var[layer0]"; 534 + * 440 535 * For 2 layers, this macro is similar to allocate a bi-dimensional array 441 - * and to return "&var[layer0][layer1]" 536 + * and to return "var[layer0][layer1]"; 537 + * 442 538 * For 3 layers, this macro is similar to allocate a tri-dimensional array 443 - * and to return "&var[layer0][layer1][layer2]" 539 + * and to return "var[layer0][layer1][layer2]"; 444 540 */ 445 541 #define EDAC_DIMM_PTR(layers, var, nlayers, layer0, layer1, layer2) ({ \ 446 542 typeof(*var) __p; \ ··· 528 620 }; 529 621 530 622 /** 531 - * edac_raw_error_desc - Raw error report structure 623 + * struct edac_raw_error_desc - Raw error report structure 532 624 * @grain: minimum granularity for an error report, in bytes 533 625 * @error_count: number of errors of the same type 534 626 * @top_layer: top layer of the error (layer[0])