Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at v2.6.21-rc2 589 lines 17 kB view raw
1 2 3EDAC - Error Detection And Correction 4 5Written by Doug Thompson <norsk5@xmission.com> 67 Dec 2005 7 8 9EDAC was written by: 10 Thayne Harbaugh, 11 modified by Dave Peterson, Doug Thompson, et al, 12 from the bluesmoke.sourceforge.net project. 13 14 15============================================================================ 16EDAC PURPOSE 17 18The 'edac' kernel module goal is to detect and report errors that occur 19within the computer system. In the initial release, memory Correctable Errors 20(CE) and Uncorrectable Errors (UE) are the primary errors being harvested. 21 22Detecting CE events, then harvesting those events and reporting them, 23CAN be a predictor of future UE events. With CE events, the system can 24continue to operate, but with less safety. Preventive maintenance and 25proactive part replacement of memory DIMMs exhibiting CEs can reduce 26the likelihood of the dreaded UE events and system 'panics'. 27 28 29In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices 30in order to determine if errors are occurring on data transfers. 31The presence of PCI Parity errors must be examined with a grain of salt. 32There are several add-in adapters that do NOT follow the PCI specification 33with regards to Parity generation and reporting. The specification says 34the vendor should tie the parity status bits to 0 if they do not intend 35to generate parity. Some vendors do not do this, and thus the parity bit 36can "float" giving false positives. 37 38[There are patches in the kernel queue which will allow for storage of 39quirks of PCI devices reporting false parity positives. The 2.6.18 40kernel should have those patches included. When that becomes available, 41then EDAC will be patched to utilize that information to "skip" such 42devices.] 43 44EDAC will have future error detectors that will be integrated with 45EDAC or added to it, in the following list: 46 47 MCE Machine Check Exception 48 MCA Machine Check Architecture 49 NMI NMI notification of ECC errors 50 MSRs Machine Specific Register error cases 51 and other mechanisms. 52 53These errors are usually bus errors, ECC errors, thermal throttling 54and the like. 55 56 57============================================================================ 58EDAC VERSIONING 59 60EDAC is composed of a "core" module (edac_mc.ko) and several Memory 61Controller (MC) driver modules. On a given system, the CORE 62is loaded and one MC driver will be loaded. Both the CORE and 63the MC driver have individual versions that reflect current release 64level of their respective modules. Thus, to "report" on what version 65a system is running, one must report both the CORE's and the 66MC driver's versions. 67 68 69LOADING 70 71If 'edac' was statically linked with the kernel then no loading is 72necessary. If 'edac' was built as modules then simply modprobe the 73'edac' pieces that you need. You should be able to modprobe 74hardware-specific modules and have the dependencies load the necessary core 75modules. 76 77Example: 78 79$> modprobe amd76x_edac 80 81loads both the amd76x_edac.ko memory controller module and the edac_mc.ko 82core module. 83 84 85============================================================================ 86EDAC sysfs INTERFACE 87 88EDAC presents a 'sysfs' interface for control, reporting and attribute 89reporting purposes. 90 91EDAC lives in the /sys/devices/system/edac directory. Within this directory 92there currently reside 2 'edac' components: 93 94 mc memory controller(s) system 95 pci PCI control and status system 96 97 98============================================================================ 99Memory Controller (mc) Model 100 101First a background on the memory controller's model abstracted in EDAC. 102Each 'mc' device controls a set of DIMM memory modules. These modules are 103laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can 104be multiple csrows and multiple channels. 105 106Memory controllers allow for several csrows, with 8 csrows being a typical value. 107Yet, the actual number of csrows depends on the electrical "loading" 108of a given motherboard, memory controller and DIMM characteristics. 109 110Dual channels allows for 128 bit data transfers to the CPU from memory. 111Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs 112(FB-DIMMs). The following example will assume 2 channels: 113 114 115 Channel 0 Channel 1 116 =================================== 117 csrow0 | DIMM_A0 | DIMM_B0 | 118 csrow1 | DIMM_A0 | DIMM_B0 | 119 =================================== 120 121 =================================== 122 csrow2 | DIMM_A1 | DIMM_B1 | 123 csrow3 | DIMM_A1 | DIMM_B1 | 124 =================================== 125 126In the above example table there are 4 physical slots on the motherboard 127for memory DIMMs: 128 129 DIMM_A0 130 DIMM_B0 131 DIMM_A1 132 DIMM_B1 133 134Labels for these slots are usually silk screened on the motherboard. Slots 135labeled 'A' are channel 0 in this example. Slots labeled 'B' 136are channel 1. Notice that there are two csrows possible on a 137physical DIMM. These csrows are allocated their csrow assignment 138based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM 139is placed in each Channel, the csrows cross both DIMMs. 140 141Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 142Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 143will have 1 csrow, csrow0. csrow1 will be empty. On the other hand, 144when 2 dual ranked DIMMs are similarly placed, then both csrow0 and 145csrow1 will be populated. The pattern repeats itself for csrow2 and 146csrow3. 147 148The representation of the above is reflected in the directory tree 149in EDAC's sysfs interface. Starting in directory 150/sys/devices/system/edac/mc each memory controller will be represented 151by its own 'mcX' directory, where 'X" is the index of the MC. 152 153 154 ..../edac/mc/ 155 | 156 |->mc0 157 |->mc1 158 |->mc2 159 .... 160 161Under each 'mcX' directory each 'csrowX' is again represented by a 162'csrowX', where 'X" is the csrow index: 163 164 165 .../mc/mc0/ 166 | 167 |->csrow0 168 |->csrow2 169 |->csrow3 170 .... 171 172Notice that there is no csrow1, which indicates that csrow0 is 173composed of a single ranked DIMMs. This should also apply in both 174Channels, in order to have dual-channel mode be operational. Since 175both csrow2 and csrow3 are populated, this indicates a dual ranked 176set of DIMMs for channels 0 and 1. 177 178 179Within each of the 'mc','mcX' and 'csrowX' directories are several 180EDAC control and attribute files. 181 182 183============================================================================ 184DIRECTORY 'mc' 185 186In directory 'mc' are EDAC system overall control and attribute files: 187 188 189Panic on UE control file: 190 191 'panic_on_ue' 192 193 An uncorrectable error will cause a machine panic. This is usually 194 desirable. It is a bad idea to continue when an uncorrectable error 195 occurs - it is indeterminate what was uncorrected and the operating 196 system context might be so mangled that continuing will lead to further 197 corruption. If the kernel has MCE configured, then EDAC will never 198 notice the UE. 199 200 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1] 201 202 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue 203 204 205Log UE control file: 206 207 'log_ue' 208 209 Generate kernel messages describing uncorrectable errors. These errors 210 are reported through the system message log system. UE statistics 211 will be accumulated even when UE logging is disabled. 212 213 LOAD TIME: module/kernel parameter: log_ue=[0|1] 214 215 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue 216 217 218Log CE control file: 219 220 'log_ce' 221 222 Generate kernel messages describing correctable errors. These 223 errors are reported through the system message log system. 224 CE statistics will be accumulated even when CE logging is disabled. 225 226 LOAD TIME: module/kernel parameter: log_ce=[0|1] 227 228 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce 229 230 231Polling period control file: 232 233 'poll_msec' 234 235 The time period, in milliseconds, for polling for error information. 236 Too small a value wastes resources. Too large a value might delay 237 necessary handling of errors and might loose valuable information for 238 locating the error. 1000 milliseconds (once each second) is the current 239 default. Systems which require all the bandwidth they can get, may 240 increase this. 241 242 LOAD TIME: module/kernel parameter: poll_msec=[0|1] 243 244 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec 245 246 247============================================================================ 248'mcX' DIRECTORIES 249 250 251In 'mcX' directories are EDAC control and attribute files for 252this 'X" instance of the memory controllers: 253 254 255Counter reset control file: 256 257 'reset_counters' 258 259 This write-only control file will zero all the statistical counters 260 for UE and CE errors. Zeroing the counters will also reset the timer 261 indicating how long since the last counter zero. This is useful 262 for computing errors/time. Since the counters are always reset at 263 driver initialization time, no module/kernel parameter is available. 264 265 RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset 266 267 This resets the counters on memory controller 0 268 269 270Seconds since last counter reset control file: 271 272 'seconds_since_reset' 273 274 This attribute file displays how many seconds have elapsed since the 275 last counter reset. This can be used with the error counters to 276 measure error rates. 277 278 279 280Memory Controller name attribute file: 281 282 'mc_name' 283 284 This attribute file displays the type of memory controller 285 that is being utilized. 286 287 288Total memory managed by this memory controller attribute file: 289 290 'size_mb' 291 292 This attribute file displays, in count of megabytes, of memory 293 that this instance of memory controller manages. 294 295 296Total Uncorrectable Errors count attribute file: 297 298 'ue_count' 299 300 This attribute file displays the total count of uncorrectable 301 errors that have occurred on this memory controller. If panic_on_ue 302 is set this counter will not have a chance to increment, 303 since EDAC will panic the system. 304 305 306Total UE count that had no information attribute fileY: 307 308 'ue_noinfo_count' 309 310 This attribute file displays the number of UEs that 311 have occurred have occurred with no informations as to which DIMM 312 slot is having errors. 313 314 315Total Correctable Errors count attribute file: 316 317 'ce_count' 318 319 This attribute file displays the total count of correctable 320 errors that have occurred on this memory controller. This 321 count is very important to examine. CEs provide early 322 indications that a DIMM is beginning to fail. This count 323 field should be monitored for non-zero values and report 324 such information to the system administrator. 325 326 327Total Correctable Errors count attribute file: 328 329 'ce_noinfo_count' 330 331 This attribute file displays the number of CEs that 332 have occurred wherewith no informations as to which DIMM slot 333 is having errors. Memory is handicapped, but operational, 334 yet no information is available to indicate which slot 335 the failing memory is in. This count field should be also 336 be monitored for non-zero values. 337 338Device Symlink: 339 340 'device' 341 342 Symlink to the memory controller device. 343 344Sdram memory scrubbing rate: 345 346 'sdram_scrub_rate' 347 348 Read/Write attribute file that controls memory scrubbing. The scrubbing 349 rate is set by writing a minimum bandwith in bytes/sec to the attribute 350 file. The rate will be translated to an internal value that gives at 351 least the specified rate. 352 353 Reading the file will return the actual scrubbing rate employed. 354 355 If configuration fails or memory scrubbing is not implemented, the value 356 of the attribute file will be -1. 357 358 359 360============================================================================ 361'csrowX' DIRECTORIES 362 363In the 'csrowX' directories are EDAC control and attribute files for 364this 'X" instance of csrow: 365 366 367Total Uncorrectable Errors count attribute file: 368 369 'ue_count' 370 371 This attribute file displays the total count of uncorrectable 372 errors that have occurred on this csrow. If panic_on_ue is set 373 this counter will not have a chance to increment, since EDAC 374 will panic the system. 375 376 377Total Correctable Errors count attribute file: 378 379 'ce_count' 380 381 This attribute file displays the total count of correctable 382 errors that have occurred on this csrow. This 383 count is very important to examine. CEs provide early 384 indications that a DIMM is beginning to fail. This count 385 field should be monitored for non-zero values and report 386 such information to the system administrator. 387 388 389Total memory managed by this csrow attribute file: 390 391 'size_mb' 392 393 This attribute file displays, in count of megabytes, of memory 394 that this csrow contains. 395 396 397Memory Type attribute file: 398 399 'mem_type' 400 401 This attribute file will display what type of memory is currently 402 on this csrow. Normally, either buffered or unbuffered memory. 403 Examples: 404 Registered-DDR 405 Unbuffered-DDR 406 407 408EDAC Mode of operation attribute file: 409 410 'edac_mode' 411 412 This attribute file will display what type of Error detection 413 and correction is being utilized. 414 415 416Device type attribute file: 417 418 'dev_type' 419 420 This attribute file will display what type of DRAM device is 421 being utilized on this DIMM. 422 Examples: 423 x1 424 x2 425 x4 426 x8 427 428 429Channel 0 CE Count attribute file: 430 431 'ch0_ce_count' 432 433 This attribute file will display the count of CEs on this 434 DIMM located in channel 0. 435 436 437Channel 0 UE Count attribute file: 438 439 'ch0_ue_count' 440 441 This attribute file will display the count of UEs on this 442 DIMM located in channel 0. 443 444 445Channel 0 DIMM Label control file: 446 447 'ch0_dimm_label' 448 449 This control file allows this DIMM to have a label assigned 450 to it. With this label in the module, when errors occur 451 the output can provide the DIMM label in the system log. 452 This becomes vital for panic events to isolate the 453 cause of the UE event. 454 455 DIMM Labels must be assigned after booting, with information 456 that correctly identifies the physical slot with its 457 silk screen label. This information is currently very 458 motherboard specific and determination of this information 459 must occur in userland at this time. 460 461 462Channel 1 CE Count attribute file: 463 464 'ch1_ce_count' 465 466 This attribute file will display the count of CEs on this 467 DIMM located in channel 1. 468 469 470Channel 1 UE Count attribute file: 471 472 'ch1_ue_count' 473 474 This attribute file will display the count of UEs on this 475 DIMM located in channel 0. 476 477 478Channel 1 DIMM Label control file: 479 480 'ch1_dimm_label' 481 482 This control file allows this DIMM to have a label assigned 483 to it. With this label in the module, when errors occur 484 the output can provide the DIMM label in the system log. 485 This becomes vital for panic events to isolate the 486 cause of the UE event. 487 488 DIMM Labels must be assigned after booting, with information 489 that correctly identifies the physical slot with its 490 silk screen label. This information is currently very 491 motherboard specific and determination of this information 492 must occur in userland at this time. 493 494 495============================================================================ 496SYSTEM LOGGING 497 498If logging for UEs and CEs are enabled then system logs will have 499error notices indicating errors that have been detected: 500 501EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, 502channel 1 "DIMM_B1": amd76x_edac 503 504EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, 505channel 1 "DIMM_B1": amd76x_edac 506 507 508The structure of the message is: 509 the memory controller (MC0) 510 Error type (CE) 511 memory page (0x283) 512 offset in the page (0xce0) 513 the byte granularity (grain 8) 514 or resolution of the error 515 the error syndrome (0xb741) 516 memory row (row 0) 517 memory channel (channel 1) 518 DIMM label, if set prior (DIMM B1 519 and then an optional, driver-specific message that may 520 have additional information. 521 522Both UEs and CEs with no info will lack all but memory controller, 523error type, a notice of "no info" and then an optional, 524driver-specific error message. 525 526 527 528============================================================================ 529PCI Bus Parity Detection 530 531 532On Header Type 00 devices the primary status is looked at 533for any parity error regardless of whether Parity is enabled on the 534device. (The spec indicates parity is generated in some cases). 535On Header Type 01 bridges, the secondary status register is also 536looked at to see if parity occurred on the bus on the other side of 537the bridge. 538 539 540SYSFS CONFIGURATION 541 542Under /sys/devices/system/edac/pci are control and attribute files as follows: 543 544 545Enable/Disable PCI Parity checking control file: 546 547 'check_pci_parity' 548 549 550 This control file enables or disables the PCI Bus Parity scanning 551 operation. Writing a 1 to this file enables the scanning. Writing 552 a 0 to this file disables the scanning. 553 554 Enable: 555 echo "1" >/sys/devices/system/edac/pci/check_pci_parity 556 557 Disable: 558 echo "0" >/sys/devices/system/edac/pci/check_pci_parity 559 560 561 562Panic on PCI PARITY Error: 563 564 'panic_on_pci_parity' 565 566 567 This control files enables or disables panicking when a parity 568 error has been detected. 569 570 571 module/kernel parameter: panic_on_pci_parity=[0|1] 572 573 Enable: 574 echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity 575 576 Disable: 577 echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity 578 579 580Parity Count: 581 582 'pci_parity_count' 583 584 This attribute file will display the number of parity errors that 585 have been detected. 586 587 588 589=======================================================================