Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

EDAC: Add a Error Check Scrub control feature

Add an Error Check Scrub (ECS) control to manage a memory device's ECS
feature.

The ECS is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and
allows the DRAM to internally read, correct single-bit errors, and write back
corrected data bits to the DRAM array while providing transparency to error
counts.

The DDR5 device contains a number of memory media Field Replaceable Units
(FRU) per device. The DDR5 ECS feature and thus the ECS control driver
supports configuring the ECS parameters per FRU.

Memory devices support the ECS feature register with the EDAC device driver,
which retrieves the ECS descriptor from the EDAC ECS driver. This driver
exposes sysfs ECS control attributes to userspace via

/sys/bus/edac/devices/<dev-name>/ecs_fruX/.

The common sysfs ECS control interface abstracts the control of an arbitrary
ECS functionality to a common set of functions.

Support for the ECS feature is added separately because the control attributes
of the DDR5 ECS feature differ from those of the scrub feature.

The sysfs ECS attribute nodes are only present if the client driver has
implemented the corresponding attribute callback function and passed the
necessary operations to the EDAC RAS feature driver during registration.

[ bp: Massage, fixup edac_dev_register() retvals. ]

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-4-shiju.jose@huawei.com

authored by

Shiju Jose and committed by
Borislav Petkov (AMD)
bcbd069b f90b7381

+356 -2
+74
Documentation/ABI/testing/sysfs-edac-ecs
··· 1 + What: /sys/bus/edac/devices/<dev-name>/ecs_fruX 2 + Date: March 2025 3 + KernelVersion: 6.15 4 + Contact: linux-edac@vger.kernel.org 5 + Description: 6 + The sysfs EDAC bus devices /<dev-name>/ecs_fruX subdirectory 7 + pertains to the memory media ECS (Error Check Scrub) control 8 + feature, where <dev-name> directory corresponds to a device 9 + registered with the EDAC device driver for the ECS feature. 10 + /ecs_fruX belongs to the media FRUs (Field Replaceable Unit) 11 + under the memory device. 12 + 13 + The sysfs ECS attr nodes are only present if the parent 14 + driver has implemented the corresponding attr callback 15 + function and provided the necessary operations to the EDAC 16 + device driver during registration. 17 + 18 + What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type 19 + Date: March 2025 20 + KernelVersion: 6.15 21 + Contact: linux-edac@vger.kernel.org 22 + Description: 23 + (RW) The log entry type of how the DDR5 ECS log is reported. 24 + 25 + - 0 - per DRAM. 26 + 27 + - 1 - per memory media FRU. 28 + 29 + - All other values are reserved. 30 + 31 + What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode 32 + Date: March 2025 33 + KernelVersion: 6.15 34 + Contact: linux-edac@vger.kernel.org 35 + Description: 36 + (RW) The mode of how the DDR5 ECS counts the errors. 37 + Error count is tracked based on two different modes 38 + selected by DDR5 ECS Control Feature - Codeword mode and 39 + Row Count mode. If the ECS is under Codeword mode, then 40 + the error count increments each time a codeword with check 41 + bit errors is detected. If the ECS is under Row Count mode, 42 + then the error counter increments each time a row with 43 + check bit errors is detected. 44 + 45 + - 0 - ECS counts rows in the memory media that have ECC errors. 46 + 47 + - 1 - ECS counts codewords with errors, specifically, it counts 48 + the number of ECC-detected errors in the memory media. 49 + 50 + - All other values are reserved. 51 + 52 + What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/reset 53 + Date: March 2025 54 + KernelVersion: 6.15 55 + Contact: linux-edac@vger.kernel.org 56 + Description: 57 + (WO) ECS reset ECC counter. 58 + 59 + - 1 - reset ECC counter to the default value. 60 + 61 + - All other values are reserved. 62 + 63 + What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold 64 + Date: March 2025 65 + KernelVersion: 6.15 66 + Contact: linux-edac@vger.kernel.org 67 + Description: 68 + (RW) DDR5 ECS threshold count per gigabits of memory cells. 69 + The ECS error count is subject to the ECS Threshold count 70 + per Gbit, which masks error counts less than the Threshold. 71 + 72 + Supported values are 256, 1024 and 4096. 73 + 74 + All other values are reserved.
+2
Documentation/edac/scrub.rst
··· 262 262 263 263 Sysfs files are documented in 264 264 `Documentation/ABI/testing/sysfs-edac-scrub` 265 + 266 + `Documentation/ABI/testing/sysfs-edac-ecs`
+9
drivers/edac/Kconfig
··· 84 84 into a unified set of functions. 85 85 Say 'y/n' to enable/disable EDAC scrub feature. 86 86 87 + config EDAC_ECS 88 + bool "EDAC ECS (Error Check Scrub) feature" 89 + help 90 + The EDAC ECS feature is optional and is designed to control on-die 91 + error check scrub (e.g., DDR5 ECS) in the system. The common sysfs 92 + ECS interface abstracts the control of various ECS functionalities 93 + into a unified set of functions. 94 + Say 'y/n' to enable/disable EDAC ECS feature. 95 + 87 96 config EDAC_AMD64 88 97 tristate "AMD64 (Opteron, Athlon64)" 89 98 depends on AMD_NB && EDAC_DECODE_MCE
+1
drivers/edac/Makefile
··· 13 13 14 14 edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o 15 15 edac_core-$(CONFIG_EDAC_SCRUB) += scrub.o 16 + edac_core-$(CONFIG_EDAC_ECS) += ecs.o 16 17 17 18 ifdef CONFIG_PCI 18 19 edac_core-y += edac_pci.o edac_pci_sysfs.o
+205
drivers/edac/ecs.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * The generic ECS driver is designed to support control of on-die error 4 + * check scrub (e.g., DDR5 ECS). The common sysfs ECS interface abstracts 5 + * the control of various ECS functionalities into a unified set of functions. 6 + * 7 + * Copyright (c) 2024-2025 HiSilicon Limited. 8 + */ 9 + 10 + #include <linux/edac.h> 11 + 12 + #define EDAC_ECS_FRU_NAME "ecs_fru" 13 + 14 + enum edac_ecs_attributes { 15 + ECS_LOG_ENTRY_TYPE, 16 + ECS_MODE, 17 + ECS_RESET, 18 + ECS_THRESHOLD, 19 + ECS_MAX_ATTRS 20 + }; 21 + 22 + struct edac_ecs_dev_attr { 23 + struct device_attribute dev_attr; 24 + int fru_id; 25 + }; 26 + 27 + struct edac_ecs_fru_context { 28 + char name[EDAC_FEAT_NAME_LEN]; 29 + struct edac_ecs_dev_attr dev_attr[ECS_MAX_ATTRS]; 30 + struct attribute *ecs_attrs[ECS_MAX_ATTRS + 1]; 31 + struct attribute_group group; 32 + }; 33 + 34 + struct edac_ecs_context { 35 + u16 num_media_frus; 36 + struct edac_ecs_fru_context *fru_ctxs; 37 + }; 38 + 39 + #define TO_ECS_DEV_ATTR(_dev_attr) \ 40 + container_of(_dev_attr, struct edac_ecs_dev_attr, dev_attr) 41 + 42 + #define EDAC_ECS_ATTR_SHOW(attrib, cb, type, format) \ 43 + static ssize_t attrib##_show(struct device *ras_feat_dev, \ 44 + struct device_attribute *attr, char *buf) \ 45 + { \ 46 + struct edac_ecs_dev_attr *dev_attr = TO_ECS_DEV_ATTR(attr); \ 47 + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ 48 + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; \ 49 + type data; \ 50 + int ret; \ 51 + \ 52 + ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private, \ 53 + dev_attr->fru_id, &data); \ 54 + if (ret) \ 55 + return ret; \ 56 + \ 57 + return sysfs_emit(buf, format, data); \ 58 + } 59 + 60 + EDAC_ECS_ATTR_SHOW(log_entry_type, get_log_entry_type, u32, "%u\n") 61 + EDAC_ECS_ATTR_SHOW(mode, get_mode, u32, "%u\n") 62 + EDAC_ECS_ATTR_SHOW(threshold, get_threshold, u32, "%u\n") 63 + 64 + #define EDAC_ECS_ATTR_STORE(attrib, cb, type, conv_func) \ 65 + static ssize_t attrib##_store(struct device *ras_feat_dev, \ 66 + struct device_attribute *attr, \ 67 + const char *buf, size_t len) \ 68 + { \ 69 + struct edac_ecs_dev_attr *dev_attr = TO_ECS_DEV_ATTR(attr); \ 70 + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ 71 + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; \ 72 + type data; \ 73 + int ret; \ 74 + \ 75 + ret = conv_func(buf, 0, &data); \ 76 + if (ret < 0) \ 77 + return ret; \ 78 + \ 79 + ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private, \ 80 + dev_attr->fru_id, data); \ 81 + if (ret) \ 82 + return ret; \ 83 + \ 84 + return len; \ 85 + } 86 + 87 + EDAC_ECS_ATTR_STORE(log_entry_type, set_log_entry_type, unsigned long, kstrtoul) 88 + EDAC_ECS_ATTR_STORE(mode, set_mode, unsigned long, kstrtoul) 89 + EDAC_ECS_ATTR_STORE(reset, reset, unsigned long, kstrtoul) 90 + EDAC_ECS_ATTR_STORE(threshold, set_threshold, unsigned long, kstrtoul) 91 + 92 + static umode_t ecs_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id) 93 + { 94 + struct device *ras_feat_dev = kobj_to_dev(kobj); 95 + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); 96 + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; 97 + 98 + switch (attr_id) { 99 + case ECS_LOG_ENTRY_TYPE: 100 + if (ops->get_log_entry_type) { 101 + if (ops->set_log_entry_type) 102 + return a->mode; 103 + else 104 + return 0444; 105 + } 106 + break; 107 + case ECS_MODE: 108 + if (ops->get_mode) { 109 + if (ops->set_mode) 110 + return a->mode; 111 + else 112 + return 0444; 113 + } 114 + break; 115 + case ECS_RESET: 116 + if (ops->reset) 117 + return a->mode; 118 + break; 119 + case ECS_THRESHOLD: 120 + if (ops->get_threshold) { 121 + if (ops->set_threshold) 122 + return a->mode; 123 + else 124 + return 0444; 125 + } 126 + break; 127 + default: 128 + break; 129 + } 130 + 131 + return 0; 132 + } 133 + 134 + #define EDAC_ECS_ATTR_RO(_name, _fru_id) \ 135 + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RO(_name), \ 136 + .fru_id = _fru_id }) 137 + 138 + #define EDAC_ECS_ATTR_WO(_name, _fru_id) \ 139 + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_WO(_name), \ 140 + .fru_id = _fru_id }) 141 + 142 + #define EDAC_ECS_ATTR_RW(_name, _fru_id) \ 143 + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RW(_name), \ 144 + .fru_id = _fru_id }) 145 + 146 + static int ecs_create_desc(struct device *ecs_dev, const struct attribute_group **attr_groups, 147 + u16 num_media_frus) 148 + { 149 + struct edac_ecs_context *ecs_ctx; 150 + u32 fru; 151 + 152 + ecs_ctx = devm_kzalloc(ecs_dev, sizeof(*ecs_ctx), GFP_KERNEL); 153 + if (!ecs_ctx) 154 + return -ENOMEM; 155 + 156 + ecs_ctx->num_media_frus = num_media_frus; 157 + ecs_ctx->fru_ctxs = devm_kcalloc(ecs_dev, num_media_frus, 158 + sizeof(*ecs_ctx->fru_ctxs), 159 + GFP_KERNEL); 160 + if (!ecs_ctx->fru_ctxs) 161 + return -ENOMEM; 162 + 163 + for (fru = 0; fru < num_media_frus; fru++) { 164 + struct edac_ecs_fru_context *fru_ctx = &ecs_ctx->fru_ctxs[fru]; 165 + struct attribute_group *group = &fru_ctx->group; 166 + int i; 167 + 168 + fru_ctx->dev_attr[ECS_LOG_ENTRY_TYPE] = EDAC_ECS_ATTR_RW(log_entry_type, fru); 169 + fru_ctx->dev_attr[ECS_MODE] = EDAC_ECS_ATTR_RW(mode, fru); 170 + fru_ctx->dev_attr[ECS_RESET] = EDAC_ECS_ATTR_WO(reset, fru); 171 + fru_ctx->dev_attr[ECS_THRESHOLD] = EDAC_ECS_ATTR_RW(threshold, fru); 172 + 173 + for (i = 0; i < ECS_MAX_ATTRS; i++) 174 + fru_ctx->ecs_attrs[i] = &fru_ctx->dev_attr[i].dev_attr.attr; 175 + 176 + sprintf(fru_ctx->name, "%s%d", EDAC_ECS_FRU_NAME, fru); 177 + group->name = fru_ctx->name; 178 + group->attrs = fru_ctx->ecs_attrs; 179 + group->is_visible = ecs_attr_visible; 180 + 181 + attr_groups[fru] = group; 182 + } 183 + 184 + return 0; 185 + } 186 + 187 + /** 188 + * edac_ecs_get_desc - get EDAC ECS descriptors 189 + * @ecs_dev: client device, supports ECS feature 190 + * @attr_groups: pointer to attribute group container 191 + * @num_media_frus: number of media FRUs in the device 192 + * 193 + * Return: 194 + * * %0 - Success. 195 + * * %-EINVAL - Invalid parameters passed. 196 + * * %-ENOMEM - Dynamic memory allocation failed. 197 + */ 198 + int edac_ecs_get_desc(struct device *ecs_dev, 199 + const struct attribute_group **attr_groups, u16 num_media_frus) 200 + { 201 + if (!ecs_dev || !attr_groups || !num_media_frus) 202 + return -EINVAL; 203 + 204 + return ecs_create_desc(ecs_dev, attr_groups, num_media_frus); 205 + }
+19
drivers/edac/edac_device.c
··· 628 628 attr_gcnt++; 629 629 scrub_cnt++; 630 630 break; 631 + case RAS_FEAT_ECS: 632 + attr_gcnt += ras_features[feat].ecs_info.num_media_frus; 633 + break; 631 634 default: 632 635 return -EINVAL; 633 636 } ··· 671 668 672 669 scrub_cnt++; 673 670 attr_gcnt++; 671 + break; 672 + case RAS_FEAT_ECS: 673 + if (!ras_features->ecs_ops) { 674 + ret = -EINVAL; 675 + goto data_mem_free; 676 + } 677 + 678 + dev_data = &ctx->ecs; 679 + dev_data->ecs_ops = ras_features->ecs_ops; 680 + dev_data->private = ras_features->ctx; 681 + ret = edac_ecs_get_desc(parent, &ras_attr_groups[attr_gcnt], 682 + ras_features->ecs_info.num_media_frus); 683 + if (ret) 684 + goto data_mem_free; 685 + 686 + attr_gcnt += ras_features->ecs_info.num_media_frus; 674 687 break; 675 688 default: 676 689 ret = -EINVAL;
+46 -2
include/linux/edac.h
··· 667 667 /* RAS feature type */ 668 668 enum edac_dev_feat { 669 669 RAS_FEAT_SCRUB, 670 + RAS_FEAT_ECS, 670 671 RAS_FEAT_MAX 671 672 }; 672 673 ··· 708 707 { return -EOPNOTSUPP; } 709 708 #endif /* CONFIG_EDAC_SCRUB */ 710 709 710 + /** 711 + * struct edac_ecs_ops - ECS device operations (all elements optional) 712 + * @get_log_entry_type: read the log entry type value. 713 + * @set_log_entry_type: set the log entry type value. 714 + * @get_mode: read the mode value. 715 + * @set_mode: set the mode value. 716 + * @reset: reset the ECS counter. 717 + * @get_threshold: read the threshold count per gigabits of memory cells. 718 + * @set_threshold: set the threshold count per gigabits of memory cells. 719 + */ 720 + struct edac_ecs_ops { 721 + int (*get_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 *val); 722 + int (*set_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 val); 723 + int (*get_mode)(struct device *dev, void *drv_data, int fru_id, u32 *val); 724 + int (*set_mode)(struct device *dev, void *drv_data, int fru_id, u32 val); 725 + int (*reset)(struct device *dev, void *drv_data, int fru_id, u32 val); 726 + int (*get_threshold)(struct device *dev, void *drv_data, int fru_id, u32 *threshold); 727 + int (*set_threshold)(struct device *dev, void *drv_data, int fru_id, u32 threshold); 728 + }; 729 + 730 + struct edac_ecs_ex_info { 731 + u16 num_media_frus; 732 + }; 733 + 734 + #if IS_ENABLED(CONFIG_EDAC_ECS) 735 + int edac_ecs_get_desc(struct device *ecs_dev, 736 + const struct attribute_group **attr_groups, 737 + u16 num_media_frus); 738 + #else 739 + static inline int edac_ecs_get_desc(struct device *ecs_dev, 740 + const struct attribute_group **attr_groups, 741 + u16 num_media_frus) 742 + { return -EOPNOTSUPP; } 743 + #endif /* CONFIG_EDAC_ECS */ 744 + 711 745 /* EDAC device feature information structure */ 712 746 struct edac_dev_data { 713 - const struct edac_scrub_ops *scrub_ops; 747 + union { 748 + const struct edac_scrub_ops *scrub_ops; 749 + const struct edac_ecs_ops *ecs_ops; 750 + }; 714 751 u8 instance; 715 752 void *private; 716 753 }; ··· 757 718 struct device dev; 758 719 void *private; 759 720 struct edac_dev_data *scrub; 721 + struct edac_dev_data ecs; 760 722 }; 761 723 762 724 struct edac_dev_feature { 763 725 enum edac_dev_feat ft_type; 764 726 u8 instance; 765 - const struct edac_scrub_ops *scrub_ops; 727 + union { 728 + const struct edac_scrub_ops *scrub_ops; 729 + const struct edac_ecs_ops *ecs_ops; 730 + }; 766 731 void *ctx; 732 + struct edac_ecs_ex_info ecs_info; 767 733 }; 768 734 769 735 int edac_dev_register(struct device *parent, char *dev_name,