Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

EDAC: Update memory repair control interface for memory sparing feature

Update memory repair control interface for memory sparing feature.

CXL memory devices can support soft and hard memory sparing at cacheline,
row, bank and rank granularities. Memory sparing is defined as a repair
function that replaces a portion of memory with a portion of functional
memory at that same granularity.

When a CXL device detects an error in memory, it will report to the host
that there's need for a repair maintenance operation by using an event
record where the "maintenance needed" flag is set.

The event records contain the device physical address (DPA) and other
attributes of the memory to repair such as bank group, bank, rank, row,
column, channel etc.

The kernel will report the corresponding CXL general media or DRAM trace
event to userspace, and userspace tools (e.g. rasdaemon) will initiate
a repair operation in response to the device request via the sysfs
repair control.

[ bp: Massage. ]

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com

authored by

Shiju Jose and committed by
Borislav Petkov (AMD)
81e42fc1 699ea521

+169
+57
Documentation/ABI/testing/sysfs-edac-memory-repair
··· 42 42 43 43 - ppr - Post package repair. 44 44 45 + - cacheline-sparing 46 + 47 + - row-sparing 48 + 49 + - bank-sparing 50 + 51 + - rank-sparing 52 + 45 53 - All other values are reserved. 46 54 47 55 What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode ··· 141 133 to use for a repair operation from the memory device via 142 134 related error records and trace events, for eg. CXL DRAM 143 135 and CXL general media error records in CXL memory devices. 136 + 137 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group 138 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank 139 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank 140 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row 141 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/column 142 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/channel 143 + What: /sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel 144 + Date: March 2025 145 + KernelVersion: 6.15 146 + Contact: linux-edac@vger.kernel.org 147 + Description: 148 + (RW) The control attributes for the memory to be repaired. 149 + The specific value of attributes to use depends on the 150 + portion of memory to repair and will be reported to the host 151 + in related error records and be available to userspace 152 + in trace events, such as CXL DRAM and CXL general media 153 + error records of CXL memory devices. 154 + 155 + When readng back these attributes, it returns the current 156 + value of memory requested to be repaired. 157 + 158 + bank_group - The bank group of the memory to repair. 159 + 160 + bank - The bank number of the memory to repair. 161 + 162 + rank - The rank of the memory to repair. Rank is defined as a 163 + set of memory devices on a channel that together execute a 164 + transaction. 165 + 166 + row - The row number of the memory to repair. 167 + 168 + column - The column number of the memory to repair. 169 + 170 + channel - The channel of the memory to repair. Channel is 171 + defined as an interface that can be independently accessed 172 + for a transaction. 173 + 174 + sub_channel - The subchannel of the memory to repair. 175 + 176 + The requirement to set these attributes varies based on the 177 + repair function. The attributes in sysfs are not present 178 + unless required for a repair function. 179 + 180 + For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103 181 + soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations, 182 + these attributes are not required to set. CXL spec ver 3.1, 183 + Section 8.2.9.7.1.4 Table 8-105 memory sparing, these attributes 184 + are required to set based on memory sparing granularity. 144 185 145 186 What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair 146 187 Date: March 2025
+84
drivers/edac/mem_repair.c
··· 22 22 MR_MIN_DPA, 23 23 MR_MAX_DPA, 24 24 MR_NIBBLE_MASK, 25 + MR_BANK_GROUP, 26 + MR_BANK, 27 + MR_RANK, 28 + MR_ROW, 29 + MR_COLUMN, 30 + MR_CHANNEL, 31 + MR_SUB_CHANNEL, 25 32 MEM_DO_REPAIR, 26 33 MR_MAX_ATTRS 27 34 }; ··· 77 70 MR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n") 78 71 MR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n") 79 72 MR_ATTR_SHOW(nibble_mask, get_nibble_mask, u32, "0x%x\n") 73 + MR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n") 74 + MR_ATTR_SHOW(bank, get_bank, u32, "%u\n") 75 + MR_ATTR_SHOW(rank, get_rank, u32, "%u\n") 76 + MR_ATTR_SHOW(row, get_row, u32, "0x%x\n") 77 + MR_ATTR_SHOW(column, get_column, u32, "%u\n") 78 + MR_ATTR_SHOW(channel, get_channel, u32, "%u\n") 79 + MR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n") 80 80 81 81 #define MR_ATTR_STORE(attrib, cb, type, conv_func) \ 82 82 static ssize_t attrib##_store(struct device *ras_feat_dev, \ ··· 113 99 MR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64) 114 100 MR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64) 115 101 MR_ATTR_STORE(nibble_mask, set_nibble_mask, unsigned long, kstrtoul) 102 + MR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul) 103 + MR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul) 104 + MR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul) 105 + MR_ATTR_STORE(row, set_row, unsigned long, kstrtoul) 106 + MR_ATTR_STORE(column, set_column, unsigned long, kstrtoul) 107 + MR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul) 108 + MR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul) 116 109 117 110 #define MR_DO_OP(attrib, cb) \ 118 111 static ssize_t attrib##_store(struct device *ras_feat_dev, \ ··· 210 189 return 0444; 211 190 } 212 191 break; 192 + case MR_BANK_GROUP: 193 + if (ops->get_bank_group) { 194 + if (ops->set_bank_group) 195 + return a->mode; 196 + else 197 + return 0444; 198 + } 199 + break; 200 + case MR_BANK: 201 + if (ops->get_bank) { 202 + if (ops->set_bank) 203 + return a->mode; 204 + else 205 + return 0444; 206 + } 207 + break; 208 + case MR_RANK: 209 + if (ops->get_rank) { 210 + if (ops->set_rank) 211 + return a->mode; 212 + else 213 + return 0444; 214 + } 215 + break; 216 + case MR_ROW: 217 + if (ops->get_row) { 218 + if (ops->set_row) 219 + return a->mode; 220 + else 221 + return 0444; 222 + } 223 + break; 224 + case MR_COLUMN: 225 + if (ops->get_column) { 226 + if (ops->set_column) 227 + return a->mode; 228 + else 229 + return 0444; 230 + } 231 + break; 232 + case MR_CHANNEL: 233 + if (ops->get_channel) { 234 + if (ops->set_channel) 235 + return a->mode; 236 + else 237 + return 0444; 238 + } 239 + break; 240 + case MR_SUB_CHANNEL: 241 + if (ops->get_sub_channel) { 242 + if (ops->set_sub_channel) 243 + return a->mode; 244 + else 245 + return 0444; 246 + } 247 + break; 213 248 case MEM_DO_REPAIR: 214 249 if (ops->do_repair) 215 250 return a->mode; ··· 307 230 [MR_MIN_DPA] = MR_ATTR_RO(min_dpa, instance), 308 231 [MR_MAX_DPA] = MR_ATTR_RO(max_dpa, instance), 309 232 [MR_NIBBLE_MASK] = MR_ATTR_RW(nibble_mask, instance), 233 + [MR_BANK_GROUP] = MR_ATTR_RW(bank_group, instance), 234 + [MR_BANK] = MR_ATTR_RW(bank, instance), 235 + [MR_RANK] = MR_ATTR_RW(rank, instance), 236 + [MR_ROW] = MR_ATTR_RW(row, instance), 237 + [MR_COLUMN] = MR_ATTR_RW(column, instance), 238 + [MR_CHANNEL] = MR_ATTR_RW(channel, instance), 239 + [MR_SUB_CHANNEL] = MR_ATTR_RW(sub_channel, instance), 310 240 [MEM_DO_REPAIR] = MR_ATTR_WO(repair, instance) 311 241 }; 312 242
+28
include/linux/edac.h
··· 780 780 * @get_max_dpa: get the maximum supported device physical address (DPA). 781 781 * @get_nibble_mask: get current nibble mask of memory to repair. 782 782 * @set_nibble_mask: set nibble mask of memory to repair. 783 + * @get_bank_group: get current bank group of memory to repair. 784 + * @set_bank_group: set bank group of memory to repair. 785 + * @get_bank: get current bank of memory to repair. 786 + * @set_bank: set bank of memory to repair. 787 + * @get_rank: get current rank of memory to repair. 788 + * @set_rank: set rank of memory to repair. 789 + * @get_row: get current row of memory to repair. 790 + * @set_row: set row of memory to repair. 791 + * @get_column: get current column of memory to repair. 792 + * @set_column: set column of memory to repair. 793 + * @get_channel: get current channel of memory to repair. 794 + * @set_channel: set channel of memory to repair. 795 + * @get_sub_channel: get current subchannel of memory to repair. 796 + * @set_sub_channel: set subchannel of memory to repair. 783 797 * @do_repair: Issue memory repair operation for the HPA/DPA and 784 798 * other control attributes set for the memory to repair. 785 799 * ··· 814 800 int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa); 815 801 int (*get_nibble_mask)(struct device *dev, void *drv_data, u32 *val); 816 802 int (*set_nibble_mask)(struct device *dev, void *drv_data, u32 val); 803 + int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val); 804 + int (*set_bank_group)(struct device *dev, void *drv_data, u32 val); 805 + int (*get_bank)(struct device *dev, void *drv_data, u32 *val); 806 + int (*set_bank)(struct device *dev, void *drv_data, u32 val); 807 + int (*get_rank)(struct device *dev, void *drv_data, u32 *val); 808 + int (*set_rank)(struct device *dev, void *drv_data, u32 val); 809 + int (*get_row)(struct device *dev, void *drv_data, u32 *val); 810 + int (*set_row)(struct device *dev, void *drv_data, u32 val); 811 + int (*get_column)(struct device *dev, void *drv_data, u32 *val); 812 + int (*set_column)(struct device *dev, void *drv_data, u32 val); 813 + int (*get_channel)(struct device *dev, void *drv_data, u32 *val); 814 + int (*set_channel)(struct device *dev, void *drv_data, u32 val); 815 + int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val); 816 + int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val); 817 817 int (*do_repair)(struct device *dev, void *drv_data, u32 val); 818 818 }; 819 819