HWPOISON: Add soft page offline support

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

This is a simpler, gentler variant of memory_failure() for soft page
offlining controlled from user space. It doesn't kill anything, just
tries to invalidate and if that doesn't work migrate the
page away.

This is useful for predictive failure analysis, where a page has
a high rate of corrected errors, but hasn't gone bad yet. Instead
it can be offlined early and avoided.

The offlining is controlled from sysfs, including a new generic
entry point for hard page offlining for symmetry too.

We use the page isolate facility to prevent re-allocation
race. Normally this is only used by memory hotplug. To avoid
races with memory allocation I am using lock_system_sleep().
This avoids the situation where memory hotplug is about
to isolate a page range and then hwpoison undoes that work.
This is a big hammer currently, but the simplest solution
currently.

When the page is not free or LRU we try to free pages
from slab and other caches. The slab freeing is currently
quite dumb and does not try to focus on the specific slab
cache which might own the page. This could be potentially
improved later.

Thanks to Fengguang Wu and Haicheng Li for some fixes.

[Added fix from Andrew Morton to adapt to new migrate_pages prototype]
Signed-off-by: Andi Kleen <ak@linux.intel.com>

authored by

Andi Kleen and committed by

Andi Kleen 16 years ago facb6011 2326c467

+297 -7

5 changed files

expand all

Documentation

ABI

testing

sysfs-memory-page-offline

drivers

base

memory.c

include

linux

mm.h

hwpoison-inject.c

memory-failure.c

+44

Documentation/ABI/testing/sysfs-memory-page-offline

··· 1 + What: /sys/devices/system/memory/soft_offline_page 2 + Date: Sep 2009 3 + KernelVersion: 2.6.33 4 + Contact: andi@firstfloor.org 5 + Description: 6 + Soft-offline the memory page containing the physical address 7 + written into this file. Input is a hex number specifying the 8 + physical address of the page. The kernel will then attempt 9 + to soft-offline it, by moving the contents elsewhere or 10 + dropping it if possible. The kernel will then be placed 11 + on the bad page list and never be reused. 12 + 13 + The offlining is done in kernel specific granuality. 14 + Normally it's the base page size of the kernel, but 15 + this might change. 16 + 17 + The page must be still accessible, not poisoned. The 18 + kernel will never kill anything for this, but rather 19 + fail the offline. Return value is the size of the 20 + number, or a error when the offlining failed. Reading 21 + the file is not allowed. 22 + 23 + What: /sys/devices/system/memory/hard_offline_page 24 + Date: Sep 2009 25 + KernelVersion: 2.6.33 26 + Contact: andi@firstfloor.org 27 + Description: 28 + Hard-offline the memory page containing the physical 29 + address written into this file. Input is a hex number 30 + specifying the physical address of the page. The 31 + kernel will then attempt to hard-offline the page, by 32 + trying to drop the page or killing any owner or 33 + triggering IO errors if needed. Note this may kill 34 + any processes owning the page. The kernel will avoid 35 + to access this page assuming it's poisoned by the 36 + hardware. 37 + 38 + The offlining is done in kernel specific granuality. 39 + Normally it's the base page size of the kernel, but 40 + this might change. 41 + 42 + Return value is the size of the number, or a error when 43 + the offlining failed. 44 + Reading the file is not allowed.

+61

drivers/base/memory.c

··· 341 341 } 342 342 #endif 343 343 344 + #ifdef CONFIG_MEMORY_FAILURE 345 + /* 346 + * Support for offlining pages of memory 347 + */ 348 + 349 + /* Soft offline a page */ 350 + static ssize_t 351 + store_soft_offline_page(struct class *class, const char *buf, size_t count) 352 + { 353 + int ret; 354 + u64 pfn; 355 + if (!capable(CAP_SYS_ADMIN)) 356 + return -EPERM; 357 + if (strict_strtoull(buf, 0, &pfn) < 0) 358 + return -EINVAL; 359 + pfn >>= PAGE_SHIFT; 360 + if (!pfn_valid(pfn)) 361 + return -ENXIO; 362 + ret = soft_offline_page(pfn_to_page(pfn), 0); 363 + return ret == 0 ? count : ret; 364 + } 365 + 366 + /* Forcibly offline a page, including killing processes. */ 367 + static ssize_t 368 + store_hard_offline_page(struct class *class, const char *buf, size_t count) 369 + { 370 + int ret; 371 + u64 pfn; 372 + if (!capable(CAP_SYS_ADMIN)) 373 + return -EPERM; 374 + if (strict_strtoull(buf, 0, &pfn) < 0) 375 + return -EINVAL; 376 + pfn >>= PAGE_SHIFT; 377 + ret = __memory_failure(pfn, 0, 0); 378 + return ret ? ret : count; 379 + } 380 + 381 + static CLASS_ATTR(soft_offline_page, 0644, NULL, store_soft_offline_page); 382 + static CLASS_ATTR(hard_offline_page, 0644, NULL, store_hard_offline_page); 383 + 384 + static __init int memory_fail_init(void) 385 + { 386 + int err; 387 + 388 + err = sysfs_create_file(&memory_sysdev_class.kset.kobj, 389 + &class_attr_soft_offline_page.attr); 390 + if (!err) 391 + err = sysfs_create_file(&memory_sysdev_class.kset.kobj, 392 + &class_attr_hard_offline_page.attr); 393 + return err; 394 + } 395 + #else 396 + static inline int memory_fail_init(void) 397 + { 398 + return 0; 399 + } 400 + #endif 401 + 344 402 /* 345 403 * Note that phys_device is optional. It is here to allow for 346 404 * differentiation between which *physical* devices each ··· 529 471 } 530 472 531 473 err = memory_probe_init(); 474 + if (!ret) 475 + ret = err; 476 + err = memory_fail_init(); 532 477 if (!ret) 533 478 ret = err; 534 479 err = block_size_init();

+2 -1

include/linux/mm.h

··· 1339 1339 extern int unpoison_memory(unsigned long pfn); 1340 1340 extern int sysctl_memory_failure_early_kill; 1341 1341 extern int sysctl_memory_failure_recovery; 1342 - extern void shake_page(struct page *p); 1342 + extern void shake_page(struct page *p, int access); 1343 1343 extern atomic_long_t mce_bad_pages; 1344 + extern int soft_offline_page(struct page *page, int flags); 1344 1345 1345 1346 #endif /* __KERNEL__ */ 1346 1347 #endif /* _LINUX_MM_H */

+1 -1

mm/hwpoison-inject.c

··· 29 29 return 0; 30 30 31 31 if (!PageLRU(p)) 32 - shake_page(p); 32 + shake_page(p, 0); 33 33 /* 34 34 * This implies unable to support non-LRU pages. 35 35 */

+189 -5

mm/memory-failure.c

··· 41 41 #include <linux/pagemap.h> 42 42 #include <linux/swap.h> 43 43 #include <linux/backing-dev.h> 44 + #include <linux/migrate.h> 45 + #include <linux/page-isolation.h> 46 + #include <linux/suspend.h> 44 47 #include "internal.h" 45 48 46 49 int sysctl_memory_failure_early_kill __read_mostly = 0; ··· 204 201 * When a unknown page type is encountered drain as many buffers as possible 205 202 * in the hope to turn the page into a LRU or free page, which we can handle. 206 203 */ 207 - void shake_page(struct page *p) 204 + void shake_page(struct page *p, int access) 208 205 { 209 206 if (!PageSlab(p)) { 210 207 lru_add_drain_all(); ··· 214 211 if (PageLRU(p) || is_free_buddy_page(p)) 215 212 return; 216 213 } 214 + 217 215 /* 218 - * Could call shrink_slab here (which would also 219 - * shrink other caches). Unfortunately that might 220 - * also access the corrupted page, which could be fatal. 216 + * Only all shrink_slab here (which would also 217 + * shrink other caches) if access is not potentially fatal. 221 218 */ 219 + if (access) { 220 + int nr; 221 + do { 222 + nr = shrink_slab(1000, GFP_KERNEL, 1000); 223 + if (page_count(p) == 0) 224 + break; 225 + } while (nr > 10); 226 + } 222 227 } 223 228 EXPORT_SYMBOL_GPL(shake_page); 224 229 ··· 960 949 * walked by the page reclaim code, however that's not a big loss. 961 950 */ 962 951 if (!PageLRU(p)) 963 - shake_page(p); 952 + shake_page(p, 0); 964 953 if (!PageLRU(p)) { 965 954 /* 966 955 * shake_page could have turned it free. ··· 1110 1099 return 0; 1111 1100 } 1112 1101 EXPORT_SYMBOL(unpoison_memory); 1102 + 1103 + static struct page *new_page(struct page *p, unsigned long private, int **x) 1104 + { 1105 + return alloc_pages(GFP_HIGHUSER_MOVABLE, 0); 1106 + } 1107 + 1108 + /* 1109 + * Safely get reference count of an arbitrary page. 1110 + * Returns 0 for a free page, -EIO for a zero refcount page 1111 + * that is not free, and 1 for any other page type. 1112 + * For 1 the page is returned with increased page count, otherwise not. 1113 + */ 1114 + static int get_any_page(struct page *p, unsigned long pfn, int flags) 1115 + { 1116 + int ret; 1117 + 1118 + if (flags & MF_COUNT_INCREASED) 1119 + return 1; 1120 + 1121 + /* 1122 + * The lock_system_sleep prevents a race with memory hotplug, 1123 + * because the isolation assumes there's only a single user. 1124 + * This is a big hammer, a better would be nicer. 1125 + */ 1126 + lock_system_sleep(); 1127 + 1128 + /* 1129 + * Isolate the page, so that it doesn't get reallocated if it 1130 + * was free. 1131 + */ 1132 + set_migratetype_isolate(p); 1133 + if (!get_page_unless_zero(compound_head(p))) { 1134 + if (is_free_buddy_page(p)) { 1135 + pr_debug("get_any_page: %#lx free buddy page\n", pfn); 1136 + /* Set hwpoison bit while page is still isolated */ 1137 + SetPageHWPoison(p); 1138 + ret = 0; 1139 + } else { 1140 + pr_debug("get_any_page: %#lx: unknown zero refcount page type %lx\n", 1141 + pfn, p->flags); 1142 + ret = -EIO; 1143 + } 1144 + } else { 1145 + /* Not a free page */ 1146 + ret = 1; 1147 + } 1148 + unset_migratetype_isolate(p); 1149 + unlock_system_sleep(); 1150 + return ret; 1151 + } 1152 + 1153 + /** 1154 + * soft_offline_page - Soft offline a page. 1155 + * @page: page to offline 1156 + * @flags: flags. Same as memory_failure(). 1157 + * 1158 + * Returns 0 on success, otherwise negated errno. 1159 + * 1160 + * Soft offline a page, by migration or invalidation, 1161 + * without killing anything. This is for the case when 1162 + * a page is not corrupted yet (so it's still valid to access), 1163 + * but has had a number of corrected errors and is better taken 1164 + * out. 1165 + * 1166 + * The actual policy on when to do that is maintained by 1167 + * user space. 1168 + * 1169 + * This should never impact any application or cause data loss, 1170 + * however it might take some time. 1171 + * 1172 + * This is not a 100% solution for all memory, but tries to be 1173 + * ``good enough'' for the majority of memory. 1174 + */ 1175 + int soft_offline_page(struct page *page, int flags) 1176 + { 1177 + int ret; 1178 + unsigned long pfn = page_to_pfn(page); 1179 + 1180 + ret = get_any_page(page, pfn, flags); 1181 + if (ret < 0) 1182 + return ret; 1183 + if (ret == 0) 1184 + goto done; 1185 + 1186 + /* 1187 + * Page cache page we can handle? 1188 + */ 1189 + if (!PageLRU(page)) { 1190 + /* 1191 + * Try to free it. 1192 + */ 1193 + put_page(page); 1194 + shake_page(page, 1); 1195 + 1196 + /* 1197 + * Did it turn free? 1198 + */ 1199 + ret = get_any_page(page, pfn, 0); 1200 + if (ret < 0) 1201 + return ret; 1202 + if (ret == 0) 1203 + goto done; 1204 + } 1205 + if (!PageLRU(page)) { 1206 + pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n", 1207 + pfn, page->flags); 1208 + return -EIO; 1209 + } 1210 + 1211 + lock_page(page); 1212 + wait_on_page_writeback(page); 1213 + 1214 + /* 1215 + * Synchronized using the page lock with memory_failure() 1216 + */ 1217 + if (PageHWPoison(page)) { 1218 + unlock_page(page); 1219 + put_page(page); 1220 + pr_debug("soft offline: %#lx page already poisoned\n", pfn); 1221 + return -EBUSY; 1222 + } 1223 + 1224 + /* 1225 + * Try to invalidate first. This should work for 1226 + * non dirty unmapped page cache pages. 1227 + */ 1228 + ret = invalidate_inode_page(page); 1229 + unlock_page(page); 1230 + 1231 + /* 1232 + * Drop count because page migration doesn't like raised 1233 + * counts. The page could get re-allocated, but if it becomes 1234 + * LRU the isolation will just fail. 1235 + * RED-PEN would be better to keep it isolated here, but we 1236 + * would need to fix isolation locking first. 1237 + */ 1238 + put_page(page); 1239 + if (ret == 1) { 1240 + ret = 0; 1241 + pr_debug("soft_offline: %#lx: invalidated\n", pfn); 1242 + goto done; 1243 + } 1244 + 1245 + /* 1246 + * Simple invalidation didn't work. 1247 + * Try to migrate to a new page instead. migrate.c 1248 + * handles a large number of cases for us. 1249 + */ 1250 + ret = isolate_lru_page(page); 1251 + if (!ret) { 1252 + LIST_HEAD(pagelist); 1253 + 1254 + list_add(&page->lru, &pagelist); 1255 + ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0); 1256 + if (ret) { 1257 + pr_debug("soft offline: %#lx: migration failed %d, type %lx\n", 1258 + pfn, ret, page->flags); 1259 + if (ret > 0) 1260 + ret = -EIO; 1261 + } 1262 + } else { 1263 + pr_debug("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n", 1264 + pfn, ret, page_count(page), page->flags); 1265 + } 1266 + if (ret) 1267 + return ret; 1268 + 1269 + done: 1270 + atomic_long_add(1, &mce_bad_pages); 1271 + SetPageHWPoison(page); 1272 + /* keep elevated page count for bad page */ 1273 + return ret; 1274 + }