Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

hugetlbfs: don't delete error page from pagecache

This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache. That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned. As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache. If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses. For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

James Houghton and committed by
Andrew Morton
8625147c 120b1162

+14 -8
+6 -7
fs/hugetlbfs/inode.c
··· 328 328 } else { 329 329 unlock_page(page); 330 330 331 + if (PageHWPoison(page)) { 332 + put_page(page); 333 + retval = -EIO; 334 + break; 335 + } 336 + 331 337 /* 332 338 * We have the page, copy it to user space buffer. 333 339 */ ··· 1117 1111 static int hugetlbfs_error_remove_page(struct address_space *mapping, 1118 1112 struct page *page) 1119 1113 { 1120 - struct inode *inode = mapping->host; 1121 - pgoff_t index = page->index; 1122 - 1123 - hugetlb_delete_from_page_cache(page); 1124 - if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1))) 1125 - hugetlb_fix_reserve_counts(inode); 1126 - 1127 1114 return 0; 1128 1115 } 1129 1116
+4
mm/hugetlb.c
··· 6111 6111 6112 6112 ptl = huge_pte_lock(h, dst_mm, dst_pte); 6113 6113 6114 + ret = -EIO; 6115 + if (PageHWPoison(page)) 6116 + goto out_release_unlock; 6117 + 6114 6118 /* 6115 6119 * We allow to overwrite a pte marker: consider when both MISSING|WP 6116 6120 * registered, we firstly wr-protect a none pte which has no page cache
+4 -1
mm/memory-failure.c
··· 1080 1080 int res; 1081 1081 struct page *hpage = compound_head(p); 1082 1082 struct address_space *mapping; 1083 + bool extra_pins = false; 1083 1084 1084 1085 if (!PageHuge(hpage)) 1085 1086 return MF_DELAYED; ··· 1088 1087 mapping = page_mapping(hpage); 1089 1088 if (mapping) { 1090 1089 res = truncate_error_page(hpage, page_to_pfn(p), mapping); 1090 + /* The page is kept in page cache. */ 1091 + extra_pins = true; 1091 1092 unlock_page(hpage); 1092 1093 } else { 1093 1094 unlock_page(hpage); ··· 1107 1104 } 1108 1105 } 1109 1106 1110 - if (has_extra_refcount(ps, p, false)) 1107 + if (has_extra_refcount(ps, p, extra_pins)) 1111 1108 res = MF_FAILED; 1112 1109 1113 1110 return res;