Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm,hwpoison: rework soft offline for in-use pages

This patch changes the way we set and handle in-use poisoned pages. Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.

Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind of pages
we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

- If they are clean and unmapped page cache pages, we invalidate
then by means of invalidate_inode_page().
- If they are mapped/dirty, we do the isolate-and-migrate dance.

Either way, do not call put_page directly from those paths. Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.

page_handle_poison sets the HWPoison flag and does the last put_page.

Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.

After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.

If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:

- wait until the page hits buddy's freelists
- take it off, and flag it

The downside of the above approach is that we could race with an
allocation, so by the time we want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.

* Hugetlb pages

- We isolate-and-migrate them

After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.

While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.

We can always build on top of this.

As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Oscar Salvador and committed by
Linus Torvalds
79f5f8fa 06be6ff3

+28 -70
-5
include/linux/page-flags.h
··· 431 431 PAGEFLAG(HWPoison, hwpoison, PF_ANY) 432 432 TESTSCFLAG(HWPoison, hwpoison, PF_ANY) 433 433 #define __PG_HWPOISON (1UL << PG_hwpoison) 434 - extern bool set_hwpoison_free_buddy_page(struct page *page); 435 434 extern bool take_page_off_buddy(struct page *page); 436 435 #else 437 436 PAGEFLAG_FALSE(HWPoison) 438 - static inline bool set_hwpoison_free_buddy_page(struct page *page) 439 - { 440 - return 0; 441 - } 442 437 #define __PG_HWPOISON 0 443 438 #endif 444 439
+14 -29
mm/memory-failure.c
··· 65 65 66 66 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); 67 67 68 - static void page_handle_poison(struct page *page) 68 + static void page_handle_poison(struct page *page, bool release) 69 69 { 70 70 SetPageHWPoison(page); 71 + if (release) 72 + put_page(page); 71 73 page_ref_inc(page); 72 74 num_poisoned_pages_inc(); 73 75 } ··· 1767 1765 ret = -EIO; 1768 1766 } else { 1769 1767 /* 1770 - * We set PG_hwpoison only when the migration source hugepage 1771 - * was successfully dissolved, because otherwise hwpoisoned 1772 - * hugepage remains on free hugepage list, then userspace will 1773 - * find it as SIGBUS by allocation failure. That's not expected 1774 - * in soft-offlining. 1768 + * We set PG_hwpoison only when we were able to take the page 1769 + * off the buddy. 1775 1770 */ 1776 - ret = dissolve_free_huge_page(page); 1777 - if (!ret) { 1778 - if (set_hwpoison_free_buddy_page(page)) 1779 - num_poisoned_pages_inc(); 1780 - else 1781 - ret = -EBUSY; 1782 - } 1771 + if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) 1772 + page_handle_poison(page, false); 1773 + else 1774 + ret = -EBUSY; 1783 1775 } 1784 1776 return ret; 1785 1777 } ··· 1808 1812 * would need to fix isolation locking first. 1809 1813 */ 1810 1814 if (ret == 1) { 1811 - put_page(page); 1812 1815 pr_info("soft_offline: %#lx: invalidated\n", pfn); 1813 - SetPageHWPoison(page); 1814 - num_poisoned_pages_inc(); 1816 + page_handle_poison(page, true); 1815 1817 return 0; 1816 1818 } 1817 1819 ··· 1840 1846 list_add(&page->lru, &pagelist); 1841 1847 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, 1842 1848 MIGRATE_SYNC, MR_MEMORY_FAILURE); 1843 - if (ret) { 1849 + if (!ret) { 1850 + page_handle_poison(page, true); 1851 + } else { 1844 1852 if (!list_empty(&pagelist)) 1845 1853 putback_movable_pages(&pagelist); 1846 1854 ··· 1861 1865 static int soft_offline_in_use_page(struct page *page, int flags) 1862 1866 { 1863 1867 int ret; 1864 - int mt; 1865 1868 struct page *hpage = compound_head(page); 1866 1869 1867 1870 if (!PageHuge(page) && PageTransHuge(hpage)) 1868 1871 if (try_to_split_thp_page(page, "soft offline") < 0) 1869 1872 return -EBUSY; 1870 1873 1871 - /* 1872 - * Setting MIGRATE_ISOLATE here ensures that the page will be linked 1873 - * to free list immediately (not via pcplist) when released after 1874 - * successful page migration. Otherwise we can't guarantee that the 1875 - * page is really free after put_page() returns, so 1876 - * set_hwpoison_free_buddy_page() highly likely fails. 1877 - */ 1878 - mt = get_pageblock_migratetype(page); 1879 - set_pageblock_migratetype(page, MIGRATE_ISOLATE); 1880 1874 if (PageHuge(page)) 1881 1875 ret = soft_offline_huge_page(page, flags); 1882 1876 else 1883 1877 ret = __soft_offline_page(page, flags); 1884 - set_pageblock_migratetype(page, mt); 1885 1878 return ret; 1886 1879 } 1887 1880 ··· 1879 1894 int rc = -EBUSY; 1880 1895 1881 1896 if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { 1882 - page_handle_poison(page); 1897 + page_handle_poison(page, false); 1883 1898 rc = 0; 1884 1899 } 1885 1900
+3 -8
mm/migrate.c
··· 1223 1223 * we want to retry. 1224 1224 */ 1225 1225 if (rc == MIGRATEPAGE_SUCCESS) { 1226 - put_page(page); 1227 - if (reason == MR_MEMORY_FAILURE) { 1226 + if (reason != MR_MEMORY_FAILURE) 1228 1227 /* 1229 - * Set PG_HWPoison on just freed page 1230 - * intentionally. Although it's rather weird, 1231 - * it's how HWPoison flag works at the moment. 1228 + * We release the page in page_handle_poison. 1232 1229 */ 1233 - if (set_hwpoison_free_buddy_page(page)) 1234 - num_poisoned_pages_inc(); 1235 - } 1230 + put_page(page); 1236 1231 } else { 1237 1232 if (rc != -EAGAIN) { 1238 1233 if (likely(!__PageMovable(page))) {
+11 -28
mm/page_alloc.c
··· 1174 1174 1175 1175 trace_mm_page_free(page, order); 1176 1176 1177 + if (unlikely(PageHWPoison(page)) && !order) { 1178 + /* 1179 + * Do not let hwpoison pages hit pcplists/buddy 1180 + * Untie memcg state and reset page's owner 1181 + */ 1182 + if (memcg_kmem_enabled() && PageKmemcg(page)) 1183 + __memcg_kmem_uncharge_page(page, order); 1184 + reset_page_owner(page, order); 1185 + return false; 1186 + } 1187 + 1177 1188 /* 1178 1189 * Check tail pages before head page information is cleared to 1179 1190 * avoid checking PageCompound for order-0 pages. ··· 8854 8843 } 8855 8844 spin_unlock_irqrestore(&zone->lock, flags); 8856 8845 return ret; 8857 - } 8858 - 8859 - /* 8860 - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This 8861 - * test is performed under the zone lock to prevent a race against page 8862 - * allocation. 8863 - */ 8864 - bool set_hwpoison_free_buddy_page(struct page *page) 8865 - { 8866 - struct zone *zone = page_zone(page); 8867 - unsigned long pfn = page_to_pfn(page); 8868 - unsigned long flags; 8869 - unsigned int order; 8870 - bool hwpoisoned = false; 8871 - 8872 - spin_lock_irqsave(&zone->lock, flags); 8873 - for (order = 0; order < MAX_ORDER; order++) { 8874 - struct page *page_head = page - (pfn & ((1 << order) - 1)); 8875 - 8876 - if (PageBuddy(page_head) && page_order(page_head) >= order) { 8877 - if (!TestSetPageHWPoison(page)) 8878 - hwpoisoned = true; 8879 - break; 8880 - } 8881 - } 8882 - spin_unlock_irqrestore(&zone->lock, flags); 8883 - 8884 - return hwpoisoned; 8885 8846 } 8886 8847 #endif