Merge tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

+4 -3

Documentation/admin-guide/kernel-parameters.txt

··· 1735 1735 hugetlb_free_vmemmap= 1736 1736 [KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 1737 1737 enabled. 1738 + Control if HugeTLB Vmemmap Optimization (HVO) is enabled. 1738 1739 Allows heavy hugetlb users to free up some more 1739 1740 memory (7 * PAGE_SIZE for each 2MB hugetlb page). 1740 - Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) } 1741 + Format: { on | off (default) } 1741 1742 1742 - [oO][Nn]/Y/y/1: enable the feature 1743 - [oO][Ff]/N/n/0: disable the feature 1743 + on: enable HVO 1744 + off: disable HVO 1744 1745 1745 1746 Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y, 1746 1747 the default is on.

+2 -2

Documentation/admin-guide/mm/hugetlbpage.rst

··· 164 164 will all result in 256 2M huge pages being allocated. Valid default 165 165 huge page size is architecture dependent. 166 166 hugetlb_free_vmemmap 167 - When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables optimizing 168 - unused vmemmap pages associated with each HugeTLB page. 167 + When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB 168 + Vmemmap Optimization (HVO). 169 169 170 170 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` 171 171 indicates the current number of pre-allocated huge pages of the default size.

+2 -2

Documentation/admin-guide/mm/memory-hotplug.rst

··· 653 653 - Concurrent activity that operates on the same physical memory area, such as 654 654 allocating gigantic pages, can result in temporary offlining failures. 655 655 656 - - Out of memory when dissolving huge pages, especially when freeing unused 657 - vmemmap pages associated with each hugetlb page is enabled. 656 + - Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap 657 + Optimization (HVO) is enabled. 658 658 659 659 Offlining code may be able to migrate huge page contents, but may not be able 660 660 to dissolve the source huge page because it fails allocating (unmovable) pages

+1 -2

Documentation/admin-guide/sysctl/vm.rst

··· 569 569 in include/linux/mm_types.h) is not power of two (an unusual system config could 570 570 result in this). 571 571 572 - Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages 573 - associated with each HugeTLB page. 572 + Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO). 574 573 575 574 Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from 576 575 buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages

+27 -4

Documentation/mm/highmem.rst

··· 60 60 This function should be preferred, where feasible, over all the others. 61 61 62 62 These mappings are thread-local and CPU-local, meaning that the mapping 63 - can only be accessed from within this thread and the thread is bound the 64 - CPU while the mapping is active. Even if the thread is preempted (since 65 - preemption is never disabled by the function) the CPU can not be 66 - unplugged from the system via CPU-hotplug until the mapping is disposed. 63 + can only be accessed from within this thread and the thread is bound to the 64 + CPU while the mapping is active. Although preemption is never disabled by 65 + this function, the CPU can not be unplugged from the system via 66 + CPU-hotplug until the mapping is disposed. 67 67 68 68 It's valid to take pagefaults in a local kmap region, unless the context 69 69 in which the local mapping is acquired does not allow it for other reasons. 70 70 71 + As said, pagefaults and preemption are never disabled. There is no need to 72 + disable preemption because, when context switches to a different task, the 73 + maps of the outgoing task are saved and those of the incoming one are 74 + restored. 75 + 71 76 kmap_local_page() always returns a valid virtual address and it is assumed 72 77 that kunmap_local() will never fail. 78 + 79 + On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the 80 + virtual address of the direct mapping. Only real highmem pages are 81 + temporarily mapped. Therefore, users may call a plain page_address() 82 + for pages which are known to not come from ZONE_HIGHMEM. However, it is 83 + always safe to use kmap_local_page() / kunmap_local(). 84 + 85 + While it is significantly faster than kmap(), for the higmem case it 86 + comes with restrictions about the pointers validity. Contrary to kmap() 87 + mappings, the local mappings are only valid in the context of the caller 88 + and cannot be handed to other contexts. This implies that users must 89 + be absolutely sure to keep the use of the return address local to the 90 + thread which mapped it. 91 + 92 + Most code can be designed to use thread local mappings. User should 93 + therefore try to design their code to avoid the use of kmap() by mapping 94 + pages in the same thread the address will be used and prefer 95 + kmap_local_page(). 73 96 74 97 Nesting kmap_local_page() and kmap_atomic() mappings is allowed to a certain 75 98 extent (up to KMAP_TYPE_NR) but their invocations have to be strictly ordered

+49 -23

Documentation/mm/vmemmap_dedup.rst

··· 7 7 HugeTLB 8 8 ======= 9 9 10 - The struct page structures (page structs) are used to describe a physical 11 - page frame. By default, there is a one-to-one mapping from a page frame to 12 - it's corresponding page struct. 10 + This section is to explain how HugeTLB Vmemmap Optimization (HVO) works. 11 + 12 + The ``struct page`` structures are used to describe a physical page frame. By 13 + default, there is a one-to-one mapping from a page frame to it's corresponding 14 + ``struct page``. 13 15 14 16 HugeTLB pages consist of multiple base page size pages and is supported by many 15 17 architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more 16 18 details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are 17 19 currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page 18 20 consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. 19 - For each base page, there is a corresponding page struct. 21 + For each base page, there is a corresponding ``struct page``. 20 22 21 - Within the HugeTLB subsystem, only the first 4 page structs are used to 22 - contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides 23 - this upper limit. The only 'useful' information in the remaining page structs 23 + Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to 24 + contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides 25 + this upper limit. The only 'useful' information in the remaining ``struct page`` 24 26 is the compound_head field, and this field is the same for all tail pages. 25 27 26 - By removing redundant page structs for HugeTLB pages, memory can be returned 28 + By removing redundant ``struct page`` for HugeTLB pages, memory can be returned 27 29 to the buddy allocator for other uses. 28 30 29 31 Different architectures support different HugeTLB pages. For example, the ··· 46 44 | | 64KB | 2MB | 512MB | 16GB | | 47 45 +--------------+-----------+-----------+-----------+-----------+-----------+ 48 46 49 - When the system boot up, every HugeTLB page has more than one struct page 47 + When the system boot up, every HugeTLB page has more than one ``struct page`` 50 48 structs which size is (unit: pages):: 51 49 52 50 struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE ··· 76 74 n is (PAGE_SIZE / sizeof(pte_t)). 77 75 78 76 This optimization only supports 64-bit system, so the value of sizeof(pte_t) 79 - is 8. And this optimization also applicable only when the size of struct page 80 - is a power of two. In most cases, the size of struct page is 64 bytes (e.g. 77 + is 8. And this optimization also applicable only when the size of ``struct page`` 78 + is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g. 81 79 x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the 82 - size of struct page structs of it is 8 page frames which size depends on the 80 + size of ``struct page`` structs of it is 8 page frames which size depends on the 83 81 size of the base page. 84 82 85 83 For the HugeTLB page of the pud level mapping, then:: ··· 88 86 = PAGE_SIZE / 8 * 8 (pages) 89 87 = PAGE_SIZE (pages) 90 88 91 - Where the struct_size(pmd) is the size of the struct page structs of a 89 + Where the struct_size(pmd) is the size of the ``struct page`` structs of a 92 90 HugeTLB page of the pmd level mapping. 93 91 94 92 E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB ··· 96 94 97 95 Next, we take the pmd level mapping of the HugeTLB page as an example to 98 96 show the internal implementation of this optimization. There are 8 pages 99 - struct page structs associated with a HugeTLB page which is pmd mapped. 97 + ``struct page`` structs associated with a HugeTLB page which is pmd mapped. 100 98 101 99 Here is how things look before optimization:: 102 100 ··· 124 122 +-----------+ 125 123 126 124 The value of page->compound_head is the same for all tail pages. The first 127 - page of page structs (page 0) associated with the HugeTLB page contains the 4 128 - page structs necessary to describe the HugeTLB. The only use of the remaining 129 - pages of page structs (page 1 to page 7) is to point to page->compound_head. 130 - Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs 125 + page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4 126 + ``struct page`` necessary to describe the HugeTLB. The only use of the remaining 127 + pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head. 128 + Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page`` 131 129 will be used for each HugeTLB page. This will allow us to free the remaining 132 130 7 pages to the buddy allocator. 133 131 ··· 169 167 170 168 The contiguous bit is used to increase the mapping size at the pmd and pte 171 169 (last) level. So this type of HugeTLB page can be optimized only when its 172 - size of the struct page structs is greater than 1 page. 170 + size of the ``struct page`` structs is greater than **1** page. 173 171 174 172 Notice: The head vmemmap page is not freed to the buddy allocator and all 175 173 tail vmemmap pages are mapped to the head vmemmap page frame. So we can see 176 - more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page) 177 - associated with each HugeTLB page. The compound_head() can handle this 178 - correctly (more details refer to the comment above compound_head()). 174 + more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB 175 + page) associated with each HugeTLB page. The ``compound_head()`` can handle 176 + this correctly. There is only **one** head ``struct page``, the tail 177 + ``struct page`` with ``PG_head`` are fake head ``struct page``. We need an 178 + approach to distinguish between those two different types of ``struct page`` so 179 + that ``compound_head()`` can return the real head ``struct page`` when the 180 + parameter is the tail ``struct page`` but with ``PG_head``. The following code 181 + snippet describes how to distinguish between real and fake head ``struct page``. 182 + 183 + .. code-block:: c 184 + 185 + if (test_bit(PG_head, &page->flags)) { 186 + unsigned long head = READ_ONCE(page[1].compound_head); 187 + 188 + if (head & 1) { 189 + if (head == (unsigned long)page + 1) 190 + /* head struct page */ 191 + else 192 + /* tail struct page */ 193 + } else { 194 + /* head struct page */ 195 + } 196 + } 197 + 198 + We can safely access the field of the **page[1]** with ``PG_head`` because the 199 + page is a compound page composed with at least two contiguous pages. 200 + The implementation refers to ``page_fixed_fake_head()``. 179 201 180 202 Device DAX 181 203 ========== ··· 213 187 214 188 The differences with HugeTLB are relatively minor. 215 189 216 - It only use 3 page structs for storing all information as opposed 190 + It only use 3 ``struct page`` for storing all information as opposed 217 191 to 4 on HugeTLB pages. 218 192 219 193 There's no remapping of vmemmap given that device-dax memory is not part of

+3 -10

arch/arm64/mm/flush.c

··· 76 76 void flush_dcache_page(struct page *page) 77 77 { 78 78 /* 79 - * Only the head page's flags of HugeTLB can be cleared since the tail 80 - * vmemmap pages associated with each HugeTLB page are mapped with 81 - * read-only when CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is enabled (more 82 - * details can refer to vmemmap_remap_pte()). Although 83 - * __sync_icache_dcache() only set PG_dcache_clean flag on the head 84 - * page struct, there is more than one page struct with PG_dcache_clean 85 - * associated with the HugeTLB page since the head vmemmap page frame 86 - * is reused (more details can refer to the comments above 87 - * page_fixed_fake_head()). 79 + * HugeTLB pages are always fully mapped and only head page will be 80 + * set PG_dcache_clean (see comments in __sync_icache_dcache()). 88 81 */ 89 - if (hugetlb_optimize_vmemmap_enabled() && PageHuge(page)) 82 + if (PageHuge(page)) 90 83 page = compound_head(page); 91 84 92 85 if (test_bit(PG_dcache_clean, &page->flags))

+7 -1

arch/x86/mm/hugetlbpage.c

··· 30 30 (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; 31 31 } 32 32 33 + /* 34 + * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal 35 + * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. 36 + * Otherwise, returns 0. 37 + */ 33 38 int pud_huge(pud_t pud) 34 39 { 35 - return !!(pud_val(pud) & _PAGE_PSE); 40 + return !pud_none(pud) && 41 + (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; 36 42 } 37 43 38 44 #ifdef CONFIG_HUGETLB_PAGE

+5 -7

fs/Kconfig

··· 247 247 248 248 # 249 249 # Select this config option from the architecture Kconfig, if it is preferred 250 - # to enable the feature of minimizing overhead of struct page associated with 251 - # each HugeTLB page. 250 + # to enable the feature of HugeTLB Vmemmap Optimization (HVO). 252 251 # 253 252 config ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 254 253 bool ··· 258 259 depends on SPARSEMEM_VMEMMAP 259 260 260 261 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON 261 - bool "Default optimizing vmemmap pages of HugeTLB to on" 262 + bool "HugeTLB Vmemmap Optimization (HVO) defaults to on" 262 263 default n 263 264 depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP 264 265 help 265 - When using HUGETLB_PAGE_OPTIMIZE_VMEMMAP, the optimizing unused vmemmap 266 - pages associated with each HugeTLB page is default off. Say Y here 267 - to enable optimizing vmemmap pages of HugeTLB by default. It can then 268 - be disabled on the command line via hugetlb_free_vmemmap=off. 266 + The HugeTLB VmemmapvOptimization (HVO) defaults to off. Say Y here to 267 + enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off 268 + (boot command line) or hugetlb_optimize_vmemmap (sysctl). 269 269 270 270 config MEMFD_CREATE 271 271 def_bool TMPFS || HUGETLBFS

+3 -4

include/linux/highmem.h

··· 60 60 61 61 /** 62 62 * kmap_local_page - Map a page for temporary usage 63 - * @page: Pointer to the page to be mapped 63 + * @page: Pointer to the page to be mapped 64 64 * 65 65 * Returns: The virtual address of the mapping 66 66 * 67 - * Can be invoked from any context. 67 + * Can be invoked from any context, including interrupts. 68 68 * 69 69 * Requires careful handling when nesting multiple mappings because the map 70 70 * management is stack based. The unmap has to be in the reverse order of ··· 86 86 * temporarily mapped. 87 87 * 88 88 * While it is significantly faster than kmap() for the higmem case it 89 - * comes with restrictions about the pointer validity. Only use when really 90 - * necessary. 89 + * comes with restrictions about the pointer validity. 91 90 * 92 91 * On HIGHMEM enabled systems mapping a highmem page has the side effect of 93 92 * disabling migration in order to keep the virtual address stable across

+18 -6

include/linux/hugetlb.h

··· 43 43 SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */ 44 44 __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD, 45 45 #endif 46 + #ifdef CONFIG_MEMORY_FAILURE 47 + SUBPAGE_INDEX_HWPOISON, 48 + #endif 46 49 __NR_USED_SUBPAGE, 47 50 }; 48 51 ··· 554 551 * Synchronization: Initially set after new page allocation with no 555 552 * locking. When examined and modified during migration processing 556 553 * (isolate, migrate, putback) the hugetlb_lock is held. 557 - * HPG_temporary - - Set on a page that is temporarily allocated from the buddy 554 + * HPG_temporary - Set on a page that is temporarily allocated from the buddy 558 555 * allocator. Typically used for migration target pages when no pages 559 556 * are available in the pool. The hugetlb free page path will 560 557 * immediately free pages with this flag set to the buddy allocator. ··· 564 561 * HPG_freed - Set when page is on the free lists. 565 562 * Synchronization: hugetlb_lock held for examination and modification. 566 563 * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. 564 + * HPG_raw_hwp_unreliable - Set when the hugetlb page has a hwpoison sub-page 565 + * that is not tracked by raw_hwp_page list. 567 566 */ 568 567 enum hugetlb_page_flags { 569 568 HPG_restore_reserve = 0, ··· 573 568 HPG_temporary, 574 569 HPG_freed, 575 570 HPG_vmemmap_optimized, 571 + HPG_raw_hwp_unreliable, 576 572 __NR_HPAGEFLAGS, 577 573 }; 578 574 ··· 620 614 HPAGEFLAG(Temporary, temporary) 621 615 HPAGEFLAG(Freed, freed) 622 616 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized) 617 + HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable) 623 618 624 619 #ifdef CONFIG_HUGETLB_PAGE 625 620 ··· 645 638 unsigned int nr_huge_pages_node[MAX_NUMNODES]; 646 639 unsigned int free_huge_pages_node[MAX_NUMNODES]; 647 640 unsigned int surplus_huge_pages_node[MAX_NUMNODES]; 648 - #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 649 - unsigned int optimize_vmemmap_pages; 650 - #endif 651 641 #ifdef CONFIG_CGROUP_HUGETLB 652 642 /* cgroup control files */ 653 643 struct cftype cgroup_files_dfl[8]; ··· 720 716 return hstate_file(vma->vm_file); 721 717 } 722 718 723 - static inline unsigned long huge_page_size(struct hstate *h) 719 + static inline unsigned long huge_page_size(const struct hstate *h) 724 720 { 725 721 return (unsigned long)PAGE_SIZE << h->order; 726 722 } ··· 749 745 return huge_page_order(h) >= MAX_ORDER; 750 746 } 751 747 752 - static inline unsigned int pages_per_huge_page(struct hstate *h) 748 + static inline unsigned int pages_per_huge_page(const struct hstate *h) 753 749 { 754 750 return 1 << h->order; 755 751 } ··· 802 798 extern int dissolve_free_huge_page(struct page *page); 803 799 extern int dissolve_free_huge_pages(unsigned long start_pfn, 804 800 unsigned long end_pfn); 801 + 802 + #ifdef CONFIG_MEMORY_FAILURE 803 + extern void hugetlb_clear_page_hwpoison(struct page *hpage); 804 + #else 805 + static inline void hugetlb_clear_page_hwpoison(struct page *hpage) 806 + { 807 + } 808 + #endif 805 809 806 810 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION 807 811 #ifndef arch_hugetlb_migration_supported

+1 -8

include/linux/mm.h

··· 3142 3142 } 3143 3143 #endif 3144 3144 3145 - #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 3146 - int vmemmap_remap_free(unsigned long start, unsigned long end, 3147 - unsigned long reuse); 3148 - int vmemmap_remap_alloc(unsigned long start, unsigned long end, 3149 - unsigned long reuse, gfp_t gfp_mask); 3150 - #endif 3151 - 3152 3145 void *sparse_buffer_alloc(unsigned long size); 3153 3146 struct page * __populate_section_memmap(unsigned long pfn, 3154 3147 unsigned long nr_pages, int nid, struct vmem_altmap *altmap, ··· 3176 3183 MF_SOFT_OFFLINE = 1 << 3, 3177 3184 MF_UNPOISON = 1 << 4, 3178 3185 MF_SW_SIMULATED = 1 << 5, 3186 + MF_NO_RETRY = 1 << 6, 3179 3187 }; 3180 3188 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, 3181 3189 unsigned long count, int mf_flags); ··· 3229 3235 MF_MSG_DIFFERENT_COMPOUND, 3230 3236 MF_MSG_HUGE, 3231 3237 MF_MSG_FREE_HUGE, 3232 - MF_MSG_NON_PMD_HUGE, 3233 3238 MF_MSG_UNMAP_FAILED, 3234 3239 MF_MSG_DIRTY_SWAPCACHE, 3235 3240 MF_MSG_CLEAN_SWAPCACHE,

+4 -28

include/linux/page-flags.h

··· 205 205 #ifndef __GENERATING_BOUNDS_H 206 206 207 207 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 208 - DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON, 209 - hugetlb_optimize_vmemmap_key); 210 - 211 - static __always_inline bool hugetlb_optimize_vmemmap_enabled(void) 212 - { 213 - return static_branch_maybe(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON, 214 - &hugetlb_optimize_vmemmap_key); 215 - } 208 + DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); 216 209 217 210 /* 218 - * If the feature of optimizing vmemmap pages associated with each HugeTLB 219 - * page is enabled, the head vmemmap page frame is reused and all of the tail 220 - * vmemmap addresses map to the head vmemmap page frame (furture details can 221 - * refer to the figure at the head of the mm/hugetlb_vmemmap.c). In other 222 - * words, there are more than one page struct with PG_head associated with each 223 - * HugeTLB page. We __know__ that there is only one head page struct, the tail 224 - * page structs with PG_head are fake head page structs. We need an approach 225 - * to distinguish between those two different types of page structs so that 226 - * compound_head() can return the real head page struct when the parameter is 227 - * the tail page struct but with PG_head. 228 - * 229 - * The page_fixed_fake_head() returns the real head page struct if the @page is 230 - * fake page head, otherwise, returns @page which can either be a true page 231 - * head or tail. 211 + * Return the real head page struct iff the @page is a fake head page, otherwise 212 + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. 232 213 */ 233 214 static __always_inline const struct page *page_fixed_fake_head(const struct page *page) 234 215 { 235 - if (!hugetlb_optimize_vmemmap_enabled()) 216 + if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) 236 217 return page; 237 218 238 219 /* ··· 240 259 static inline const struct page *page_fixed_fake_head(const struct page *page) 241 260 { 242 261 return page; 243 - } 244 - 245 - static inline bool hugetlb_optimize_vmemmap_enabled(void) 246 - { 247 - return false; 248 262 } 249 263 #endif 250 264

+9

include/linux/swapops.h

··· 490 490 atomic_long_dec(&num_poisoned_pages); 491 491 } 492 492 493 + static inline void num_poisoned_pages_sub(long i) 494 + { 495 + atomic_long_sub(i, &num_poisoned_pages); 496 + } 497 + 493 498 #else 494 499 495 500 static inline swp_entry_t make_hwpoison_entry(struct page *page) ··· 508 503 } 509 504 510 505 static inline void num_poisoned_pages_inc(void) 506 + { 507 + } 508 + 509 + static inline void num_poisoned_pages_sub(long i) 511 510 { 512 511 } 513 512 #endif

+4

include/linux/sysctl.h

··· 268 268 return NULL; 269 269 } 270 270 271 + static inline void register_sysctl_init(const char *path, struct ctl_table *table) 272 + { 273 + } 274 + 271 275 static inline struct ctl_table_header *register_sysctl_mount_point(const char *path) 272 276 { 273 277 return NULL;

-1

include/ras/ras_event.h

··· 360 360 EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \ 361 361 EM ( MF_MSG_HUGE, "huge page" ) \ 362 362 EM ( MF_MSG_FREE_HUGE, "free huge page" ) \ 363 - EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" ) \ 364 363 EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \ 365 364 EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \ 366 365 EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \

+53 -20

mm/hugetlb.c

··· 1535 1535 if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) 1536 1536 return; 1537 1537 1538 - if (hugetlb_vmemmap_alloc(h, page)) { 1538 + /* 1539 + * If we don't know which subpages are hwpoisoned, we can't free 1540 + * the hugepage, so it's leaked intentionally. 1541 + */ 1542 + if (HPageRawHwpUnreliable(page)) 1543 + return; 1544 + 1545 + if (hugetlb_vmemmap_restore(h, page)) { 1539 1546 spin_lock_irq(&hugetlb_lock); 1540 1547 /* 1541 1548 * If we cannot allocate vmemmap pages, just refuse to free the ··· 1553 1546 spin_unlock_irq(&hugetlb_lock); 1554 1547 return; 1555 1548 } 1549 + 1550 + /* 1551 + * Move PageHWPoison flag from head page to the raw error pages, 1552 + * which makes any healthy subpages reusable. 1553 + */ 1554 + if (unlikely(PageHWPoison(page))) 1555 + hugetlb_clear_page_hwpoison(page); 1556 1556 1557 1557 for (i = 0; i < pages_per_huge_page(h); 1558 1558 i++, subpage = mem_map_next(subpage, page, i)) { ··· 1626 1612 1627 1613 static inline void flush_free_hpage_work(struct hstate *h) 1628 1614 { 1629 - if (hugetlb_optimize_vmemmap_pages(h)) 1615 + if (hugetlb_vmemmap_optimizable(h)) 1630 1616 flush_work(&free_hpage_work); 1631 1617 } 1632 1618 ··· 1748 1734 1749 1735 static void __prep_new_huge_page(struct hstate *h, struct page *page) 1750 1736 { 1751 - hugetlb_vmemmap_free(h, page); 1737 + hugetlb_vmemmap_optimize(h, page); 1752 1738 INIT_LIST_HEAD(&page->lru); 1753 1739 set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); 1754 1740 hugetlb_set_page_subpool(page, NULL); ··· 2121 2107 * Attempt to allocate vmemmmap here so that we can take 2122 2108 * appropriate action on failure. 2123 2109 */ 2124 - rc = hugetlb_vmemmap_alloc(h, head); 2110 + rc = hugetlb_vmemmap_restore(h, head); 2125 2111 if (!rc) { 2126 - /* 2127 - * Move PageHWPoison flag from head page to the raw 2128 - * error page, which makes any subpages rather than 2129 - * the error page reusable. 2130 - */ 2131 - if (PageHWPoison(head) && page != head) { 2132 - SetPageHWPoison(page); 2133 - ClearPageHWPoison(head); 2134 - } 2135 2112 update_and_free_page(h, head, false); 2136 2113 } else { 2137 2114 spin_lock_irq(&hugetlb_lock); ··· 2437 2432 /* Uncommit the reservation */ 2438 2433 h->resv_huge_pages -= unused_resv_pages; 2439 2434 2440 - /* Cannot return gigantic pages currently */ 2441 - if (hstate_is_gigantic(h)) 2435 + if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) 2442 2436 goto out; 2443 2437 2444 2438 /* ··· 3186 3182 char buf[32]; 3187 3183 3188 3184 string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32); 3189 - pr_info("HugeTLB registered %s page size, pre-allocated %ld pages\n", 3185 + pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n", 3190 3186 buf, h->free_huge_pages); 3187 + pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n", 3188 + hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf); 3191 3189 } 3192 3190 } 3193 3191 ··· 3427 3421 remove_hugetlb_page_for_demote(h, page, false); 3428 3422 spin_unlock_irq(&hugetlb_lock); 3429 3423 3430 - rc = hugetlb_vmemmap_alloc(h, page); 3424 + rc = hugetlb_vmemmap_restore(h, page); 3431 3425 if (rc) { 3432 3426 /* Allocation of vmemmmap failed, we can not demote page */ 3433 3427 spin_lock_irq(&hugetlb_lock); ··· 4117 4111 h->next_nid_to_free = first_memory_node; 4118 4112 snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", 4119 4113 huge_page_size(h)/1024); 4120 - hugetlb_vmemmap_init(h); 4121 4114 4122 4115 parsed_hstate = h; 4123 4116 } ··· 6990 6985 follow_huge_pud(struct mm_struct *mm, unsigned long address, 6991 6986 pud_t *pud, int flags) 6992 6987 { 6993 - if (flags & (FOLL_GET | FOLL_PIN)) 6988 + struct page *page = NULL; 6989 + spinlock_t *ptl; 6990 + pte_t pte; 6991 + 6992 + if (WARN_ON_ONCE(flags & FOLL_PIN)) 6994 6993 return NULL; 6995 6994 6996 - return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); 6995 + retry: 6996 + ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud); 6997 + if (!pud_huge(*pud)) 6998 + goto out; 6999 + pte = huge_ptep_get((pte_t *)pud); 7000 + if (pte_present(pte)) { 7001 + page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); 7002 + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { 7003 + page = NULL; 7004 + goto out; 7005 + } 7006 + } else { 7007 + if (is_hugetlb_entry_migration(pte)) { 7008 + spin_unlock(ptl); 7009 + __migration_entry_wait(mm, (pte_t *)pud, ptl); 7010 + goto retry; 7011 + } 7012 + /* 7013 + * hwpoisoned entry is treated as no_page_table in 7014 + * follow_page_mask(). 7015 + */ 7016 + } 7017 + out: 7018 + spin_unlock(ptl); 7019 + return page; 6997 7020 } 6998 7021 6999 7022 struct page * __weak

+456 -139

mm/hugetlb_vmemmap.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * Optimize vmemmap pages associated with HugeTLB 3 + * HugeTLB Vmemmap Optimization (HVO) 4 4 * 5 - * Copyright (c) 2020, Bytedance. All rights reserved. 5 + * Copyright (c) 2020, ByteDance. All rights reserved. 6 6 * 7 7 * Author: Muchun Song <songmuchun@bytedance.com> 8 8 * ··· 10 10 */ 11 11 #define pr_fmt(fmt) "HugeTLB: " fmt 12 12 13 - #include <linux/memory.h> 13 + #include <linux/pgtable.h> 14 + #include <linux/bootmem_info.h> 15 + #include <asm/pgalloc.h> 16 + #include <asm/tlbflush.h> 14 17 #include "hugetlb_vmemmap.h" 15 18 16 - /* 17 - * There are a lot of struct page structures associated with each HugeTLB page. 18 - * For tail pages, the value of compound_head is the same. So we can reuse first 19 - * page of head page structures. We map the virtual addresses of all the pages 20 - * of tail page structures to the head page struct, and then free these page 21 - * frames. Therefore, we need to reserve one pages as vmemmap areas. 19 + /** 20 + * struct vmemmap_remap_walk - walk vmemmap page table 21 + * 22 + * @remap_pte: called for each lowest-level entry (PTE). 23 + * @nr_walked: the number of walked pte. 24 + * @reuse_page: the page which is reused for the tail vmemmap pages. 25 + * @reuse_addr: the virtual address of the @reuse_page page. 26 + * @vmemmap_pages: the list head of the vmemmap pages that can be freed 27 + * or is mapped from. 22 28 */ 23 - #define RESERVE_VMEMMAP_NR 1U 24 - #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) 25 - 26 - enum vmemmap_optimize_mode { 27 - VMEMMAP_OPTIMIZE_OFF, 28 - VMEMMAP_OPTIMIZE_ON, 29 + struct vmemmap_remap_walk { 30 + void (*remap_pte)(pte_t *pte, unsigned long addr, 31 + struct vmemmap_remap_walk *walk); 32 + unsigned long nr_walked; 33 + struct page *reuse_page; 34 + unsigned long reuse_addr; 35 + struct list_head *vmemmap_pages; 29 36 }; 30 37 31 - DEFINE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON, 32 - hugetlb_optimize_vmemmap_key); 33 - EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key); 34 - 35 - static enum vmemmap_optimize_mode vmemmap_optimize_mode = 36 - IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON); 37 - 38 - static void vmemmap_optimize_mode_switch(enum vmemmap_optimize_mode to) 38 + static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 39 39 { 40 - if (vmemmap_optimize_mode == to) 41 - return; 40 + pmd_t __pmd; 41 + int i; 42 + unsigned long addr = start; 43 + struct page *page = pmd_page(*pmd); 44 + pte_t *pgtable = pte_alloc_one_kernel(&init_mm); 42 45 43 - if (to == VMEMMAP_OPTIMIZE_OFF) 44 - static_branch_dec(&hugetlb_optimize_vmemmap_key); 45 - else 46 - static_branch_inc(&hugetlb_optimize_vmemmap_key); 47 - WRITE_ONCE(vmemmap_optimize_mode, to); 48 - } 46 + if (!pgtable) 47 + return -ENOMEM; 49 48 50 - static int __init hugetlb_vmemmap_early_param(char *buf) 51 - { 52 - bool enable; 53 - enum vmemmap_optimize_mode mode; 49 + pmd_populate_kernel(&init_mm, &__pmd, pgtable); 54 50 55 - if (kstrtobool(buf, &enable)) 56 - return -EINVAL; 51 + for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) { 52 + pte_t entry, *pte; 53 + pgprot_t pgprot = PAGE_KERNEL; 57 54 58 - mode = enable ? VMEMMAP_OPTIMIZE_ON : VMEMMAP_OPTIMIZE_OFF; 59 - vmemmap_optimize_mode_switch(mode); 55 + entry = mk_pte(page + i, pgprot); 56 + pte = pte_offset_kernel(&__pmd, addr); 57 + set_pte_at(&init_mm, addr, pte, entry); 58 + } 59 + 60 + spin_lock(&init_mm.page_table_lock); 61 + if (likely(pmd_leaf(*pmd))) { 62 + /* 63 + * Higher order allocations from buddy allocator must be able to 64 + * be treated as indepdenent small pages (as they can be freed 65 + * individually). 66 + */ 67 + if (!PageReserved(page)) 68 + split_page(page, get_order(PMD_SIZE)); 69 + 70 + /* Make pte visible before pmd. See comment in pmd_install(). */ 71 + smp_wmb(); 72 + pmd_populate_kernel(&init_mm, pmd, pgtable); 73 + flush_tlb_kernel_range(start, start + PMD_SIZE); 74 + } else { 75 + pte_free_kernel(&init_mm, pgtable); 76 + } 77 + spin_unlock(&init_mm.page_table_lock); 60 78 61 79 return 0; 62 80 } 63 - early_param("hugetlb_free_vmemmap", hugetlb_vmemmap_early_param); 81 + 82 + static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 83 + { 84 + int leaf; 85 + 86 + spin_lock(&init_mm.page_table_lock); 87 + leaf = pmd_leaf(*pmd); 88 + spin_unlock(&init_mm.page_table_lock); 89 + 90 + if (!leaf) 91 + return 0; 92 + 93 + return __split_vmemmap_huge_pmd(pmd, start); 94 + } 95 + 96 + static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, 97 + unsigned long end, 98 + struct vmemmap_remap_walk *walk) 99 + { 100 + pte_t *pte = pte_offset_kernel(pmd, addr); 101 + 102 + /* 103 + * The reuse_page is found 'first' in table walk before we start 104 + * remapping (which is calling @walk->remap_pte). 105 + */ 106 + if (!walk->reuse_page) { 107 + walk->reuse_page = pte_page(*pte); 108 + /* 109 + * Because the reuse address is part of the range that we are 110 + * walking, skip the reuse address range. 111 + */ 112 + addr += PAGE_SIZE; 113 + pte++; 114 + walk->nr_walked++; 115 + } 116 + 117 + for (; addr != end; addr += PAGE_SIZE, pte++) { 118 + walk->remap_pte(pte, addr, walk); 119 + walk->nr_walked++; 120 + } 121 + } 122 + 123 + static int vmemmap_pmd_range(pud_t *pud, unsigned long addr, 124 + unsigned long end, 125 + struct vmemmap_remap_walk *walk) 126 + { 127 + pmd_t *pmd; 128 + unsigned long next; 129 + 130 + pmd = pmd_offset(pud, addr); 131 + do { 132 + int ret; 133 + 134 + ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK); 135 + if (ret) 136 + return ret; 137 + 138 + next = pmd_addr_end(addr, end); 139 + vmemmap_pte_range(pmd, addr, next, walk); 140 + } while (pmd++, addr = next, addr != end); 141 + 142 + return 0; 143 + } 144 + 145 + static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr, 146 + unsigned long end, 147 + struct vmemmap_remap_walk *walk) 148 + { 149 + pud_t *pud; 150 + unsigned long next; 151 + 152 + pud = pud_offset(p4d, addr); 153 + do { 154 + int ret; 155 + 156 + next = pud_addr_end(addr, end); 157 + ret = vmemmap_pmd_range(pud, addr, next, walk); 158 + if (ret) 159 + return ret; 160 + } while (pud++, addr = next, addr != end); 161 + 162 + return 0; 163 + } 164 + 165 + static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr, 166 + unsigned long end, 167 + struct vmemmap_remap_walk *walk) 168 + { 169 + p4d_t *p4d; 170 + unsigned long next; 171 + 172 + p4d = p4d_offset(pgd, addr); 173 + do { 174 + int ret; 175 + 176 + next = p4d_addr_end(addr, end); 177 + ret = vmemmap_pud_range(p4d, addr, next, walk); 178 + if (ret) 179 + return ret; 180 + } while (p4d++, addr = next, addr != end); 181 + 182 + return 0; 183 + } 184 + 185 + static int vmemmap_remap_range(unsigned long start, unsigned long end, 186 + struct vmemmap_remap_walk *walk) 187 + { 188 + unsigned long addr = start; 189 + unsigned long next; 190 + pgd_t *pgd; 191 + 192 + VM_BUG_ON(!PAGE_ALIGNED(start)); 193 + VM_BUG_ON(!PAGE_ALIGNED(end)); 194 + 195 + pgd = pgd_offset_k(addr); 196 + do { 197 + int ret; 198 + 199 + next = pgd_addr_end(addr, end); 200 + ret = vmemmap_p4d_range(pgd, addr, next, walk); 201 + if (ret) 202 + return ret; 203 + } while (pgd++, addr = next, addr != end); 204 + 205 + /* 206 + * We only change the mapping of the vmemmap virtual address range 207 + * [@start + PAGE_SIZE, end), so we only need to flush the TLB which 208 + * belongs to the range. 209 + */ 210 + flush_tlb_kernel_range(start + PAGE_SIZE, end); 211 + 212 + return 0; 213 + } 64 214 65 215 /* 66 - * Previously discarded vmemmap pages will be allocated and remapping 67 - * after this function returns zero. 216 + * Free a vmemmap page. A vmemmap page can be allocated from the memblock 217 + * allocator or buddy allocator. If the PG_reserved flag is set, it means 218 + * that it allocated from the memblock allocator, just free it via the 219 + * free_bootmem_page(). Otherwise, use __free_page(). 68 220 */ 69 - int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head) 221 + static inline void free_vmemmap_page(struct page *page) 222 + { 223 + if (PageReserved(page)) 224 + free_bootmem_page(page); 225 + else 226 + __free_page(page); 227 + } 228 + 229 + /* Free a list of the vmemmap pages */ 230 + static void free_vmemmap_page_list(struct list_head *list) 231 + { 232 + struct page *page, *next; 233 + 234 + list_for_each_entry_safe(page, next, list, lru) { 235 + list_del(&page->lru); 236 + free_vmemmap_page(page); 237 + } 238 + } 239 + 240 + static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, 241 + struct vmemmap_remap_walk *walk) 242 + { 243 + /* 244 + * Remap the tail pages as read-only to catch illegal write operation 245 + * to the tail pages. 246 + */ 247 + pgprot_t pgprot = PAGE_KERNEL_RO; 248 + pte_t entry = mk_pte(walk->reuse_page, pgprot); 249 + struct page *page = pte_page(*pte); 250 + 251 + list_add_tail(&page->lru, walk->vmemmap_pages); 252 + set_pte_at(&init_mm, addr, pte, entry); 253 + } 254 + 255 + /* 256 + * How many struct page structs need to be reset. When we reuse the head 257 + * struct page, the special metadata (e.g. page->flags or page->mapping) 258 + * cannot copy to the tail struct page structs. The invalid value will be 259 + * checked in the free_tail_pages_check(). In order to avoid the message 260 + * of "corrupted mapping in tail page". We need to reset at least 3 (one 261 + * head struct page struct and two tail struct page structs) struct page 262 + * structs. 263 + */ 264 + #define NR_RESET_STRUCT_PAGE 3 265 + 266 + static inline void reset_struct_pages(struct page *start) 267 + { 268 + int i; 269 + struct page *from = start + NR_RESET_STRUCT_PAGE; 270 + 271 + for (i = 0; i < NR_RESET_STRUCT_PAGE; i++) 272 + memcpy(start + i, from, sizeof(*from)); 273 + } 274 + 275 + static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, 276 + struct vmemmap_remap_walk *walk) 277 + { 278 + pgprot_t pgprot = PAGE_KERNEL; 279 + struct page *page; 280 + void *to; 281 + 282 + BUG_ON(pte_page(*pte) != walk->reuse_page); 283 + 284 + page = list_first_entry(walk->vmemmap_pages, struct page, lru); 285 + list_del(&page->lru); 286 + to = page_to_virt(page); 287 + copy_page(to, (void *)walk->reuse_addr); 288 + reset_struct_pages(to); 289 + 290 + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); 291 + } 292 + 293 + /** 294 + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) 295 + * to the page which @reuse is mapped to, then free vmemmap 296 + * which the range are mapped to. 297 + * @start: start address of the vmemmap virtual address range that we want 298 + * to remap. 299 + * @end: end address of the vmemmap virtual address range that we want to 300 + * remap. 301 + * @reuse: reuse address. 302 + * 303 + * Return: %0 on success, negative error code otherwise. 304 + */ 305 + static int vmemmap_remap_free(unsigned long start, unsigned long end, 306 + unsigned long reuse) 70 307 { 71 308 int ret; 72 - unsigned long vmemmap_addr = (unsigned long)head; 73 - unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages; 309 + LIST_HEAD(vmemmap_pages); 310 + struct vmemmap_remap_walk walk = { 311 + .remap_pte = vmemmap_remap_pte, 312 + .reuse_addr = reuse, 313 + .vmemmap_pages = &vmemmap_pages, 314 + }; 315 + 316 + /* 317 + * In order to make remapping routine most efficient for the huge pages, 318 + * the routine of vmemmap page table walking has the following rules 319 + * (see more details from the vmemmap_pte_range()): 320 + * 321 + * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) 322 + * should be continuous. 323 + * - The @reuse address is part of the range [@reuse, @end) that we are 324 + * walking which is passed to vmemmap_remap_range(). 325 + * - The @reuse address is the first in the complete range. 326 + * 327 + * So we need to make sure that @start and @reuse meet the above rules. 328 + */ 329 + BUG_ON(start - reuse != PAGE_SIZE); 330 + 331 + mmap_read_lock(&init_mm); 332 + ret = vmemmap_remap_range(reuse, end, &walk); 333 + if (ret && walk.nr_walked) { 334 + end = reuse + walk.nr_walked * PAGE_SIZE; 335 + /* 336 + * vmemmap_pages contains pages from the previous 337 + * vmemmap_remap_range call which failed. These 338 + * are pages which were removed from the vmemmap. 339 + * They will be restored in the following call. 340 + */ 341 + walk = (struct vmemmap_remap_walk) { 342 + .remap_pte = vmemmap_restore_pte, 343 + .reuse_addr = reuse, 344 + .vmemmap_pages = &vmemmap_pages, 345 + }; 346 + 347 + vmemmap_remap_range(reuse, end, &walk); 348 + } 349 + mmap_read_unlock(&init_mm); 350 + 351 + free_vmemmap_page_list(&vmemmap_pages); 352 + 353 + return ret; 354 + } 355 + 356 + static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, 357 + gfp_t gfp_mask, struct list_head *list) 358 + { 359 + unsigned long nr_pages = (end - start) >> PAGE_SHIFT; 360 + int nid = page_to_nid((struct page *)start); 361 + struct page *page, *next; 362 + 363 + while (nr_pages--) { 364 + page = alloc_pages_node(nid, gfp_mask, 0); 365 + if (!page) 366 + goto out; 367 + list_add_tail(&page->lru, list); 368 + } 369 + 370 + return 0; 371 + out: 372 + list_for_each_entry_safe(page, next, list, lru) 373 + __free_pages(page, 0); 374 + return -ENOMEM; 375 + } 376 + 377 + /** 378 + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) 379 + * to the page which is from the @vmemmap_pages 380 + * respectively. 381 + * @start: start address of the vmemmap virtual address range that we want 382 + * to remap. 383 + * @end: end address of the vmemmap virtual address range that we want to 384 + * remap. 385 + * @reuse: reuse address. 386 + * @gfp_mask: GFP flag for allocating vmemmap pages. 387 + * 388 + * Return: %0 on success, negative error code otherwise. 389 + */ 390 + static int vmemmap_remap_alloc(unsigned long start, unsigned long end, 391 + unsigned long reuse, gfp_t gfp_mask) 392 + { 393 + LIST_HEAD(vmemmap_pages); 394 + struct vmemmap_remap_walk walk = { 395 + .remap_pte = vmemmap_restore_pte, 396 + .reuse_addr = reuse, 397 + .vmemmap_pages = &vmemmap_pages, 398 + }; 399 + 400 + /* See the comment in the vmemmap_remap_free(). */ 401 + BUG_ON(start - reuse != PAGE_SIZE); 402 + 403 + if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages)) 404 + return -ENOMEM; 405 + 406 + mmap_read_lock(&init_mm); 407 + vmemmap_remap_range(reuse, end, &walk); 408 + mmap_read_unlock(&init_mm); 409 + 410 + return 0; 411 + } 412 + 413 + DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); 414 + EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key); 415 + 416 + static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON); 417 + core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0); 418 + 419 + /** 420 + * hugetlb_vmemmap_restore - restore previously optimized (by 421 + * hugetlb_vmemmap_optimize()) vmemmap pages which 422 + * will be reallocated and remapped. 423 + * @h: struct hstate. 424 + * @head: the head page whose vmemmap pages will be restored. 425 + * 426 + * Return: %0 if @head's vmemmap pages have been reallocated and remapped, 427 + * negative error code otherwise. 428 + */ 429 + int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head) 430 + { 431 + int ret; 432 + unsigned long vmemmap_start = (unsigned long)head, vmemmap_end; 433 + unsigned long vmemmap_reuse; 74 434 75 435 if (!HPageVmemmapOptimized(head)) 76 436 return 0; 77 437 78 - vmemmap_addr += RESERVE_VMEMMAP_SIZE; 79 - vmemmap_pages = hugetlb_optimize_vmemmap_pages(h); 80 - vmemmap_end = vmemmap_addr + (vmemmap_pages << PAGE_SHIFT); 81 - vmemmap_reuse = vmemmap_addr - PAGE_SIZE; 438 + vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); 439 + vmemmap_reuse = vmemmap_start; 440 + vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; 82 441 83 442 /* 84 - * The pages which the vmemmap virtual address range [@vmemmap_addr, 443 + * The pages which the vmemmap virtual address range [@vmemmap_start, 85 444 * @vmemmap_end) are mapped to are freed to the buddy allocator, and 86 445 * the range is mapped to the page which @vmemmap_reuse is mapped to. 87 446 * When a HugeTLB page is freed to the buddy allocator, previously 88 447 * discarded vmemmap pages must be allocated and remapping. 89 448 */ 90 - ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, 449 + ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, 91 450 GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE); 92 451 if (!ret) { 93 452 ClearHPageVmemmapOptimized(head); ··· 456 97 return ret; 457 98 } 458 99 459 - static unsigned int vmemmap_optimizable_pages(struct hstate *h, 460 - struct page *head) 100 + /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */ 101 + static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head) 461 102 { 462 - if (READ_ONCE(vmemmap_optimize_mode) == VMEMMAP_OPTIMIZE_OFF) 463 - return 0; 103 + if (!READ_ONCE(vmemmap_optimize_enabled)) 104 + return false; 105 + 106 + if (!hugetlb_vmemmap_optimizable(h)) 107 + return false; 464 108 465 109 if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) { 466 110 pmd_t *pmdp, pmd; ··· 506 144 * +-------------------------------------------+ 507 145 */ 508 146 if (PageVmemmapSelfHosted(vmemmap_page)) 509 - return 0; 147 + return false; 510 148 } 511 149 512 - return hugetlb_optimize_vmemmap_pages(h); 150 + return true; 513 151 } 514 152 515 - void hugetlb_vmemmap_free(struct hstate *h, struct page *head) 153 + /** 154 + * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. 155 + * @h: struct hstate. 156 + * @head: the head page whose vmemmap pages will be optimized. 157 + * 158 + * This function only tries to optimize @head's vmemmap pages and does not 159 + * guarantee that the optimization will succeed after it returns. The caller 160 + * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages 161 + * have been optimized. 162 + */ 163 + void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) 516 164 { 517 - unsigned long vmemmap_addr = (unsigned long)head; 518 - unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages; 165 + unsigned long vmemmap_start = (unsigned long)head, vmemmap_end; 166 + unsigned long vmemmap_reuse; 519 167 520 - vmemmap_pages = vmemmap_optimizable_pages(h, head); 521 - if (!vmemmap_pages) 168 + if (!vmemmap_should_optimize(h, head)) 522 169 return; 523 170 524 171 static_branch_inc(&hugetlb_optimize_vmemmap_key); 525 172 526 - vmemmap_addr += RESERVE_VMEMMAP_SIZE; 527 - vmemmap_end = vmemmap_addr + (vmemmap_pages << PAGE_SHIFT); 528 - vmemmap_reuse = vmemmap_addr - PAGE_SIZE; 173 + vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); 174 + vmemmap_reuse = vmemmap_start; 175 + vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; 529 176 530 177 /* 531 - * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end) 178 + * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) 532 179 * to the page which @vmemmap_reuse is mapped to, then free the pages 533 - * which the range [@vmemmap_addr, @vmemmap_end] is mapped to. 180 + * which the range [@vmemmap_start, @vmemmap_end] is mapped to. 534 181 */ 535 - if (vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse)) 182 + if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse)) 536 183 static_branch_dec(&hugetlb_optimize_vmemmap_key); 537 184 else 538 185 SetHPageVmemmapOptimized(head); 539 186 } 540 187 541 - void __init hugetlb_vmemmap_init(struct hstate *h) 542 - { 543 - unsigned int nr_pages = pages_per_huge_page(h); 544 - unsigned int vmemmap_pages; 545 - 546 - /* 547 - * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct 548 - * page structs that can be used when CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, 549 - * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. 550 - */ 551 - BUILD_BUG_ON(__NR_USED_SUBPAGE >= 552 - RESERVE_VMEMMAP_SIZE / sizeof(struct page)); 553 - 554 - if (!is_power_of_2(sizeof(struct page))) { 555 - pr_warn_once("cannot optimize vmemmap pages because \"struct page\" crosses page boundaries\n"); 556 - static_branch_disable(&hugetlb_optimize_vmemmap_key); 557 - return; 558 - } 559 - 560 - vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT; 561 - /* 562 - * The head page is not to be freed to buddy allocator, the other tail 563 - * pages will map to the head page, so they can be freed. 564 - * 565 - * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true 566 - * on some architectures (e.g. aarch64). See Documentation/arm64/ 567 - * hugetlbpage.rst for more details. 568 - */ 569 - if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR)) 570 - h->optimize_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR; 571 - 572 - pr_info("can optimize %d vmemmap pages for %s\n", 573 - h->optimize_vmemmap_pages, h->name); 574 - } 575 - 576 - #ifdef CONFIG_PROC_SYSCTL 577 - static int hugetlb_optimize_vmemmap_handler(struct ctl_table *table, int write, 578 - void *buffer, size_t *length, 579 - loff_t *ppos) 580 - { 581 - int ret; 582 - enum vmemmap_optimize_mode mode; 583 - static DEFINE_MUTEX(sysctl_mutex); 584 - 585 - if (write && !capable(CAP_SYS_ADMIN)) 586 - return -EPERM; 587 - 588 - mutex_lock(&sysctl_mutex); 589 - mode = vmemmap_optimize_mode; 590 - table->data = &mode; 591 - ret = proc_dointvec_minmax(table, write, buffer, length, ppos); 592 - if (write && !ret) 593 - vmemmap_optimize_mode_switch(mode); 594 - mutex_unlock(&sysctl_mutex); 595 - 596 - return ret; 597 - } 598 - 599 188 static struct ctl_table hugetlb_vmemmap_sysctls[] = { 600 189 { 601 190 .procname = "hugetlb_optimize_vmemmap", 602 - .maxlen = sizeof(enum vmemmap_optimize_mode), 191 + .data = &vmemmap_optimize_enabled, 192 + .maxlen = sizeof(int), 603 193 .mode = 0644, 604 - .proc_handler = hugetlb_optimize_vmemmap_handler, 605 - .extra1 = SYSCTL_ZERO, 606 - .extra2 = SYSCTL_ONE, 194 + .proc_handler = proc_dobool, 607 195 }, 608 196 { } 609 197 }; 610 198 611 - static __init int hugetlb_vmemmap_sysctls_init(void) 199 + static int __init hugetlb_vmemmap_init(void) 612 200 { 613 - /* 614 - * If "struct page" crosses page boundaries, the vmemmap pages cannot 615 - * be optimized. 616 - */ 617 - if (is_power_of_2(sizeof(struct page))) 618 - register_sysctl_init("vm", hugetlb_vmemmap_sysctls); 201 + /* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */ 202 + BUILD_BUG_ON(__NR_USED_SUBPAGE * sizeof(struct page) > HUGETLB_VMEMMAP_RESERVE_SIZE); 619 203 204 + if (IS_ENABLED(CONFIG_PROC_SYSCTL)) { 205 + const struct hstate *h; 206 + 207 + for_each_hstate(h) { 208 + if (hugetlb_vmemmap_optimizable(h)) { 209 + register_sysctl_init("vm", hugetlb_vmemmap_sysctls); 210 + break; 211 + } 212 + } 213 + } 620 214 return 0; 621 215 } 622 - late_initcall(hugetlb_vmemmap_sysctls_init); 623 - #endif /* CONFIG_PROC_SYSCTL */ 216 + late_initcall(hugetlb_vmemmap_init);

+31 -16

mm/hugetlb_vmemmap.h

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * Optimize vmemmap pages associated with HugeTLB 3 + * HugeTLB Vmemmap Optimization (HVO) 4 4 * 5 - * Copyright (c) 2020, Bytedance. All rights reserved. 5 + * Copyright (c) 2020, ByteDance. All rights reserved. 6 6 * 7 7 * Author: Muchun Song <songmuchun@bytedance.com> 8 8 */ ··· 11 11 #include <linux/hugetlb.h> 12 12 13 13 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 14 - int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head); 15 - void hugetlb_vmemmap_free(struct hstate *h, struct page *head); 16 - void hugetlb_vmemmap_init(struct hstate *h); 14 + int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head); 15 + void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head); 17 16 18 17 /* 19 - * How many vmemmap pages associated with a HugeTLB page that can be 20 - * optimized and freed to the buddy allocator. 18 + * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See 19 + * Documentation/vm/vmemmap_dedup.rst. 21 20 */ 22 - static inline unsigned int hugetlb_optimize_vmemmap_pages(struct hstate *h) 21 + #define HUGETLB_VMEMMAP_RESERVE_SIZE PAGE_SIZE 22 + 23 + static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h) 23 24 { 24 - return h->optimize_vmemmap_pages; 25 + return pages_per_huge_page(h) * sizeof(struct page); 26 + } 27 + 28 + /* 29 + * Return how many vmemmap size associated with a HugeTLB page that can be 30 + * optimized and can be freed to the buddy allocator. 31 + */ 32 + static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h) 33 + { 34 + int size = hugetlb_vmemmap_size(h) - HUGETLB_VMEMMAP_RESERVE_SIZE; 35 + 36 + if (!is_power_of_2(sizeof(struct page))) 37 + return 0; 38 + return size > 0 ? size : 0; 25 39 } 26 40 #else 27 - static inline int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head) 41 + static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head) 28 42 { 29 43 return 0; 30 44 } 31 45 32 - static inline void hugetlb_vmemmap_free(struct hstate *h, struct page *head) 46 + static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) 33 47 { 34 48 } 35 49 36 - static inline void hugetlb_vmemmap_init(struct hstate *h) 37 - { 38 - } 39 - 40 - static inline unsigned int hugetlb_optimize_vmemmap_pages(struct hstate *h) 50 + static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h) 41 51 { 42 52 return 0; 43 53 } 44 54 #endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */ 55 + 56 + static inline bool hugetlb_vmemmap_optimizable(const struct hstate *h) 57 + { 58 + return hugetlb_vmemmap_optimizable_size(h) != 0; 59 + } 45 60 #endif /* _LINUX_HUGETLB_VMEMMAP_H */

+148 -31

mm/memory-failure.c

··· 74 74 75 75 static bool hw_memory_failure __read_mostly = false; 76 76 77 - static bool __page_handle_poison(struct page *page) 77 + /* 78 + * Return values: 79 + * 1: the page is dissolved (if needed) and taken off from buddy, 80 + * 0: the page is dissolved (if needed) and not taken off from buddy, 81 + * < 0: failed to dissolve. 82 + */ 83 + static int __page_handle_poison(struct page *page) 78 84 { 79 85 int ret; 80 86 ··· 90 84 ret = take_page_off_buddy(page); 91 85 zone_pcp_enable(page_zone(page)); 92 86 93 - return ret > 0; 87 + return ret; 94 88 } 95 89 96 90 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release) ··· 100 94 * Doing this check for free pages is also fine since dissolve_free_huge_page 101 95 * returns 0 for non-hugetlb pages as well. 102 96 */ 103 - if (!__page_handle_poison(page)) 97 + if (__page_handle_poison(page) <= 0) 104 98 /* 105 99 * We could fail to take off the target page from buddy 106 100 * for example due to racy page allocation, but that's ··· 768 762 [MF_MSG_DIFFERENT_COMPOUND] = "different compound page after locking", 769 763 [MF_MSG_HUGE] = "huge page", 770 764 [MF_MSG_FREE_HUGE] = "free huge page", 771 - [MF_MSG_NON_PMD_HUGE] = "non-pmd-sized huge page", 772 765 [MF_MSG_UNMAP_FAILED] = "unmapping failed page", 773 766 [MF_MSG_DIRTY_SWAPCACHE] = "dirty swapcache page", 774 767 [MF_MSG_CLEAN_SWAPCACHE] = "clean swapcache page", ··· 1083 1078 res = truncate_error_page(hpage, page_to_pfn(p), mapping); 1084 1079 unlock_page(hpage); 1085 1080 } else { 1086 - res = MF_FAILED; 1087 1081 unlock_page(hpage); 1088 1082 /* 1089 1083 * migration entry prevents later access on error hugepage, ··· 1090 1086 * subpages. 1091 1087 */ 1092 1088 put_page(hpage); 1093 - if (__page_handle_poison(p)) { 1089 + if (__page_handle_poison(p) >= 0) { 1094 1090 page_ref_inc(p); 1095 1091 res = MF_RECOVERED; 1092 + } else { 1093 + res = MF_FAILED; 1096 1094 } 1097 1095 } 1098 1096 ··· 1668 1662 EXPORT_SYMBOL_GPL(mf_dax_kill_procs); 1669 1663 #endif /* CONFIG_FS_DAX */ 1670 1664 1665 + #ifdef CONFIG_HUGETLB_PAGE 1666 + /* 1667 + * Struct raw_hwp_page represents information about "raw error page", 1668 + * constructing singly linked list originated from ->private field of 1669 + * SUBPAGE_INDEX_HWPOISON-th tail page. 1670 + */ 1671 + struct raw_hwp_page { 1672 + struct llist_node node; 1673 + struct page *page; 1674 + }; 1675 + 1676 + static inline struct llist_head *raw_hwp_list_head(struct page *hpage) 1677 + { 1678 + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); 1679 + } 1680 + 1681 + static unsigned long __free_raw_hwp_pages(struct page *hpage, bool move_flag) 1682 + { 1683 + struct llist_head *head; 1684 + struct llist_node *t, *tnode; 1685 + unsigned long count = 0; 1686 + 1687 + head = raw_hwp_list_head(hpage); 1688 + llist_for_each_safe(tnode, t, head->first) { 1689 + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); 1690 + 1691 + if (move_flag) 1692 + SetPageHWPoison(p->page); 1693 + kfree(p); 1694 + count++; 1695 + } 1696 + llist_del_all(head); 1697 + return count; 1698 + } 1699 + 1700 + static int hugetlb_set_page_hwpoison(struct page *hpage, struct page *page) 1701 + { 1702 + struct llist_head *head; 1703 + struct raw_hwp_page *raw_hwp; 1704 + struct llist_node *t, *tnode; 1705 + int ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; 1706 + 1707 + /* 1708 + * Once the hwpoison hugepage has lost reliable raw error info, 1709 + * there is little meaning to keep additional error info precisely, 1710 + * so skip to add additional raw error info. 1711 + */ 1712 + if (HPageRawHwpUnreliable(hpage)) 1713 + return -EHWPOISON; 1714 + head = raw_hwp_list_head(hpage); 1715 + llist_for_each_safe(tnode, t, head->first) { 1716 + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); 1717 + 1718 + if (p->page == page) 1719 + return -EHWPOISON; 1720 + } 1721 + 1722 + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC); 1723 + if (raw_hwp) { 1724 + raw_hwp->page = page; 1725 + llist_add(&raw_hwp->node, head); 1726 + /* the first error event will be counted in action_result(). */ 1727 + if (ret) 1728 + num_poisoned_pages_inc(); 1729 + } else { 1730 + /* 1731 + * Failed to save raw error info. We no longer trace all 1732 + * hwpoisoned subpages, and we need refuse to free/dissolve 1733 + * this hwpoisoned hugepage. 1734 + */ 1735 + SetHPageRawHwpUnreliable(hpage); 1736 + /* 1737 + * Once HPageRawHwpUnreliable is set, raw_hwp_page is not 1738 + * used any more, so free it. 1739 + */ 1740 + __free_raw_hwp_pages(hpage, false); 1741 + } 1742 + return ret; 1743 + } 1744 + 1745 + static unsigned long free_raw_hwp_pages(struct page *hpage, bool move_flag) 1746 + { 1747 + /* 1748 + * HPageVmemmapOptimized hugepages can't be freed because struct 1749 + * pages for tail pages are required but they don't exist. 1750 + */ 1751 + if (move_flag && HPageVmemmapOptimized(hpage)) 1752 + return 0; 1753 + 1754 + /* 1755 + * HPageRawHwpUnreliable hugepages shouldn't be unpoisoned by 1756 + * definition. 1757 + */ 1758 + if (HPageRawHwpUnreliable(hpage)) 1759 + return 0; 1760 + 1761 + return __free_raw_hwp_pages(hpage, move_flag); 1762 + } 1763 + 1764 + void hugetlb_clear_page_hwpoison(struct page *hpage) 1765 + { 1766 + if (HPageRawHwpUnreliable(hpage)) 1767 + return; 1768 + ClearPageHWPoison(hpage); 1769 + free_raw_hwp_pages(hpage, true); 1770 + } 1771 + 1671 1772 /* 1672 1773 * Called from hugetlb code with hugetlb_lock held. 1673 1774 * ··· 1806 1693 count_increased = true; 1807 1694 } else { 1808 1695 ret = -EBUSY; 1809 - goto out; 1696 + if (!(flags & MF_NO_RETRY)) 1697 + goto out; 1810 1698 } 1811 1699 1812 - if (TestSetPageHWPoison(head)) { 1700 + if (hugetlb_set_page_hwpoison(head, page)) { 1813 1701 ret = -EHWPOISON; 1814 1702 goto out; 1815 1703 } ··· 1822 1708 return ret; 1823 1709 } 1824 1710 1825 - #ifdef CONFIG_HUGETLB_PAGE 1826 1711 /* 1827 1712 * Taking refcount of hugetlb pages needs extra care about race conditions 1828 1713 * with basic operations like hugepage allocation/free/demotion. ··· 1834 1721 struct page *p = pfn_to_page(pfn); 1835 1722 struct page *head; 1836 1723 unsigned long page_flags; 1837 - bool retry = true; 1838 1724 1839 1725 *hugetlb = 1; 1840 1726 retry: ··· 1849 1737 } 1850 1738 return res; 1851 1739 } else if (res == -EBUSY) { 1852 - if (retry) { 1853 - retry = false; 1740 + if (!(flags & MF_NO_RETRY)) { 1741 + flags |= MF_NO_RETRY; 1854 1742 goto retry; 1855 1743 } 1856 1744 action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); ··· 1861 1749 lock_page(head); 1862 1750 1863 1751 if (hwpoison_filter(p)) { 1864 - ClearPageHWPoison(head); 1752 + hugetlb_clear_page_hwpoison(head); 1865 1753 res = -EOPNOTSUPP; 1866 1754 goto out; 1867 1755 } ··· 1872 1760 */ 1873 1761 if (res == 0) { 1874 1762 unlock_page(head); 1875 - res = MF_FAILED; 1876 - if (__page_handle_poison(p)) { 1763 + if (__page_handle_poison(p) >= 0) { 1877 1764 page_ref_inc(p); 1878 1765 res = MF_RECOVERED; 1766 + } else { 1767 + res = MF_FAILED; 1879 1768 } 1880 1769 action_result(pfn, MF_MSG_FREE_HUGE, res); 1881 1770 return res == MF_RECOVERED ? 0 : -EBUSY; 1882 1771 } 1883 1772 1884 1773 page_flags = head->flags; 1885 - 1886 - /* 1887 - * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so 1888 - * simply disable it. In order to make it work properly, we need 1889 - * make sure that: 1890 - * - conversion of a pud that maps an error hugetlb into hwpoison 1891 - * entry properly works, and 1892 - * - other mm code walking over page table is aware of pud-aligned 1893 - * hwpoison entries. 1894 - */ 1895 - if (huge_page_size(page_hstate(head)) > PMD_SIZE) { 1896 - action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED); 1897 - res = -EBUSY; 1898 - goto out; 1899 - } 1900 1774 1901 1775 if (!hwpoison_user_mappings(p, pfn, flags, head)) { 1902 1776 action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED); ··· 1902 1804 return 0; 1903 1805 } 1904 1806 1807 + static inline unsigned long free_raw_hwp_pages(struct page *hpage, bool flag) 1808 + { 1809 + return 0; 1810 + } 1905 1811 #endif /* CONFIG_HUGETLB_PAGE */ 1906 1812 1907 1813 static int memory_failure_dev_pagemap(unsigned long pfn, int flags, ··· 2311 2209 struct page *p; 2312 2210 int ret = -EBUSY; 2313 2211 int freeit = 0; 2212 + unsigned long count = 1; 2314 2213 static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL, 2315 2214 DEFAULT_RATELIMIT_BURST); 2316 2215 ··· 2359 2256 2360 2257 ret = get_hwpoison_page(p, MF_UNPOISON); 2361 2258 if (!ret) { 2259 + if (PageHuge(p)) { 2260 + count = free_raw_hwp_pages(page, false); 2261 + if (count == 0) { 2262 + ret = -EBUSY; 2263 + goto unlock_mutex; 2264 + } 2265 + } 2362 2266 ret = TestClearPageHWPoison(page) ? 0 : -EBUSY; 2363 2267 } else if (ret < 0) { 2364 2268 if (ret == -EHWPOISON) { ··· 2374 2264 unpoison_pr_info("Unpoison: failed to grab page %#lx\n", 2375 2265 pfn, &unpoison_rs); 2376 2266 } else { 2267 + if (PageHuge(p)) { 2268 + count = free_raw_hwp_pages(page, false); 2269 + if (count == 0) { 2270 + ret = -EBUSY; 2271 + goto unlock_mutex; 2272 + } 2273 + } 2377 2274 freeit = !!TestClearPageHWPoison(p); 2378 2275 2379 2276 put_page(page); ··· 2393 2276 unlock_mutex: 2394 2277 mutex_unlock(&mf_mutex); 2395 2278 if (!ret || freeit) { 2396 - num_poisoned_pages_dec(); 2279 + num_poisoned_pages_sub(count); 2397 2280 unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n", 2398 2281 page_to_pfn(p), &unpoison_rs); 2399 2282 }

-399

mm/sparse-vmemmap.c

··· 27 27 #include <linux/spinlock.h> 28 28 #include <linux/vmalloc.h> 29 29 #include <linux/sched.h> 30 - #include <linux/pgtable.h> 31 - #include <linux/bootmem_info.h> 32 30 33 31 #include <asm/dma.h> 34 32 #include <asm/pgalloc.h> 35 - #include <asm/tlbflush.h> 36 - 37 - #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP 38 - /** 39 - * struct vmemmap_remap_walk - walk vmemmap page table 40 - * 41 - * @remap_pte: called for each lowest-level entry (PTE). 42 - * @nr_walked: the number of walked pte. 43 - * @reuse_page: the page which is reused for the tail vmemmap pages. 44 - * @reuse_addr: the virtual address of the @reuse_page page. 45 - * @vmemmap_pages: the list head of the vmemmap pages that can be freed 46 - * or is mapped from. 47 - */ 48 - struct vmemmap_remap_walk { 49 - void (*remap_pte)(pte_t *pte, unsigned long addr, 50 - struct vmemmap_remap_walk *walk); 51 - unsigned long nr_walked; 52 - struct page *reuse_page; 53 - unsigned long reuse_addr; 54 - struct list_head *vmemmap_pages; 55 - }; 56 - 57 - static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 58 - { 59 - pmd_t __pmd; 60 - int i; 61 - unsigned long addr = start; 62 - struct page *page = pmd_page(*pmd); 63 - pte_t *pgtable = pte_alloc_one_kernel(&init_mm); 64 - 65 - if (!pgtable) 66 - return -ENOMEM; 67 - 68 - pmd_populate_kernel(&init_mm, &__pmd, pgtable); 69 - 70 - for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) { 71 - pte_t entry, *pte; 72 - pgprot_t pgprot = PAGE_KERNEL; 73 - 74 - entry = mk_pte(page + i, pgprot); 75 - pte = pte_offset_kernel(&__pmd, addr); 76 - set_pte_at(&init_mm, addr, pte, entry); 77 - } 78 - 79 - spin_lock(&init_mm.page_table_lock); 80 - if (likely(pmd_leaf(*pmd))) { 81 - /* 82 - * Higher order allocations from buddy allocator must be able to 83 - * be treated as indepdenent small pages (as they can be freed 84 - * individually). 85 - */ 86 - if (!PageReserved(page)) 87 - split_page(page, get_order(PMD_SIZE)); 88 - 89 - /* Make pte visible before pmd. See comment in pmd_install(). */ 90 - smp_wmb(); 91 - pmd_populate_kernel(&init_mm, pmd, pgtable); 92 - flush_tlb_kernel_range(start, start + PMD_SIZE); 93 - } else { 94 - pte_free_kernel(&init_mm, pgtable); 95 - } 96 - spin_unlock(&init_mm.page_table_lock); 97 - 98 - return 0; 99 - } 100 - 101 - static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 102 - { 103 - int leaf; 104 - 105 - spin_lock(&init_mm.page_table_lock); 106 - leaf = pmd_leaf(*pmd); 107 - spin_unlock(&init_mm.page_table_lock); 108 - 109 - if (!leaf) 110 - return 0; 111 - 112 - return __split_vmemmap_huge_pmd(pmd, start); 113 - } 114 - 115 - static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, 116 - unsigned long end, 117 - struct vmemmap_remap_walk *walk) 118 - { 119 - pte_t *pte = pte_offset_kernel(pmd, addr); 120 - 121 - /* 122 - * The reuse_page is found 'first' in table walk before we start 123 - * remapping (which is calling @walk->remap_pte). 124 - */ 125 - if (!walk->reuse_page) { 126 - walk->reuse_page = pte_page(*pte); 127 - /* 128 - * Because the reuse address is part of the range that we are 129 - * walking, skip the reuse address range. 130 - */ 131 - addr += PAGE_SIZE; 132 - pte++; 133 - walk->nr_walked++; 134 - } 135 - 136 - for (; addr != end; addr += PAGE_SIZE, pte++) { 137 - walk->remap_pte(pte, addr, walk); 138 - walk->nr_walked++; 139 - } 140 - } 141 - 142 - static int vmemmap_pmd_range(pud_t *pud, unsigned long addr, 143 - unsigned long end, 144 - struct vmemmap_remap_walk *walk) 145 - { 146 - pmd_t *pmd; 147 - unsigned long next; 148 - 149 - pmd = pmd_offset(pud, addr); 150 - do { 151 - int ret; 152 - 153 - ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK); 154 - if (ret) 155 - return ret; 156 - 157 - next = pmd_addr_end(addr, end); 158 - vmemmap_pte_range(pmd, addr, next, walk); 159 - } while (pmd++, addr = next, addr != end); 160 - 161 - return 0; 162 - } 163 - 164 - static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr, 165 - unsigned long end, 166 - struct vmemmap_remap_walk *walk) 167 - { 168 - pud_t *pud; 169 - unsigned long next; 170 - 171 - pud = pud_offset(p4d, addr); 172 - do { 173 - int ret; 174 - 175 - next = pud_addr_end(addr, end); 176 - ret = vmemmap_pmd_range(pud, addr, next, walk); 177 - if (ret) 178 - return ret; 179 - } while (pud++, addr = next, addr != end); 180 - 181 - return 0; 182 - } 183 - 184 - static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr, 185 - unsigned long end, 186 - struct vmemmap_remap_walk *walk) 187 - { 188 - p4d_t *p4d; 189 - unsigned long next; 190 - 191 - p4d = p4d_offset(pgd, addr); 192 - do { 193 - int ret; 194 - 195 - next = p4d_addr_end(addr, end); 196 - ret = vmemmap_pud_range(p4d, addr, next, walk); 197 - if (ret) 198 - return ret; 199 - } while (p4d++, addr = next, addr != end); 200 - 201 - return 0; 202 - } 203 - 204 - static int vmemmap_remap_range(unsigned long start, unsigned long end, 205 - struct vmemmap_remap_walk *walk) 206 - { 207 - unsigned long addr = start; 208 - unsigned long next; 209 - pgd_t *pgd; 210 - 211 - VM_BUG_ON(!PAGE_ALIGNED(start)); 212 - VM_BUG_ON(!PAGE_ALIGNED(end)); 213 - 214 - pgd = pgd_offset_k(addr); 215 - do { 216 - int ret; 217 - 218 - next = pgd_addr_end(addr, end); 219 - ret = vmemmap_p4d_range(pgd, addr, next, walk); 220 - if (ret) 221 - return ret; 222 - } while (pgd++, addr = next, addr != end); 223 - 224 - /* 225 - * We only change the mapping of the vmemmap virtual address range 226 - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which 227 - * belongs to the range. 228 - */ 229 - flush_tlb_kernel_range(start + PAGE_SIZE, end); 230 - 231 - return 0; 232 - } 233 - 234 - /* 235 - * Free a vmemmap page. A vmemmap page can be allocated from the memblock 236 - * allocator or buddy allocator. If the PG_reserved flag is set, it means 237 - * that it allocated from the memblock allocator, just free it via the 238 - * free_bootmem_page(). Otherwise, use __free_page(). 239 - */ 240 - static inline void free_vmemmap_page(struct page *page) 241 - { 242 - if (PageReserved(page)) 243 - free_bootmem_page(page); 244 - else 245 - __free_page(page); 246 - } 247 - 248 - /* Free a list of the vmemmap pages */ 249 - static void free_vmemmap_page_list(struct list_head *list) 250 - { 251 - struct page *page, *next; 252 - 253 - list_for_each_entry_safe(page, next, list, lru) { 254 - list_del(&page->lru); 255 - free_vmemmap_page(page); 256 - } 257 - } 258 - 259 - static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, 260 - struct vmemmap_remap_walk *walk) 261 - { 262 - /* 263 - * Remap the tail pages as read-only to catch illegal write operation 264 - * to the tail pages. 265 - */ 266 - pgprot_t pgprot = PAGE_KERNEL_RO; 267 - pte_t entry = mk_pte(walk->reuse_page, pgprot); 268 - struct page *page = pte_page(*pte); 269 - 270 - list_add_tail(&page->lru, walk->vmemmap_pages); 271 - set_pte_at(&init_mm, addr, pte, entry); 272 - } 273 - 274 - /* 275 - * How many struct page structs need to be reset. When we reuse the head 276 - * struct page, the special metadata (e.g. page->flags or page->mapping) 277 - * cannot copy to the tail struct page structs. The invalid value will be 278 - * checked in the free_tail_pages_check(). In order to avoid the message 279 - * of "corrupted mapping in tail page". We need to reset at least 3 (one 280 - * head struct page struct and two tail struct page structs) struct page 281 - * structs. 282 - */ 283 - #define NR_RESET_STRUCT_PAGE 3 284 - 285 - static inline void reset_struct_pages(struct page *start) 286 - { 287 - int i; 288 - struct page *from = start + NR_RESET_STRUCT_PAGE; 289 - 290 - for (i = 0; i < NR_RESET_STRUCT_PAGE; i++) 291 - memcpy(start + i, from, sizeof(*from)); 292 - } 293 - 294 - static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, 295 - struct vmemmap_remap_walk *walk) 296 - { 297 - pgprot_t pgprot = PAGE_KERNEL; 298 - struct page *page; 299 - void *to; 300 - 301 - BUG_ON(pte_page(*pte) != walk->reuse_page); 302 - 303 - page = list_first_entry(walk->vmemmap_pages, struct page, lru); 304 - list_del(&page->lru); 305 - to = page_to_virt(page); 306 - copy_page(to, (void *)walk->reuse_addr); 307 - reset_struct_pages(to); 308 - 309 - set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); 310 - } 311 - 312 - /** 313 - * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) 314 - * to the page which @reuse is mapped to, then free vmemmap 315 - * which the range are mapped to. 316 - * @start: start address of the vmemmap virtual address range that we want 317 - * to remap. 318 - * @end: end address of the vmemmap virtual address range that we want to 319 - * remap. 320 - * @reuse: reuse address. 321 - * 322 - * Return: %0 on success, negative error code otherwise. 323 - */ 324 - int vmemmap_remap_free(unsigned long start, unsigned long end, 325 - unsigned long reuse) 326 - { 327 - int ret; 328 - LIST_HEAD(vmemmap_pages); 329 - struct vmemmap_remap_walk walk = { 330 - .remap_pte = vmemmap_remap_pte, 331 - .reuse_addr = reuse, 332 - .vmemmap_pages = &vmemmap_pages, 333 - }; 334 - 335 - /* 336 - * In order to make remapping routine most efficient for the huge pages, 337 - * the routine of vmemmap page table walking has the following rules 338 - * (see more details from the vmemmap_pte_range()): 339 - * 340 - * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) 341 - * should be continuous. 342 - * - The @reuse address is part of the range [@reuse, @end) that we are 343 - * walking which is passed to vmemmap_remap_range(). 344 - * - The @reuse address is the first in the complete range. 345 - * 346 - * So we need to make sure that @start and @reuse meet the above rules. 347 - */ 348 - BUG_ON(start - reuse != PAGE_SIZE); 349 - 350 - mmap_read_lock(&init_mm); 351 - ret = vmemmap_remap_range(reuse, end, &walk); 352 - if (ret && walk.nr_walked) { 353 - end = reuse + walk.nr_walked * PAGE_SIZE; 354 - /* 355 - * vmemmap_pages contains pages from the previous 356 - * vmemmap_remap_range call which failed. These 357 - * are pages which were removed from the vmemmap. 358 - * They will be restored in the following call. 359 - */ 360 - walk = (struct vmemmap_remap_walk) { 361 - .remap_pte = vmemmap_restore_pte, 362 - .reuse_addr = reuse, 363 - .vmemmap_pages = &vmemmap_pages, 364 - }; 365 - 366 - vmemmap_remap_range(reuse, end, &walk); 367 - } 368 - mmap_read_unlock(&init_mm); 369 - 370 - free_vmemmap_page_list(&vmemmap_pages); 371 - 372 - return ret; 373 - } 374 - 375 - static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, 376 - gfp_t gfp_mask, struct list_head *list) 377 - { 378 - unsigned long nr_pages = (end - start) >> PAGE_SHIFT; 379 - int nid = page_to_nid((struct page *)start); 380 - struct page *page, *next; 381 - 382 - while (nr_pages--) { 383 - page = alloc_pages_node(nid, gfp_mask, 0); 384 - if (!page) 385 - goto out; 386 - list_add_tail(&page->lru, list); 387 - } 388 - 389 - return 0; 390 - out: 391 - list_for_each_entry_safe(page, next, list, lru) 392 - __free_pages(page, 0); 393 - return -ENOMEM; 394 - } 395 - 396 - /** 397 - * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) 398 - * to the page which is from the @vmemmap_pages 399 - * respectively. 400 - * @start: start address of the vmemmap virtual address range that we want 401 - * to remap. 402 - * @end: end address of the vmemmap virtual address range that we want to 403 - * remap. 404 - * @reuse: reuse address. 405 - * @gfp_mask: GFP flag for allocating vmemmap pages. 406 - * 407 - * Return: %0 on success, negative error code otherwise. 408 - */ 409 - int vmemmap_remap_alloc(unsigned long start, unsigned long end, 410 - unsigned long reuse, gfp_t gfp_mask) 411 - { 412 - LIST_HEAD(vmemmap_pages); 413 - struct vmemmap_remap_walk walk = { 414 - .remap_pte = vmemmap_restore_pte, 415 - .reuse_addr = reuse, 416 - .vmemmap_pages = &vmemmap_pages, 417 - }; 418 - 419 - /* See the comment in the vmemmap_remap_free(). */ 420 - BUG_ON(start - reuse != PAGE_SIZE); 421 - 422 - if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages)) 423 - return -ENOMEM; 424 - 425 - mmap_read_lock(&init_mm); 426 - vmemmap_remap_range(reuse, end, &walk); 427 - mmap_read_unlock(&init_mm); 428 - 429 - return 0; 430 - } 431 - #endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */ 432 33 433 34 /* 434 35 * Allocate a block of memory to be used to back the virtual memory map