Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/hugetlb: fix set_max_huge_pages() when there are surplus pages

In set_max_huge_pages(), min_count is computed taking into account surplus
huge pages, which might lead in some cases to not be able to free huge
pages and end up accounting them as surplus instead.

One way to solve it is to subtract surplus_huge_pages directly, but we
cannot do it blindly because there might be surplus pages that are also
free pages, which might happen when we fail to restore the vmemmap for
optimized hvo pages. So we could be subtracting the same page twice.

In order to work this around, let us first compute the number of free
persistent pages, and use that along with surplus pages to compute
min_count.

Steps to reproduce:
1) create 5 hugetlb folios in Node0
2) run a program to use all the hugetlb folios
3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus
the 5 hugetlb folios in Node0 are accounted as surplus.
4) create 5 hugetlb folios in Node1
5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios

The result:
Node0 Node1
Total 5 5
Free 0 5
Surp 5 5

The result with this patch:
Node0 Node1
Total 5 0
Free 0 0
Surp 5 0

Link: https://lkml.kernel.org/r/20250409055957.3774471-1-tujinjiang@huawei.com
Link: https://lkml.kernel.org/r/20250407124706.2688092-1-tujinjiang@huawei.com
Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Jinjiang Tu and committed by
Andrew Morton
aabf58bf 60580e0b

+18 -1
+18 -1
mm/hugetlb.c
··· 3825 3825 static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, 3826 3826 nodemask_t *nodes_allowed) 3827 3827 { 3828 + unsigned long persistent_free_count; 3828 3829 unsigned long min_count; 3829 3830 unsigned long allocated; 3830 3831 struct folio *folio; ··· 3960 3959 * though, we'll note that we're not allowed to exceed surplus 3961 3960 * and won't grow the pool anywhere else. Not until one of the 3962 3961 * sysctls are changed, or the surplus pages go out of use. 3962 + * 3963 + * min_count is the expected number of persistent pages, we 3964 + * shouldn't calculate min_count by using 3965 + * resv_huge_pages + persistent_huge_pages() - free_huge_pages, 3966 + * because there may exist free surplus huge pages, and this will 3967 + * lead to subtracting twice. Free surplus huge pages come from HVO 3968 + * failing to restore vmemmap, see comments in the callers of 3969 + * hugetlb_vmemmap_restore_folio(). Thus, we should calculate 3970 + * persistent free count first. 3963 3971 */ 3964 - min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; 3972 + persistent_free_count = h->free_huge_pages; 3973 + if (h->free_huge_pages > persistent_huge_pages(h)) { 3974 + if (h->free_huge_pages > h->surplus_huge_pages) 3975 + persistent_free_count -= h->surplus_huge_pages; 3976 + else 3977 + persistent_free_count = 0; 3978 + } 3979 + min_count = h->resv_huge_pages + persistent_huge_pages(h) - persistent_free_count; 3965 3980 min_count = max(count, min_count); 3966 3981 try_to_free_low(h, min_count, nodes_allowed); 3967 3982