hugetlb: introduce nr_overcommit_hugepages sysctl

hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

nr_overcommit_hugepages > 0

indicates the same administrative setting as

hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by Nishanth Aravamudan and committed by Linus Torvalds d1c3fb1f 7a3f595c

+70 -6
+1
include/linux/hugetlb.h
··· 34 extern unsigned long max_huge_pages; 35 extern unsigned long hugepages_treat_as_movable; 36 extern int hugetlb_dynamic_pool; 37 extern const unsigned long hugetlb_zero, hugetlb_infinity; 38 extern int sysctl_hugetlb_shm_group; 39
··· 34 extern unsigned long max_huge_pages; 35 extern unsigned long hugepages_treat_as_movable; 36 extern int hugetlb_dynamic_pool; 37 + extern unsigned long nr_overcommit_huge_pages; 38 extern const unsigned long hugetlb_zero, hugetlb_infinity; 39 extern int sysctl_hugetlb_shm_group; 40
+8
kernel/sysctl.c
··· 912 .mode = 0644, 913 .proc_handler = &proc_dointvec, 914 }, 915 #endif 916 { 917 .ctl_name = VM_LOWMEM_RESERVE_RATIO,
··· 912 .mode = 0644, 913 .proc_handler = &proc_dointvec, 914 }, 915 + { 916 + .ctl_name = CTL_UNNUMBERED, 917 + .procname = "nr_overcommit_hugepages", 918 + .data = &nr_overcommit_huge_pages, 919 + .maxlen = sizeof(nr_overcommit_huge_pages), 920 + .mode = 0644, 921 + .proc_handler = &proc_doulongvec_minmax, 922 + }, 923 #endif 924 { 925 .ctl_name = VM_LOWMEM_RESERVE_RATIO,
+61 -6
mm/hugetlb.c
··· 32 static gfp_t htlb_alloc_mask = GFP_HIGHUSER; 33 unsigned long hugepages_treat_as_movable; 34 int hugetlb_dynamic_pool; 35 static int hugetlb_next_nid; 36 37 /* ··· 228 unsigned long address) 229 { 230 struct page *page; 231 232 /* Check if the dynamic pool is enabled */ 233 if (!hugetlb_dynamic_pool) 234 return NULL; 235 236 page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN, 237 HUGETLB_PAGE_ORDER); 238 if (page) { 239 set_compound_page_dtor(page, free_huge_page); 240 - spin_lock(&hugetlb_lock); 241 - nr_huge_pages++; 242 - nr_huge_pages_node[page_to_nid(page)]++; 243 - surplus_huge_pages++; 244 - surplus_huge_pages_node[page_to_nid(page)]++; 245 - spin_unlock(&hugetlb_lock); 246 } 247 248 return page; 249 } ··· 522 * Increase the pool size 523 * First take pages out of surplus state. Then make up the 524 * remaining difference by allocating fresh huge pages. 525 */ 526 spin_lock(&hugetlb_lock); 527 while (surplus_huge_pages && count > persistent_huge_pages) { ··· 556 * to keep enough around to satisfy reservations). Then place 557 * pages into surplus state as needed so the pool will shrink 558 * to the desired size as pages become free. 559 */ 560 min_count = resv_huge_pages + nr_huge_pages - free_huge_pages; 561 min_count = max(count, min_count);
··· 32 static gfp_t htlb_alloc_mask = GFP_HIGHUSER; 33 unsigned long hugepages_treat_as_movable; 34 int hugetlb_dynamic_pool; 35 + unsigned long nr_overcommit_huge_pages; 36 static int hugetlb_next_nid; 37 38 /* ··· 227 unsigned long address) 228 { 229 struct page *page; 230 + unsigned int nid; 231 232 /* Check if the dynamic pool is enabled */ 233 if (!hugetlb_dynamic_pool) 234 return NULL; 235 236 + /* 237 + * Assume we will successfully allocate the surplus page to 238 + * prevent racing processes from causing the surplus to exceed 239 + * overcommit 240 + * 241 + * This however introduces a different race, where a process B 242 + * tries to grow the static hugepage pool while alloc_pages() is 243 + * called by process A. B will only examine the per-node 244 + * counters in determining if surplus huge pages can be 245 + * converted to normal huge pages in adjust_pool_surplus(). A 246 + * won't be able to increment the per-node counter, until the 247 + * lock is dropped by B, but B doesn't drop hugetlb_lock until 248 + * no more huge pages can be converted from surplus to normal 249 + * state (and doesn't try to convert again). Thus, we have a 250 + * case where a surplus huge page exists, the pool is grown, and 251 + * the surplus huge page still exists after, even though it 252 + * should just have been converted to a normal huge page. This 253 + * does not leak memory, though, as the hugepage will be freed 254 + * once it is out of use. It also does not allow the counters to 255 + * go out of whack in adjust_pool_surplus() as we don't modify 256 + * the node values until we've gotten the hugepage and only the 257 + * per-node value is checked there. 258 + */ 259 + spin_lock(&hugetlb_lock); 260 + if (surplus_huge_pages >= nr_overcommit_huge_pages) { 261 + spin_unlock(&hugetlb_lock); 262 + return NULL; 263 + } else { 264 + nr_huge_pages++; 265 + surplus_huge_pages++; 266 + } 267 + spin_unlock(&hugetlb_lock); 268 + 269 page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN, 270 HUGETLB_PAGE_ORDER); 271 + 272 + spin_lock(&hugetlb_lock); 273 if (page) { 274 + nid = page_to_nid(page); 275 set_compound_page_dtor(page, free_huge_page); 276 + /* 277 + * We incremented the global counters already 278 + */ 279 + nr_huge_pages_node[nid]++; 280 + surplus_huge_pages_node[nid]++; 281 + } else { 282 + nr_huge_pages--; 283 + surplus_huge_pages--; 284 } 285 + spin_unlock(&hugetlb_lock); 286 287 return page; 288 } ··· 481 * Increase the pool size 482 * First take pages out of surplus state. Then make up the 483 * remaining difference by allocating fresh huge pages. 484 + * 485 + * We might race with alloc_buddy_huge_page() here and be unable 486 + * to convert a surplus huge page to a normal huge page. That is 487 + * not critical, though, it just means the overall size of the 488 + * pool might be one hugepage larger than it needs to be, but 489 + * within all the constraints specified by the sysctls. 490 */ 491 spin_lock(&hugetlb_lock); 492 while (surplus_huge_pages && count > persistent_huge_pages) { ··· 509 * to keep enough around to satisfy reservations). Then place 510 * pages into surplus state as needed so the pool will shrink 511 * to the desired size as pages become free. 512 + * 513 + * By placing pages into the surplus state independent of the 514 + * overcommit value, we are allowing the surplus pool size to 515 + * exceed overcommit. There are few sane options here. Since 516 + * alloc_buddy_huge_page() is checking the global counter, 517 + * though, we'll note that we're not allowed to exceed surplus 518 + * and won't grow the pool anywhere else. Not until one of the 519 + * sysctls are changed, or the surplus pages go out of use. 520 */ 521 min_count = resv_huge_pages + nr_huge_pages - free_huge_pages; 522 min_count = max(count, min_count);