Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/memory_hotplug: introduce "auto-movable" online policy

When onlining without specifying a zone (using "online" instead of
"online_kernel" or "online_movable"), we currently select a zone such that
existing zones are kept contiguous. This online policy made sense in the
past, where contiguous zones where required.

We'd like to implement smarter policies, however:

* User space has little insight. As one example, it has no idea which
memory blocks logically belong together (e.g., to a DIMM or to a
virtio-mem device).

* Drivers that add memory in separate memory blocks, especially
virtio-mem, want memory to get onlined right from the kernel when
adding.

So we really want to have onlining to differing zones managed in the
kernel, configured by user space.

We see more and more cases where we might eventually hotplug a lot of
memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB),
however:

* Resizing happens dynamically, in smaller steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...)

* We still want as much flexibility as possible, especially,
hotunplugging as much memory as possible later.

We can really only use "online_movable" if we know that the amount of
memory we are going to hotplug upfront, and we know that it won't result
in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64
GiB could currently not use "online_movable", and instead, "online_kernel"
would have to be used, resulting in worse (no) memory hotunplug
reliability.

Let's add a new "auto-movable" online policy that considers the current
zone ratios (global, per-node) to determine, whether we a memory block can
be onlined to ZONE_MOVABLE:

MOVABLE : KERNEL

However, internally we'll only consider the following ratio for now:

MOVABLE : KERNEL_EARLY

For now, we don't allow for hotplugged KERNEL memory to allow for more
MOVABLE memory, because there is no coordination across memory devices.
In follow-up patches, we will allow for more KERNEL memory within a memory
device to allow for more MOVABLE memory within the same memory device --
which only makes sense for special memory device types.

We base our calculation on "present pages", see the code comments for
details. Hotplugged memory will get online to ZONE_MOVABLE if the
configured ratio allows for it. Depending on the setup, this can result
in fragmented zones, which can make compaction slower and dynamic
allocation of gigantic pages when not using CMA less reliable (... which
is already pretty unreliable).

The old policy will be the default and called "contig-zones". In
follow-up patches, our new policy will use additional information, such as
memory groups, to make even smarter decisions across memory blocks.

Configuration:

* memory_hotplug.online_policy is used to switch between both polices
and defaults to "contig-zones".

* memory_hotplug.auto_movable_ratio defines the maximum ratio is in
percent and defaults to "301" -- allowing e.g., most 8 GiB machines to
grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The
additional percent accounts for a handful of lost present pages (e.g.,
firmware allocations). User space is expected to adjust this ratio when
enabling the new "auto-movable" policy, though.

* memory_hotplug.auto_movable_numa_aware considers numa node stats in
addition to global stats, and defaults to "true".

Note: just like the old policy, the new policy won't take things like
unmovable huge pages or memory ballooning that doesn't support balloon
compaction into account. User space has to configure onlining
accordingly.

Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

David Hildenbrand and committed by
Linus Torvalds
e83a437f 4b097002

+191
+191
mm/memory_hotplug.c
··· 52 52 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); 53 53 #endif 54 54 55 + enum { 56 + ONLINE_POLICY_CONTIG_ZONES = 0, 57 + ONLINE_POLICY_AUTO_MOVABLE, 58 + }; 59 + 60 + const char *online_policy_to_str[] = { 61 + [ONLINE_POLICY_CONTIG_ZONES] = "contig-zones", 62 + [ONLINE_POLICY_AUTO_MOVABLE] = "auto-movable", 63 + }; 64 + 65 + static int set_online_policy(const char *val, const struct kernel_param *kp) 66 + { 67 + int ret = sysfs_match_string(online_policy_to_str, val); 68 + 69 + if (ret < 0) 70 + return ret; 71 + *((int *)kp->arg) = ret; 72 + return 0; 73 + } 74 + 75 + static int get_online_policy(char *buffer, const struct kernel_param *kp) 76 + { 77 + return sprintf(buffer, "%s\n", online_policy_to_str[*((int *)kp->arg)]); 78 + } 79 + 80 + /* 81 + * memory_hotplug.online_policy: configure online behavior when onlining without 82 + * specifying a zone (MMOP_ONLINE) 83 + * 84 + * "contig-zones": keep zone contiguous 85 + * "auto-movable": online memory to ZONE_MOVABLE if the configuration 86 + * (auto_movable_ratio, auto_movable_numa_aware) allows for it 87 + */ 88 + static int online_policy __read_mostly = ONLINE_POLICY_CONTIG_ZONES; 89 + static const struct kernel_param_ops online_policy_ops = { 90 + .set = set_online_policy, 91 + .get = get_online_policy, 92 + }; 93 + module_param_cb(online_policy, &online_policy_ops, &online_policy, 0644); 94 + MODULE_PARM_DESC(online_policy, 95 + "Set the online policy (\"contig-zones\", \"auto-movable\") " 96 + "Default: \"contig-zones\""); 97 + 98 + /* 99 + * memory_hotplug.auto_movable_ratio: specify maximum MOVABLE:KERNEL ratio 100 + * 101 + * The ratio represent an upper limit and the kernel might decide to not 102 + * online some memory to ZONE_MOVABLE -- e.g., because hotplugged KERNEL memory 103 + * doesn't allow for more MOVABLE memory. 104 + */ 105 + static unsigned int auto_movable_ratio __read_mostly = 301; 106 + module_param(auto_movable_ratio, uint, 0644); 107 + MODULE_PARM_DESC(auto_movable_ratio, 108 + "Set the maximum ratio of MOVABLE:KERNEL memory in the system " 109 + "in percent for \"auto-movable\" online policy. Default: 301"); 110 + 111 + /* 112 + * memory_hotplug.auto_movable_numa_aware: consider numa node stats 113 + */ 114 + #ifdef CONFIG_NUMA 115 + static bool auto_movable_numa_aware __read_mostly = true; 116 + module_param(auto_movable_numa_aware, bool, 0644); 117 + MODULE_PARM_DESC(auto_movable_numa_aware, 118 + "Consider numa node stats in addition to global stats in " 119 + "\"auto-movable\" online policy. Default: true"); 120 + #endif /* CONFIG_NUMA */ 121 + 55 122 /* 56 123 * online_page_callback contains pointer to current page onlining function. 57 124 * Initially it is generic_online_page(). If it is required it could be ··· 730 663 set_zone_contiguous(zone); 731 664 } 732 665 666 + struct auto_movable_stats { 667 + unsigned long kernel_early_pages; 668 + unsigned long movable_pages; 669 + }; 670 + 671 + static void auto_movable_stats_account_zone(struct auto_movable_stats *stats, 672 + struct zone *zone) 673 + { 674 + if (zone_idx(zone) == ZONE_MOVABLE) { 675 + stats->movable_pages += zone->present_pages; 676 + } else { 677 + stats->kernel_early_pages += zone->present_early_pages; 678 + #ifdef CONFIG_CMA 679 + /* 680 + * CMA pages (never on hotplugged memory) behave like 681 + * ZONE_MOVABLE. 682 + */ 683 + stats->movable_pages += zone->cma_pages; 684 + stats->kernel_early_pages -= zone->cma_pages; 685 + #endif /* CONFIG_CMA */ 686 + } 687 + } 688 + 689 + static bool auto_movable_can_online_movable(int nid, unsigned long nr_pages) 690 + { 691 + struct auto_movable_stats stats = {}; 692 + unsigned long kernel_early_pages, movable_pages; 693 + pg_data_t *pgdat = NODE_DATA(nid); 694 + struct zone *zone; 695 + int i; 696 + 697 + /* Walk all relevant zones and collect MOVABLE vs. KERNEL stats. */ 698 + if (nid == NUMA_NO_NODE) { 699 + /* TODO: cache values */ 700 + for_each_populated_zone(zone) 701 + auto_movable_stats_account_zone(&stats, zone); 702 + } else { 703 + for (i = 0; i < MAX_NR_ZONES; i++) { 704 + zone = pgdat->node_zones + i; 705 + if (populated_zone(zone)) 706 + auto_movable_stats_account_zone(&stats, zone); 707 + } 708 + } 709 + 710 + kernel_early_pages = stats.kernel_early_pages; 711 + movable_pages = stats.movable_pages; 712 + 713 + /* 714 + * Test if we could online the given number of pages to ZONE_MOVABLE 715 + * and still stay in the configured ratio. 716 + */ 717 + movable_pages += nr_pages; 718 + return movable_pages <= (auto_movable_ratio * kernel_early_pages) / 100; 719 + } 720 + 733 721 /* 734 722 * Returns a default kernel memory zone for the given pfn range. 735 723 * If no kernel zone covers this pfn range it will automatically go ··· 804 682 } 805 683 806 684 return &pgdat->node_zones[ZONE_NORMAL]; 685 + } 686 + 687 + /* 688 + * Determine to which zone to online memory dynamically based on user 689 + * configuration and system stats. We care about the following ratio: 690 + * 691 + * MOVABLE : KERNEL 692 + * 693 + * Whereby MOVABLE is memory in ZONE_MOVABLE and KERNEL is memory in 694 + * one of the kernel zones. CMA pages inside one of the kernel zones really 695 + * behaves like ZONE_MOVABLE, so we treat them accordingly. 696 + * 697 + * We don't allow for hotplugged memory in a KERNEL zone to increase the 698 + * amount of MOVABLE memory we can have, so we end up with: 699 + * 700 + * MOVABLE : KERNEL_EARLY 701 + * 702 + * Whereby KERNEL_EARLY is memory in one of the kernel zones, available sinze 703 + * boot. We base our calculation on KERNEL_EARLY internally, because: 704 + * 705 + * a) Hotplugged memory in one of the kernel zones can sometimes still get 706 + * hotunplugged, especially when hot(un)plugging individual memory blocks. 707 + * There is no coordination across memory devices, therefore "automatic" 708 + * hotunplugging, as implemented in hypervisors, could result in zone 709 + * imbalances. 710 + * b) Early/boot memory in one of the kernel zones can usually not get 711 + * hotunplugged again (e.g., no firmware interface to unplug, fragmented 712 + * with unmovable allocations). While there are corner cases where it might 713 + * still work, it is barely relevant in practice. 714 + * 715 + * We rely on "present pages" instead of "managed pages", as the latter is 716 + * highly unreliable and dynamic in virtualized environments, and does not 717 + * consider boot time allocations. For example, memory ballooning adjusts the 718 + * managed pages when inflating/deflating the balloon, and balloon compaction 719 + * can even migrate inflated pages between zones. 720 + * 721 + * Using "present pages" is better but some things to keep in mind are: 722 + * 723 + * a) Some memblock allocations, such as for the crashkernel area, are 724 + * effectively unused by the kernel, yet they account to "present pages". 725 + * Fortunately, these allocations are comparatively small in relevant setups 726 + * (e.g., fraction of system memory). 727 + * b) Some hotplugged memory blocks in virtualized environments, esecially 728 + * hotplugged by virtio-mem, look like they are completely present, however, 729 + * only parts of the memory block are actually currently usable. 730 + * "present pages" is an upper limit that can get reached at runtime. As 731 + * we base our calculations on KERNEL_EARLY, this is not an issue. 732 + */ 733 + static struct zone *auto_movable_zone_for_pfn(int nid, unsigned long pfn, 734 + unsigned long nr_pages) 735 + { 736 + if (!auto_movable_ratio) 737 + goto kernel_zone; 738 + 739 + if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages)) 740 + goto kernel_zone; 741 + 742 + #ifdef CONFIG_NUMA 743 + if (auto_movable_numa_aware && 744 + !auto_movable_can_online_movable(nid, nr_pages)) 745 + goto kernel_zone; 746 + #endif /* CONFIG_NUMA */ 747 + 748 + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; 749 + kernel_zone: 750 + return default_kernel_zone_for_pfn(nid, pfn, nr_pages); 807 751 } 808 752 809 753 static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn, ··· 904 716 905 717 if (online_type == MMOP_ONLINE_MOVABLE) 906 718 return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; 719 + 720 + if (online_policy == ONLINE_POLICY_AUTO_MOVABLE) 721 + return auto_movable_zone_for_pfn(nid, start_pfn, nr_pages); 907 722 908 723 return default_zone_for_pfn(nid, start_pfn, nr_pages); 909 724 }