Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drivers/base/memory: introduce "memory groups" to logically group memory blocks

In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device. Examples of memory
devices include ACPI memory devices (in the simplest case a single DIMM)
and virtio-mem. For now, we don't have a connection between a single
memory block device and the real memory device. Each memory device
consists of 1..X memory block devices.

Let's logically group memory blocks belonging to the same memory device in
"memory groups". Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.

Introduce two memory group types:

1) Static memory group: E.g., a single ACPI memory device, consisting
of 1..X memory resources. A memory group consists of 1..Y memory
blocks. The whole group is added/removed in one go. If any part
cannot get offlined, the whole group cannot be removed.

2) Dynamic memory group: E.g., a single virtio-mem device. Memory is
dynamically added/removed in a fixed granularity, called a "unit",
consisting of 1..X memory blocks. A unit is added/removed in one go.
If any part of a unit cannot get offlined, the whole unit cannot be
removed.

In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
none. In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE. We want a single unit to be of the same type.

For now, memory groups are an internal concept that is not exposed to user
space; we might want to change that in the future, though.

add_memory() users can specify a mgid instead of a nid when passing the
MHP_NID_IS_MGID flag.

Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

David Hildenbrand and committed by
Linus Torvalds
028fc57a e83a437f

+215 -6
+155 -4
drivers/base/memory.c
··· 82 82 */ 83 83 static DEFINE_XARRAY(memory_blocks); 84 84 85 + /* 86 + * Memory groups, indexed by memory group id (mgid). 87 + */ 88 + static DEFINE_XARRAY_FLAGS(memory_groups, XA_FLAGS_ALLOC); 89 + 85 90 static BLOCKING_NOTIFIER_HEAD(memory_chain); 86 91 87 92 int register_memory_notifier(struct notifier_block *nb) ··· 639 634 } 640 635 641 636 static int init_memory_block(unsigned long block_id, unsigned long state, 642 - unsigned long nr_vmemmap_pages) 637 + unsigned long nr_vmemmap_pages, 638 + struct memory_group *group) 643 639 { 644 640 struct memory_block *mem; 645 641 int ret = 0; ··· 658 652 mem->state = state; 659 653 mem->nid = NUMA_NO_NODE; 660 654 mem->nr_vmemmap_pages = nr_vmemmap_pages; 655 + INIT_LIST_HEAD(&mem->group_next); 656 + 657 + if (group) { 658 + mem->group = group; 659 + list_add(&mem->group_next, &group->memory_blocks); 660 + } 661 661 662 662 ret = register_memory(mem); 663 663 ··· 683 671 if (section_count == 0) 684 672 return 0; 685 673 return init_memory_block(memory_block_id(base_section_nr), 686 - MEM_ONLINE, 0); 674 + MEM_ONLINE, 0, NULL); 687 675 } 688 676 689 677 static void unregister_memory(struct memory_block *memory) ··· 692 680 return; 693 681 694 682 WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL); 683 + 684 + if (memory->group) { 685 + list_del(&memory->group_next); 686 + memory->group = NULL; 687 + } 695 688 696 689 /* drop the ref. we got via find_memory_block() */ 697 690 put_device(&memory->dev); ··· 711 694 * Called under device_hotplug_lock. 712 695 */ 713 696 int create_memory_block_devices(unsigned long start, unsigned long size, 714 - unsigned long vmemmap_pages) 697 + unsigned long vmemmap_pages, 698 + struct memory_group *group) 715 699 { 716 700 const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start)); 717 701 unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size)); ··· 725 707 return -EINVAL; 726 708 727 709 for (block_id = start_block_id; block_id != end_block_id; block_id++) { 728 - ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages); 710 + ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages, 711 + group); 729 712 if (ret) 730 713 break; 731 714 } ··· 909 890 910 891 return bus_for_each_dev(&memory_subsys, NULL, &cb_data, 911 892 for_each_memory_block_cb); 893 + } 894 + 895 + /* 896 + * This is an internal helper to unify allocation and initialization of 897 + * memory groups. Note that the passed memory group will be copied to a 898 + * dynamically allocated memory group. After this call, the passed 899 + * memory group should no longer be used. 900 + */ 901 + static int memory_group_register(struct memory_group group) 902 + { 903 + struct memory_group *new_group; 904 + uint32_t mgid; 905 + int ret; 906 + 907 + if (!node_possible(group.nid)) 908 + return -EINVAL; 909 + 910 + new_group = kzalloc(sizeof(group), GFP_KERNEL); 911 + if (!new_group) 912 + return -ENOMEM; 913 + *new_group = group; 914 + INIT_LIST_HEAD(&new_group->memory_blocks); 915 + 916 + ret = xa_alloc(&memory_groups, &mgid, new_group, xa_limit_31b, 917 + GFP_KERNEL); 918 + if (ret) { 919 + kfree(new_group); 920 + return ret; 921 + } 922 + return mgid; 923 + } 924 + 925 + /** 926 + * memory_group_register_static() - Register a static memory group. 927 + * @nid: The node id. 928 + * @max_pages: The maximum number of pages we'll have in this static memory 929 + * group. 930 + * 931 + * Register a new static memory group and return the memory group id. 932 + * All memory in the group belongs to a single unit, such as a DIMM. All 933 + * memory belonging to a static memory group is added in one go to be removed 934 + * in one go -- it's static. 935 + * 936 + * Returns an error if out of memory, if the node id is invalid, if no new 937 + * memory groups can be registered, or if max_pages is invalid (0). Otherwise, 938 + * returns the new memory group id. 939 + */ 940 + int memory_group_register_static(int nid, unsigned long max_pages) 941 + { 942 + struct memory_group group = { 943 + .nid = nid, 944 + .s = { 945 + .max_pages = max_pages, 946 + }, 947 + }; 948 + 949 + if (!max_pages) 950 + return -EINVAL; 951 + return memory_group_register(group); 952 + } 953 + EXPORT_SYMBOL_GPL(memory_group_register_static); 954 + 955 + /** 956 + * memory_group_register_dynamic() - Register a dynamic memory group. 957 + * @nid: The node id. 958 + * @unit_pages: Unit in pages in which is memory added/removed in this dynamic 959 + * memory group. 960 + * 961 + * Register a new dynamic memory group and return the memory group id. 962 + * Memory within a dynamic memory group is added/removed dynamically 963 + * in unit_pages. 964 + * 965 + * Returns an error if out of memory, if the node id is invalid, if no new 966 + * memory groups can be registered, or if unit_pages is invalid (0, not a 967 + * power of two, smaller than a single memory block). Otherwise, returns the 968 + * new memory group id. 969 + */ 970 + int memory_group_register_dynamic(int nid, unsigned long unit_pages) 971 + { 972 + struct memory_group group = { 973 + .nid = nid, 974 + .is_dynamic = true, 975 + .d = { 976 + .unit_pages = unit_pages, 977 + }, 978 + }; 979 + 980 + if (!unit_pages || !is_power_of_2(unit_pages) || 981 + unit_pages < PHYS_PFN(memory_block_size_bytes())) 982 + return -EINVAL; 983 + return memory_group_register(group); 984 + } 985 + EXPORT_SYMBOL_GPL(memory_group_register_dynamic); 986 + 987 + /** 988 + * memory_group_unregister() - Unregister a memory group. 989 + * @mgid: the memory group id 990 + * 991 + * Unregister a memory group. If any memory block still belongs to this 992 + * memory group, unregistering will fail. 993 + * 994 + * Returns -EINVAL if the memory group id is invalid, returns -EBUSY if some 995 + * memory blocks still belong to this memory group and returns 0 if 996 + * unregistering succeeded. 997 + */ 998 + int memory_group_unregister(int mgid) 999 + { 1000 + struct memory_group *group; 1001 + 1002 + if (mgid < 0) 1003 + return -EINVAL; 1004 + 1005 + group = xa_load(&memory_groups, mgid); 1006 + if (!group) 1007 + return -EINVAL; 1008 + if (!list_empty(&group->memory_blocks)) 1009 + return -EBUSY; 1010 + xa_erase(&memory_groups, mgid); 1011 + kfree(group); 1012 + return 0; 1013 + } 1014 + EXPORT_SYMBOL_GPL(memory_group_unregister); 1015 + 1016 + /* 1017 + * This is an internal helper only to be used in core memory hotplug code to 1018 + * lookup a memory group. We don't care about locking, as we don't expect a 1019 + * memory group to get unregistered while adding memory to it -- because 1020 + * the group and the memory is managed by the same driver. 1021 + */ 1022 + struct memory_group *memory_group_find_by_id(int mgid) 1023 + { 1024 + return xa_load(&memory_groups, mgid); 912 1025 }
+45 -1
include/linux/memory.h
··· 23 23 24 24 #define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS) 25 25 26 + /** 27 + * struct memory_group - a logical group of memory blocks 28 + * @nid: The node id for all memory blocks inside the memory group. 29 + * @blocks: List of all memory blocks belonging to this memory group. 30 + * @is_dynamic: The memory group type: static vs. dynamic 31 + * @s.max_pages: Valid with &memory_group.is_dynamic == false. The maximum 32 + * number of pages we'll have in this static memory group. 33 + * @d.unit_pages: Valid with &memory_group.is_dynamic == true. Unit in pages 34 + * in which memory is added/removed in this dynamic memory group. 35 + * This granularity defines the alignment of a unit in physical 36 + * address space; it has to be at least as big as a single 37 + * memory block. 38 + * 39 + * A memory group logically groups memory blocks; each memory block 40 + * belongs to at most one memory group. A memory group corresponds to 41 + * a memory device, such as a DIMM or a NUMA node, which spans multiple 42 + * memory blocks and might even span multiple non-contiguous physical memory 43 + * ranges. 44 + * 45 + * Modification of members after registration is serialized by memory 46 + * hot(un)plug code. 47 + */ 48 + struct memory_group { 49 + int nid; 50 + struct list_head memory_blocks; 51 + bool is_dynamic; 52 + union { 53 + struct { 54 + unsigned long max_pages; 55 + } s; 56 + struct { 57 + unsigned long unit_pages; 58 + } d; 59 + }; 60 + }; 61 + 26 62 struct memory_block { 27 63 unsigned long start_section_nr; 28 64 unsigned long state; /* serialized by the dev->lock */ ··· 70 34 * lay at the beginning of the memory block. 71 35 */ 72 36 unsigned long nr_vmemmap_pages; 37 + struct memory_group *group; /* group (if any) for this block */ 38 + struct list_head group_next; /* next block inside memory group */ 73 39 }; 74 40 75 41 int arch_get_memory_phys_device(unsigned long start_pfn); ··· 124 86 extern int register_memory_notifier(struct notifier_block *nb); 125 87 extern void unregister_memory_notifier(struct notifier_block *nb); 126 88 int create_memory_block_devices(unsigned long start, unsigned long size, 127 - unsigned long vmemmap_pages); 89 + unsigned long vmemmap_pages, 90 + struct memory_group *group); 128 91 void remove_memory_block_devices(unsigned long start, unsigned long size); 129 92 extern void memory_dev_init(void); 130 93 extern int memory_notify(unsigned long val, void *v); ··· 135 96 void *arg, walk_memory_blocks_func_t func); 136 97 extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func); 137 98 #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT) 99 + 100 + extern int memory_group_register_static(int nid, unsigned long max_pages); 101 + extern int memory_group_register_dynamic(int nid, unsigned long unit_pages); 102 + extern int memory_group_unregister(int mgid); 103 + struct memory_group *memory_group_find_by_id(int mgid); 138 104 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ 139 105 140 106 #ifdef CONFIG_MEMORY_HOTPLUG
+5
include/linux/memory_hotplug.h
··· 50 50 * Only selected architectures support it with SPARSE_VMEMMAP. 51 51 */ 52 52 #define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1)) 53 + /* 54 + * The nid field specifies a memory group id (mgid) instead. The memory group 55 + * implies the node id (nid). 56 + */ 57 + #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2)) 53 58 54 59 /* 55 60 * Extended parameters for memory hotplug:
+10 -1
mm/memory_hotplug.c
··· 1258 1258 { 1259 1259 struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; 1260 1260 struct vmem_altmap mhp_altmap = {}; 1261 + struct memory_group *group = NULL; 1261 1262 u64 start, size; 1262 1263 bool new_node = false; 1263 1264 int ret; ··· 1269 1268 ret = check_hotplug_memory_range(start, size); 1270 1269 if (ret) 1271 1270 return ret; 1271 + 1272 + if (mhp_flags & MHP_NID_IS_MGID) { 1273 + group = memory_group_find_by_id(nid); 1274 + if (!group) 1275 + return -EINVAL; 1276 + nid = group->nid; 1277 + } 1272 1278 1273 1279 if (!node_possible(nid)) { 1274 1280 WARN(1, "node %d was absent from the node_possible_map\n", nid); ··· 1311 1303 goto error; 1312 1304 1313 1305 /* create memory block devices after memory was added */ 1314 - ret = create_memory_block_devices(start, size, mhp_altmap.alloc); 1306 + ret = create_memory_block_devices(start, size, mhp_altmap.alloc, 1307 + group); 1315 1308 if (ret) { 1316 1309 arch_remove_memory(start, size, NULL); 1317 1310 goto error;