Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

drivers/base/memory: determine and store zone for single-zone memory blocks

test_pages_in_a_zone() is just another nasty PFN walker that can easily
stumble over ZONE_DEVICE memory ranges falling into the same memory block
as ordinary system RAM: the memmap of parts of these ranges might possibly
be uninitialized. In fact, we observed (on an older kernel) with UBSAN:

UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
index 7 is out of range for type 'zone [5]'
CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
Call Trace:
dump_stack+0x9a/0xf0
ubsan_epilogue+0x9/0x7a
__ubsan_handle_out_of_bounds+0x13a/0x181
test_pages_in_a_zone+0x3c4/0x500
show_valid_zones+0x1fa/0x380
dev_attr_show+0x43/0xb0
sysfs_kf_seq_show+0x1c5/0x440
seq_read+0x49d/0x1190
vfs_read+0xff/0x300
ksys_read+0xb8/0x170
do_syscall_64+0xa5/0x4b0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf
RIP: 0033:0x7f01f4439b52

We seem to stumble over a memmap that contains a garbage zone id. While
we could try inserting pfn_to_online_page() calls, it will just make
memory offlining slower, because we use test_pages_in_a_zone() to make
sure we're offlining pages that all belong to the same zone.

Let's just get rid of this PFN walker and determine the single zone of a
memory block -- if any -- for early memory blocks during boot. For memory
onlining, we know the single zone already. Let's avoid any additional
memmap scanning and just rely on the zone information available during
boot.

For memory hot(un)plug, we only really care about memory blocks that:
* span a single zone (and, thereby, a single node)
* are completely System RAM (IOW, no holes, no ZONE_DEVICE)
If one of these conditions is not met, we reject memory offlining.
Hotplugged memory blocks (starting out offline), always meet both
conditions.

There are three scenarios to handle:

(1) Memory hot(un)plug

A memory block with zone == NULL cannot be offlined, corresponding to
our previous test_pages_in_a_zone() check.

After successful memory onlining/offlining, we simply set the zone
accordingly.
* Memory onlining: set the zone we just used for onlining
* Memory offlining: set zone = NULL

So a hotplugged memory block starts with zone = NULL. Once memory
onlining is done, we set the proper zone.

(2) Boot memory with !CONFIG_NUMA

We know that there is just a single pgdat, so we simply scan all zones
of that pgdat for an intersection with our memory block PFN range when
adding the memory block. If more than one zone intersects (e.g., DMA and
DMA32 on x86 for the first memory block) we set zone = NULL and
consequently mimic what test_pages_in_a_zone() used to do.

(3) Boot memory with CONFIG_NUMA

At the point in time we create the memory block devices during boot, we
don't know yet which nodes *actually* span a memory block. While we could
scan all zones of all nodes for intersections, overlapping nodes complicate
the situation and scanning all nodes is possibly expensive. But that
problem has already been solved by the code that sets the node of a memory
block and creates the link in the sysfs --
do_register_memory_block_under_node().

So, we hook into the code that sets the node id for a memory block. If
we already have a different node id set for the memory block, we know
that multiple nodes *actually* have PFNs falling into our memory block:
we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
to do. If there is no node id set, we do the same as (2) for the given
node.

Note that the call order in driver_init() is:
-> memory_dev_init(): create memory block devices
-> node_dev_init(): link memory block devices to the node and set the
node id

So in summary, we detect if there is a single zone responsible for this
memory block and we consequently store the zone in that case in the
memory block, updating it during memory onlining/offlining.

Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Rafael Parra <rparrazo@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael Parra <rparrazo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

David Hildenbrand and committed by
Linus Torvalds
395f6081 cc651559

+125 -57
+96 -5
drivers/base/memory.c
··· 215 215 adjust_present_page_count(pfn_to_page(start_pfn), mem->group, 216 216 nr_vmemmap_pages); 217 217 218 + mem->zone = zone; 218 219 return ret; 219 220 } 220 221 ··· 226 225 unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; 227 226 int ret; 228 227 228 + if (!mem->zone) 229 + return -EINVAL; 230 + 229 231 /* 230 232 * Unaccount before offlining, such that unpopulated zone and kthreads 231 233 * can properly be torn down in offline_pages(). ··· 238 234 -nr_vmemmap_pages); 239 235 240 236 ret = offline_pages(start_pfn + nr_vmemmap_pages, 241 - nr_pages - nr_vmemmap_pages, mem->group); 237 + nr_pages - nr_vmemmap_pages, mem->zone, mem->group); 242 238 if (ret) { 243 239 /* offline_pages() failed. Account back. */ 244 240 if (nr_vmemmap_pages) ··· 250 246 if (nr_vmemmap_pages) 251 247 mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); 252 248 249 + mem->zone = NULL; 253 250 return ret; 254 251 } 255 252 ··· 416 411 */ 417 412 if (mem->state == MEM_ONLINE) { 418 413 /* 419 - * The block contains more than one zone can not be offlined. 420 - * This can happen e.g. for ZONE_DMA and ZONE_DMA32 414 + * If !mem->zone, the memory block spans multiple zones and 415 + * cannot get offlined. 421 416 */ 422 - default_zone = test_pages_in_a_zone(start_pfn, 423 - start_pfn + nr_pages); 417 + default_zone = mem->zone; 424 418 if (!default_zone) 425 419 return sysfs_emit(buf, "%s\n", "none"); 426 420 len += sysfs_emit_at(buf, len, "%s", default_zone->name); ··· 647 643 return ret; 648 644 } 649 645 646 + static struct zone *early_node_zone_for_memory_block(struct memory_block *mem, 647 + int nid) 648 + { 649 + const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 650 + const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 651 + struct zone *zone, *matching_zone = NULL; 652 + pg_data_t *pgdat = NODE_DATA(nid); 653 + int i; 654 + 655 + /* 656 + * This logic only works for early memory, when the applicable zones 657 + * already span the memory block. We don't expect overlapping zones on 658 + * a single node for early memory. So if we're told that some PFNs 659 + * of a node fall into this memory block, we can assume that all node 660 + * zones that intersect with the memory block are actually applicable. 661 + * No need to look at the memmap. 662 + */ 663 + for (i = 0; i < MAX_NR_ZONES; i++) { 664 + zone = pgdat->node_zones + i; 665 + if (!populated_zone(zone)) 666 + continue; 667 + if (!zone_intersects(zone, start_pfn, nr_pages)) 668 + continue; 669 + if (!matching_zone) { 670 + matching_zone = zone; 671 + continue; 672 + } 673 + /* Spans multiple zones ... */ 674 + matching_zone = NULL; 675 + break; 676 + } 677 + return matching_zone; 678 + } 679 + 680 + #ifdef CONFIG_NUMA 681 + /** 682 + * memory_block_add_nid() - Indicate that system RAM falling into this memory 683 + * block device (partially) belongs to the given node. 684 + * @mem: The memory block device. 685 + * @nid: The node id. 686 + * @context: The memory initialization context. 687 + * 688 + * Indicate that system RAM falling into this memory block (partially) belongs 689 + * to the given node. If the context indicates ("early") that we are adding the 690 + * node during node device subsystem initialization, this will also properly 691 + * set/adjust mem->zone based on the zone ranges of the given node. 692 + */ 693 + void memory_block_add_nid(struct memory_block *mem, int nid, 694 + enum meminit_context context) 695 + { 696 + if (context == MEMINIT_EARLY && mem->nid != nid) { 697 + /* 698 + * For early memory we have to determine the zone when setting 699 + * the node id and handle multiple nodes spanning a single 700 + * memory block by indicate via zone == NULL that we're not 701 + * dealing with a single zone. So if we're setting the node id 702 + * the first time, determine if there is a single zone. If we're 703 + * setting the node id a second time to a different node, 704 + * invalidate the single detected zone. 705 + */ 706 + if (mem->nid == NUMA_NO_NODE) 707 + mem->zone = early_node_zone_for_memory_block(mem, nid); 708 + else 709 + mem->zone = NULL; 710 + } 711 + 712 + /* 713 + * If this memory block spans multiple nodes, we only indicate 714 + * the last processed node. If we span multiple nodes (not applicable 715 + * to hotplugged memory), zone == NULL will prohibit memory offlining 716 + * and consequently unplug. 717 + */ 718 + mem->nid = nid; 719 + } 720 + #endif 721 + 650 722 static int init_memory_block(unsigned long block_id, unsigned long state, 651 723 unsigned long nr_vmemmap_pages, 652 724 struct memory_group *group) ··· 744 664 mem->nid = NUMA_NO_NODE; 745 665 mem->nr_vmemmap_pages = nr_vmemmap_pages; 746 666 INIT_LIST_HEAD(&mem->group_next); 667 + 668 + #ifndef CONFIG_NUMA 669 + if (state == MEM_ONLINE) 670 + /* 671 + * MEM_ONLINE at this point implies early memory. With NUMA, 672 + * we'll determine the zone when setting the node id via 673 + * memory_block_add_nid(). Memory hotplug updated the zone 674 + * manually when memory onlining/offlining succeeds. 675 + */ 676 + mem->zone = early_node_zone_for_memory_block(mem, NUMA_NO_NODE); 677 + #endif /* CONFIG_NUMA */ 747 678 748 679 ret = register_memory(mem); 749 680 if (ret)
+5 -8
drivers/base/node.c
··· 796 796 } 797 797 798 798 static void do_register_memory_block_under_node(int nid, 799 - struct memory_block *mem_blk) 799 + struct memory_block *mem_blk, 800 + enum meminit_context context) 800 801 { 801 802 int ret; 802 803 803 - /* 804 - * If this memory block spans multiple nodes, we only indicate 805 - * the last processed node. 806 - */ 807 - mem_blk->nid = nid; 804 + memory_block_add_nid(mem_blk, nid, context); 808 805 809 806 ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, 810 807 &mem_blk->dev.kobj, ··· 854 857 if (page_nid != nid) 855 858 continue; 856 859 857 - do_register_memory_block_under_node(nid, mem_blk); 860 + do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); 858 861 return 0; 859 862 } 860 863 /* mem section does not span the specified node */ ··· 870 873 { 871 874 int nid = *(int *)arg; 872 875 873 - do_register_memory_block_under_node(nid, mem_blk); 876 + do_register_memory_block_under_node(nid, mem_blk, MEMINIT_HOTPLUG); 874 877 return 0; 875 878 } 876 879
+12
include/linux/memory.h
··· 70 70 unsigned long state; /* serialized by the dev->lock */ 71 71 int online_type; /* for passing data to online routine */ 72 72 int nid; /* NID for this memory block */ 73 + /* 74 + * The single zone of this memory block if all PFNs of this memory block 75 + * that are System RAM (not a memory hole, not ZONE_DEVICE ranges) are 76 + * managed by a single zone. NULL if multiple zones (including nodes) 77 + * apply. 78 + */ 79 + struct zone *zone; 73 80 struct device dev; 74 81 /* 75 82 * Number of vmemmap pages. These pages ··· 168 161 }) 169 162 #define register_hotmemory_notifier(nb) register_memory_notifier(nb) 170 163 #define unregister_hotmemory_notifier(nb) unregister_memory_notifier(nb) 164 + 165 + #ifdef CONFIG_NUMA 166 + void memory_block_add_nid(struct memory_block *mem, int nid, 167 + enum meminit_context context); 168 + #endif /* CONFIG_NUMA */ 171 169 #endif /* CONFIG_MEMORY_HOTPLUG */ 172 170 173 171 /*
+2 -4
include/linux/memory_hotplug.h
··· 163 163 extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages); 164 164 extern int online_pages(unsigned long pfn, unsigned long nr_pages, 165 165 struct zone *zone, struct memory_group *group); 166 - extern struct zone *test_pages_in_a_zone(unsigned long start_pfn, 167 - unsigned long end_pfn); 168 166 extern void __offline_isolated_pages(unsigned long start_pfn, 169 167 unsigned long end_pfn); 170 168 ··· 291 293 292 294 extern void try_offline_node(int nid); 293 295 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages, 294 - struct memory_group *group); 296 + struct zone *zone, struct memory_group *group); 295 297 extern int remove_memory(u64 start, u64 size); 296 298 extern void __remove_memory(u64 start, u64 size); 297 299 extern int offline_and_remove_memory(u64 start, u64 size); ··· 300 302 static inline void try_offline_node(int nid) {} 301 303 302 304 static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages, 303 - struct memory_group *group) 305 + struct zone *zone, struct memory_group *group) 304 306 { 305 307 return -EINVAL; 306 308 }
+10 -40
mm/memory_hotplug.c
··· 1549 1549 1550 1550 #ifdef CONFIG_MEMORY_HOTREMOVE 1551 1551 /* 1552 - * Confirm all pages in a range [start, end) belong to the same zone (skipping 1553 - * memory holes). When true, return the zone. 1554 - */ 1555 - struct zone *test_pages_in_a_zone(unsigned long start_pfn, 1556 - unsigned long end_pfn) 1557 - { 1558 - unsigned long pfn, sec_end_pfn; 1559 - struct zone *zone = NULL; 1560 - struct page *page; 1561 - 1562 - for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1); 1563 - pfn < end_pfn; 1564 - pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) { 1565 - /* Make sure the memory section is present first */ 1566 - if (!present_section_nr(pfn_to_section_nr(pfn))) 1567 - continue; 1568 - for (; pfn < sec_end_pfn && pfn < end_pfn; 1569 - pfn += MAX_ORDER_NR_PAGES) { 1570 - /* Check if we got outside of the zone */ 1571 - if (zone && !zone_spans_pfn(zone, pfn)) 1572 - return NULL; 1573 - page = pfn_to_page(pfn); 1574 - if (zone && page_zone(page) != zone) 1575 - return NULL; 1576 - zone = page_zone(page); 1577 - } 1578 - } 1579 - 1580 - return zone; 1581 - } 1582 - 1583 - /* 1584 1552 * Scan pfn range [start,end) to find movable/migratable pages (LRU pages, 1585 1553 * non-lru movable pages and hugepages). Will skip over most unmovable 1586 1554 * pages (esp., pages that can be skipped when offlining), but bail out on ··· 1771 1803 } 1772 1804 1773 1805 int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, 1774 - struct memory_group *group) 1806 + struct zone *zone, struct memory_group *group) 1775 1807 { 1776 1808 const unsigned long end_pfn = start_pfn + nr_pages; 1777 1809 unsigned long pfn, system_ram_pages = 0; 1810 + const int node = zone_to_nid(zone); 1778 1811 unsigned long flags; 1779 - struct zone *zone; 1780 1812 struct memory_notify arg; 1781 - int ret, node; 1782 1813 char *reason; 1814 + int ret; 1783 1815 1784 1816 /* 1785 1817 * {on,off}lining is constrained to full memory sections (or more ··· 1811 1843 goto failed_removal; 1812 1844 } 1813 1845 1814 - /* This makes hotplug much easier...and readable. 1815 - we assume this for now. .*/ 1816 - zone = test_pages_in_a_zone(start_pfn, end_pfn); 1817 - if (!zone) { 1846 + /* 1847 + * We only support offlining of memory blocks managed by a single zone, 1848 + * checked by calling code. This is just a sanity check that we might 1849 + * want to remove in the future. 1850 + */ 1851 + if (WARN_ON_ONCE(page_zone(pfn_to_page(start_pfn)) != zone || 1852 + page_zone(pfn_to_page(end_pfn - 1)) != zone)) { 1818 1853 ret = -EINVAL; 1819 1854 reason = "multizone range"; 1820 1855 goto failed_removal; 1821 1856 } 1822 - node = zone_to_nid(zone); 1823 1857 1824 1858 /* 1825 1859 * Disable pcplists so that page isolation cannot race with freeing