Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory

Currently, it can happen that pages are allocated (and freed) via the
buddy before we finished basic memory onlining.

For example, pages are exposed to the buddy and can be allocated before we
actually mark the sections online. Allocated pages could suddenly fail
pfn_to_online_page() checks. We had similar issues with pcp handling,
when pages are allocated+freed before we reach zone_pcp_update() in
online_pages() [1].

Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.

[1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

David Hildenbrand and committed by
Linus Torvalds
b30c5927 d882c006

+23 -11
+1 -1
mm/Kconfig
··· 152 152 # eventually, we can have this option just 'select SPARSEMEM' 153 153 config MEMORY_HOTPLUG 154 154 bool "Allow for memory hot-add" 155 + select MEMORY_ISOLATION 155 156 depends on SPARSEMEM || X86_64_ACPI_NUMA 156 157 depends on ARCH_ENABLE_MEMORY_HOTPLUG 157 158 depends on 64BIT || BROKEN ··· 179 178 180 179 config MEMORY_HOTREMOVE 181 180 bool "Allow for memory hot remove" 182 - select MEMORY_ISOLATION 183 181 select HAVE_BOOTMEM_INFO_NODE if (X86_64 || PPC64) 184 182 depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE 185 183 depends on MIGRATION
+22 -10
mm/memory_hotplug.c
··· 813 813 814 814 /* associate pfn range with the zone */ 815 815 zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages); 816 - move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_MOVABLE); 816 + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); 817 817 818 818 arg.start_pfn = pfn; 819 819 arg.nr_pages = nr_pages; ··· 823 823 ret = notifier_to_errno(ret); 824 824 if (ret) 825 825 goto failed_addition; 826 + 827 + /* 828 + * Fixup the number of isolated pageblocks before marking the sections 829 + * onlining, such that undo_isolate_page_range() works correctly. 830 + */ 831 + spin_lock_irqsave(&zone->lock, flags); 832 + zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages; 833 + spin_unlock_irqrestore(&zone->lock, flags); 826 834 827 835 /* 828 836 * If this zone is not populated, then it is not in zonelist. ··· 849 841 zone->zone_pgdat->node_present_pages += nr_pages; 850 842 pgdat_resize_unlock(zone->zone_pgdat, &flags); 851 843 844 + node_states_set_node(nid, &arg); 845 + if (need_zonelists_rebuild) 846 + build_all_zonelists(NULL); 847 + zone_pcp_update(zone); 848 + 849 + /* Basic onlining is complete, allow allocation of onlined pages. */ 850 + undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); 851 + 852 852 /* 853 853 * When exposing larger, physically contiguous memory areas to the 854 854 * buddy, shuffling in the buddy (when freeing onlined pages, putting 855 855 * them either to the head or the tail of the freelist) is only helpful 856 856 * for maintaining the shuffle, but not for creating the initial 857 857 * shuffle. Shuffle the whole zone to make sure the just onlined pages 858 - * are properly distributed across the whole freelist. 858 + * are properly distributed across the whole freelist. Make sure to 859 + * shuffle once pageblocks are no longer isolated. 859 860 */ 860 861 shuffle_zone(zone); 861 - 862 - node_states_set_node(nid, &arg); 863 - if (need_zonelists_rebuild) 864 - build_all_zonelists(NULL); 865 - zone_pcp_update(zone); 866 862 867 863 init_per_zone_wmark_min(); 868 864 ··· 1589 1577 pr_info("Offlined Pages %ld\n", nr_pages); 1590 1578 1591 1579 /* 1592 - * Onlining will reset pagetype flags and makes migrate type 1593 - * MOVABLE, so just need to decrease the number of isolated 1594 - * pageblocks zone counter here. 1580 + * The memory sections are marked offline, and the pageblock flags 1581 + * effectively stale; nobody should be touching them. Fixup the number 1582 + * of isolated pageblocks, memory onlining will properly revert this. 1595 1583 */ 1596 1584 spin_lock_irqsave(&zone->lock, flags); 1597 1585 zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;