Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

memblock: add MEMBLOCK_RSRV_KERN flag

Patch series "kexec: introduce Kexec HandOver (KHO)", v8.

Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers Conference
2023 presentation for details:

https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO). As a really simple example target, we
use memblock's reserve_mem.

With this patchset applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command line
to pass data (including memory reservations) between kexec'ing kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of
for example IOMMU page tables. But IMHO they would all be users of KHO,
with KHO providing the foundational primitive to pass metadata and bulk
memory reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other.
How they pass it is architecture specific. The file's format is a
Flattened Device Tree (fdt) which has a generator and parser already
included in Linux. KHO is enabled in the kernel command line by `kho=on`.
When the root user enables KHO through
/sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every
KHO users to register preserved memory regions, which contain drivers'
states.

When the actual kexec happens, the fdt is part of the image set that we
boot into. In addition, we keep "scratch regions" available for kexec:
physically contiguous memory regions that are guaranteed to not have any
memory that KHO would preserve. The new kernel bootstraps itself using
the scratch regions and sets all handed over memory as in use. When
drivers initialize that support KHO, they introspect the fdt, restore
preserved memory regions, and retrieve their states stored in the
preserved memory.

== Limitations ==

Currently KHO is only implemented for file based kexec. The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho=on" command line
parameter. KHO will automatically create scratch regions. If you want to
set the scratch size explicitly you can use "kho_scratch=" command line
parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16
MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
per NUMA node scratch regions on boot.

Make sure to have a reserved memory range requested with reserv_mem
command line option, for example, "reserve_mem=64m:4k:n1".

Then before you invoke file based "kexec -l", finalize KHO FDT:

# echo 1 > /sys/kernel/debug/kho/out/finalize

You can preview the generated FDT using `dtc`,

# dtc /sys/kernel/debug/kho/out/fdt
# dtc /sys/kernel/debug/kho/out/sub_fdts/memblock

`dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`.

Now kexec into the new kernel,

# kexec -l Image --initrd=initrd -s
# kexec -e

(The order of KHO finalization and "kexec -l" does not matter.)

The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.

You can also review the FDT passed from the old kernel,

# dtc /sys/kernel/debug/kho/in/fdt
# dtc /sys/kernel/debug/kho/in/sub_fdts/memblock


This patch (of 17):

To denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.

Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/
Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/
Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/
Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/
Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com
Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Mike Rapoport (Microsoft) and committed by
Andrew Morton
4c78cc59 50dbe531

+73 -32
+18 -1
include/linux/memblock.h
··· 42 42 * kernel resource tree. 43 43 * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are 44 44 * not initialized (only for reserved regions). 45 + * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, 46 + * either explictitly with memblock_reserve_kern() or via memblock 47 + * allocation APIs. All memblock allocations set this flag. 45 48 */ 46 49 enum memblock_flags { 47 50 MEMBLOCK_NONE = 0x0, /* No special request */ ··· 53 50 MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ 54 51 MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ 55 52 MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ 53 + MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */ 56 54 }; 57 55 58 56 /** ··· 120 116 int memblock_add(phys_addr_t base, phys_addr_t size); 121 117 int memblock_remove(phys_addr_t base, phys_addr_t size); 122 118 int memblock_phys_free(phys_addr_t base, phys_addr_t size); 123 - int memblock_reserve(phys_addr_t base, phys_addr_t size); 119 + int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid, 120 + enum memblock_flags flags); 121 + 122 + static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size) 123 + { 124 + return __memblock_reserve(base, size, NUMA_NO_NODE, 0); 125 + } 126 + 127 + static __always_inline int memblock_reserve_kern(phys_addr_t base, phys_addr_t size) 128 + { 129 + return __memblock_reserve(base, size, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN); 130 + } 131 + 124 132 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP 125 133 int memblock_physmem_add(phys_addr_t base, phys_addr_t size); 126 134 #endif ··· 492 476 493 477 phys_addr_t memblock_phys_mem_size(void); 494 478 phys_addr_t memblock_reserved_size(void); 479 + phys_addr_t memblock_reserved_kern_size(phys_addr_t limit, int nid); 495 480 unsigned long memblock_estimated_nr_free_pages(void); 496 481 phys_addr_t memblock_start_of_DRAM(void); 497 482 phys_addr_t memblock_end_of_DRAM(void);
+32 -8
mm/memblock.c
··· 499 499 * needn't do it 500 500 */ 501 501 if (!use_slab) 502 - BUG_ON(memblock_reserve(addr, new_alloc_size)); 502 + BUG_ON(memblock_reserve_kern(addr, new_alloc_size)); 503 503 504 504 /* Update slab flag */ 505 505 *in_slab = use_slab; ··· 649 649 #ifdef CONFIG_NUMA 650 650 WARN_ON(nid != memblock_get_region_node(rgn)); 651 651 #endif 652 - WARN_ON(flags != rgn->flags); 652 + WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags); 653 653 nr_new++; 654 654 if (insert) { 655 655 if (start_rgn == -1) ··· 909 909 return memblock_remove_range(&memblock.reserved, base, size); 910 910 } 911 911 912 - int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) 912 + int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size, 913 + int nid, enum memblock_flags flags) 913 914 { 914 915 phys_addr_t end = base + size - 1; 915 916 916 - memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, 917 - &base, &end, (void *)_RET_IP_); 917 + memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__, 918 + &base, &end, nid, flags, (void *)_RET_IP_); 918 919 919 - return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0); 920 + return memblock_add_range(&memblock.reserved, base, size, nid, flags); 920 921 } 921 922 922 923 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP ··· 1468 1467 again: 1469 1468 found = memblock_find_in_range_node(size, align, start, end, nid, 1470 1469 flags); 1471 - if (found && !memblock_reserve(found, size)) 1470 + if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN)) 1472 1471 goto done; 1473 1472 1474 1473 if (numa_valid_node(nid) && !exact_nid) { 1475 1474 found = memblock_find_in_range_node(size, align, start, 1476 1475 end, NUMA_NO_NODE, 1477 1476 flags); 1478 - if (found && !memblock_reserve(found, size)) 1477 + if (found && !memblock_reserve_kern(found, size)) 1479 1478 goto done; 1480 1479 } 1481 1480 ··· 1758 1757 phys_addr_t __init_memblock memblock_reserved_size(void) 1759 1758 { 1760 1759 return memblock.reserved.total_size; 1760 + } 1761 + 1762 + phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int nid) 1763 + { 1764 + struct memblock_region *r; 1765 + phys_addr_t total = 0; 1766 + 1767 + for_each_reserved_mem_region(r) { 1768 + phys_addr_t size = r->size; 1769 + 1770 + if (r->base > limit) 1771 + break; 1772 + 1773 + if (r->base + r->size > limit) 1774 + size = limit - r->base; 1775 + 1776 + if (nid == memblock_get_region_node(r) || !numa_valid_node(nid)) 1777 + if (r->flags & MEMBLOCK_RSRV_KERN) 1778 + total += size; 1779 + } 1780 + 1781 + return total; 1761 1782 } 1762 1783 1763 1784 /** ··· 2481 2458 [ilog2(MEMBLOCK_NOMAP)] = "NOMAP", 2482 2459 [ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG", 2483 2460 [ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT", 2461 + [ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN", 2484 2462 }; 2485 2463 2486 2464 static int memblock_debug_show(struct seq_file *m, void *private)
+11 -11
tools/testing/memblock/tests/alloc_api.c
··· 134 134 PREFIX_PUSH(); 135 135 setup_memblock(); 136 136 137 - memblock_reserve(memblock_end_of_DRAM() - total_size, r1_size); 137 + memblock_reserve_kern(memblock_end_of_DRAM() - total_size, r1_size); 138 138 139 139 allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); 140 140 ··· 182 182 183 183 total_size = r1.size + r2_size; 184 184 185 - memblock_reserve(r1.base, r1.size); 185 + memblock_reserve_kern(r1.base, r1.size); 186 186 187 187 allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); 188 188 ··· 231 231 232 232 total_size = r1.size + r2.size + r3_size; 233 233 234 - memblock_reserve(r1.base, r1.size); 235 - memblock_reserve(r2.base, r2.size); 234 + memblock_reserve_kern(r1.base, r1.size); 235 + memblock_reserve_kern(r2.base, r2.size); 236 236 237 237 allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); 238 238 ··· 285 285 286 286 total_size = r1.size + r2.size + r3_size; 287 287 288 - memblock_reserve(r1.base, r1.size); 289 - memblock_reserve(r2.base, r2.size); 288 + memblock_reserve_kern(r1.base, r1.size); 289 + memblock_reserve_kern(r2.base, r2.size); 290 290 291 291 allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); 292 292 ··· 422 422 setup_memblock(); 423 423 424 424 /* Simulate almost-full memory */ 425 - memblock_reserve(memblock_start_of_DRAM(), reserved_size); 425 + memblock_reserve_kern(memblock_start_of_DRAM(), reserved_size); 426 426 427 427 allocated_ptr = run_memblock_alloc(available_size, SMP_CACHE_BYTES); 428 428 ··· 608 608 PREFIX_PUSH(); 609 609 setup_memblock(); 610 610 611 - memblock_reserve(memblock_start_of_DRAM() + r1_size, r2_size); 611 + memblock_reserve_kern(memblock_start_of_DRAM() + r1_size, r2_size); 612 612 613 613 allocated_ptr = run_memblock_alloc(r1_size, SMP_CACHE_BYTES); 614 614 ··· 655 655 656 656 total_size = r1.size + r2_size; 657 657 658 - memblock_reserve(r1.base, r1.size); 658 + memblock_reserve_kern(r1.base, r1.size); 659 659 660 660 allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); 661 661 ··· 705 705 706 706 total_size = r1.size + r2.size + r3_size; 707 707 708 - memblock_reserve(r1.base, r1.size); 709 - memblock_reserve(r2.base, r2.size); 708 + memblock_reserve_kern(r1.base, r1.size); 709 + memblock_reserve_kern(r2.base, r2.size); 710 710 711 711 allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); 712 712
+2 -2
tools/testing/memblock/tests/alloc_helpers_api.c
··· 163 163 min_addr = memblock_end_of_DRAM() - SMP_CACHE_BYTES * 2; 164 164 165 165 /* No space above this address */ 166 - memblock_reserve(min_addr, r2_size); 166 + memblock_reserve_kern(min_addr, r2_size); 167 167 168 168 allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr); 169 169 ··· 199 199 start_addr = (phys_addr_t)memblock_start_of_DRAM(); 200 200 min_addr = start_addr - SMP_CACHE_BYTES * 3; 201 201 202 - memblock_reserve(start_addr + r1_size, MEM_SIZE - r1_size); 202 + memblock_reserve_kern(start_addr + r1_size, MEM_SIZE - r1_size); 203 203 204 204 allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr); 205 205
+10 -10
tools/testing/memblock/tests/alloc_nid_api.c
··· 324 324 min_addr = max_addr - r2_size; 325 325 reserved_base = min_addr - r1_size; 326 326 327 - memblock_reserve(reserved_base, r1_size); 327 + memblock_reserve_kern(reserved_base, r1_size); 328 328 329 329 allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES, 330 330 min_addr, max_addr, ··· 374 374 max_addr = memblock_end_of_DRAM() - r1_size; 375 375 min_addr = max_addr - r2_size; 376 376 377 - memblock_reserve(max_addr, r1_size); 377 + memblock_reserve_kern(max_addr, r1_size); 378 378 379 379 allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES, 380 380 min_addr, max_addr, ··· 436 436 min_addr = r2.base + r2.size; 437 437 max_addr = r1.base; 438 438 439 - memblock_reserve(r1.base, r1.size); 440 - memblock_reserve(r2.base, r2.size); 439 + memblock_reserve_kern(r1.base, r1.size); 440 + memblock_reserve_kern(r2.base, r2.size); 441 441 442 442 allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, 443 443 min_addr, max_addr, ··· 499 499 min_addr = r2.base + r2.size; 500 500 max_addr = r1.base; 501 501 502 - memblock_reserve(r1.base, r1.size); 503 - memblock_reserve(r2.base, r2.size); 502 + memblock_reserve_kern(r1.base, r1.size); 503 + memblock_reserve_kern(r2.base, r2.size); 504 504 505 505 allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, 506 506 min_addr, max_addr, ··· 563 563 min_addr = r2.base + r2.size; 564 564 max_addr = r1.base; 565 565 566 - memblock_reserve(r1.base, r1.size); 567 - memblock_reserve(r2.base, r2.size); 566 + memblock_reserve_kern(r1.base, r1.size); 567 + memblock_reserve_kern(r2.base, r2.size); 568 568 569 569 allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, 570 570 min_addr, max_addr, ··· 909 909 min_addr = r2.base + r2.size; 910 910 max_addr = r1.base; 911 911 912 - memblock_reserve(r1.base, r1.size); 913 - memblock_reserve(r2.base, r2.size); 912 + memblock_reserve_kern(r1.base, r1.size); 913 + memblock_reserve_kern(r2.base, r2.size); 914 914 915 915 allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, 916 916 min_addr, max_addr,