Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Add a new optional ",cma" suffix to the crashkernel= command line option

Patch series "kdump: crashkernel reservation from CMA", v5.

This series implements a way to reserve additional crash kernel memory
using CMA.

Currently, all the memory for the crash kernel is not usable by the 1st
(production) kernel. It is also unmapped so that it can't be corrupted by
the fault that will eventually trigger the crash. This makes sense for
the memory actually used by the kexec-loaded crash kernel image and initrd
and the data prepared during the load (vmcoreinfo, ...). However, the
reserved space needs to be much larger than that to provide enough
run-time memory for the crash kernel and the kdump userspace. Estimating
the amount of memory to reserve is difficult. Being too careful makes
kdump likely to end in OOM, being too generous takes even more memory from
the production system. Also, the reservation only allows reserving a
single contiguous block (or two with the "low" suffix). I've seen systems
where this fails because the physical memory is fragmented.

By reserving additional crashkernel memory from CMA, the main crashkernel
reservation can be just large enough to fit the kernel and initrd image,
minimizing the memory taken away from the production system. Most of the
run-time memory for the crash kernel will be memory previously available
to userspace in the production system. As this memory is no longer
wasted, the reservation can be done with a generous margin, making kdump
more reliable. Kernel memory that we need to preserve for dumping is
normally not allocated from CMA, unless it is explicitly allocated as
movable. Currently this is only the case for memory ballooning and zswap.
Such movable memory will be missing from the vmcore. User data is
typically not dumped by makedumpfile. When dumping of user data is
intended this new CMA reservation cannot be used.

There are five patches in this series:

The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more argument to
store the cma reservation size.

The second patch implements reserve_crashkernel_cma() which performs the
reservation. If the requested size is not available in a single range,
multiple smaller ranges will be reserved.

The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.

The fourth patch adds a short delay before booting the kdump kernel,
allowing pending DMA transfers to finish.

The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
through /proc/vmcore by excluding them from the vmcoreinfo
PT_LOAD ranges.

Adding other architectures is easy and I can do that as soon as this
series is merged.

With this series applied, specifying
crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.

An additional 1G will be reserved from CMA, still usable by the production
system. The crash kernel will have 1.1G memory available. The 100M can
be reliably predicted based on the size of the kernel and initrd.

The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.


This patch (of 5):

Add a new cma_size parameter to parse_crashkernel(). When not NULL, call
__parse_crashkernel to parse the CMA reservation size from
"crashkernel=size,cma" and store it in cma_size.

Set cma_size to NULL in all calls to parse_crashkernel().

Link: https://lkml.kernel.org/r/aEqnxxfLZMllMC8I@dwarf.suse.cz
Link: https://lkml.kernel.org/r/aEqoQckgoTQNULnh@dwarf.suse.cz
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Jiri Bohac and committed by
Andrew Morton
35c18f29 22c2ed69

+27 -14
+1 -1
arch/arm/kernel/setup.c
··· 1004 1004 total_mem = get_total_mem(); 1005 1005 ret = parse_crashkernel(boot_command_line, total_mem, 1006 1006 &crash_size, &crash_base, 1007 - NULL, NULL); 1007 + NULL, NULL, NULL); 1008 1008 /* invalid value specified or crashkernel=0 */ 1009 1009 if (ret || !crash_size) 1010 1010 return;
+1 -1
arch/arm64/mm/init.c
··· 106 106 107 107 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), 108 108 &crash_size, &crash_base, 109 - &low_size, &high); 109 + &low_size, NULL, &high); 110 110 if (ret) 111 111 return; 112 112
+1 -1
arch/loongarch/kernel/setup.c
··· 265 265 return; 266 266 267 267 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), 268 - &crash_size, &crash_base, &low_size, &high); 268 + &crash_size, &crash_base, &low_size, NULL, &high); 269 269 if (ret) 270 270 return; 271 271
+1 -1
arch/mips/kernel/setup.c
··· 458 458 total_mem = memblock_phys_mem_size(); 459 459 ret = parse_crashkernel(boot_command_line, total_mem, 460 460 &crash_size, &crash_base, 461 - NULL, NULL); 461 + NULL, NULL, NULL); 462 462 if (ret != 0 || crash_size <= 0) 463 463 return; 464 464
+1 -1
arch/powerpc/kernel/fadump.c
··· 333 333 * memory at a predefined offset. 334 334 */ 335 335 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), 336 - &size, &base, NULL, NULL); 336 + &size, &base, NULL, NULL, NULL); 337 337 if (ret == 0 && size > 0) { 338 338 unsigned long max_size; 339 339
+1 -1
arch/powerpc/kexec/core.c
··· 110 110 111 111 /* use common parsing */ 112 112 ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, 113 - &crash_base, NULL, NULL); 113 + &crash_base, NULL, NULL, NULL); 114 114 115 115 if (ret) 116 116 return;
+1 -1
arch/powerpc/mm/nohash/kaslr_booke.c
··· 178 178 int ret; 179 179 180 180 ret = parse_crashkernel(boot_command_line, size, &crash_size, 181 - &crash_base, NULL, NULL); 181 + &crash_base, NULL, NULL, NULL); 182 182 if (ret != 0 || crash_size == 0) 183 183 return; 184 184 if (crash_base == 0)
+1 -1
arch/riscv/mm/init.c
··· 1408 1408 1409 1409 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), 1410 1410 &crash_size, &crash_base, 1411 - &low_size, &high); 1411 + &low_size, NULL, &high); 1412 1412 if (ret) 1413 1413 return; 1414 1414
+1 -1
arch/s390/kernel/setup.c
··· 605 605 int rc; 606 606 607 607 rc = parse_crashkernel(boot_command_line, ident_map_size, 608 - &crash_size, &crash_base, NULL, NULL); 608 + &crash_size, &crash_base, NULL, NULL, NULL); 609 609 610 610 crash_base = ALIGN(crash_base, KEXEC_CRASH_MEM_ALIGN); 611 611 crash_size = ALIGN(crash_size, KEXEC_CRASH_MEM_ALIGN);
+1 -1
arch/sh/kernel/machine_kexec.c
··· 146 146 return; 147 147 148 148 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), 149 - &crash_size, &crash_base, NULL, NULL); 149 + &crash_size, &crash_base, NULL, NULL, NULL); 150 150 if (ret == 0 && crash_size > 0) { 151 151 crashk_res.start = crash_base; 152 152 crashk_res.end = crash_base + crash_size - 1;
+1 -1
arch/x86/kernel/setup.c
··· 608 608 609 609 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), 610 610 &crash_size, &crash_base, 611 - &low_size, &high); 611 + &low_size, NULL, &high); 612 612 if (ret) 613 613 return; 614 614
+2 -1
include/linux/crash_reserve.h
··· 16 16 17 17 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram, 18 18 unsigned long long *crash_size, unsigned long long *crash_base, 19 - unsigned long long *low_size, bool *high); 19 + unsigned long long *low_size, unsigned long long *cma_size, 20 + bool *high); 20 21 21 22 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION 22 23 #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
+14 -2
kernel/crash_reserve.c
··· 172 172 173 173 #define SUFFIX_HIGH 0 174 174 #define SUFFIX_LOW 1 175 - #define SUFFIX_NULL 2 175 + #define SUFFIX_CMA 2 176 + #define SUFFIX_NULL 3 176 177 static __initdata char *suffix_tbl[] = { 177 178 [SUFFIX_HIGH] = ",high", 178 179 [SUFFIX_LOW] = ",low", 180 + [SUFFIX_CMA] = ",cma", 179 181 [SUFFIX_NULL] = NULL, 180 182 }; 181 183 182 184 /* 183 185 * That function parses "suffix" crashkernel command lines like 184 186 * 185 - * crashkernel=size,[high|low] 187 + * crashkernel=size,[high|low|cma] 186 188 * 187 189 * It returns 0 on success and -EINVAL on failure. 188 190 */ ··· 300 298 unsigned long long *crash_size, 301 299 unsigned long long *crash_base, 302 300 unsigned long long *low_size, 301 + unsigned long long *cma_size, 303 302 bool *high) 304 303 { 305 304 int ret; 305 + unsigned long long __always_unused cma_base; 306 306 307 307 /* crashkernel=X[@offset] */ 308 308 ret = __parse_crashkernel(cmdline, system_ram, crash_size, ··· 335 331 336 332 *high = true; 337 333 } 334 + 335 + /* 336 + * optional CMA reservation 337 + * cma_base is ignored 338 + */ 339 + if (cma_size) 340 + __parse_crashkernel(cmdline, 0, cma_size, 341 + &cma_base, suffix_tbl[SUFFIX_CMA]); 338 342 #endif 339 343 if (!*crash_size) 340 344 ret = -EINVAL;