Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

powerpc/pseries: add RTAS work area allocator

Various pseries-specific RTAS functions take a temporary "work area"
parameter - a buffer in memory accessible to RTAS. Typically such
functions are passed the statically allocated rtas_data_buf buffer as
the argument. This buffer is protected by a global spinlock. So users
of rtas_data_buf cannot perform sleeping operations while accessing
the buffer.

Most RTAS functions that have a work area parameter can return a
status (-2/990x) that indicates that the caller should retry. Before
retrying, the caller may need to reschedule or sleep (see
rtas_busy_delay() for details). This combination of factors
leads to uncomfortable constructions like this:

do {
spin_lock(&rtas_data_buf_lock);
rc = rtas_call(token, __pa(rtas_data_buf, ...);
if (rc == 0) {
/* parse or copy out rtas_data_buf contents */
}
spin_unlock(&rtas_data_buf_lock);
} while (rtas_busy_delay(rc));

Another unfortunately common way of handling this is for callers to
blithely ignore the possibility of a -2/990x status and hope for the
best.

If users were allowed to perform blocking operations while owning a
work area, the programming model would become less tedious and
error-prone. Users could schedule away, sleep, or perform other
blocking operations without having to release and re-acquire
resources.

We could continue to use a single work area buffer, and convert
rtas_data_buf_lock to a mutex. But that would impose an unnecessarily
coarse serialization on all users. As awkward as the current design
is, it prevents longer running operations that need to repeatedly use
rtas_data_buf from blocking the progress of others.

There are more considerations. One is that while 4KB is fine for all
current in-kernel uses, some RTAS calls can take much smaller buffers,
and some (VPD, platform dumps) would likely benefit from larger
ones. Another is that at least one RTAS function (ibm,get-vpd)
has *two* work area parameters. And finally, we should expect the
number of work area users in the kernel to increase over time as we
introduce lockdown-compatible ABIs to replace less safe use cases
based on sys_rtas/librtas.

So a special-purpose allocator for RTAS work area buffers seems worth
trying.

Properties:

* The backing memory for the allocator is reserved early in boot in
order to satisfy RTAS addressing requirements, and then managed with
genalloc.
* Allocations can block, but they never fail (mempool-like).
* Prioritizes first-come, first-serve fairness over throughput.
* Early boot allocations before the allocator has been initialized are
served via an internal static buffer.

Intended to replace rtas_data_buf. New code that needs RTAS work area
buffers should prefer this API.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230125-b4-powerpc-rtas-queue-v3-12-26929c8cce78@linux.ibm.com

authored by

Nathan Lynch and committed by
Michael Ellerman
43033bc6 24098f58

+309 -1
+96
arch/powerpc/include/asm/rtas-work-area.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef _ASM_POWERPC_RTAS_WORK_AREA_H 3 + #define _ASM_POWERPC_RTAS_WORK_AREA_H 4 + 5 + #include <linux/build_bug.h> 6 + #include <linux/sizes.h> 7 + #include <linux/types.h> 8 + 9 + #include <asm/page.h> 10 + 11 + /** 12 + * struct rtas_work_area - RTAS work area descriptor. 13 + * 14 + * Descriptor for a "work area" in PAPR terminology that satisfies 15 + * RTAS addressing requirements. 16 + */ 17 + struct rtas_work_area { 18 + /* private: Use the APIs provided below. */ 19 + char *buf; 20 + size_t size; 21 + }; 22 + 23 + enum { 24 + /* Maximum allocation size, enforced at build time. */ 25 + RTAS_WORK_AREA_MAX_ALLOC_SZ = SZ_128K, 26 + }; 27 + 28 + /** 29 + * rtas_work_area_alloc() - Acquire a work area of the requested size. 30 + * @size_: Allocation size. Must be compile-time constant and not more 31 + * than %RTAS_WORK_AREA_MAX_ALLOC_SZ. 32 + * 33 + * Allocate a buffer suitable for passing to RTAS functions that have 34 + * a memory address parameter, often (but not always) referred to as a 35 + * "work area" in PAPR. Although callers are allowed to block while 36 + * holding a work area, the amount of memory reserved for this purpose 37 + * is limited, and allocations should be short-lived. A good guideline 38 + * is to release any allocated work area before returning from a 39 + * system call. 40 + * 41 + * This function does not fail. It blocks until the allocation 42 + * succeeds. To prevent deadlocks, callers are discouraged from 43 + * allocating more than one work area simultaneously in a single task 44 + * context. 45 + * 46 + * Context: This function may sleep. 47 + * Return: A &struct rtas_work_area descriptor for the allocated work area. 48 + */ 49 + #define rtas_work_area_alloc(size_) ({ \ 50 + static_assert(__builtin_constant_p(size_)); \ 51 + static_assert((size_) > 0); \ 52 + static_assert((size_) <= RTAS_WORK_AREA_MAX_ALLOC_SZ); \ 53 + __rtas_work_area_alloc(size_); \ 54 + }) 55 + 56 + /* 57 + * Do not call __rtas_work_area_alloc() directly. Use 58 + * rtas_work_area_alloc(). 59 + */ 60 + struct rtas_work_area *__rtas_work_area_alloc(size_t size); 61 + 62 + /** 63 + * rtas_work_area_free() - Release a work area. 64 + * @area: Work area descriptor as returned from rtas_work_area_alloc(). 65 + * 66 + * Return a work area buffer to the pool. 67 + */ 68 + void rtas_work_area_free(struct rtas_work_area *area); 69 + 70 + static inline char *rtas_work_area_raw_buf(const struct rtas_work_area *area) 71 + { 72 + return area->buf; 73 + } 74 + 75 + static inline size_t rtas_work_area_size(const struct rtas_work_area *area) 76 + { 77 + return area->size; 78 + } 79 + 80 + static inline phys_addr_t rtas_work_area_phys(const struct rtas_work_area *area) 81 + { 82 + return __pa(area->buf); 83 + } 84 + 85 + /* 86 + * Early setup for the work area allocator. Call from 87 + * rtas_initialize() only. 88 + */ 89 + 90 + #ifdef CONFIG_PPC_PSERIES 91 + void rtas_work_area_reserve_arena(phys_addr_t limit); 92 + #else /* CONFIG_PPC_PSERIES */ 93 + static inline void rtas_work_area_reserve_arena(phys_addr_t limit) {} 94 + #endif /* CONFIG_PPC_PSERIES */ 95 + 96 + #endif /* _ASM_POWERPC_RTAS_WORK_AREA_H */
+3
arch/powerpc/kernel/rtas.c
··· 36 36 #include <asm/machdep.h> 37 37 #include <asm/mmu.h> 38 38 #include <asm/page.h> 39 + #include <asm/rtas-work-area.h> 39 40 #include <asm/rtas.h> 40 41 #include <asm/time.h> 41 42 #include <asm/trace.h> ··· 1940 1939 #endif 1941 1940 ibm_open_errinjct_token = rtas_token("ibm,open-errinjct"); 1942 1941 ibm_errinjct_token = rtas_token("ibm,errinjct"); 1942 + 1943 + rtas_work_area_reserve_arena(rtas_region); 1943 1944 } 1944 1945 1945 1946 int __init early_init_dt_scan_rtas(unsigned long node,
+1 -1
arch/powerpc/platforms/pseries/Makefile
··· 3 3 ccflags-$(CONFIG_PPC_PSERIES_DEBUG) += -DDEBUG 4 4 5 5 obj-y := lpar.o hvCall.o nvram.o reconfig.o \ 6 - of_helpers.o \ 6 + of_helpers.o rtas-work-area.o \ 7 7 setup.o iommu.o event_sources.o ras.o \ 8 8 firmware.o power.o dlpar.o mobility.o rng.o \ 9 9 pci.o pci_dlpar.o eeh_pseries.o msi.o \
+209
arch/powerpc/platforms/pseries/rtas-work-area.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + #define pr_fmt(fmt) "rtas-work-area: " fmt 4 + 5 + #include <linux/genalloc.h> 6 + #include <linux/log2.h> 7 + #include <linux/kernel.h> 8 + #include <linux/memblock.h> 9 + #include <linux/mempool.h> 10 + #include <linux/minmax.h> 11 + #include <linux/mutex.h> 12 + #include <linux/numa.h> 13 + #include <linux/sizes.h> 14 + #include <linux/wait.h> 15 + 16 + #include <asm/machdep.h> 17 + #include <asm/rtas-work-area.h> 18 + #include <asm/rtas.h> 19 + 20 + enum { 21 + /* 22 + * Ensure the pool is page-aligned. 23 + */ 24 + RTAS_WORK_AREA_ARENA_ALIGN = PAGE_SIZE, 25 + /* 26 + * Don't let a single allocation claim the whole arena. 27 + */ 28 + RTAS_WORK_AREA_ARENA_SZ = RTAS_WORK_AREA_MAX_ALLOC_SZ * 2, 29 + /* 30 + * The smallest known work area size is for ibm,get-vpd's 31 + * location code argument, which is limited to 79 characters 32 + * plus 1 nul terminator. 33 + * 34 + * PAPR+ 7.3.20 ibm,get-vpd RTAS Call 35 + * PAPR+ 12.3.2.4 Converged Location Code Rules - Length Restrictions 36 + */ 37 + RTAS_WORK_AREA_MIN_ALLOC_SZ = roundup_pow_of_two(80), 38 + }; 39 + 40 + static struct { 41 + struct gen_pool *gen_pool; 42 + char *arena; 43 + struct mutex mutex; /* serializes allocations */ 44 + struct wait_queue_head wqh; 45 + mempool_t descriptor_pool; 46 + bool available; 47 + } rwa_state = { 48 + .mutex = __MUTEX_INITIALIZER(rwa_state.mutex), 49 + .wqh = __WAIT_QUEUE_HEAD_INITIALIZER(rwa_state.wqh), 50 + }; 51 + 52 + /* 53 + * A single work area buffer and descriptor to serve requests early in 54 + * boot before the allocator is fully initialized. We know 4KB is the 55 + * most any boot time user needs (they all call ibm,get-system-parameter). 56 + */ 57 + static bool early_work_area_in_use __initdata; 58 + static char early_work_area_buf[SZ_4K] __initdata __aligned(SZ_4K); 59 + static struct rtas_work_area early_work_area __initdata = { 60 + .buf = early_work_area_buf, 61 + .size = sizeof(early_work_area_buf), 62 + }; 63 + 64 + 65 + static struct rtas_work_area * __init rtas_work_area_alloc_early(size_t size) 66 + { 67 + WARN_ON(size > early_work_area.size); 68 + WARN_ON(early_work_area_in_use); 69 + early_work_area_in_use = true; 70 + memset(early_work_area.buf, 0, early_work_area.size); 71 + return &early_work_area; 72 + } 73 + 74 + static void __init rtas_work_area_free_early(struct rtas_work_area *work_area) 75 + { 76 + WARN_ON(work_area != &early_work_area); 77 + WARN_ON(!early_work_area_in_use); 78 + early_work_area_in_use = false; 79 + } 80 + 81 + struct rtas_work_area * __ref __rtas_work_area_alloc(size_t size) 82 + { 83 + struct rtas_work_area *area; 84 + unsigned long addr; 85 + 86 + might_sleep(); 87 + 88 + /* 89 + * The rtas_work_area_alloc() wrapper enforces this at build 90 + * time. Requests that exceed the arena size will block 91 + * indefinitely. 92 + */ 93 + WARN_ON(size > RTAS_WORK_AREA_MAX_ALLOC_SZ); 94 + 95 + if (!rwa_state.available) 96 + return rtas_work_area_alloc_early(size); 97 + /* 98 + * To ensure FCFS behavior and prevent a high rate of smaller 99 + * requests from starving larger ones, use the mutex to queue 100 + * allocations. 101 + */ 102 + mutex_lock(&rwa_state.mutex); 103 + wait_event(rwa_state.wqh, 104 + (addr = gen_pool_alloc(rwa_state.gen_pool, size)) != 0); 105 + mutex_unlock(&rwa_state.mutex); 106 + 107 + area = mempool_alloc(&rwa_state.descriptor_pool, GFP_KERNEL); 108 + area->buf = (char *)addr; 109 + area->size = size; 110 + 111 + return area; 112 + } 113 + 114 + void __ref rtas_work_area_free(struct rtas_work_area *area) 115 + { 116 + if (!rwa_state.available) { 117 + rtas_work_area_free_early(area); 118 + return; 119 + } 120 + 121 + gen_pool_free(rwa_state.gen_pool, (unsigned long)area->buf, area->size); 122 + mempool_free(area, &rwa_state.descriptor_pool); 123 + wake_up(&rwa_state.wqh); 124 + } 125 + 126 + /* 127 + * Initialization of the work area allocator happens in two parts. To 128 + * reliably reserve an arena that satisfies RTAS addressing 129 + * requirements, we must perform a memblock allocation early, 130 + * immmediately after RTAS instantiation. Then we have to wait until 131 + * the slab allocator is up before setting up the descriptor mempool 132 + * and adding the arena to a gen_pool. 133 + */ 134 + static __init int rtas_work_area_allocator_init(void) 135 + { 136 + const unsigned int order = ilog2(RTAS_WORK_AREA_MIN_ALLOC_SZ); 137 + const phys_addr_t pa_start = __pa(rwa_state.arena); 138 + const phys_addr_t pa_end = pa_start + RTAS_WORK_AREA_ARENA_SZ - 1; 139 + struct gen_pool *pool; 140 + const int nid = NUMA_NO_NODE; 141 + int err; 142 + 143 + err = -ENOMEM; 144 + if (!rwa_state.arena) 145 + goto err_out; 146 + 147 + pool = gen_pool_create(order, nid); 148 + if (!pool) 149 + goto err_out; 150 + /* 151 + * All RTAS functions that consume work areas are OK with 152 + * natural alignment, when they have alignment requirements at 153 + * all. 154 + */ 155 + gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL); 156 + 157 + err = gen_pool_add(pool, (unsigned long)rwa_state.arena, 158 + RTAS_WORK_AREA_ARENA_SZ, nid); 159 + if (err) 160 + goto err_destroy; 161 + 162 + err = mempool_init_kmalloc_pool(&rwa_state.descriptor_pool, 1, 163 + sizeof(struct rtas_work_area)); 164 + if (err) 165 + goto err_destroy; 166 + 167 + rwa_state.gen_pool = pool; 168 + rwa_state.available = true; 169 + 170 + pr_debug("arena [%pa-%pa] (%uK), min/max alloc sizes %u/%u\n", 171 + &pa_start, &pa_end, 172 + RTAS_WORK_AREA_ARENA_SZ / SZ_1K, 173 + RTAS_WORK_AREA_MIN_ALLOC_SZ, 174 + RTAS_WORK_AREA_MAX_ALLOC_SZ); 175 + 176 + return 0; 177 + 178 + err_destroy: 179 + gen_pool_destroy(pool); 180 + err_out: 181 + return err; 182 + } 183 + machine_arch_initcall(pseries, rtas_work_area_allocator_init); 184 + 185 + /** 186 + * rtas_work_area_reserve_arena() - Reserve memory suitable for RTAS work areas. 187 + */ 188 + void __init rtas_work_area_reserve_arena(const phys_addr_t limit) 189 + { 190 + const phys_addr_t align = RTAS_WORK_AREA_ARENA_ALIGN; 191 + const phys_addr_t size = RTAS_WORK_AREA_ARENA_SZ; 192 + const phys_addr_t min = MEMBLOCK_LOW_LIMIT; 193 + const int nid = NUMA_NO_NODE; 194 + 195 + /* 196 + * Too early for a machine_is(pseries) check. But PAPR 197 + * effectively mandates that ibm,get-system-parameter is 198 + * present: 199 + * 200 + * R1–7.3.16–1. All platforms must support the System 201 + * Parameters option. 202 + * 203 + * So set up the arena if we find that, with a fallback to 204 + * ibm,configure-connector, just in case. 205 + */ 206 + if (rtas_service_present("ibm,get-system-parameter") || 207 + rtas_service_present("ibm,configure-connector")) 208 + rwa_state.arena = memblock_alloc_try_nid(size, align, min, limit, nid); 209 + }