Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branches 'for-next/misc', 'for-next/tlbflush', 'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core

* arm64/for-next/perf:
: Perf updates
perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc()
perf/arm-cmn: Fix incorrect error check for devm_ioremap()
perf: add NVIDIA Tegra410 C2C PMU
perf: add NVIDIA Tegra410 CPU Memory Latency PMU
perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
perf/arm_cspmu: nvidia: Rename doc to Tegra241
perf/arm-cmn: Stop claiming entire iomem region
arm64: cpufeature: Use pmuv3_implemented() function
arm64: cpufeature: Make PMUVer and PerfMon unsigned
KVM: arm64: Read PMUVer as unsigned

* arm64/for-next/read-once:
: Fixes for __READ_ONCE() with CONFIG_LTO=y
arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
arm64: Optimize __READ_ONCE() with CONFIG_LTO=y

* for-next/misc:
: Miscellaneous cleanups/fixes
arm64: rsi: use linear-map alias for realm config buffer
arm64: Kconfig: fix duplicate word in CMDLINE help text
arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
arm64: kexec: Remove duplicate allocation for trans_pgd
arm64: mm: Use generic enum pgtable_level
arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0
arm64: remove ARCH_INLINE_*

* for-next/tlbflush:
: Refactor the arm64 TLB invalidation API and implementation
arm64: mm: __ptep_set_access_flags must hint correct TTL
arm64: mm: Provide level hint for flush_tlb_page()
arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range()
arm64: mm: More flags for __flush_tlb_range()
arm64: mm: Refactor __flush_tlb_range() to take flags
arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid()
arm64: mm: Simplify __flush_tlb_range_limit_excess()
arm64: mm: Simplify __TLBI_RANGE_NUM() macro
arm64: mm: Re-implement the __flush_tlb_range_op macro in C
arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range()
arm64: mm: Push __TLBI_VADDR() into __tlbi_level()
arm64: mm: Implicitly invalidate user ASID based on TLBI operation
arm64: mm: Introduce a C wrapper for by-range TLB invalidation
arm64: mm: Re-implement the __tlbi_level macro as a C function

* for-next/ttbr-macros-cleanup:
: Cleanups of the TTBR1_* macros
arm64/mm: Directly use TTBRx_EL1_CnP
arm64/mm: Directly use TTBRx_EL1_ASID_MASK
arm64/mm: Describe TTBR1_BADDR_4852_OFFSET

* for-next/kselftest:
: arm64 kselftest updates
selftests/arm64: Implement cmpbr_sigill() to hwcap test

* for-next/feat_lsui:
: Futex support using FEAT_LSUI instructions to avoid toggling PAN
arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present
arm64: Kconfig: Add support for LSUI
KVM: arm64: Use CAST instruction for swapping guest descriptor
arm64: futex: Support futex with FEAT_LSUI
arm64: futex: Refactor futex atomic operation
KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI
KVM: arm64: Expose FEAT_LSUI to guests
arm64: cpufeature: Add FEAT_LSUI

* for-next/mpam: (40 commits)
: Expose MPAM to user-space via resctrl:
: - Add architecture context-switch and hiding of the feature from KVM.
: - Add interface to allow MPAM to be exposed to user-space using resctrl.
: - Add errata workaoround for some existing platforms.
: - Add documentation for using MPAM and what shape of platforms can use resctrl
arm64: mpam: Add initial MPAM documentation
arm_mpam: Quirk CMN-650's CSU NRDY behaviour
arm_mpam: Add workaround for T241-MPAM-6
arm_mpam: Add workaround for T241-MPAM-4
arm_mpam: Add workaround for T241-MPAM-1
arm_mpam: Add quirk framework
arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl
arm64: mpam: Select ARCH_HAS_CPU_RESCTRL
arm_mpam: resctrl: Add empty definitions for assorted resctrl functions
arm_mpam: resctrl: Update the rmid reallocation limit
arm_mpam: resctrl: Add resctrl_arch_rmid_read()
arm_mpam: resctrl: Allow resctrl to allocate monitors
arm_mpam: resctrl: Add support for csu counters
arm_mpam: resctrl: Add monitor initialisation and domain boilerplate
arm_mpam: resctrl: Add kunit test for control format conversions
arm_mpam: resctrl: Add support for 'MB' resource
arm_mpam: resctrl: Wait for cacheinfo to be ready
arm_mpam: resctrl: Add rmid index helpers
arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats
arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT
...

* for-next/hotplug-batched-tlbi:
: arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
arm64/mm: Reject memory removal that splits a kernel leaf mapping
arm64/mm: Enable batched TLB flush in unmap_hotplug_range()

* for-next/bbml2-fixes:
: Fixes for realm guest and BBML2_NOABORT
arm64: mm: Remove pmd_sect() and pud_sect()
arm64: mm: Handle invalid large leaf mappings correctly
arm64: mm: Fix rodata=full block mapping support for realm guests

* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06

* for-next/generic-entry:
: More arm64 refactoring towards using the generic entry code
arm64: Check DAIF (and PMR) at task-switch time
arm64: entry: Use split preemption logic
arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
arm64: entry: Consistently prefix arm64-specific wrappers
arm64: entry: Don't preempt with SError or Debug masked
entry: Split preemption from irqentry_exit_to_kernel_mode()
entry: Split kernel mode logic from irqentry_{enter,exit}()
entry: Move irqentry_enter() prototype later
entry: Remove local_irq_{enable,disable}_exit_to_user()
entry: Fix stale comment for irqentry_enter()

* for-next/acpi:
: arm64 ACPI updates
ACPI: AGDI: fix missing newline in error message

+4094 -881
+1
Documentation/arch/arm64/index.rst
··· 23 23 memory 24 24 memory-tagging-extension 25 25 mops 26 + mpam 26 27 perf 27 28 pointer-authentication 28 29 ptdump
+72
Documentation/arch/arm64/mpam.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==== 4 + MPAM 5 + ==== 6 + 7 + What is MPAM 8 + ============ 9 + MPAM (Memory Partitioning and Monitoring) is a feature in the CPUs and memory 10 + system components such as the caches or memory controllers that allow memory 11 + traffic to be labelled, partitioned and monitored. 12 + 13 + Traffic is labelled by the CPU, based on the control or monitor group the 14 + current task is assigned to using resctrl. Partitioning policy can be set 15 + using the schemata file in resctrl, and monitor values read via resctrl. 16 + See Documentation/filesystems/resctrl.rst for more details. 17 + 18 + This allows tasks that share memory system resources, such as caches, to be 19 + isolated from each other according to the partitioning policy (so called noisy 20 + neighbours). 21 + 22 + Supported Platforms 23 + =================== 24 + Use of this feature requires CPU support, support in the memory system 25 + components, and a description from firmware of where the MPAM device controls 26 + are in the MMIO address space. (e.g. the 'MPAM' ACPI table). 27 + 28 + The MMIO device that provides MPAM controls/monitors for a memory system 29 + component is called a memory system component. (MSC). 30 + 31 + Because the user interface to MPAM is via resctrl, only MPAM features that are 32 + compatible with resctrl can be exposed to user-space. 33 + 34 + MSC are considered as a group based on the topology. MSC that correspond with 35 + the L3 cache are considered together, it is not possible to mix MSC between L2 36 + and L3 to 'cover' a resctrl schema. 37 + 38 + The supported features are: 39 + 40 + * Cache portion bitmap controls (CPOR) on the L2 or L3 caches. To expose 41 + CPOR at L2 or L3, every CPU must have a corresponding CPU cache at this 42 + level that also supports the feature. Mismatched big/little platforms are 43 + not supported as resctrl's controls would then also depend on task 44 + placement. 45 + 46 + * Memory bandwidth maximum controls (MBW_MAX) on or after the L3 cache. 47 + resctrl uses the L3 cache-id to identify where the memory bandwidth 48 + control is applied. For this reason the platform must have an L3 cache 49 + with cache-id's supplied by firmware. (It doesn't need to support MPAM.) 50 + 51 + To be exported as the 'MB' schema, the topology of the group of MSC chosen 52 + must match the topology of the L3 cache so that the cache-id's can be 53 + repainted. For example: Platforms with Memory bandwidth maximum controls 54 + on CPU-less NUMA nodes cannot expose the 'MB' schema to resctrl as these 55 + nodes do not have a corresponding L3 cache. If the memory bandwidth 56 + control is on the memory rather than the L3 then there must be a single 57 + global L3 as otherwise it is unknown which L3 the traffic came from. There 58 + must be no caches between the L3 and the memory so that the two ends of 59 + the path have equivalent traffic. 60 + 61 + When the MPAM driver finds multiple groups of MSC it can use for the 'MB' 62 + schema, it prefers the group closest to the L3 cache. 63 + 64 + * Cache Storage Usage (CSU) counters can expose the 'llc_occupancy' provided 65 + there is at least one CSU monitor on each MSC that makes up the L3 group. 66 + Exposing CSU counters from other caches or devices is not supported. 67 + 68 + Reporting Bugs 69 + ============== 70 + If you are not seeing the counters or controls you expect please share the 71 + debug messages produced when enabling dynamic debug and booting with: 72 + dyndbg="file mpam_resctrl.c +pl"
+9
Documentation/arch/arm64/silicon-errata.rst
··· 214 214 +----------------+-----------------+-----------------+-----------------------------+ 215 215 | ARM | SI L1 | #4311569 | ARM64_ERRATUM_4311569 | 216 216 +----------------+-----------------+-----------------+-----------------------------+ 217 + | ARM | CMN-650 | #3642720 | N/A | 218 + +----------------+-----------------+-----------------+-----------------------------+ 219 + +----------------+-----------------+-----------------+-----------------------------+ 217 220 | Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_845719 | 218 221 +----------------+-----------------+-----------------+-----------------------------+ 219 222 | Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_843419 | ··· 249 246 | NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM | 250 247 +----------------+-----------------+-----------------+-----------------------------+ 251 248 | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | 249 + +----------------+-----------------+-----------------+-----------------------------+ 250 + | NVIDIA | T241 MPAM | T241-MPAM-1 | N/A | 251 + +----------------+-----------------+-----------------+-----------------------------+ 252 + | NVIDIA | T241 MPAM | T241-MPAM-4 | N/A | 253 + +----------------+-----------------+-----------------+-----------------------------+ 254 + | NVIDIA | T241 MPAM | T241-MPAM-6 | N/A | 252 255 +----------------+-----------------+-----------------+-----------------------------+ 253 256 +----------------+-----------------+-----------------+-----------------------------+ 254 257 | Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 |
+25 -29
arch/arm64/Kconfig
··· 61 61 select ARCH_HAVE_ELF_PROT 62 62 select ARCH_HAVE_NMI_SAFE_CMPXCHG 63 63 select ARCH_HAVE_TRACE_MMIO_ACCESS 64 - select ARCH_INLINE_READ_LOCK if !PREEMPTION 65 - select ARCH_INLINE_READ_LOCK_BH if !PREEMPTION 66 - select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPTION 67 - select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPTION 68 - select ARCH_INLINE_READ_UNLOCK if !PREEMPTION 69 - select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPTION 70 - select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPTION 71 - select ARCH_INLINE_READ_UNLOCK_IRQRESTORE if !PREEMPTION 72 - select ARCH_INLINE_WRITE_LOCK if !PREEMPTION 73 - select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPTION 74 - select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPTION 75 - select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPTION 76 - select ARCH_INLINE_WRITE_UNLOCK if !PREEMPTION 77 - select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPTION 78 - select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPTION 79 - select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE if !PREEMPTION 80 - select ARCH_INLINE_SPIN_TRYLOCK if !PREEMPTION 81 - select ARCH_INLINE_SPIN_TRYLOCK_BH if !PREEMPTION 82 - select ARCH_INLINE_SPIN_LOCK if !PREEMPTION 83 - select ARCH_INLINE_SPIN_LOCK_BH if !PREEMPTION 84 - select ARCH_INLINE_SPIN_LOCK_IRQ if !PREEMPTION 85 - select ARCH_INLINE_SPIN_LOCK_IRQSAVE if !PREEMPTION 86 - select ARCH_INLINE_SPIN_UNLOCK if !PREEMPTION 87 - select ARCH_INLINE_SPIN_UNLOCK_BH if !PREEMPTION 88 - select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION 89 - select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION 90 64 select ARCH_KEEP_MEMBLOCK 91 65 select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 92 66 select ARCH_USE_CMPXCHG_LOCKREF ··· 1990 2016 1991 2017 config ARM64_MPAM 1992 2018 bool "Enable support for MPAM" 1993 - select ARM64_MPAM_DRIVER if EXPERT # does nothing yet 1994 - select ACPI_MPAM if ACPI 2019 + select ARM64_MPAM_DRIVER 2020 + select ARCH_HAS_CPU_RESCTRL 1995 2021 help 1996 2022 Memory System Resource Partitioning and Monitoring (MPAM) is an 1997 2023 optional extension to the Arm architecture that allows each ··· 2012 2038 of where the MSCs are in the address space. 2013 2039 2014 2040 MPAM is exposed to user-space via the resctrl pseudo filesystem. 2041 + 2042 + This option enables the extra context switch code. 2015 2043 2016 2044 endmenu # "ARMv8.4 architectural features" 2017 2045 ··· 2191 2215 2192 2216 endmenu # "ARMv9.4 architectural features" 2193 2217 2218 + config AS_HAS_LSUI 2219 + def_bool $(as-instr,.arch_extension lsui) 2220 + help 2221 + Supported by LLVM 20+ and binutils 2.45+. 2222 + 2223 + menu "ARMv9.6 architectural features" 2224 + 2225 + config ARM64_LSUI 2226 + bool "Support Unprivileged Load Store Instructions (LSUI)" 2227 + default y 2228 + depends on AS_HAS_LSUI && !CPU_BIG_ENDIAN 2229 + help 2230 + The Unprivileged Load Store Instructions (LSUI) provides 2231 + variants load/store instructions that access user-space memory 2232 + from the kernel without clearing PSTATE.PAN bit. 2233 + 2234 + This feature is supported by LLVM 20+ and binutils 2.45+. 2235 + 2236 + endmenu # "ARMv9.6 architectural feature" 2237 + 2194 2238 config ARM64_SVE 2195 2239 bool "ARM Scalable Vector Extension support" 2196 2240 default y ··· 2368 2372 default "" 2369 2373 help 2370 2374 Provide a set of default command-line options at build time by 2371 - entering them here. As a minimum, you should specify the the 2375 + entering them here. As a minimum, you should specify the 2372 2376 root device (e.g. root=/dev/nfs). 2373 2377 2374 2378 choice
+1 -1
arch/arm64/include/asm/asm-uaccess.h
··· 15 15 #ifdef CONFIG_ARM64_SW_TTBR0_PAN 16 16 .macro __uaccess_ttbr0_disable, tmp1 17 17 mrs \tmp1, ttbr1_el1 // swapper_pg_dir 18 - bic \tmp1, \tmp1, #TTBR_ASID_MASK 18 + bic \tmp1, \tmp1, #TTBRx_EL1_ASID_MASK 19 19 sub \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET // reserved_pg_dir 20 20 msr ttbr0_el1, \tmp1 // set reserved TTBR0_EL1 21 21 add \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET
+2
arch/arm64/include/asm/cpucaps.h
··· 71 71 return true; 72 72 case ARM64_HAS_PMUV3: 73 73 return IS_ENABLED(CONFIG_HW_PERF_EVENTS); 74 + case ARM64_HAS_LSUI: 75 + return IS_ENABLED(CONFIG_ARM64_LSUI); 74 76 } 75 77 76 78 return true;
+2 -1
arch/arm64/include/asm/el2_setup.h
··· 513 513 check_override id_aa64pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT, .Linit_mpam_\@, .Lskip_mpam_\@, x1, x2 514 514 515 515 .Linit_mpam_\@: 516 - msr_s SYS_MPAM2_EL2, xzr // use the default partition 516 + mov x0, #MPAM2_EL2_EnMPAMSM_MASK 517 + msr_s SYS_MPAM2_EL2, x0 // use the default partition, 517 518 // and disable lower traps 518 519 mrs_s x0, SYS_MPAMIDR_EL1 519 520 tbz x0, #MPAMIDR_EL1_HAS_HCR_SHIFT, .Lskip_mpam_\@ // skip if no MPAMHCR reg
+253 -58
arch/arm64/include/asm/futex.h
··· 9 9 #include <linux/uaccess.h> 10 10 11 11 #include <asm/errno.h> 12 + #include <asm/lsui.h> 12 13 13 14 #define FUTEX_MAX_LOOPS 128 /* What's the largest number you can think of? */ 14 15 15 - #define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg) \ 16 - do { \ 16 + #define LLSC_FUTEX_ATOMIC_OP(op, insn) \ 17 + static __always_inline int \ 18 + __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \ 19 + { \ 17 20 unsigned int loops = FUTEX_MAX_LOOPS; \ 21 + int ret, oldval, newval; \ 18 22 \ 19 23 uaccess_enable_privileged(); \ 20 - asm volatile( \ 21 - " prfm pstl1strm, %2\n" \ 22 - "1: ldxr %w1, %2\n" \ 24 + asm volatile("// __llsc_futex_atomic_" #op "\n" \ 25 + " prfm pstl1strm, %[uaddr]\n" \ 26 + "1: ldxr %w[oldval], %[uaddr]\n" \ 23 27 insn "\n" \ 24 - "2: stlxr %w0, %w3, %2\n" \ 25 - " cbz %w0, 3f\n" \ 26 - " sub %w4, %w4, %w0\n" \ 27 - " cbnz %w4, 1b\n" \ 28 - " mov %w0, %w6\n" \ 28 + "2: stlxr %w[ret], %w[newval], %[uaddr]\n" \ 29 + " cbz %w[ret], 3f\n" \ 30 + " sub %w[loops], %w[loops], %w[ret]\n" \ 31 + " cbnz %w[loops], 1b\n" \ 32 + " mov %w[ret], %w[err]\n" \ 29 33 "3:\n" \ 30 34 " dmb ish\n" \ 31 - _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0) \ 32 - _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0) \ 33 - : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp), \ 34 - "+r" (loops) \ 35 - : "r" (oparg), "Ir" (-EAGAIN) \ 35 + _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w[ret]) \ 36 + _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w[ret]) \ 37 + : [ret] "=&r" (ret), [oldval] "=&r" (oldval), \ 38 + [uaddr] "+Q" (*uaddr), [newval] "=&r" (newval), \ 39 + [loops] "+r" (loops) \ 40 + : [oparg] "r" (oparg), [err] "Ir" (-EAGAIN) \ 36 41 : "memory"); \ 37 42 uaccess_disable_privileged(); \ 38 - } while (0) 43 + \ 44 + if (!ret) \ 45 + *oval = oldval; \ 46 + \ 47 + return ret; \ 48 + } 49 + 50 + LLSC_FUTEX_ATOMIC_OP(add, "add %w[newval], %w[oldval], %w[oparg]") 51 + LLSC_FUTEX_ATOMIC_OP(or, "orr %w[newval], %w[oldval], %w[oparg]") 52 + LLSC_FUTEX_ATOMIC_OP(and, "and %w[newval], %w[oldval], %w[oparg]") 53 + LLSC_FUTEX_ATOMIC_OP(eor, "eor %w[newval], %w[oldval], %w[oparg]") 54 + LLSC_FUTEX_ATOMIC_OP(set, "mov %w[newval], %w[oparg]") 55 + 56 + static __always_inline int 57 + __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 58 + { 59 + int ret = 0; 60 + unsigned int loops = FUTEX_MAX_LOOPS; 61 + u32 val, tmp; 62 + 63 + uaccess_enable_privileged(); 64 + asm volatile("//__llsc_futex_cmpxchg\n" 65 + " prfm pstl1strm, %[uaddr]\n" 66 + "1: ldxr %w[curval], %[uaddr]\n" 67 + " eor %w[tmp], %w[curval], %w[oldval]\n" 68 + " cbnz %w[tmp], 4f\n" 69 + "2: stlxr %w[tmp], %w[newval], %[uaddr]\n" 70 + " cbz %w[tmp], 3f\n" 71 + " sub %w[loops], %w[loops], %w[tmp]\n" 72 + " cbnz %w[loops], 1b\n" 73 + " mov %w[ret], %w[err]\n" 74 + "3:\n" 75 + " dmb ish\n" 76 + "4:\n" 77 + _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w[ret]) 78 + _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w[ret]) 79 + : [ret] "+r" (ret), [curval] "=&r" (val), 80 + [uaddr] "+Q" (*uaddr), [tmp] "=&r" (tmp), 81 + [loops] "+r" (loops) 82 + : [oldval] "r" (oldval), [newval] "r" (newval), 83 + [err] "Ir" (-EAGAIN) 84 + : "memory"); 85 + uaccess_disable_privileged(); 86 + 87 + if (!ret) 88 + *oval = val; 89 + 90 + return ret; 91 + } 92 + 93 + #ifdef CONFIG_ARM64_LSUI 94 + 95 + /* 96 + * Wrap LSUI instructions with uaccess_ttbr0_enable()/disable(), as 97 + * PAN toggling is not required. 98 + */ 99 + 100 + #define LSUI_FUTEX_ATOMIC_OP(op, asm_op) \ 101 + static __always_inline int \ 102 + __lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \ 103 + { \ 104 + int ret = 0; \ 105 + int oldval; \ 106 + \ 107 + uaccess_ttbr0_enable(); \ 108 + \ 109 + asm volatile("// __lsui_futex_atomic_" #op "\n" \ 110 + __LSUI_PREAMBLE \ 111 + "1: " #asm_op "al %w[oparg], %w[oldval], %[uaddr]\n" \ 112 + "2:\n" \ 113 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) \ 114 + : [ret] "+r" (ret), [uaddr] "+Q" (*uaddr), \ 115 + [oldval] "=r" (oldval) \ 116 + : [oparg] "r" (oparg) \ 117 + : "memory"); \ 118 + \ 119 + uaccess_ttbr0_disable(); \ 120 + \ 121 + if (!ret) \ 122 + *oval = oldval; \ 123 + return ret; \ 124 + } 125 + 126 + LSUI_FUTEX_ATOMIC_OP(add, ldtadd) 127 + LSUI_FUTEX_ATOMIC_OP(or, ldtset) 128 + LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr) 129 + LSUI_FUTEX_ATOMIC_OP(set, swpt) 130 + 131 + static __always_inline int 132 + __lsui_cmpxchg64(u64 __user *uaddr, u64 *oldval, u64 newval) 133 + { 134 + int ret = 0; 135 + 136 + uaccess_ttbr0_enable(); 137 + 138 + asm volatile("// __lsui_cmpxchg64\n" 139 + __LSUI_PREAMBLE 140 + "1: casalt %[oldval], %[newval], %[uaddr]\n" 141 + "2:\n" 142 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) 143 + : [ret] "+r" (ret), [uaddr] "+Q" (*uaddr), 144 + [oldval] "+r" (*oldval) 145 + : [newval] "r" (newval) 146 + : "memory"); 147 + 148 + uaccess_ttbr0_disable(); 149 + 150 + return ret; 151 + } 152 + 153 + static __always_inline int 154 + __lsui_cmpxchg32(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 155 + { 156 + u64 __user *uaddr64; 157 + bool futex_pos, other_pos; 158 + u32 other, orig_other; 159 + union { 160 + u32 futex[2]; 161 + u64 raw; 162 + } oval64, orig64, nval64; 163 + 164 + uaddr64 = (u64 __user *)PTR_ALIGN_DOWN(uaddr, sizeof(u64)); 165 + futex_pos = !IS_ALIGNED((unsigned long)uaddr, sizeof(u64)); 166 + other_pos = !futex_pos; 167 + 168 + oval64.futex[futex_pos] = oldval; 169 + if (get_user(oval64.futex[other_pos], (u32 __user *)uaddr64 + other_pos)) 170 + return -EFAULT; 171 + 172 + orig64.raw = oval64.raw; 173 + 174 + nval64.futex[futex_pos] = newval; 175 + nval64.futex[other_pos] = oval64.futex[other_pos]; 176 + 177 + if (__lsui_cmpxchg64(uaddr64, &oval64.raw, nval64.raw)) 178 + return -EFAULT; 179 + 180 + oldval = oval64.futex[futex_pos]; 181 + other = oval64.futex[other_pos]; 182 + orig_other = orig64.futex[other_pos]; 183 + 184 + if (other != orig_other) 185 + return -EAGAIN; 186 + 187 + *oval = oldval; 188 + 189 + return 0; 190 + } 191 + 192 + static __always_inline int 193 + __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval) 194 + { 195 + /* 196 + * Undo the bitwise negation applied to the oparg passed from 197 + * arch_futex_atomic_op_inuser() with FUTEX_OP_ANDN. 198 + */ 199 + return __lsui_futex_atomic_andnot(~oparg, uaddr, oval); 200 + } 201 + 202 + static __always_inline int 203 + __lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval) 204 + { 205 + u32 oldval, newval, val; 206 + int ret, i; 207 + 208 + if (get_user(oldval, uaddr)) 209 + return -EFAULT; 210 + 211 + /* 212 + * there are no ldteor/stteor instructions... 213 + */ 214 + for (i = 0; i < FUTEX_MAX_LOOPS; i++) { 215 + newval = oldval ^ oparg; 216 + 217 + ret = __lsui_cmpxchg32(uaddr, oldval, newval, &val); 218 + switch (ret) { 219 + case -EFAULT: 220 + return ret; 221 + case -EAGAIN: 222 + continue; 223 + } 224 + 225 + if (val == oldval) { 226 + *oval = val; 227 + return 0; 228 + } 229 + 230 + oldval = val; 231 + } 232 + 233 + return -EAGAIN; 234 + } 235 + 236 + static __always_inline int 237 + __lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 238 + { 239 + /* 240 + * Callers of futex_atomic_cmpxchg_inatomic() already retry on 241 + * -EAGAIN, no need for another loop of max retries. 242 + */ 243 + return __lsui_cmpxchg32(uaddr, oldval, newval, oval); 244 + } 245 + #endif /* CONFIG_ARM64_LSUI */ 246 + 247 + 248 + #define FUTEX_ATOMIC_OP(op) \ 249 + static __always_inline int \ 250 + __futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval) \ 251 + { \ 252 + return __lsui_llsc_body(futex_atomic_##op, oparg, uaddr, oval); \ 253 + } 254 + 255 + FUTEX_ATOMIC_OP(add) 256 + FUTEX_ATOMIC_OP(or) 257 + FUTEX_ATOMIC_OP(and) 258 + FUTEX_ATOMIC_OP(eor) 259 + FUTEX_ATOMIC_OP(set) 260 + 261 + static __always_inline int 262 + __futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval) 263 + { 264 + return __lsui_llsc_body(futex_cmpxchg, uaddr, oldval, newval, oval); 265 + } 39 266 40 267 static inline int 41 268 arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *_uaddr) 42 269 { 43 - int oldval = 0, ret, tmp; 44 - u32 __user *uaddr = __uaccess_mask_ptr(_uaddr); 270 + int ret; 271 + u32 __user *uaddr; 45 272 46 273 if (!access_ok(_uaddr, sizeof(u32))) 47 274 return -EFAULT; 48 275 276 + uaddr = __uaccess_mask_ptr(_uaddr); 277 + 49 278 switch (op) { 50 279 case FUTEX_OP_SET: 51 - __futex_atomic_op("mov %w3, %w5", 52 - ret, oldval, uaddr, tmp, oparg); 280 + ret = __futex_atomic_set(oparg, uaddr, oval); 53 281 break; 54 282 case FUTEX_OP_ADD: 55 - __futex_atomic_op("add %w3, %w1, %w5", 56 - ret, oldval, uaddr, tmp, oparg); 283 + ret = __futex_atomic_add(oparg, uaddr, oval); 57 284 break; 58 285 case FUTEX_OP_OR: 59 - __futex_atomic_op("orr %w3, %w1, %w5", 60 - ret, oldval, uaddr, tmp, oparg); 286 + ret = __futex_atomic_or(oparg, uaddr, oval); 61 287 break; 62 288 case FUTEX_OP_ANDN: 63 - __futex_atomic_op("and %w3, %w1, %w5", 64 - ret, oldval, uaddr, tmp, ~oparg); 289 + ret = __futex_atomic_and(~oparg, uaddr, oval); 65 290 break; 66 291 case FUTEX_OP_XOR: 67 - __futex_atomic_op("eor %w3, %w1, %w5", 68 - ret, oldval, uaddr, tmp, oparg); 292 + ret = __futex_atomic_eor(oparg, uaddr, oval); 69 293 break; 70 294 default: 71 295 ret = -ENOSYS; 72 296 } 73 - 74 - if (!ret) 75 - *oval = oldval; 76 297 77 298 return ret; 78 299 } ··· 302 81 futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *_uaddr, 303 82 u32 oldval, u32 newval) 304 83 { 305 - int ret = 0; 306 - unsigned int loops = FUTEX_MAX_LOOPS; 307 - u32 val, tmp; 308 84 u32 __user *uaddr; 309 85 310 86 if (!access_ok(_uaddr, sizeof(u32))) 311 87 return -EFAULT; 312 88 313 89 uaddr = __uaccess_mask_ptr(_uaddr); 314 - uaccess_enable_privileged(); 315 - asm volatile("// futex_atomic_cmpxchg_inatomic\n" 316 - " prfm pstl1strm, %2\n" 317 - "1: ldxr %w1, %2\n" 318 - " sub %w3, %w1, %w5\n" 319 - " cbnz %w3, 4f\n" 320 - "2: stlxr %w3, %w6, %2\n" 321 - " cbz %w3, 3f\n" 322 - " sub %w4, %w4, %w3\n" 323 - " cbnz %w4, 1b\n" 324 - " mov %w0, %w7\n" 325 - "3:\n" 326 - " dmb ish\n" 327 - "4:\n" 328 - _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0) 329 - _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0) 330 - : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops) 331 - : "r" (oldval), "r" (newval), "Ir" (-EAGAIN) 332 - : "memory"); 333 - uaccess_disable_privileged(); 334 90 335 - if (!ret) 336 - *uval = val; 337 - 338 - return ret; 91 + return __futex_cmpxchg(uaddr, oldval, newval, uval); 339 92 } 340 93 341 94 #endif /* __ASM_FUTEX_H */
+6 -6
arch/arm64/include/asm/hugetlb.h
··· 71 71 unsigned long start, 72 72 unsigned long end, 73 73 unsigned long stride, 74 - bool last_level) 74 + tlbf_t flags) 75 75 { 76 76 switch (stride) { 77 77 #ifndef __PAGETABLE_PMD_FOLDED 78 78 case PUD_SIZE: 79 - __flush_tlb_range(vma, start, end, PUD_SIZE, last_level, 1); 79 + __flush_tlb_range(vma, start, end, PUD_SIZE, 1, flags); 80 80 break; 81 81 #endif 82 82 case CONT_PMD_SIZE: 83 83 case PMD_SIZE: 84 - __flush_tlb_range(vma, start, end, PMD_SIZE, last_level, 2); 84 + __flush_tlb_range(vma, start, end, PMD_SIZE, 2, flags); 85 85 break; 86 86 case CONT_PTE_SIZE: 87 - __flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, 3); 87 + __flush_tlb_range(vma, start, end, PAGE_SIZE, 3, flags); 88 88 break; 89 89 default: 90 - __flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, TLBI_TTL_UNKNOWN); 90 + __flush_tlb_range(vma, start, end, PAGE_SIZE, TLBI_TTL_UNKNOWN, flags); 91 91 } 92 92 } 93 93 ··· 98 98 { 99 99 unsigned long stride = huge_page_size(hstate_vma(vma)); 100 100 101 - __flush_hugetlb_tlb_range(vma, start, end, stride, false); 101 + __flush_hugetlb_tlb_range(vma, start, end, stride, TLBF_NONE); 102 102 } 103 103 104 104 #endif /* __ASM_HUGETLB_H */
+2 -118
arch/arm64/include/asm/hwcap.h
··· 60 60 * of KERNEL_HWCAP_{feature}. 61 61 */ 62 62 #define __khwcap_feature(x) const_ilog2(HWCAP_ ## x) 63 - #define KERNEL_HWCAP_FP __khwcap_feature(FP) 64 - #define KERNEL_HWCAP_ASIMD __khwcap_feature(ASIMD) 65 - #define KERNEL_HWCAP_EVTSTRM __khwcap_feature(EVTSTRM) 66 - #define KERNEL_HWCAP_AES __khwcap_feature(AES) 67 - #define KERNEL_HWCAP_PMULL __khwcap_feature(PMULL) 68 - #define KERNEL_HWCAP_SHA1 __khwcap_feature(SHA1) 69 - #define KERNEL_HWCAP_SHA2 __khwcap_feature(SHA2) 70 - #define KERNEL_HWCAP_CRC32 __khwcap_feature(CRC32) 71 - #define KERNEL_HWCAP_ATOMICS __khwcap_feature(ATOMICS) 72 - #define KERNEL_HWCAP_FPHP __khwcap_feature(FPHP) 73 - #define KERNEL_HWCAP_ASIMDHP __khwcap_feature(ASIMDHP) 74 - #define KERNEL_HWCAP_CPUID __khwcap_feature(CPUID) 75 - #define KERNEL_HWCAP_ASIMDRDM __khwcap_feature(ASIMDRDM) 76 - #define KERNEL_HWCAP_JSCVT __khwcap_feature(JSCVT) 77 - #define KERNEL_HWCAP_FCMA __khwcap_feature(FCMA) 78 - #define KERNEL_HWCAP_LRCPC __khwcap_feature(LRCPC) 79 - #define KERNEL_HWCAP_DCPOP __khwcap_feature(DCPOP) 80 - #define KERNEL_HWCAP_SHA3 __khwcap_feature(SHA3) 81 - #define KERNEL_HWCAP_SM3 __khwcap_feature(SM3) 82 - #define KERNEL_HWCAP_SM4 __khwcap_feature(SM4) 83 - #define KERNEL_HWCAP_ASIMDDP __khwcap_feature(ASIMDDP) 84 - #define KERNEL_HWCAP_SHA512 __khwcap_feature(SHA512) 85 - #define KERNEL_HWCAP_SVE __khwcap_feature(SVE) 86 - #define KERNEL_HWCAP_ASIMDFHM __khwcap_feature(ASIMDFHM) 87 - #define KERNEL_HWCAP_DIT __khwcap_feature(DIT) 88 - #define KERNEL_HWCAP_USCAT __khwcap_feature(USCAT) 89 - #define KERNEL_HWCAP_ILRCPC __khwcap_feature(ILRCPC) 90 - #define KERNEL_HWCAP_FLAGM __khwcap_feature(FLAGM) 91 - #define KERNEL_HWCAP_SSBS __khwcap_feature(SSBS) 92 - #define KERNEL_HWCAP_SB __khwcap_feature(SB) 93 - #define KERNEL_HWCAP_PACA __khwcap_feature(PACA) 94 - #define KERNEL_HWCAP_PACG __khwcap_feature(PACG) 95 - #define KERNEL_HWCAP_GCS __khwcap_feature(GCS) 96 - #define KERNEL_HWCAP_CMPBR __khwcap_feature(CMPBR) 97 - #define KERNEL_HWCAP_FPRCVT __khwcap_feature(FPRCVT) 98 - #define KERNEL_HWCAP_F8MM8 __khwcap_feature(F8MM8) 99 - #define KERNEL_HWCAP_F8MM4 __khwcap_feature(F8MM4) 100 - #define KERNEL_HWCAP_SVE_F16MM __khwcap_feature(SVE_F16MM) 101 - #define KERNEL_HWCAP_SVE_ELTPERM __khwcap_feature(SVE_ELTPERM) 102 - #define KERNEL_HWCAP_SVE_AES2 __khwcap_feature(SVE_AES2) 103 - #define KERNEL_HWCAP_SVE_BFSCALE __khwcap_feature(SVE_BFSCALE) 104 - #define KERNEL_HWCAP_SVE2P2 __khwcap_feature(SVE2P2) 105 - #define KERNEL_HWCAP_SME2P2 __khwcap_feature(SME2P2) 106 - #define KERNEL_HWCAP_SME_SBITPERM __khwcap_feature(SME_SBITPERM) 107 - #define KERNEL_HWCAP_SME_AES __khwcap_feature(SME_AES) 108 - #define KERNEL_HWCAP_SME_SFEXPA __khwcap_feature(SME_SFEXPA) 109 - #define KERNEL_HWCAP_SME_STMOP __khwcap_feature(SME_STMOP) 110 - #define KERNEL_HWCAP_SME_SMOP4 __khwcap_feature(SME_SMOP4) 111 - 112 63 #define __khwcap2_feature(x) (const_ilog2(HWCAP2_ ## x) + 64) 113 - #define KERNEL_HWCAP_DCPODP __khwcap2_feature(DCPODP) 114 - #define KERNEL_HWCAP_SVE2 __khwcap2_feature(SVE2) 115 - #define KERNEL_HWCAP_SVEAES __khwcap2_feature(SVEAES) 116 - #define KERNEL_HWCAP_SVEPMULL __khwcap2_feature(SVEPMULL) 117 - #define KERNEL_HWCAP_SVEBITPERM __khwcap2_feature(SVEBITPERM) 118 - #define KERNEL_HWCAP_SVESHA3 __khwcap2_feature(SVESHA3) 119 - #define KERNEL_HWCAP_SVESM4 __khwcap2_feature(SVESM4) 120 - #define KERNEL_HWCAP_FLAGM2 __khwcap2_feature(FLAGM2) 121 - #define KERNEL_HWCAP_FRINT __khwcap2_feature(FRINT) 122 - #define KERNEL_HWCAP_SVEI8MM __khwcap2_feature(SVEI8MM) 123 - #define KERNEL_HWCAP_SVEF32MM __khwcap2_feature(SVEF32MM) 124 - #define KERNEL_HWCAP_SVEF64MM __khwcap2_feature(SVEF64MM) 125 - #define KERNEL_HWCAP_SVEBF16 __khwcap2_feature(SVEBF16) 126 - #define KERNEL_HWCAP_I8MM __khwcap2_feature(I8MM) 127 - #define KERNEL_HWCAP_BF16 __khwcap2_feature(BF16) 128 - #define KERNEL_HWCAP_DGH __khwcap2_feature(DGH) 129 - #define KERNEL_HWCAP_RNG __khwcap2_feature(RNG) 130 - #define KERNEL_HWCAP_BTI __khwcap2_feature(BTI) 131 - #define KERNEL_HWCAP_MTE __khwcap2_feature(MTE) 132 - #define KERNEL_HWCAP_ECV __khwcap2_feature(ECV) 133 - #define KERNEL_HWCAP_AFP __khwcap2_feature(AFP) 134 - #define KERNEL_HWCAP_RPRES __khwcap2_feature(RPRES) 135 - #define KERNEL_HWCAP_MTE3 __khwcap2_feature(MTE3) 136 - #define KERNEL_HWCAP_SME __khwcap2_feature(SME) 137 - #define KERNEL_HWCAP_SME_I16I64 __khwcap2_feature(SME_I16I64) 138 - #define KERNEL_HWCAP_SME_F64F64 __khwcap2_feature(SME_F64F64) 139 - #define KERNEL_HWCAP_SME_I8I32 __khwcap2_feature(SME_I8I32) 140 - #define KERNEL_HWCAP_SME_F16F32 __khwcap2_feature(SME_F16F32) 141 - #define KERNEL_HWCAP_SME_B16F32 __khwcap2_feature(SME_B16F32) 142 - #define KERNEL_HWCAP_SME_F32F32 __khwcap2_feature(SME_F32F32) 143 - #define KERNEL_HWCAP_SME_FA64 __khwcap2_feature(SME_FA64) 144 - #define KERNEL_HWCAP_WFXT __khwcap2_feature(WFXT) 145 - #define KERNEL_HWCAP_EBF16 __khwcap2_feature(EBF16) 146 - #define KERNEL_HWCAP_SVE_EBF16 __khwcap2_feature(SVE_EBF16) 147 - #define KERNEL_HWCAP_CSSC __khwcap2_feature(CSSC) 148 - #define KERNEL_HWCAP_RPRFM __khwcap2_feature(RPRFM) 149 - #define KERNEL_HWCAP_SVE2P1 __khwcap2_feature(SVE2P1) 150 - #define KERNEL_HWCAP_SME2 __khwcap2_feature(SME2) 151 - #define KERNEL_HWCAP_SME2P1 __khwcap2_feature(SME2P1) 152 - #define KERNEL_HWCAP_SME_I16I32 __khwcap2_feature(SME_I16I32) 153 - #define KERNEL_HWCAP_SME_BI32I32 __khwcap2_feature(SME_BI32I32) 154 - #define KERNEL_HWCAP_SME_B16B16 __khwcap2_feature(SME_B16B16) 155 - #define KERNEL_HWCAP_SME_F16F16 __khwcap2_feature(SME_F16F16) 156 - #define KERNEL_HWCAP_MOPS __khwcap2_feature(MOPS) 157 - #define KERNEL_HWCAP_HBC __khwcap2_feature(HBC) 158 - #define KERNEL_HWCAP_SVE_B16B16 __khwcap2_feature(SVE_B16B16) 159 - #define KERNEL_HWCAP_LRCPC3 __khwcap2_feature(LRCPC3) 160 - #define KERNEL_HWCAP_LSE128 __khwcap2_feature(LSE128) 161 - #define KERNEL_HWCAP_FPMR __khwcap2_feature(FPMR) 162 - #define KERNEL_HWCAP_LUT __khwcap2_feature(LUT) 163 - #define KERNEL_HWCAP_FAMINMAX __khwcap2_feature(FAMINMAX) 164 - #define KERNEL_HWCAP_F8CVT __khwcap2_feature(F8CVT) 165 - #define KERNEL_HWCAP_F8FMA __khwcap2_feature(F8FMA) 166 - #define KERNEL_HWCAP_F8DP4 __khwcap2_feature(F8DP4) 167 - #define KERNEL_HWCAP_F8DP2 __khwcap2_feature(F8DP2) 168 - #define KERNEL_HWCAP_F8E4M3 __khwcap2_feature(F8E4M3) 169 - #define KERNEL_HWCAP_F8E5M2 __khwcap2_feature(F8E5M2) 170 - #define KERNEL_HWCAP_SME_LUTV2 __khwcap2_feature(SME_LUTV2) 171 - #define KERNEL_HWCAP_SME_F8F16 __khwcap2_feature(SME_F8F16) 172 - #define KERNEL_HWCAP_SME_F8F32 __khwcap2_feature(SME_F8F32) 173 - #define KERNEL_HWCAP_SME_SF8FMA __khwcap2_feature(SME_SF8FMA) 174 - #define KERNEL_HWCAP_SME_SF8DP4 __khwcap2_feature(SME_SF8DP4) 175 - #define KERNEL_HWCAP_SME_SF8DP2 __khwcap2_feature(SME_SF8DP2) 176 - #define KERNEL_HWCAP_POE __khwcap2_feature(POE) 177 - 178 64 #define __khwcap3_feature(x) (const_ilog2(HWCAP3_ ## x) + 128) 179 - #define KERNEL_HWCAP_MTE_FAR __khwcap3_feature(MTE_FAR) 180 - #define KERNEL_HWCAP_MTE_STORE_ONLY __khwcap3_feature(MTE_STORE_ONLY) 181 - #define KERNEL_HWCAP_LSFE __khwcap3_feature(LSFE) 182 - #define KERNEL_HWCAP_LS64 __khwcap3_feature(LS64) 65 + 66 + #include "asm/kernel-hwcap.h" 183 67 184 68 /* 185 69 * This yields a mask that user programs can use to figure out what
+27
arch/arm64/include/asm/lsui.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __ASM_LSUI_H 3 + #define __ASM_LSUI_H 4 + 5 + #include <linux/compiler_types.h> 6 + #include <linux/stringify.h> 7 + #include <asm/alternative.h> 8 + #include <asm/alternative-macros.h> 9 + #include <asm/cpucaps.h> 10 + 11 + #define __LSUI_PREAMBLE ".arch_extension lsui\n" 12 + 13 + #ifdef CONFIG_ARM64_LSUI 14 + 15 + #define __lsui_llsc_body(op, ...) \ 16 + ({ \ 17 + alternative_has_cap_unlikely(ARM64_HAS_LSUI) ? \ 18 + __lsui_##op(__VA_ARGS__) : __llsc_##op(__VA_ARGS__); \ 19 + }) 20 + 21 + #else /* CONFIG_ARM64_LSUI */ 22 + 23 + #define __lsui_llsc_body(op, ...) __llsc_##op(__VA_ARGS__) 24 + 25 + #endif /* CONFIG_ARM64_LSUI */ 26 + 27 + #endif /* __ASM_LSUI_H */
+2 -8
arch/arm64/include/asm/mmu.h
··· 10 10 #define MMCF_AARCH32 0x1 /* mm context flag for AArch32 executables */ 11 11 #define USER_ASID_BIT 48 12 12 #define USER_ASID_FLAG (UL(1) << USER_ASID_BIT) 13 - #define TTBR_ASID_MASK (UL(0xffff) << 48) 14 13 15 14 #ifndef __ASSEMBLER__ 16 15 17 16 #include <linux/refcount.h> 18 17 #include <asm/cpufeature.h> 19 - 20 - enum pgtable_type { 21 - TABLE_PTE, 22 - TABLE_PMD, 23 - TABLE_PUD, 24 - TABLE_P4D, 25 - }; 26 18 27 19 typedef struct { 28 20 atomic64_t id; ··· 103 111 #else 104 112 static inline void kpti_install_ng_mappings(void) {} 105 113 #endif 114 + 115 + extern bool page_alloc_available; 106 116 107 117 #endif /* !__ASSEMBLER__ */ 108 118 #endif
+2 -1
arch/arm64/include/asm/mmu_context.h
··· 210 210 if (mm == &init_mm) 211 211 ttbr = phys_to_ttbr(__pa_symbol(reserved_pg_dir)); 212 212 else 213 - ttbr = phys_to_ttbr(virt_to_phys(mm->pgd)) | ASID(mm) << 48; 213 + ttbr = phys_to_ttbr(virt_to_phys(mm->pgd)) | 214 + FIELD_PREP(TTBRx_EL1_ASID_MASK, ASID(mm)); 214 215 215 216 WRITE_ONCE(task_thread_info(tsk)->ttbr0, ttbr); 216 217 }
+96
arch/arm64/include/asm/mpam.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (C) 2025 Arm Ltd. */ 3 + 4 + #ifndef __ASM__MPAM_H 5 + #define __ASM__MPAM_H 6 + 7 + #include <linux/arm_mpam.h> 8 + #include <linux/bitfield.h> 9 + #include <linux/jump_label.h> 10 + #include <linux/percpu.h> 11 + #include <linux/sched.h> 12 + 13 + #include <asm/sysreg.h> 14 + 15 + DECLARE_STATIC_KEY_FALSE(mpam_enabled); 16 + DECLARE_PER_CPU(u64, arm64_mpam_default); 17 + DECLARE_PER_CPU(u64, arm64_mpam_current); 18 + 19 + /* 20 + * The value of the MPAM0_EL1 sysreg when a task is in resctrl's default group. 21 + * This is used by the context switch code to use the resctrl CPU property 22 + * instead. The value is modified when CDP is enabled/disabled by mounting 23 + * the resctrl filesystem. 24 + */ 25 + extern u64 arm64_mpam_global_default; 26 + 27 + #ifdef CONFIG_ARM64_MPAM 28 + static inline u64 __mpam_regval(u16 partid_d, u16 partid_i, u8 pmg_d, u8 pmg_i) 29 + { 30 + return FIELD_PREP(MPAM0_EL1_PARTID_D, partid_d) | 31 + FIELD_PREP(MPAM0_EL1_PARTID_I, partid_i) | 32 + FIELD_PREP(MPAM0_EL1_PMG_D, pmg_d) | 33 + FIELD_PREP(MPAM0_EL1_PMG_I, pmg_i); 34 + } 35 + 36 + static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i, 37 + u8 pmg_d, u8 pmg_i) 38 + { 39 + u64 default_val = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i); 40 + 41 + WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val); 42 + } 43 + 44 + /* 45 + * The resctrl filesystem writes to the partid/pmg values for threads and CPUs, 46 + * which may race with reads in mpam_thread_switch(). Ensure only one of the old 47 + * or new values are used. Particular care should be taken with the pmg field as 48 + * mpam_thread_switch() may read a partid and pmg that don't match, causing this 49 + * value to be stored with cache allocations, despite being considered 'free' by 50 + * resctrl. 51 + */ 52 + static inline u64 mpam_get_regval(struct task_struct *tsk) 53 + { 54 + return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg); 55 + } 56 + 57 + static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, 58 + u16 partid_d, u16 partid_i, 59 + u8 pmg_d, u8 pmg_i) 60 + { 61 + u64 regval = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i); 62 + 63 + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); 64 + } 65 + 66 + static inline void mpam_thread_switch(struct task_struct *tsk) 67 + { 68 + u64 oldregval; 69 + int cpu = smp_processor_id(); 70 + u64 regval = mpam_get_regval(tsk); 71 + 72 + if (!static_branch_likely(&mpam_enabled)) 73 + return; 74 + 75 + if (regval == READ_ONCE(arm64_mpam_global_default)) 76 + regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); 77 + 78 + oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); 79 + if (oldregval == regval) 80 + return; 81 + 82 + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); 83 + if (system_supports_sme()) 84 + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1); 85 + isb(); 86 + 87 + /* Synchronising the EL0 write is left until the ERET to EL0 */ 88 + write_sysreg_s(regval, SYS_MPAM0_EL1); 89 + 90 + WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); 91 + } 92 + #else 93 + static inline void mpam_thread_switch(struct task_struct *tsk) {} 94 + #endif /* CONFIG_ARM64_MPAM */ 95 + 96 + #endif /* __ASM__MPAM_H */
+6
arch/arm64/include/asm/mte.h
··· 252 252 if (!kasan_hw_tags_enabled()) 253 253 return; 254 254 255 + if (!system_uses_mte_async_or_asymm_mode()) 256 + return; 257 + 255 258 mte_check_tfsr_el1(); 256 259 } 257 260 258 261 static inline void mte_check_tfsr_exit(void) 259 262 { 260 263 if (!kasan_hw_tags_enabled()) 264 + return; 265 + 266 + if (!system_uses_mte_async_or_asymm_mode()) 261 267 return; 262 268 263 269 /*
+5 -4
arch/arm64/include/asm/pgtable-hwdef.h
··· 223 223 */ 224 224 #define S1_TABLE_AP (_AT(pmdval_t, 3) << 61) 225 225 226 - #define TTBR_CNP_BIT (UL(1) << 0) 227 - 228 226 /* 229 227 * TCR flags. 230 228 */ ··· 285 287 #endif 286 288 287 289 #ifdef CONFIG_ARM64_VA_BITS_52 290 + #define PTRS_PER_PGD_52_VA (UL(1) << (52 - PGDIR_SHIFT)) 291 + #define PTRS_PER_PGD_48_VA (UL(1) << (48 - PGDIR_SHIFT)) 292 + #define PTRS_PER_PGD_EXTRA (PTRS_PER_PGD_52_VA - PTRS_PER_PGD_48_VA) 293 + 288 294 /* Must be at least 64-byte aligned to prevent corruption of the TTBR */ 289 - #define TTBR1_BADDR_4852_OFFSET (((UL(1) << (52 - PGDIR_SHIFT)) - \ 290 - (UL(1) << (48 - PGDIR_SHIFT))) * 8) 295 + #define TTBR1_BADDR_4852_OFFSET (PTRS_PER_PGD_EXTRA << PTDESC_ORDER) 291 296 #endif 292 297 293 298 #endif
+2
arch/arm64/include/asm/pgtable-prot.h
··· 25 25 */ 26 26 #define PTE_PRESENT_INVALID (PTE_NG) /* only when !PTE_VALID */ 27 27 28 + #define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG) 29 + 28 30 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 29 31 #define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */ 30 32 #define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
+40 -20
arch/arm64/include/asm/pgtable.h
··· 89 89 90 90 /* Set stride and tlb_level in flush_*_tlb_range */ 91 91 #define flush_pmd_tlb_range(vma, addr, end) \ 92 - __flush_tlb_range(vma, addr, end, PMD_SIZE, false, 2) 92 + __flush_tlb_range(vma, addr, end, PMD_SIZE, 2, TLBF_NONE) 93 93 #define flush_pud_tlb_range(vma, addr, end) \ 94 - __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) 94 + __flush_tlb_range(vma, addr, end, PUD_SIZE, 1, TLBF_NONE) 95 95 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 96 96 97 97 /* ··· 101 101 * entries exist. 102 102 */ 103 103 #define flush_tlb_fix_spurious_fault(vma, address, ptep) \ 104 - local_flush_tlb_page_nonotify(vma, address) 104 + __flush_tlb_page(vma, address, TLBF_NOBROADCAST | TLBF_NONOTIFY) 105 105 106 - #define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ 107 - local_flush_tlb_page_nonotify(vma, address) 106 + #define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ 107 + __flush_tlb_range(vma, address, address + PMD_SIZE, PMD_SIZE, 2, \ 108 + TLBF_NOBROADCAST | TLBF_NONOTIFY | TLBF_NOWALKCACHE) 108 109 109 110 /* 110 111 * ZERO_PAGE is a global shared page that is always zero: used ··· 323 322 return clear_pte_bit(pte, __pgprot(PTE_CONT)); 324 323 } 325 324 326 - static inline pte_t pte_mkvalid(pte_t pte) 325 + static inline pte_t pte_mkvalid_k(pte_t pte) 327 326 { 328 - return set_pte_bit(pte, __pgprot(PTE_VALID)); 327 + pte = clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID)); 328 + pte = set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL)); 329 + return pte; 329 330 } 330 331 331 332 static inline pte_t pte_mkinvalid(pte_t pte) ··· 597 594 #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) 598 595 #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) 599 596 #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) 597 + #define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd))) 600 598 #define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd))) 601 599 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 602 600 #define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd)) ··· 639 635 640 636 #define pud_young(pud) pte_young(pud_pte(pud)) 641 637 #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) 638 + #define pud_mkwrite_novma(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) 639 + #define pud_mkvalid_k(pud) pte_pud(pte_mkvalid_k(pud_pte(pud))) 642 640 #define pud_write(pud) pte_write(pud_pte(pud)) 643 641 644 642 static inline pud_t pud_mkhuge(pud_t pud) ··· 785 779 786 780 #define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \ 787 781 PMD_TYPE_TABLE) 788 - #define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \ 789 - PMD_TYPE_SECT) 790 - #define pmd_leaf(pmd) (pmd_present(pmd) && !pmd_table(pmd)) 782 + 783 + #define pmd_leaf pmd_leaf 784 + static inline bool pmd_leaf(pmd_t pmd) 785 + { 786 + return pmd_present(pmd) && !pmd_table(pmd); 787 + } 788 + 791 789 #define pmd_bad(pmd) (!pmd_table(pmd)) 792 790 793 791 #define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE) ··· 809 799 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 810 800 811 801 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3 812 - static inline bool pud_sect(pud_t pud) { return false; } 813 802 static inline bool pud_table(pud_t pud) { return true; } 814 803 #else 815 - #define pud_sect(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \ 816 - PUD_TYPE_SECT) 817 804 #define pud_table(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \ 818 805 PUD_TYPE_TABLE) 819 806 #endif ··· 880 873 PUD_TYPE_TABLE) 881 874 #define pud_present(pud) pte_present(pud_pte(pud)) 882 875 #ifndef __PAGETABLE_PMD_FOLDED 883 - #define pud_leaf(pud) (pud_present(pud) && !pud_table(pud)) 876 + #define pud_leaf pud_leaf 877 + static inline bool pud_leaf(pud_t pud) 878 + { 879 + return pud_present(pud) && !pud_table(pud); 880 + } 884 881 #else 885 882 #define pud_leaf(pud) false 886 883 #endif ··· 1258 1247 return pte_pmd(pte_modify(pmd_pte(pmd), newprot)); 1259 1248 } 1260 1249 1261 - extern int __ptep_set_access_flags(struct vm_area_struct *vma, 1262 - unsigned long address, pte_t *ptep, 1263 - pte_t entry, int dirty); 1250 + extern int __ptep_set_access_flags_anysz(struct vm_area_struct *vma, 1251 + unsigned long address, pte_t *ptep, 1252 + pte_t entry, int dirty, 1253 + unsigned long pgsize); 1254 + 1255 + static inline int __ptep_set_access_flags(struct vm_area_struct *vma, 1256 + unsigned long address, pte_t *ptep, 1257 + pte_t entry, int dirty) 1258 + { 1259 + return __ptep_set_access_flags_anysz(vma, address, ptep, entry, dirty, 1260 + PAGE_SIZE); 1261 + } 1264 1262 1265 1263 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1266 1264 #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS ··· 1277 1257 unsigned long address, pmd_t *pmdp, 1278 1258 pmd_t entry, int dirty) 1279 1259 { 1280 - return __ptep_set_access_flags(vma, address, (pte_t *)pmdp, 1281 - pmd_pte(entry), dirty); 1260 + return __ptep_set_access_flags_anysz(vma, address, (pte_t *)pmdp, 1261 + pmd_pte(entry), dirty, PMD_SIZE); 1282 1262 } 1283 1263 #endif 1284 1264 ··· 1340 1320 * context-switch, which provides a DSB to complete the TLB 1341 1321 * invalidation. 1342 1322 */ 1343 - flush_tlb_page_nosync(vma, address); 1323 + __flush_tlb_page(vma, address, TLBF_NOSYNC); 1344 1324 } 1345 1325 1346 1326 return young;
+2
arch/arm64/include/asm/resctrl.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <linux/arm_mpam.h>
+19 -5
arch/arm64/include/asm/rwonce.h
··· 20 20 ARM64_HAS_LDAPR) 21 21 22 22 /* 23 + * Replace this with typeof_unqual() when minimum compiler versions are 24 + * increased to GCC 14 and Clang 19. For the time being, we need this 25 + * workaround, which relies on function return values dropping qualifiers. 26 + */ 27 + #define __rwonce_typeof_unqual(x) typeof(({ \ 28 + __diag_push() \ 29 + __diag_ignore_all("-Wignored-qualifiers", "") \ 30 + ((typeof(x)(*)(void))0)(); \ 31 + __diag_pop() })) 32 + 33 + /* 23 34 * When building with LTO, there is an increased risk of the compiler 24 35 * converting an address dependency headed by a READ_ONCE() invocation 25 36 * into a control dependency and consequently allowing for harmful ··· 42 31 */ 43 32 #define __READ_ONCE(x) \ 44 33 ({ \ 45 - typeof(&(x)) __x = &(x); \ 46 - int atomic = 1; \ 47 - union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \ 34 + auto __x = &(x); \ 35 + auto __ret = (__rwonce_typeof_unqual(*__x) *)__x; \ 36 + /* Hides alias reassignment from Clang's -Wthread-safety. */ \ 37 + auto __retp = &__ret; \ 38 + union { typeof(*__ret) __val; char __c[1]; } __u; \ 39 + *__retp = &__u.__val; \ 48 40 switch (sizeof(x)) { \ 49 41 case 1: \ 50 42 asm volatile(__LOAD_RCPC(b, %w0, %1) \ ··· 70 56 : "Q" (*__x) : "memory"); \ 71 57 break; \ 72 58 default: \ 73 - atomic = 0; \ 59 + __u.__val = *(volatile typeof(*__x) *)__x; \ 74 60 } \ 75 - atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(*__x) *)__x);\ 61 + *__ret; \ 76 62 }) 77 63 78 64 #endif /* !BUILD_VDSO */
+8
arch/arm64/include/asm/scs.h
··· 10 10 #ifdef CONFIG_SHADOW_CALL_STACK 11 11 scs_sp .req x18 12 12 13 + .macro scs_load_current_base 14 + get_current_task scs_sp 15 + ldr scs_sp, [scs_sp, #TSK_TI_SCS_BASE] 16 + .endm 17 + 13 18 .macro scs_load_current 14 19 get_current_task scs_sp 15 20 ldr scs_sp, [scs_sp, #TSK_TI_SCS_SP] ··· 24 19 str scs_sp, [\tsk, #TSK_TI_SCS_SP] 25 20 .endm 26 21 #else 22 + .macro scs_load_current_base 23 + .endm 24 + 27 25 .macro scs_load_current 28 26 .endm 29 27
+3
arch/arm64/include/asm/thread_info.h
··· 42 42 void *scs_base; 43 43 void *scs_sp; 44 44 #endif 45 + #ifdef CONFIG_ARM64_MPAM 46 + u64 mpam_partid_pmg; 47 + #endif 45 48 u32 cpu; 46 49 }; 47 50
+3 -3
arch/arm64/include/asm/tlb.h
··· 53 53 static inline void tlb_flush(struct mmu_gather *tlb) 54 54 { 55 55 struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0); 56 - bool last_level = !tlb->freed_tables; 56 + tlbf_t flags = tlb->freed_tables ? TLBF_NONE : TLBF_NOWALKCACHE; 57 57 unsigned long stride = tlb_get_unmap_size(tlb); 58 58 int tlb_level = tlb_get_level(tlb); 59 59 ··· 63 63 * reallocate our ASID without invalidating the entire TLB. 64 64 */ 65 65 if (tlb->fullmm) { 66 - if (!last_level) 66 + if (tlb->freed_tables) 67 67 flush_tlb_mm(tlb->mm); 68 68 return; 69 69 } 70 70 71 71 __flush_tlb_range(&vma, tlb->start, tlb->end, stride, 72 - last_level, tlb_level); 72 + tlb_level, flags); 73 73 } 74 74 75 75 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
+263 -208
arch/arm64/include/asm/tlbflush.h
··· 97 97 98 98 #define TLBI_TTL_UNKNOWN INT_MAX 99 99 100 - #define __tlbi_level(op, addr, level) do { \ 101 - u64 arg = addr; \ 102 - \ 103 - if (alternative_has_cap_unlikely(ARM64_HAS_ARMv8_4_TTL) && \ 104 - level >= 0 && level <= 3) { \ 105 - u64 ttl = level & 3; \ 106 - ttl |= get_trans_granule() << 2; \ 107 - arg &= ~TLBI_TTL_MASK; \ 108 - arg |= FIELD_PREP(TLBI_TTL_MASK, ttl); \ 109 - } \ 110 - \ 111 - __tlbi(op, arg); \ 112 - } while(0) 100 + typedef void (*tlbi_op)(u64 arg); 113 101 114 - #define __tlbi_user_level(op, arg, level) do { \ 115 - if (arm64_kernel_unmapped_at_el0()) \ 116 - __tlbi_level(op, (arg | USER_ASID_FLAG), level); \ 117 - } while (0) 102 + static __always_inline void vae1is(u64 arg) 103 + { 104 + __tlbi(vae1is, arg); 105 + __tlbi_user(vae1is, arg); 106 + } 107 + 108 + static __always_inline void vae2is(u64 arg) 109 + { 110 + __tlbi(vae2is, arg); 111 + } 112 + 113 + static __always_inline void vale1(u64 arg) 114 + { 115 + __tlbi(vale1, arg); 116 + __tlbi_user(vale1, arg); 117 + } 118 + 119 + static __always_inline void vale1is(u64 arg) 120 + { 121 + __tlbi(vale1is, arg); 122 + __tlbi_user(vale1is, arg); 123 + } 124 + 125 + static __always_inline void vale2is(u64 arg) 126 + { 127 + __tlbi(vale2is, arg); 128 + } 129 + 130 + static __always_inline void vaale1is(u64 arg) 131 + { 132 + __tlbi(vaale1is, arg); 133 + } 134 + 135 + static __always_inline void ipas2e1(u64 arg) 136 + { 137 + __tlbi(ipas2e1, arg); 138 + } 139 + 140 + static __always_inline void ipas2e1is(u64 arg) 141 + { 142 + __tlbi(ipas2e1is, arg); 143 + } 144 + 145 + static __always_inline void __tlbi_level_asid(tlbi_op op, u64 addr, u32 level, 146 + u16 asid) 147 + { 148 + u64 arg = __TLBI_VADDR(addr, asid); 149 + 150 + if (alternative_has_cap_unlikely(ARM64_HAS_ARMv8_4_TTL) && level <= 3) { 151 + u64 ttl = level | (get_trans_granule() << 2); 152 + 153 + FIELD_MODIFY(TLBI_TTL_MASK, &arg, ttl); 154 + } 155 + 156 + op(arg); 157 + } 158 + 159 + static inline void __tlbi_level(tlbi_op op, u64 addr, u32 level) 160 + { 161 + __tlbi_level_asid(op, addr, level, 0); 162 + } 118 163 119 164 /* 120 165 * This macro creates a properly formatted VA operand for the TLB RANGE. The ··· 186 141 #define TLBIR_TTL_MASK GENMASK_ULL(38, 37) 187 142 #define TLBIR_BADDR_MASK GENMASK_ULL(36, 0) 188 143 189 - #define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \ 190 - ({ \ 191 - unsigned long __ta = 0; \ 192 - unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \ 193 - __ta |= FIELD_PREP(TLBIR_BADDR_MASK, baddr); \ 194 - __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \ 195 - __ta |= FIELD_PREP(TLBIR_NUM_MASK, num); \ 196 - __ta |= FIELD_PREP(TLBIR_SCALE_MASK, scale); \ 197 - __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \ 198 - __ta |= FIELD_PREP(TLBIR_ASID_MASK, asid); \ 199 - __ta; \ 200 - }) 201 - 202 144 /* These macros are used by the TLBI RANGE feature. */ 203 145 #define __TLBI_RANGE_PAGES(num, scale) \ 204 146 ((unsigned long)((num) + 1) << (5 * (scale) + 1)) ··· 199 167 * range. 200 168 */ 201 169 #define __TLBI_RANGE_NUM(pages, scale) \ 202 - ({ \ 203 - int __pages = min((pages), \ 204 - __TLBI_RANGE_PAGES(31, (scale))); \ 205 - (__pages >> (5 * (scale) + 1)) - 1; \ 206 - }) 170 + (((pages) >> (5 * (scale) + 1)) - 1) 207 171 208 172 #define __repeat_tlbi_sync(op, arg...) \ 209 173 do { \ ··· 269 241 * unmapping pages from vmalloc/io space. 270 242 * 271 243 * flush_tlb_page(vma, addr) 272 - * Invalidate a single user mapping for address 'addr' in the 273 - * address space corresponding to 'vma->mm'. Note that this 274 - * operation only invalidates a single, last-level page-table 275 - * entry and therefore does not affect any walk-caches. 244 + * Equivalent to __flush_tlb_page(..., flags=TLBF_NONE) 276 245 * 277 246 * 278 247 * Next, we have some undocumented invalidation routines that you probably ··· 283 258 * CPUs, ensuring that any walk-cache entries associated with the 284 259 * translation are also invalidated. 285 260 * 286 - * __flush_tlb_range(vma, start, end, stride, last_level, tlb_level) 261 + * __flush_tlb_range(vma, start, end, stride, tlb_level, flags) 287 262 * Invalidate the virtual-address range '[start, end)' on all 288 263 * CPUs for the user address space corresponding to 'vma->mm'. 289 264 * The invalidation operations are issued at a granularity 290 - * determined by 'stride' and only affect any walk-cache entries 291 - * if 'last_level' is equal to false. tlb_level is the level at 265 + * determined by 'stride'. tlb_level is the level at 292 266 * which the invalidation must take place. If the level is wrong, 293 267 * no invalidation may take place. In the case where the level 294 268 * cannot be easily determined, the value TLBI_TTL_UNKNOWN will 295 - * perform a non-hinted invalidation. 269 + * perform a non-hinted invalidation. flags may be TLBF_NONE (0) or 270 + * any combination of TLBF_NOWALKCACHE (elide eviction of walk 271 + * cache entries), TLBF_NONOTIFY (don't call mmu notifiers), 272 + * TLBF_NOSYNC (don't issue trailing dsb) and TLBF_NOBROADCAST 273 + * (only perform the invalidation for the local cpu). 296 274 * 297 - * local_flush_tlb_page(vma, addr) 298 - * Local variant of flush_tlb_page(). Stale TLB entries may 299 - * remain in remote CPUs. 300 - * 301 - * local_flush_tlb_page_nonotify(vma, addr) 302 - * Same as local_flush_tlb_page() except MMU notifier will not be 303 - * called. 304 - * 305 - * local_flush_tlb_contpte(vma, addr) 306 - * Invalidate the virtual-address range 307 - * '[addr, addr+CONT_PTE_SIZE)' mapped with contpte on local CPU 308 - * for the user address space corresponding to 'vma->mm'. Stale 309 - * TLB entries may remain in remote CPUs. 275 + * __flush_tlb_page(vma, addr, flags) 276 + * Invalidate a single user mapping for address 'addr' in the 277 + * address space corresponding to 'vma->mm'. Note that this 278 + * operation only invalidates a single level 3 page-table entry 279 + * and therefore does not affect any walk-caches. flags may contain 280 + * any combination of TLBF_NONOTIFY (don't call mmu notifiers), 281 + * TLBF_NOSYNC (don't issue trailing dsb) and TLBF_NOBROADCAST 282 + * (only perform the invalidation for the local cpu). 310 283 * 311 284 * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented 312 285 * on top of these routines, since that is our interface to the mmu_gather ··· 338 315 mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 339 316 } 340 317 341 - static inline void __local_flush_tlb_page_nonotify_nosync(struct mm_struct *mm, 342 - unsigned long uaddr) 343 - { 344 - unsigned long addr; 345 - 346 - dsb(nshst); 347 - addr = __TLBI_VADDR(uaddr, ASID(mm)); 348 - __tlbi(vale1, addr); 349 - __tlbi_user(vale1, addr); 350 - } 351 - 352 - static inline void local_flush_tlb_page_nonotify(struct vm_area_struct *vma, 353 - unsigned long uaddr) 354 - { 355 - __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); 356 - dsb(nsh); 357 - } 358 - 359 - static inline void local_flush_tlb_page(struct vm_area_struct *vma, 360 - unsigned long uaddr) 361 - { 362 - __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); 363 - mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, 364 - (uaddr & PAGE_MASK) + PAGE_SIZE); 365 - dsb(nsh); 366 - } 367 - 368 - static inline void __flush_tlb_page_nosync(struct mm_struct *mm, 369 - unsigned long uaddr) 370 - { 371 - unsigned long addr; 372 - 373 - dsb(ishst); 374 - addr = __TLBI_VADDR(uaddr, ASID(mm)); 375 - __tlbi(vale1is, addr); 376 - __tlbi_user(vale1is, addr); 377 - mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, 378 - (uaddr & PAGE_MASK) + PAGE_SIZE); 379 - } 380 - 381 - static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, 382 - unsigned long uaddr) 383 - { 384 - return __flush_tlb_page_nosync(vma->vm_mm, uaddr); 385 - } 386 - 387 - static inline void flush_tlb_page(struct vm_area_struct *vma, 388 - unsigned long uaddr) 389 - { 390 - flush_tlb_page_nosync(vma, uaddr); 391 - __tlbi_sync_s1ish(); 392 - } 393 - 394 318 static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) 395 319 { 396 320 return true; ··· 367 397 /* 368 398 * __flush_tlb_range_op - Perform TLBI operation upon a range 369 399 * 370 - * @op: TLBI instruction that operates on a range (has 'r' prefix) 400 + * @lop: TLBI level operation to perform 401 + * @rop: TLBI range operation to perform 371 402 * @start: The start address of the range 372 403 * @pages: Range as the number of pages from 'start' 373 404 * @stride: Flush granularity 374 405 * @asid: The ASID of the task (0 for IPA instructions) 375 - * @tlb_level: Translation Table level hint, if known 376 - * @tlbi_user: If 'true', call an additional __tlbi_user() 377 - * (typically for user ASIDs). 'flase' for IPA instructions 406 + * @level: Translation Table level hint, if known 378 407 * @lpa2: If 'true', the lpa2 scheme is used as set out below 379 408 * 380 409 * When the CPU does not support TLB range operations, flush the TLB ··· 396 427 * operations can only span an even number of pages. We save this for last to 397 428 * ensure 64KB start alignment is maintained for the LPA2 case. 398 429 */ 399 - #define __flush_tlb_range_op(op, start, pages, stride, \ 400 - asid, tlb_level, tlbi_user, lpa2) \ 401 - do { \ 402 - typeof(start) __flush_start = start; \ 403 - typeof(pages) __flush_pages = pages; \ 404 - int num = 0; \ 405 - int scale = 3; \ 406 - int shift = lpa2 ? 16 : PAGE_SHIFT; \ 407 - unsigned long addr; \ 408 - \ 409 - while (__flush_pages > 0) { \ 410 - if (!system_supports_tlb_range() || \ 411 - __flush_pages == 1 || \ 412 - (lpa2 && __flush_start != ALIGN(__flush_start, SZ_64K))) { \ 413 - addr = __TLBI_VADDR(__flush_start, asid); \ 414 - __tlbi_level(op, addr, tlb_level); \ 415 - if (tlbi_user) \ 416 - __tlbi_user_level(op, addr, tlb_level); \ 417 - __flush_start += stride; \ 418 - __flush_pages -= stride >> PAGE_SHIFT; \ 419 - continue; \ 420 - } \ 421 - \ 422 - num = __TLBI_RANGE_NUM(__flush_pages, scale); \ 423 - if (num >= 0) { \ 424 - addr = __TLBI_VADDR_RANGE(__flush_start >> shift, asid, \ 425 - scale, num, tlb_level); \ 426 - __tlbi(r##op, addr); \ 427 - if (tlbi_user) \ 428 - __tlbi_user(r##op, addr); \ 429 - __flush_start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \ 430 - __flush_pages -= __TLBI_RANGE_PAGES(num, scale);\ 431 - } \ 432 - scale--; \ 433 - } \ 434 - } while (0) 435 - 436 - #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \ 437 - __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled()); 438 - 439 - static inline bool __flush_tlb_range_limit_excess(unsigned long start, 440 - unsigned long end, unsigned long pages, unsigned long stride) 430 + static __always_inline void rvae1is(u64 arg) 441 431 { 442 - /* 443 - * When the system does not support TLB range based flush 444 - * operation, (MAX_DVM_OPS - 1) pages can be handled. But 445 - * with TLB range based operation, MAX_TLBI_RANGE_PAGES 446 - * pages can be handled. 447 - */ 448 - if ((!system_supports_tlb_range() && 449 - (end - start) >= (MAX_DVM_OPS * stride)) || 450 - pages > MAX_TLBI_RANGE_PAGES) 451 - return true; 452 - 453 - return false; 432 + __tlbi(rvae1is, arg); 433 + __tlbi_user(rvae1is, arg); 454 434 } 455 435 456 - static inline void __flush_tlb_range_nosync(struct mm_struct *mm, 457 - unsigned long start, unsigned long end, 458 - unsigned long stride, bool last_level, 459 - int tlb_level) 436 + static __always_inline void rvale1(u64 arg) 460 437 { 438 + __tlbi(rvale1, arg); 439 + __tlbi_user(rvale1, arg); 440 + } 441 + 442 + static __always_inline void rvale1is(u64 arg) 443 + { 444 + __tlbi(rvale1is, arg); 445 + __tlbi_user(rvale1is, arg); 446 + } 447 + 448 + static __always_inline void rvaale1is(u64 arg) 449 + { 450 + __tlbi(rvaale1is, arg); 451 + } 452 + 453 + static __always_inline void ripas2e1is(u64 arg) 454 + { 455 + __tlbi(ripas2e1is, arg); 456 + } 457 + 458 + static __always_inline void __tlbi_range(tlbi_op op, u64 addr, 459 + u16 asid, int scale, int num, 460 + u32 level, bool lpa2) 461 + { 462 + u64 arg = 0; 463 + 464 + arg |= FIELD_PREP(TLBIR_BADDR_MASK, addr >> (lpa2 ? 16 : PAGE_SHIFT)); 465 + arg |= FIELD_PREP(TLBIR_TTL_MASK, level > 3 ? 0 : level); 466 + arg |= FIELD_PREP(TLBIR_NUM_MASK, num); 467 + arg |= FIELD_PREP(TLBIR_SCALE_MASK, scale); 468 + arg |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); 469 + arg |= FIELD_PREP(TLBIR_ASID_MASK, asid); 470 + 471 + op(arg); 472 + } 473 + 474 + static __always_inline void __flush_tlb_range_op(tlbi_op lop, tlbi_op rop, 475 + u64 start, size_t pages, 476 + u64 stride, u16 asid, 477 + u32 level, bool lpa2) 478 + { 479 + u64 addr = start, end = start + pages * PAGE_SIZE; 480 + int scale = 3; 481 + 482 + while (addr != end) { 483 + int num; 484 + 485 + pages = (end - addr) >> PAGE_SHIFT; 486 + 487 + if (!system_supports_tlb_range() || pages == 1) 488 + goto invalidate_one; 489 + 490 + if (lpa2 && !IS_ALIGNED(addr, SZ_64K)) 491 + goto invalidate_one; 492 + 493 + num = __TLBI_RANGE_NUM(pages, scale); 494 + if (num >= 0) { 495 + __tlbi_range(rop, addr, asid, scale, num, level, lpa2); 496 + addr += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; 497 + } 498 + 499 + scale--; 500 + continue; 501 + invalidate_one: 502 + __tlbi_level_asid(lop, addr, level, asid); 503 + addr += stride; 504 + } 505 + } 506 + 507 + #define __flush_s1_tlb_range_op(op, start, pages, stride, asid, tlb_level) \ 508 + __flush_tlb_range_op(op, r##op, start, pages, stride, asid, tlb_level, lpa2_is_enabled()) 509 + 510 + #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \ 511 + __flush_tlb_range_op(op, r##op, start, pages, stride, 0, tlb_level, kvm_lpa2_is_enabled()) 512 + 513 + static inline bool __flush_tlb_range_limit_excess(unsigned long pages, 514 + unsigned long stride) 515 + { 516 + /* 517 + * Assume that the worst case number of DVM ops required to flush a 518 + * given range on a system that supports tlb-range is 20 (4 scales, 1 519 + * final page, 15 for alignment on LPA2 systems), which is much smaller 520 + * than MAX_DVM_OPS. 521 + */ 522 + if (system_supports_tlb_range()) 523 + return pages > MAX_TLBI_RANGE_PAGES; 524 + 525 + return pages >= (MAX_DVM_OPS * stride) >> PAGE_SHIFT; 526 + } 527 + 528 + typedef unsigned __bitwise tlbf_t; 529 + 530 + /* No special behaviour. */ 531 + #define TLBF_NONE ((__force tlbf_t)0) 532 + 533 + /* Invalidate tlb entries only, leaving the page table walk cache intact. */ 534 + #define TLBF_NOWALKCACHE ((__force tlbf_t)BIT(0)) 535 + 536 + /* Skip the trailing dsb after issuing tlbi. */ 537 + #define TLBF_NOSYNC ((__force tlbf_t)BIT(1)) 538 + 539 + /* Suppress tlb notifier callbacks for this flush operation. */ 540 + #define TLBF_NONOTIFY ((__force tlbf_t)BIT(2)) 541 + 542 + /* Perform the tlbi locally without broadcasting to other CPUs. */ 543 + #define TLBF_NOBROADCAST ((__force tlbf_t)BIT(3)) 544 + 545 + static __always_inline void __do_flush_tlb_range(struct vm_area_struct *vma, 546 + unsigned long start, unsigned long end, 547 + unsigned long stride, int tlb_level, 548 + tlbf_t flags) 549 + { 550 + struct mm_struct *mm = vma->vm_mm; 461 551 unsigned long asid, pages; 462 552 463 - start = round_down(start, stride); 464 - end = round_up(end, stride); 465 553 pages = (end - start) >> PAGE_SHIFT; 466 554 467 - if (__flush_tlb_range_limit_excess(start, end, pages, stride)) { 555 + if (__flush_tlb_range_limit_excess(pages, stride)) { 468 556 flush_tlb_mm(mm); 469 557 return; 470 558 } 471 559 472 - dsb(ishst); 560 + if (!(flags & TLBF_NOBROADCAST)) 561 + dsb(ishst); 562 + else 563 + dsb(nshst); 564 + 473 565 asid = ASID(mm); 474 566 475 - if (last_level) 476 - __flush_tlb_range_op(vale1is, start, pages, stride, asid, 477 - tlb_level, true, lpa2_is_enabled()); 478 - else 479 - __flush_tlb_range_op(vae1is, start, pages, stride, asid, 480 - tlb_level, true, lpa2_is_enabled()); 567 + switch (flags & (TLBF_NOWALKCACHE | TLBF_NOBROADCAST)) { 568 + case TLBF_NONE: 569 + __flush_s1_tlb_range_op(vae1is, start, pages, stride, 570 + asid, tlb_level); 571 + break; 572 + case TLBF_NOWALKCACHE: 573 + __flush_s1_tlb_range_op(vale1is, start, pages, stride, 574 + asid, tlb_level); 575 + break; 576 + case TLBF_NOBROADCAST: 577 + /* Combination unused */ 578 + BUG(); 579 + break; 580 + case TLBF_NOWALKCACHE | TLBF_NOBROADCAST: 581 + __flush_s1_tlb_range_op(vale1, start, pages, stride, 582 + asid, tlb_level); 583 + break; 584 + } 481 585 482 - mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 586 + if (!(flags & TLBF_NONOTIFY)) 587 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 588 + 589 + if (!(flags & TLBF_NOSYNC)) { 590 + if (!(flags & TLBF_NOBROADCAST)) 591 + __tlbi_sync_s1ish(); 592 + else 593 + dsb(nsh); 594 + } 483 595 } 484 596 485 597 static inline void __flush_tlb_range(struct vm_area_struct *vma, 486 598 unsigned long start, unsigned long end, 487 - unsigned long stride, bool last_level, 488 - int tlb_level) 599 + unsigned long stride, int tlb_level, 600 + tlbf_t flags) 489 601 { 490 - __flush_tlb_range_nosync(vma->vm_mm, start, end, stride, 491 - last_level, tlb_level); 492 - __tlbi_sync_s1ish(); 493 - } 494 - 495 - static inline void local_flush_tlb_contpte(struct vm_area_struct *vma, 496 - unsigned long addr) 497 - { 498 - unsigned long asid; 499 - 500 - addr = round_down(addr, CONT_PTE_SIZE); 501 - 502 - dsb(nshst); 503 - asid = ASID(vma->vm_mm); 504 - __flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid, 505 - 3, true, lpa2_is_enabled()); 506 - mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr, 507 - addr + CONT_PTE_SIZE); 508 - dsb(nsh); 602 + start = round_down(start, stride); 603 + end = round_up(end, stride); 604 + __do_flush_tlb_range(vma, start, end, stride, tlb_level, flags); 509 605 } 510 606 511 607 static inline void flush_tlb_range(struct vm_area_struct *vma, ··· 582 548 * Set the tlb_level to TLBI_TTL_UNKNOWN because we can not get enough 583 549 * information here. 584 550 */ 585 - __flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN); 551 + __flush_tlb_range(vma, start, end, PAGE_SIZE, TLBI_TTL_UNKNOWN, TLBF_NONE); 552 + } 553 + 554 + static inline void __flush_tlb_page(struct vm_area_struct *vma, 555 + unsigned long uaddr, tlbf_t flags) 556 + { 557 + unsigned long start = round_down(uaddr, PAGE_SIZE); 558 + unsigned long end = start + PAGE_SIZE; 559 + 560 + __do_flush_tlb_range(vma, start, end, PAGE_SIZE, 3, 561 + TLBF_NOWALKCACHE | flags); 562 + } 563 + 564 + static inline void flush_tlb_page(struct vm_area_struct *vma, 565 + unsigned long uaddr) 566 + { 567 + __flush_tlb_page(vma, uaddr, TLBF_NONE); 586 568 } 587 569 588 570 static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end) ··· 610 560 end = round_up(end, stride); 611 561 pages = (end - start) >> PAGE_SHIFT; 612 562 613 - if (__flush_tlb_range_limit_excess(start, end, pages, stride)) { 563 + if (__flush_tlb_range_limit_excess(pages, stride)) { 614 564 flush_tlb_all(); 615 565 return; 616 566 } 617 567 618 568 dsb(ishst); 619 - __flush_tlb_range_op(vaale1is, start, pages, stride, 0, 620 - TLBI_TTL_UNKNOWN, false, lpa2_is_enabled()); 569 + __flush_s1_tlb_range_op(vaale1is, start, pages, stride, 0, 570 + TLBI_TTL_UNKNOWN); 621 571 __tlbi_sync_s1ish(); 622 572 isb(); 623 573 } ··· 639 589 static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, 640 590 struct mm_struct *mm, unsigned long start, unsigned long end) 641 591 { 642 - __flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3); 592 + struct vm_area_struct vma = { .vm_mm = mm, .vm_flags = 0 }; 593 + 594 + __flush_tlb_range(&vma, start, end, PAGE_SIZE, 3, 595 + TLBF_NOWALKCACHE | TLBF_NOSYNC); 643 596 } 644 597 645 598 static inline bool __pte_flags_need_flush(ptdesc_t oldval, ptdesc_t newval) ··· 671 618 } 672 619 #define huge_pmd_needs_flush huge_pmd_needs_flush 673 620 621 + #undef __tlbi_user 622 + #undef __TLBI_VADDR 674 623 #endif 675 624 676 625 #endif
+3 -3
arch/arm64/include/asm/uaccess.h
··· 62 62 63 63 local_irq_save(flags); 64 64 ttbr = read_sysreg(ttbr1_el1); 65 - ttbr &= ~TTBR_ASID_MASK; 65 + ttbr &= ~TTBRx_EL1_ASID_MASK; 66 66 /* reserved_pg_dir placed before swapper_pg_dir */ 67 67 write_sysreg(ttbr - RESERVED_SWAPPER_OFFSET, ttbr0_el1); 68 68 /* Set reserved ASID */ ··· 85 85 86 86 /* Restore active ASID */ 87 87 ttbr1 = read_sysreg(ttbr1_el1); 88 - ttbr1 &= ~TTBR_ASID_MASK; /* safety measure */ 89 - ttbr1 |= ttbr0 & TTBR_ASID_MASK; 88 + ttbr1 &= ~TTBRx_EL1_ASID_MASK; /* safety measure */ 89 + ttbr1 |= ttbr0 & TTBRx_EL1_ASID_MASK; 90 90 write_sysreg(ttbr1, ttbr1_el1); 91 91 92 92 /* Restore user page table */
+1
arch/arm64/kernel/Makefile
··· 67 67 obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o 68 68 obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o 69 69 obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o 70 + obj-$(CONFIG_ARM64_MPAM) += mpam.o 70 71 obj-$(CONFIG_ARM64_MTE) += mte.o 71 72 obj-y += vdso-wrap.o 72 73 obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
+14
arch/arm64/kernel/armv8_deprecated.c
··· 610 610 } 611 611 612 612 #endif 613 + 614 + #ifdef CONFIG_SWP_EMULATION 615 + /* 616 + * The purpose of supporting LSUI is to eliminate PAN toggling. CPUs 617 + * that support LSUI are unlikely to support a 32-bit runtime. Rather 618 + * than emulating the SWP instruction using LSUI instructions, simply 619 + * disable SWP emulation. 620 + */ 621 + if (cpus_have_final_cap(ARM64_HAS_LSUI)) { 622 + insn_swp.status = INSN_UNAVAILABLE; 623 + pr_info("swp/swpb instruction emulation is not supported on this system\n"); 624 + } 625 + #endif 626 + 613 627 for (int i = 0; i < ARRAY_SIZE(insn_emulations); i++) { 614 628 struct insn_emulation *ie = insn_emulations[i]; 615 629
+24 -7
arch/arm64/kernel/cpufeature.c
··· 87 87 #include <asm/kvm_host.h> 88 88 #include <asm/mmu.h> 89 89 #include <asm/mmu_context.h> 90 + #include <asm/mpam.h> 90 91 #include <asm/mte.h> 91 92 #include <asm/hypervisor.h> 92 93 #include <asm/processor.h> ··· 283 282 284 283 static const struct arm64_ftr_bits ftr_id_aa64isar3[] = { 285 284 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FPRCVT_SHIFT, 4, 0), 285 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSUI_SHIFT, 4, ID_AA64ISAR3_EL1_LSUI_NI), 286 286 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSFE_SHIFT, 4, 0), 287 287 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FAMINMAX_SHIFT, 4, 0), 288 288 ARM64_FTR_END, ··· 2486 2484 static void 2487 2485 cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) 2488 2486 { 2489 - /* 2490 - * Access by the kernel (at EL1) should use the reserved PARTID 2491 - * which is configured unrestricted. This avoids priority-inversion 2492 - * where latency sensitive tasks have to wait for a task that has 2493 - * been throttled to release the lock. 2494 - */ 2495 - write_sysreg_s(0, SYS_MPAM1_EL1); 2487 + int cpu = smp_processor_id(); 2488 + u64 regval = 0; 2489 + 2490 + if (IS_ENABLED(CONFIG_ARM64_MPAM) && static_branch_likely(&mpam_enabled)) 2491 + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); 2492 + 2493 + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); 2494 + if (cpus_have_cap(ARM64_SME)) 2495 + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1); 2496 + isb(); 2497 + 2498 + /* Synchronising the EL0 write is left until the ERET to EL0 */ 2499 + write_sysreg_s(regval, SYS_MPAM0_EL1); 2496 2500 } 2497 2501 2498 2502 static bool ··· 3169 3161 .cpu_enable = cpu_enable_ls64_v, 3170 3162 ARM64_CPUID_FIELDS(ID_AA64ISAR1_EL1, LS64, LS64_V) 3171 3163 }, 3164 + #ifdef CONFIG_ARM64_LSUI 3165 + { 3166 + .desc = "Unprivileged Load Store Instructions (LSUI)", 3167 + .capability = ARM64_HAS_LSUI, 3168 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 3169 + .matches = has_cpuid_feature, 3170 + ARM64_CPUID_FIELDS(ID_AA64ISAR3_EL1, LSUI, IMP) 3171 + }, 3172 + #endif 3172 3173 {}, 3173 3174 }; 3174 3175
+24 -28
arch/arm64/kernel/entry-common.c
··· 35 35 * Before this function is called it is not safe to call regular kernel code, 36 36 * instrumentable code, or any code which may trigger an exception. 37 37 */ 38 - static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs) 38 + static noinstr irqentry_state_t arm64_enter_from_kernel_mode(struct pt_regs *regs) 39 39 { 40 40 irqentry_state_t state; 41 41 42 - state = irqentry_enter(regs); 42 + state = irqentry_enter_from_kernel_mode(regs); 43 43 mte_check_tfsr_entry(); 44 44 mte_disable_tco_entry(current); 45 45 ··· 51 51 * After this function returns it is not safe to call regular kernel code, 52 52 * instrumentable code, or any code which may trigger an exception. 53 53 */ 54 - static void noinstr exit_to_kernel_mode(struct pt_regs *regs, 55 - irqentry_state_t state) 54 + static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs, 55 + irqentry_state_t state) 56 56 { 57 + local_irq_disable(); 58 + irqentry_exit_to_kernel_mode_preempt(regs, state); 59 + local_daif_mask(); 57 60 mte_check_tfsr_exit(); 58 - irqentry_exit(regs, state); 61 + irqentry_exit_to_kernel_mode_after_preempt(regs, state); 59 62 } 60 63 61 64 /* ··· 301 298 unsigned long far = read_sysreg(far_el1); 302 299 irqentry_state_t state; 303 300 304 - state = enter_from_kernel_mode(regs); 301 + state = arm64_enter_from_kernel_mode(regs); 305 302 local_daif_inherit(regs); 306 303 do_mem_abort(far, esr, regs); 307 - local_daif_mask(); 308 - exit_to_kernel_mode(regs, state); 304 + arm64_exit_to_kernel_mode(regs, state); 309 305 } 310 306 311 307 static void noinstr el1_pc(struct pt_regs *regs, unsigned long esr) ··· 312 310 unsigned long far = read_sysreg(far_el1); 313 311 irqentry_state_t state; 314 312 315 - state = enter_from_kernel_mode(regs); 313 + state = arm64_enter_from_kernel_mode(regs); 316 314 local_daif_inherit(regs); 317 315 do_sp_pc_abort(far, esr, regs); 318 - local_daif_mask(); 319 - exit_to_kernel_mode(regs, state); 316 + arm64_exit_to_kernel_mode(regs, state); 320 317 } 321 318 322 319 static void noinstr el1_undef(struct pt_regs *regs, unsigned long esr) 323 320 { 324 321 irqentry_state_t state; 325 322 326 - state = enter_from_kernel_mode(regs); 323 + state = arm64_enter_from_kernel_mode(regs); 327 324 local_daif_inherit(regs); 328 325 do_el1_undef(regs, esr); 329 - local_daif_mask(); 330 - exit_to_kernel_mode(regs, state); 326 + arm64_exit_to_kernel_mode(regs, state); 331 327 } 332 328 333 329 static void noinstr el1_bti(struct pt_regs *regs, unsigned long esr) 334 330 { 335 331 irqentry_state_t state; 336 332 337 - state = enter_from_kernel_mode(regs); 333 + state = arm64_enter_from_kernel_mode(regs); 338 334 local_daif_inherit(regs); 339 335 do_el1_bti(regs, esr); 340 - local_daif_mask(); 341 - exit_to_kernel_mode(regs, state); 336 + arm64_exit_to_kernel_mode(regs, state); 342 337 } 343 338 344 339 static void noinstr el1_gcs(struct pt_regs *regs, unsigned long esr) 345 340 { 346 341 irqentry_state_t state; 347 342 348 - state = enter_from_kernel_mode(regs); 343 + state = arm64_enter_from_kernel_mode(regs); 349 344 local_daif_inherit(regs); 350 345 do_el1_gcs(regs, esr); 351 - local_daif_mask(); 352 - exit_to_kernel_mode(regs, state); 346 + arm64_exit_to_kernel_mode(regs, state); 353 347 } 354 348 355 349 static void noinstr el1_mops(struct pt_regs *regs, unsigned long esr) 356 350 { 357 351 irqentry_state_t state; 358 352 359 - state = enter_from_kernel_mode(regs); 353 + state = arm64_enter_from_kernel_mode(regs); 360 354 local_daif_inherit(regs); 361 355 do_el1_mops(regs, esr); 362 - local_daif_mask(); 363 - exit_to_kernel_mode(regs, state); 356 + arm64_exit_to_kernel_mode(regs, state); 364 357 } 365 358 366 359 static void noinstr el1_breakpt(struct pt_regs *regs, unsigned long esr) ··· 417 420 { 418 421 irqentry_state_t state; 419 422 420 - state = enter_from_kernel_mode(regs); 423 + state = arm64_enter_from_kernel_mode(regs); 421 424 local_daif_inherit(regs); 422 425 do_el1_fpac(regs, esr); 423 - local_daif_mask(); 424 - exit_to_kernel_mode(regs, state); 426 + arm64_exit_to_kernel_mode(regs, state); 425 427 } 426 428 427 429 asmlinkage void noinstr el1h_64_sync_handler(struct pt_regs *regs) ··· 487 491 { 488 492 irqentry_state_t state; 489 493 490 - state = enter_from_kernel_mode(regs); 494 + state = arm64_enter_from_kernel_mode(regs); 491 495 492 496 irq_enter_rcu(); 493 497 do_interrupt_handler(regs, handler); 494 498 irq_exit_rcu(); 495 499 496 - exit_to_kernel_mode(regs, state); 500 + arm64_exit_to_kernel_mode(regs, state); 497 501 } 498 502 static void noinstr el1_interrupt(struct pt_regs *regs, 499 503 void (*handler)(struct pt_regs *))
+2 -4
arch/arm64/kernel/entry.S
··· 273 273 alternative_else_nop_endif 274 274 1: 275 275 276 - scs_load_current 276 + scs_load_current_base 277 277 .else 278 278 add x21, sp, #PT_REGS_SIZE 279 279 get_current_task tsk ··· 378 378 alternative_else_nop_endif 379 379 #endif 380 380 3: 381 - scs_save tsk 382 - 383 381 /* Ignore asynchronous tag check faults in the uaccess routines */ 384 382 ldr x0, [tsk, THREAD_SCTLR_USER] 385 383 clear_mte_async_tcf x0 ··· 471 473 */ 472 474 SYM_CODE_START_LOCAL(__swpan_entry_el1) 473 475 mrs x21, ttbr0_el1 474 - tst x21, #TTBR_ASID_MASK // Check for the reserved ASID 476 + tst x21, #TTBRx_EL1_ASID_MASK // Check for the reserved ASID 475 477 orr x23, x23, #PSR_PAN_BIT // Set the emulated PAN in the saved SPSR 476 478 b.eq 1f // TTBR0 access already disabled 477 479 and x23, x23, #~PSR_PAN_BIT // Clear the emulated PAN in the saved SPSR
-3
arch/arm64/kernel/machine_kexec.c
··· 129 129 } 130 130 131 131 /* Create a copy of the linear map */ 132 - trans_pgd = kexec_page_alloc(kimage); 133 - if (!trans_pgd) 134 - return -ENOMEM; 135 132 rc = trans_pgd_create_copy(&info, &trans_pgd, PAGE_OFFSET, PAGE_END); 136 133 if (rc) 137 134 return rc;
+62
arch/arm64/kernel/mpam.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (C) 2025 Arm Ltd. */ 3 + 4 + #include <asm/mpam.h> 5 + 6 + #include <linux/arm_mpam.h> 7 + #include <linux/cpu_pm.h> 8 + #include <linux/jump_label.h> 9 + #include <linux/percpu.h> 10 + 11 + DEFINE_STATIC_KEY_FALSE(mpam_enabled); 12 + DEFINE_PER_CPU(u64, arm64_mpam_default); 13 + DEFINE_PER_CPU(u64, arm64_mpam_current); 14 + 15 + u64 arm64_mpam_global_default; 16 + 17 + static int mpam_pm_notifier(struct notifier_block *self, 18 + unsigned long cmd, void *v) 19 + { 20 + u64 regval; 21 + int cpu = smp_processor_id(); 22 + 23 + switch (cmd) { 24 + case CPU_PM_EXIT: 25 + /* 26 + * Don't use mpam_thread_switch() as the system register 27 + * value has changed under our feet. 28 + */ 29 + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); 30 + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); 31 + if (system_supports_sme()) { 32 + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), 33 + SYS_MPAMSM_EL1); 34 + } 35 + isb(); 36 + 37 + write_sysreg_s(regval, SYS_MPAM0_EL1); 38 + 39 + return NOTIFY_OK; 40 + default: 41 + return NOTIFY_DONE; 42 + } 43 + } 44 + 45 + static struct notifier_block mpam_pm_nb = { 46 + .notifier_call = mpam_pm_notifier, 47 + }; 48 + 49 + static int __init arm64_mpam_register_cpus(void) 50 + { 51 + u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); 52 + u16 partid_max = FIELD_GET(MPAMIDR_EL1_PARTID_MAX, mpamidr); 53 + u8 pmg_max = FIELD_GET(MPAMIDR_EL1_PMG_MAX, mpamidr); 54 + 55 + if (!system_supports_mpam()) 56 + return 0; 57 + 58 + cpu_pm_register_notifier(&mpam_pm_nb); 59 + return mpam_register_requestor(partid_max, pmg_max); 60 + } 61 + /* Must occur before mpam_msc_driver_init() from subsys_initcall() */ 62 + arch_initcall(arm64_mpam_register_cpus)
+8 -2
arch/arm64/kernel/mte.c
··· 291 291 /* TCO may not have been disabled on exception entry for the current task. */ 292 292 mte_disable_tco_entry(next); 293 293 294 + if (!system_uses_mte_async_or_asymm_mode()) 295 + return; 296 + 294 297 /* 295 298 * Check if an async tag exception occurred at EL1. 296 299 * ··· 318 315 * CnP is not a boot feature so MTE gets enabled before CnP, but let's 319 316 * make sure that is the case. 320 317 */ 321 - BUG_ON(read_sysreg(ttbr0_el1) & TTBR_CNP_BIT); 322 - BUG_ON(read_sysreg(ttbr1_el1) & TTBR_CNP_BIT); 318 + BUG_ON(read_sysreg(ttbr0_el1) & TTBRx_EL1_CnP); 319 + BUG_ON(read_sysreg(ttbr1_el1) & TTBRx_EL1_CnP); 323 320 324 321 /* Normal Tagged memory type at the corresponding MAIR index */ 325 322 sysreg_clear_set(mair_el1, ··· 351 348 void mte_suspend_enter(void) 352 349 { 353 350 if (!system_supports_mte()) 351 + return; 352 + 353 + if (!system_uses_mte_async_or_asymm_mode()) 354 354 return; 355 355 356 356 /*
+32
arch/arm64/kernel/process.c
··· 51 51 #include <asm/fpsimd.h> 52 52 #include <asm/gcs.h> 53 53 #include <asm/mmu_context.h> 54 + #include <asm/mpam.h> 54 55 #include <asm/mte.h> 55 56 #include <asm/processor.h> 56 57 #include <asm/pointer_auth.h> ··· 700 699 isb(); 701 700 } 702 701 702 + static inline void debug_switch_state(void) 703 + { 704 + if (system_uses_irq_prio_masking()) { 705 + unsigned long daif_expected = 0; 706 + unsigned long daif_actual = read_sysreg(daif); 707 + unsigned long pmr_expected = GIC_PRIO_IRQOFF; 708 + unsigned long pmr_actual = read_sysreg_s(SYS_ICC_PMR_EL1); 709 + 710 + WARN_ONCE(daif_actual != daif_expected || 711 + pmr_actual != pmr_expected, 712 + "Unexpected DAIF + PMR: 0x%lx + 0x%lx (expected 0x%lx + 0x%lx)\n", 713 + daif_actual, pmr_actual, 714 + daif_expected, pmr_expected); 715 + } else { 716 + unsigned long daif_expected = DAIF_PROCCTX_NOIRQ; 717 + unsigned long daif_actual = read_sysreg(daif); 718 + 719 + WARN_ONCE(daif_actual != daif_expected, 720 + "Unexpected DAIF value: 0x%lx (expected 0x%lx)\n", 721 + daif_actual, daif_expected); 722 + } 723 + } 724 + 703 725 /* 704 726 * Thread switching. 705 727 */ ··· 731 707 struct task_struct *next) 732 708 { 733 709 struct task_struct *last; 710 + 711 + debug_switch_state(); 734 712 735 713 fpsimd_thread_switch(next); 736 714 tls_thread_switch(next); ··· 763 737 /* avoid expensive SCTLR_EL1 accesses if no change */ 764 738 if (prev->thread.sctlr_user != next->thread.sctlr_user) 765 739 update_sctlr_el1(next->thread.sctlr_user); 740 + 741 + /* 742 + * MPAM thread switch happens after the DSB to ensure prev's accesses 743 + * use prev's MPAM settings. 744 + */ 745 + mpam_thread_switch(next); 766 746 767 747 /* the actual thread switch */ 768 748 last = cpu_switch_to(prev, next);
+1 -1
arch/arm64/kernel/rsi.c
··· 144 144 return; 145 145 if (!rsi_version_matches()) 146 146 return; 147 - if (WARN_ON(rsi_get_realm_config(&config))) 147 + if (WARN_ON(rsi_get_realm_config(lm_alias(&config)))) 148 148 return; 149 149 prot_ns_shared = BIT(config.ipa_bits - 1); 150 150
+1 -1
arch/arm64/kernel/sys_compat.c
··· 36 36 * The workaround requires an inner-shareable tlbi. 37 37 * We pick the reserved-ASID to minimise the impact. 38 38 */ 39 - __tlbi(aside1is, __TLBI_VADDR(0, 0)); 39 + __tlbi(aside1is, 0UL); 40 40 __tlbi_sync_s1ish(); 41 41 } 42 42
+33 -1
arch/arm64/kvm/at.c
··· 9 9 #include <asm/esr.h> 10 10 #include <asm/kvm_hyp.h> 11 11 #include <asm/kvm_mmu.h> 12 + #include <asm/lsui.h> 12 13 13 14 static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool s1ptw) 14 15 { ··· 1682 1681 } 1683 1682 } 1684 1683 1684 + static int __lsui_swap_desc(u64 __user *ptep, u64 old, u64 new) 1685 + { 1686 + u64 tmp = old; 1687 + int ret = 0; 1688 + 1689 + /* 1690 + * Wrap LSUI instructions with uaccess_ttbr0_enable()/disable(), 1691 + * as PAN toggling is not required. 1692 + */ 1693 + uaccess_ttbr0_enable(); 1694 + 1695 + asm volatile(__LSUI_PREAMBLE 1696 + "1: cast %[old], %[new], %[addr]\n" 1697 + "2:\n" 1698 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) 1699 + : [old] "+r" (old), [addr] "+Q" (*ptep), [ret] "+r" (ret) 1700 + : [new] "r" (new) 1701 + : "memory"); 1702 + 1703 + uaccess_ttbr0_disable(); 1704 + 1705 + if (ret) 1706 + return ret; 1707 + if (tmp != old) 1708 + return -EAGAIN; 1709 + 1710 + return ret; 1711 + } 1712 + 1685 1713 static int __lse_swap_desc(u64 __user *ptep, u64 old, u64 new) 1686 1714 { 1687 1715 u64 tmp = old; ··· 1786 1756 return -EPERM; 1787 1757 1788 1758 ptep = (u64 __user *)hva + offset; 1789 - if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS)) 1759 + if (cpus_have_final_cap(ARM64_HAS_LSUI)) 1760 + r = __lsui_swap_desc(ptep, old, new); 1761 + else if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS)) 1790 1762 r = __lse_swap_desc(ptep, old, new); 1791 1763 else 1792 1764 r = __llsc_swap_desc(ptep, old, new);
+8 -4
arch/arm64/kvm/hyp/include/hyp/switch.h
··· 267 267 268 268 static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) 269 269 { 270 - u64 r = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; 270 + u64 clr = MPAM2_EL2_EnMPAMSM; 271 + u64 set = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; 271 272 272 273 if (!system_supports_mpam()) 273 274 return; ··· 278 277 write_sysreg_s(MPAMHCR_EL2_TRAP_MPAMIDR_EL1, SYS_MPAMHCR_EL2); 279 278 } else { 280 279 /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ 281 - r |= MPAM2_EL2_TIDR; 280 + set |= MPAM2_EL2_TIDR; 282 281 } 283 282 284 - write_sysreg_s(r, SYS_MPAM2_EL2); 283 + sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set); 285 284 } 286 285 287 286 static inline void __deactivate_traps_mpam(void) 288 287 { 288 + u64 clr = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1 | MPAM2_EL2_TIDR; 289 + u64 set = MPAM2_EL2_EnMPAMSM; 290 + 289 291 if (!system_supports_mpam()) 290 292 return; 291 293 292 - write_sysreg_s(0, SYS_MPAM2_EL2); 294 + sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set); 293 295 294 296 if (system_supports_mpam_hcr()) 295 297 write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2);
+2 -2
arch/arm64/kvm/hyp/nvhe/hyp-init.S
··· 130 130 ldr x1, [x0, #NVHE_INIT_PGD_PA] 131 131 phys_to_ttbr x2, x1 132 132 alternative_if ARM64_HAS_CNP 133 - orr x2, x2, #TTBR_CNP_BIT 133 + orr x2, x2, #TTBRx_EL1_CnP 134 134 alternative_else_nop_endif 135 135 msr ttbr0_el2, x2 136 136 ··· 291 291 /* Install the new pgtables */ 292 292 phys_to_ttbr x5, x0 293 293 alternative_if ARM64_HAS_CNP 294 - orr x5, x5, #TTBR_CNP_BIT 294 + orr x5, x5, #TTBRx_EL1_CnP 295 295 alternative_else_nop_endif 296 296 msr ttbr0_el2, x5 297 297
+1 -1
arch/arm64/kvm/hyp/nvhe/mm.c
··· 270 270 * https://lore.kernel.org/kvm/20221017115209.2099-1-will@kernel.org/T/#mf10dfbaf1eaef9274c581b81c53758918c1d0f03 271 271 */ 272 272 dsb(ishst); 273 - __tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level); 273 + __tlbi_level(vale2is, addr, level); 274 274 __tlbi_sync_s1ish_hyp(); 275 275 isb(); 276 276 }
-2
arch/arm64/kvm/hyp/nvhe/tlb.c
··· 158 158 * Instead, we invalidate Stage-2 for this IPA, and the 159 159 * whole of Stage-1. Weep... 160 160 */ 161 - ipa >>= 12; 162 161 __tlbi_level(ipas2e1is, ipa, level); 163 162 164 163 /* ··· 187 188 * Instead, we invalidate Stage-2 for this IPA, and the 188 189 * whole of Stage-1. Weep... 189 190 */ 190 - ipa >>= 12; 191 191 __tlbi_level(ipas2e1, ipa, level); 192 192 193 193 /*
+2 -2
arch/arm64/kvm/hyp/pgtable.c
··· 490 490 491 491 kvm_clear_pte(ctx->ptep); 492 492 dsb(ishst); 493 - __tlbi_level(vae2is, __TLBI_VADDR(ctx->addr, 0), TLBI_TTL_UNKNOWN); 493 + __tlbi_level(vae2is, ctx->addr, TLBI_TTL_UNKNOWN); 494 494 } else { 495 495 if (ctx->end - ctx->addr < granule) 496 496 return -EINVAL; 497 497 498 498 kvm_clear_pte(ctx->ptep); 499 499 dsb(ishst); 500 - __tlbi_level(vale2is, __TLBI_VADDR(ctx->addr, 0), ctx->level); 500 + __tlbi_level(vale2is, ctx->addr, ctx->level); 501 501 *unmapped += granule; 502 502 } 503 503
+16
arch/arm64/kvm/hyp/vhe/sysreg-sr.c
··· 183 183 } 184 184 NOKPROBE_SYMBOL(sysreg_restore_guest_state_vhe); 185 185 186 + /* 187 + * The _EL0 value was written by the host's context switch and belongs to the 188 + * VMM. Copy this into the guest's _EL1 register. 189 + */ 190 + static inline void __mpam_guest_load(void) 191 + { 192 + u64 mask = MPAM0_EL1_PARTID_D | MPAM0_EL1_PARTID_I | MPAM0_EL1_PMG_D | MPAM0_EL1_PMG_I; 193 + 194 + if (system_supports_mpam()) { 195 + u64 val = (read_sysreg_s(SYS_MPAM0_EL1) & mask) | MPAM1_EL1_MPAMEN; 196 + 197 + write_sysreg_el1(val, SYS_MPAM1); 198 + } 199 + } 200 + 186 201 /** 187 202 * __vcpu_load_switch_sysregs - Load guest system registers to the physical CPU 188 203 * ··· 237 222 */ 238 223 __sysreg32_restore_state(vcpu); 239 224 __sysreg_restore_user_state(guest_ctxt); 225 + __mpam_guest_load(); 240 226 241 227 if (unlikely(is_hyp_ctxt(vcpu))) { 242 228 __sysreg_restore_vel2_state(vcpu);
-2
arch/arm64/kvm/hyp/vhe/tlb.c
··· 104 104 * Instead, we invalidate Stage-2 for this IPA, and the 105 105 * whole of Stage-1. Weep... 106 106 */ 107 - ipa >>= 12; 108 107 __tlbi_level(ipas2e1is, ipa, level); 109 108 110 109 /* ··· 135 136 * Instead, we invalidate Stage-2 for this IPA, and the 136 137 * whole of Stage-1. Weep... 137 138 */ 138 - ipa >>= 12; 139 139 __tlbi_level(ipas2e1, ipa, level); 140 140 141 141 /*
+4 -1
arch/arm64/kvm/sys_regs.c
··· 1805 1805 break; 1806 1806 case SYS_ID_AA64ISAR3_EL1: 1807 1807 val &= ID_AA64ISAR3_EL1_FPRCVT | ID_AA64ISAR3_EL1_LSFE | 1808 - ID_AA64ISAR3_EL1_FAMINMAX; 1808 + ID_AA64ISAR3_EL1_FAMINMAX | ID_AA64ISAR3_EL1_LSUI; 1809 1809 break; 1810 1810 case SYS_ID_AA64MMFR2_EL1: 1811 1811 val &= ~ID_AA64MMFR2_EL1_CCIDX_MASK; ··· 3252 3252 ID_AA64ISAR2_EL1_GPA3)), 3253 3253 ID_WRITABLE(ID_AA64ISAR3_EL1, (ID_AA64ISAR3_EL1_FPRCVT | 3254 3254 ID_AA64ISAR3_EL1_LSFE | 3255 + ID_AA64ISAR3_EL1_LSUI | 3255 3256 ID_AA64ISAR3_EL1_FAMINMAX)), 3256 3257 ID_UNALLOCATED(6,4), 3257 3258 ID_UNALLOCATED(6,5), ··· 3377 3376 3378 3377 { SYS_DESC(SYS_MPAM1_EL1), undef_access }, 3379 3378 { SYS_DESC(SYS_MPAM0_EL1), undef_access }, 3379 + { SYS_DESC(SYS_MPAMSM_EL1), undef_access }, 3380 + 3380 3381 { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, 3381 3382 { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, 3382 3383
+4 -4
arch/arm64/mm/context.c
··· 354 354 355 355 /* Skip CNP for the reserved ASID */ 356 356 if (system_supports_cnp() && asid) 357 - ttbr0 |= TTBR_CNP_BIT; 357 + ttbr0 |= TTBRx_EL1_CnP; 358 358 359 359 /* SW PAN needs a copy of the ASID in TTBR0 for entry */ 360 360 if (IS_ENABLED(CONFIG_ARM64_SW_TTBR0_PAN)) 361 - ttbr0 |= FIELD_PREP(TTBR_ASID_MASK, asid); 361 + ttbr0 |= FIELD_PREP(TTBRx_EL1_ASID_MASK, asid); 362 362 363 363 /* Set ASID in TTBR1 since TCR.A1 is set */ 364 - ttbr1 &= ~TTBR_ASID_MASK; 365 - ttbr1 |= FIELD_PREP(TTBR_ASID_MASK, asid); 364 + ttbr1 &= ~TTBRx_EL1_ASID_MASK; 365 + ttbr1 |= FIELD_PREP(TTBRx_EL1_ASID_MASK, asid); 366 366 367 367 cpu_set_reserved_ttbr0_nosync(); 368 368 write_sysreg(ttbr1, ttbr1_el1);
+8 -4
arch/arm64/mm/contpte.c
··· 225 225 */ 226 226 227 227 if (!system_supports_bbml2_noabort()) 228 - __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); 228 + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, 3, 229 + TLBF_NOWALKCACHE); 229 230 230 231 __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES); 231 232 } ··· 552 551 * See comment in __ptep_clear_flush_young(); same rationale for 553 552 * eliding the trailing DSB applies here. 554 553 */ 555 - __flush_tlb_range_nosync(vma->vm_mm, addr, end, 556 - PAGE_SIZE, true, 3); 554 + __flush_tlb_range(vma, addr, end, PAGE_SIZE, 3, 555 + TLBF_NOWALKCACHE | TLBF_NOSYNC); 557 556 } 558 557 559 558 return young; ··· 686 685 __ptep_set_access_flags(vma, addr, ptep, entry, 0); 687 686 688 687 if (dirty) 689 - local_flush_tlb_contpte(vma, start_addr); 688 + __flush_tlb_range(vma, start_addr, 689 + start_addr + CONT_PTE_SIZE, 690 + PAGE_SIZE, 3, 691 + TLBF_NOWALKCACHE | TLBF_NOBROADCAST); 690 692 } else { 691 693 __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); 692 694 __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+25 -5
arch/arm64/mm/fault.c
··· 204 204 * 205 205 * Returns whether or not the PTE actually changed. 206 206 */ 207 - int __ptep_set_access_flags(struct vm_area_struct *vma, 208 - unsigned long address, pte_t *ptep, 209 - pte_t entry, int dirty) 207 + int __ptep_set_access_flags_anysz(struct vm_area_struct *vma, 208 + unsigned long address, pte_t *ptep, 209 + pte_t entry, int dirty, unsigned long pgsize) 210 210 { 211 211 pteval_t old_pteval, pteval; 212 212 pte_t pte = __ptep_get(ptep); 213 + int level; 213 214 214 215 if (pte_same(pte, entry)) 215 216 return 0; ··· 239 238 * may still cause page faults and be invalidated via 240 239 * flush_tlb_fix_spurious_fault(). 241 240 */ 242 - if (dirty) 243 - local_flush_tlb_page(vma, address); 241 + if (dirty) { 242 + switch (pgsize) { 243 + case PAGE_SIZE: 244 + level = 3; 245 + break; 246 + case PMD_SIZE: 247 + level = 2; 248 + break; 249 + #ifndef __PAGETABLE_PMD_FOLDED 250 + case PUD_SIZE: 251 + level = 1; 252 + break; 253 + #endif 254 + default: 255 + level = TLBI_TTL_UNKNOWN; 256 + WARN_ON(1); 257 + } 258 + 259 + __flush_tlb_range(vma, address, address + pgsize, pgsize, level, 260 + TLBF_NOWALKCACHE | TLBF_NOBROADCAST); 261 + } 244 262 return 1; 245 263 } 246 264
+5 -5
arch/arm64/mm/hugetlbpage.c
··· 181 181 struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); 182 182 unsigned long end = addr + (pgsize * ncontig); 183 183 184 - __flush_hugetlb_tlb_range(&vma, addr, end, pgsize, true); 184 + __flush_hugetlb_tlb_range(&vma, addr, end, pgsize, TLBF_NOWALKCACHE); 185 185 return orig_pte; 186 186 } 187 187 ··· 209 209 if (mm == &init_mm) 210 210 flush_tlb_kernel_range(saddr, addr); 211 211 else 212 - __flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true); 212 + __flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, TLBF_NOWALKCACHE); 213 213 } 214 214 215 215 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, ··· 427 427 pte_t orig_pte; 428 428 429 429 VM_WARN_ON(!pte_present(pte)); 430 + ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize); 430 431 431 432 if (!pte_cont(pte)) 432 - return __ptep_set_access_flags(vma, addr, ptep, pte, dirty); 433 - 434 - ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize); 433 + return __ptep_set_access_flags_anysz(vma, addr, ptep, pte, 434 + dirty, pgsize); 435 435 436 436 if (!__cont_access_flags_changed(ptep, pte, ncontig)) 437 437 return 0;
+8 -1
arch/arm64/mm/init.c
··· 350 350 } 351 351 352 352 swiotlb_init(swiotlb, flags); 353 - swiotlb_update_mem_attributes(); 354 353 355 354 /* 356 355 * Check boundaries twice: Some fundamental inconsistencies can be ··· 374 375 */ 375 376 sysctl_overcommit_memory = OVERCOMMIT_ALWAYS; 376 377 } 378 + } 379 + 380 + bool page_alloc_available __ro_after_init; 381 + 382 + void __init mem_init(void) 383 + { 384 + page_alloc_available = true; 385 + swiotlb_update_mem_attributes(); 377 386 } 378 387 379 388 void free_initmem(void)
+212 -74
arch/arm64/mm/mmu.c
··· 112 112 } 113 113 EXPORT_SYMBOL(phys_mem_access_prot); 114 114 115 - static phys_addr_t __init early_pgtable_alloc(enum pgtable_type pgtable_type) 115 + static phys_addr_t __init early_pgtable_alloc(enum pgtable_level pgtable_level) 116 116 { 117 117 phys_addr_t phys; 118 118 ··· 197 197 static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, 198 198 unsigned long end, phys_addr_t phys, 199 199 pgprot_t prot, 200 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 200 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 201 201 int flags) 202 202 { 203 203 unsigned long next; 204 204 pmd_t pmd = READ_ONCE(*pmdp); 205 205 pte_t *ptep; 206 206 207 - BUG_ON(pmd_sect(pmd)); 207 + BUG_ON(pmd_leaf(pmd)); 208 208 if (pmd_none(pmd)) { 209 209 pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF; 210 210 phys_addr_t pte_phys; ··· 212 212 if (flags & NO_EXEC_MAPPINGS) 213 213 pmdval |= PMD_TABLE_PXN; 214 214 BUG_ON(!pgtable_alloc); 215 - pte_phys = pgtable_alloc(TABLE_PTE); 215 + pte_phys = pgtable_alloc(PGTABLE_LEVEL_PTE); 216 216 if (pte_phys == INVALID_PHYS_ADDR) 217 217 return -ENOMEM; 218 218 ptep = pte_set_fixmap(pte_phys); ··· 252 252 253 253 static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end, 254 254 phys_addr_t phys, pgprot_t prot, 255 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags) 255 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), int flags) 256 256 { 257 257 unsigned long next; 258 258 ··· 292 292 static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, 293 293 unsigned long end, phys_addr_t phys, 294 294 pgprot_t prot, 295 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 295 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 296 296 int flags) 297 297 { 298 298 int ret; ··· 303 303 /* 304 304 * Check for initial section mappings in the pgd/pud. 305 305 */ 306 - BUG_ON(pud_sect(pud)); 306 + BUG_ON(pud_leaf(pud)); 307 307 if (pud_none(pud)) { 308 308 pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF; 309 309 phys_addr_t pmd_phys; ··· 311 311 if (flags & NO_EXEC_MAPPINGS) 312 312 pudval |= PUD_TABLE_PXN; 313 313 BUG_ON(!pgtable_alloc); 314 - pmd_phys = pgtable_alloc(TABLE_PMD); 314 + pmd_phys = pgtable_alloc(PGTABLE_LEVEL_PMD); 315 315 if (pmd_phys == INVALID_PHYS_ADDR) 316 316 return -ENOMEM; 317 317 pmdp = pmd_set_fixmap(pmd_phys); ··· 349 349 350 350 static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end, 351 351 phys_addr_t phys, pgprot_t prot, 352 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 352 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 353 353 int flags) 354 354 { 355 355 int ret = 0; ··· 364 364 if (flags & NO_EXEC_MAPPINGS) 365 365 p4dval |= P4D_TABLE_PXN; 366 366 BUG_ON(!pgtable_alloc); 367 - pud_phys = pgtable_alloc(TABLE_PUD); 367 + pud_phys = pgtable_alloc(PGTABLE_LEVEL_PUD); 368 368 if (pud_phys == INVALID_PHYS_ADDR) 369 369 return -ENOMEM; 370 370 pudp = pud_set_fixmap(pud_phys); ··· 415 415 416 416 static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end, 417 417 phys_addr_t phys, pgprot_t prot, 418 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 418 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 419 419 int flags) 420 420 { 421 421 int ret; ··· 430 430 if (flags & NO_EXEC_MAPPINGS) 431 431 pgdval |= PGD_TABLE_PXN; 432 432 BUG_ON(!pgtable_alloc); 433 - p4d_phys = pgtable_alloc(TABLE_P4D); 433 + p4d_phys = pgtable_alloc(PGTABLE_LEVEL_P4D); 434 434 if (p4d_phys == INVALID_PHYS_ADDR) 435 435 return -ENOMEM; 436 436 p4dp = p4d_set_fixmap(p4d_phys); ··· 467 467 static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys, 468 468 unsigned long virt, phys_addr_t size, 469 469 pgprot_t prot, 470 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 470 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 471 471 int flags) 472 472 { 473 473 int ret; ··· 500 500 static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, 501 501 unsigned long virt, phys_addr_t size, 502 502 pgprot_t prot, 503 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 503 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 504 504 int flags) 505 505 { 506 506 int ret; ··· 516 516 static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, 517 517 unsigned long virt, phys_addr_t size, 518 518 pgprot_t prot, 519 - phys_addr_t (*pgtable_alloc)(enum pgtable_type), 519 + phys_addr_t (*pgtable_alloc)(enum pgtable_level), 520 520 int flags) 521 521 { 522 522 int ret; ··· 528 528 } 529 529 530 530 static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp, 531 - enum pgtable_type pgtable_type) 531 + enum pgtable_level pgtable_level) 532 532 { 533 533 /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */ 534 534 struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0); ··· 539 539 540 540 pa = page_to_phys(ptdesc_page(ptdesc)); 541 541 542 - switch (pgtable_type) { 543 - case TABLE_PTE: 542 + switch (pgtable_level) { 543 + case PGTABLE_LEVEL_PTE: 544 544 BUG_ON(!pagetable_pte_ctor(mm, ptdesc)); 545 545 break; 546 - case TABLE_PMD: 546 + case PGTABLE_LEVEL_PMD: 547 547 BUG_ON(!pagetable_pmd_ctor(mm, ptdesc)); 548 548 break; 549 - case TABLE_PUD: 549 + case PGTABLE_LEVEL_PUD: 550 550 pagetable_pud_ctor(ptdesc); 551 551 break; 552 - case TABLE_P4D: 552 + case PGTABLE_LEVEL_P4D: 553 553 pagetable_p4d_ctor(ptdesc); 554 + break; 555 + case PGTABLE_LEVEL_PGD: 556 + VM_WARN_ON(1); 554 557 break; 555 558 } 556 559 ··· 561 558 } 562 559 563 560 static phys_addr_t 564 - pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp) 561 + pgd_pgtable_alloc_init_mm_gfp(enum pgtable_level pgtable_level, gfp_t gfp) 565 562 { 566 - return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type); 563 + return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_level); 567 564 } 568 565 569 566 static phys_addr_t __maybe_unused 570 - pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type) 567 + pgd_pgtable_alloc_init_mm(enum pgtable_level pgtable_level) 571 568 { 572 - return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL); 569 + return pgd_pgtable_alloc_init_mm_gfp(pgtable_level, GFP_PGTABLE_KERNEL); 573 570 } 574 571 575 572 static phys_addr_t 576 - pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type) 573 + pgd_pgtable_alloc_special_mm(enum pgtable_level pgtable_level) 577 574 { 578 - return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type); 575 + return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_level); 579 576 } 580 577 581 578 static void split_contpte(pte_t *ptep) ··· 596 593 pte_t *ptep; 597 594 int i; 598 595 599 - pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp); 596 + pte_phys = pgd_pgtable_alloc_init_mm_gfp(PGTABLE_LEVEL_PTE, gfp); 600 597 if (pte_phys == INVALID_PHYS_ADDR) 601 598 return -ENOMEM; 602 599 ptep = (pte_t *)phys_to_virt(pte_phys); ··· 605 602 tableprot |= PMD_TABLE_PXN; 606 603 607 604 prot = __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE); 605 + if (!pmd_valid(pmd)) 606 + prot = pte_pgprot(pte_mkinvalid(pfn_pte(0, prot))); 608 607 prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); 609 608 if (to_cont) 610 609 prot = __pgprot(pgprot_val(prot) | PTE_CONT); ··· 643 638 pmd_t *pmdp; 644 639 int i; 645 640 646 - pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp); 641 + pmd_phys = pgd_pgtable_alloc_init_mm_gfp(PGTABLE_LEVEL_PMD, gfp); 647 642 if (pmd_phys == INVALID_PHYS_ADDR) 648 643 return -ENOMEM; 649 644 pmdp = (pmd_t *)phys_to_virt(pmd_phys); ··· 652 647 tableprot |= PUD_TABLE_PXN; 653 648 654 649 prot = __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT); 650 + if (!pud_valid(pud)) 651 + prot = pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot))); 655 652 prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); 656 653 if (to_cont) 657 654 prot = __pgprot(pgprot_val(prot) | PTE_CONT); ··· 775 768 } 776 769 777 770 static DEFINE_MUTEX(pgtable_split_lock); 771 + static bool linear_map_requires_bbml2; 778 772 779 773 int split_kernel_leaf_mapping(unsigned long start, unsigned long end) 780 774 { 781 775 int ret; 782 776 783 777 /* 784 - * !BBML2_NOABORT systems should not be trying to change permissions on 785 - * anything that is not pte-mapped in the first place. Just return early 786 - * and let the permission change code raise a warning if not already 787 - * pte-mapped. 788 - */ 789 - if (!system_supports_bbml2_noabort()) 790 - return 0; 791 - 792 - /* 793 778 * If the region is within a pte-mapped area, there is no need to try to 794 779 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may 795 780 * change permissions from atomic context so for those cases (which are 796 781 * always pte-mapped), we must not go any further because taking the 797 - * mutex below may sleep. 782 + * mutex below may sleep. Do not call force_pte_mapping() here because 783 + * it could return a confusing result if called from a secondary cpu 784 + * prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us 785 + * what we need. 798 786 */ 799 - if (force_pte_mapping() || is_kfence_address((void *)start)) 787 + if (!linear_map_requires_bbml2 || is_kfence_address((void *)start)) 800 788 return 0; 789 + 790 + if (!system_supports_bbml2_noabort()) { 791 + /* 792 + * !BBML2_NOABORT systems should not be trying to change 793 + * permissions on anything that is not pte-mapped in the first 794 + * place. Just return early and let the permission change code 795 + * raise a warning if not already pte-mapped. 796 + */ 797 + if (system_capabilities_finalized()) 798 + return 0; 799 + 800 + /* 801 + * Boot-time: split_kernel_leaf_mapping_locked() allocates from 802 + * page allocator. Can't split until it's available. 803 + */ 804 + if (WARN_ON(!page_alloc_available)) 805 + return -EBUSY; 806 + 807 + /* 808 + * Boot-time: Started secondary cpus but don't know if they 809 + * support BBML2_NOABORT yet. Can't allow splitting in this 810 + * window in case they don't. 811 + */ 812 + if (WARN_ON(num_online_cpus() > 1)) 813 + return -EBUSY; 814 + } 801 815 802 816 /* 803 817 * Ensure start and end are at least page-aligned since this is the ··· 918 890 919 891 return ret; 920 892 } 921 - 922 - static bool linear_map_requires_bbml2 __initdata; 923 893 924 894 u32 idmap_kpti_bbml2_flag; 925 895 ··· 1252 1226 1253 1227 static phys_addr_t kpti_ng_temp_alloc __initdata; 1254 1228 1255 - static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type) 1229 + static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_level pgtable_level) 1256 1230 { 1257 1231 kpti_ng_temp_alloc -= PAGE_SIZE; 1258 1232 return kpti_ng_temp_alloc; ··· 1484 1458 1485 1459 WARN_ON(!pte_present(pte)); 1486 1460 __pte_clear(&init_mm, addr, ptep); 1487 - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1488 - if (free_mapped) 1461 + if (free_mapped) { 1462 + /* CONT blocks are not supported in the vmemmap */ 1463 + WARN_ON(pte_cont(pte)); 1464 + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1489 1465 free_hotplug_page_range(pte_page(pte), 1490 1466 PAGE_SIZE, altmap); 1467 + } 1468 + /* unmap_hotplug_range() flushes TLB for !free_mapped */ 1491 1469 } while (addr += PAGE_SIZE, addr < end); 1492 1470 } 1493 1471 ··· 1510 1480 continue; 1511 1481 1512 1482 WARN_ON(!pmd_present(pmd)); 1513 - if (pmd_sect(pmd)) { 1483 + if (pmd_leaf(pmd)) { 1514 1484 pmd_clear(pmdp); 1515 - 1516 - /* 1517 - * One TLBI should be sufficient here as the PMD_SIZE 1518 - * range is mapped with a single block entry. 1519 - */ 1520 - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1521 - if (free_mapped) 1485 + if (free_mapped) { 1486 + /* CONT blocks are not supported in the vmemmap */ 1487 + WARN_ON(pmd_cont(pmd)); 1488 + flush_tlb_kernel_range(addr, addr + PMD_SIZE); 1522 1489 free_hotplug_page_range(pmd_page(pmd), 1523 1490 PMD_SIZE, altmap); 1491 + } 1492 + /* unmap_hotplug_range() flushes TLB for !free_mapped */ 1524 1493 continue; 1525 1494 } 1526 1495 WARN_ON(!pmd_table(pmd)); ··· 1542 1513 continue; 1543 1514 1544 1515 WARN_ON(!pud_present(pud)); 1545 - if (pud_sect(pud)) { 1516 + if (pud_leaf(pud)) { 1546 1517 pud_clear(pudp); 1547 - 1548 - /* 1549 - * One TLBI should be sufficient here as the PUD_SIZE 1550 - * range is mapped with a single block entry. 1551 - */ 1552 - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 1553 - if (free_mapped) 1518 + if (free_mapped) { 1519 + flush_tlb_kernel_range(addr, addr + PUD_SIZE); 1554 1520 free_hotplug_page_range(pud_page(pud), 1555 1521 PUD_SIZE, altmap); 1522 + } 1523 + /* unmap_hotplug_range() flushes TLB for !free_mapped */ 1556 1524 continue; 1557 1525 } 1558 1526 WARN_ON(!pud_table(pud)); ··· 1579 1553 static void unmap_hotplug_range(unsigned long addr, unsigned long end, 1580 1554 bool free_mapped, struct vmem_altmap *altmap) 1581 1555 { 1556 + unsigned long start = addr; 1582 1557 unsigned long next; 1583 1558 pgd_t *pgdp, pgd; 1584 1559 ··· 1601 1574 WARN_ON(!pgd_present(pgd)); 1602 1575 unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); 1603 1576 } while (addr = next, addr < end); 1577 + 1578 + if (!free_mapped) 1579 + flush_tlb_kernel_range(start, end); 1604 1580 } 1605 1581 1606 1582 static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, ··· 1657 1627 if (pmd_none(pmd)) 1658 1628 continue; 1659 1629 1660 - WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); 1630 + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd)); 1661 1631 free_empty_pte_table(pmdp, addr, next, floor, ceiling); 1662 1632 } while (addr = next, addr < end); 1663 1633 ··· 1697 1667 if (pud_none(pud)) 1698 1668 continue; 1699 1669 1700 - WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud)); 1670 + WARN_ON(!pud_present(pud) || !pud_table(pud)); 1701 1671 free_empty_pmd_table(pudp, addr, next, floor, ceiling); 1702 1672 } while (addr = next, addr < end); 1703 1673 ··· 1793 1763 { 1794 1764 vmemmap_verify((pte_t *)pmdp, node, addr, next); 1795 1765 1796 - return pmd_sect(READ_ONCE(*pmdp)); 1766 + return pmd_leaf(READ_ONCE(*pmdp)); 1797 1767 } 1798 1768 1799 1769 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, ··· 1857 1827 1858 1828 int pud_clear_huge(pud_t *pudp) 1859 1829 { 1860 - if (!pud_sect(READ_ONCE(*pudp))) 1830 + if (!pud_leaf(READ_ONCE(*pudp))) 1861 1831 return 0; 1862 1832 pud_clear(pudp); 1863 1833 return 1; ··· 1865 1835 1866 1836 int pmd_clear_huge(pmd_t *pmdp) 1867 1837 { 1868 - if (!pmd_sect(READ_ONCE(*pmdp))) 1838 + if (!pmd_leaf(READ_ONCE(*pmdp))) 1869 1839 return 0; 1870 1840 pmd_clear(pmdp); 1871 1841 return 1; ··· 2040 2010 __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); 2041 2011 } 2042 2012 2013 + 2014 + static bool addr_splits_kernel_leaf(unsigned long addr) 2015 + { 2016 + pgd_t *pgdp, pgd; 2017 + p4d_t *p4dp, p4d; 2018 + pud_t *pudp, pud; 2019 + pmd_t *pmdp, pmd; 2020 + pte_t *ptep, pte; 2021 + 2022 + /* 2023 + * If the given address points at a the start address of 2024 + * a possible leaf, we certainly won't split. Otherwise, 2025 + * check if we would actually split a leaf by traversing 2026 + * the page tables further. 2027 + */ 2028 + if (IS_ALIGNED(addr, PGDIR_SIZE)) 2029 + return false; 2030 + 2031 + pgdp = pgd_offset_k(addr); 2032 + pgd = pgdp_get(pgdp); 2033 + if (!pgd_present(pgd)) 2034 + return false; 2035 + 2036 + if (IS_ALIGNED(addr, P4D_SIZE)) 2037 + return false; 2038 + 2039 + p4dp = p4d_offset(pgdp, addr); 2040 + p4d = p4dp_get(p4dp); 2041 + if (!p4d_present(p4d)) 2042 + return false; 2043 + 2044 + if (IS_ALIGNED(addr, PUD_SIZE)) 2045 + return false; 2046 + 2047 + pudp = pud_offset(p4dp, addr); 2048 + pud = pudp_get(pudp); 2049 + if (!pud_present(pud)) 2050 + return false; 2051 + 2052 + if (pud_leaf(pud)) 2053 + return true; 2054 + 2055 + if (IS_ALIGNED(addr, CONT_PMD_SIZE)) 2056 + return false; 2057 + 2058 + pmdp = pmd_offset(pudp, addr); 2059 + pmd = pmdp_get(pmdp); 2060 + if (!pmd_present(pmd)) 2061 + return false; 2062 + 2063 + if (pmd_cont(pmd)) 2064 + return true; 2065 + 2066 + if (IS_ALIGNED(addr, PMD_SIZE)) 2067 + return false; 2068 + 2069 + if (pmd_leaf(pmd)) 2070 + return true; 2071 + 2072 + if (IS_ALIGNED(addr, CONT_PTE_SIZE)) 2073 + return false; 2074 + 2075 + ptep = pte_offset_kernel(pmdp, addr); 2076 + pte = __ptep_get(ptep); 2077 + if (!pte_present(pte)) 2078 + return false; 2079 + 2080 + if (pte_cont(pte)) 2081 + return true; 2082 + 2083 + return !IS_ALIGNED(addr, PAGE_SIZE); 2084 + } 2085 + 2086 + static bool can_unmap_without_split(unsigned long pfn, unsigned long nr_pages) 2087 + { 2088 + unsigned long phys_start, phys_end, start, end; 2089 + 2090 + phys_start = PFN_PHYS(pfn); 2091 + phys_end = phys_start + nr_pages * PAGE_SIZE; 2092 + 2093 + /* PFN range's linear map edges are leaf entry aligned */ 2094 + start = __phys_to_virt(phys_start); 2095 + end = __phys_to_virt(phys_end); 2096 + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) { 2097 + pr_warn("[%lx %lx] splits a leaf entry in linear map\n", 2098 + phys_start, phys_end); 2099 + return false; 2100 + } 2101 + 2102 + /* PFN range's vmemmap edges are leaf entry aligned */ 2103 + BUILD_BUG_ON(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)); 2104 + start = (unsigned long)pfn_to_page(pfn); 2105 + end = (unsigned long)pfn_to_page(pfn + nr_pages); 2106 + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) { 2107 + pr_warn("[%lx %lx] splits a leaf entry in vmemmap\n", 2108 + phys_start, phys_end); 2109 + return false; 2110 + } 2111 + return true; 2112 + } 2113 + 2043 2114 /* 2044 2115 * This memory hotplug notifier helps prevent boot memory from being 2045 2116 * inadvertently removed as it blocks pfn range offlining process in ··· 2149 2018 * In future if and when boot memory could be removed, this notifier 2150 2019 * should be dropped and free_hotplug_page_range() should handle any 2151 2020 * reserved pages allocated during boot. 2021 + * 2022 + * This also blocks any memory remove that would have caused a split 2023 + * in leaf entry in kernel linear or vmemmap mapping. 2152 2024 */ 2153 - static int prevent_bootmem_remove_notifier(struct notifier_block *nb, 2025 + static int prevent_memory_remove_notifier(struct notifier_block *nb, 2154 2026 unsigned long action, void *data) 2155 2027 { 2156 2028 struct mem_section *ms; ··· 2199 2065 return NOTIFY_DONE; 2200 2066 } 2201 2067 } 2068 + 2069 + if (!can_unmap_without_split(pfn, arg->nr_pages)) 2070 + return NOTIFY_BAD; 2071 + 2202 2072 return NOTIFY_OK; 2203 2073 } 2204 2074 2205 - static struct notifier_block prevent_bootmem_remove_nb = { 2206 - .notifier_call = prevent_bootmem_remove_notifier, 2075 + static struct notifier_block prevent_memory_remove_nb = { 2076 + .notifier_call = prevent_memory_remove_notifier, 2207 2077 }; 2208 2078 2209 2079 /* ··· 2257 2119 } 2258 2120 } 2259 2121 2260 - static int __init prevent_bootmem_remove_init(void) 2122 + static int __init prevent_memory_remove_init(void) 2261 2123 { 2262 2124 int ret = 0; 2263 2125 ··· 2265 2127 return ret; 2266 2128 2267 2129 validate_bootmem_online(); 2268 - ret = register_memory_notifier(&prevent_bootmem_remove_nb); 2130 + ret = register_memory_notifier(&prevent_memory_remove_nb); 2269 2131 if (ret) 2270 2132 pr_err("%s: Notifier registration failed %d\n", __func__, ret); 2271 2133 2272 2134 return ret; 2273 2135 } 2274 - early_initcall(prevent_bootmem_remove_init); 2136 + early_initcall(prevent_memory_remove_init); 2275 2137 #endif 2276 2138 2277 2139 pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr, ··· 2287 2149 */ 2288 2150 if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte)) 2289 2151 __flush_tlb_range(vma, addr, nr * PAGE_SIZE, 2290 - PAGE_SIZE, true, 3); 2152 + PAGE_SIZE, 3, TLBF_NOWALKCACHE); 2291 2153 } 2292 2154 2293 2155 return pte; ··· 2326 2188 phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(pgdp)); 2327 2189 2328 2190 if (cnp) 2329 - ttbr1 |= TTBR_CNP_BIT; 2191 + ttbr1 |= TTBRx_EL1_CnP; 2330 2192 2331 2193 replace_phys = (void *)__pa_symbol(idmap_cpu_replace_ttbr1); 2332 2194
+28 -22
arch/arm64/mm/pageattr.c
··· 25 25 { 26 26 struct page_change_data *masks = walk->private; 27 27 28 + /* 29 + * Some users clear and set bits which alias each other (e.g. PTE_NG and 30 + * PTE_PRESENT_INVALID). It is therefore important that we always clear 31 + * first then set. 32 + */ 28 33 val &= ~(pgprot_val(masks->clear_mask)); 29 34 val |= (pgprot_val(masks->set_mask)); 30 35 ··· 41 36 { 42 37 pud_t val = pudp_get(pud); 43 38 44 - if (pud_sect(val)) { 39 + if (pud_leaf(val)) { 45 40 if (WARN_ON_ONCE((next - addr) != PUD_SIZE)) 46 41 return -EINVAL; 47 42 val = __pud(set_pageattr_masks(pud_val(val), walk)); ··· 57 52 { 58 53 pmd_t val = pmdp_get(pmd); 59 54 60 - if (pmd_sect(val)) { 55 + if (pmd_leaf(val)) { 61 56 if (WARN_ON_ONCE((next - addr) != PMD_SIZE)) 62 57 return -EINVAL; 63 58 val = __pmd(set_pageattr_masks(pmd_val(val), walk)); ··· 137 132 ret = update_range_prot(start, size, set_mask, clear_mask); 138 133 139 134 /* 140 - * If the memory is being made valid without changing any other bits 141 - * then a TLBI isn't required as a non-valid entry cannot be cached in 142 - * the TLB. 135 + * If the memory is being switched from present-invalid to valid without 136 + * changing any other bits then a TLBI isn't required as a non-valid 137 + * entry cannot be cached in the TLB. 143 138 */ 144 - if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask)) 139 + if (pgprot_val(set_mask) != PTE_PRESENT_VALID_KERNEL || 140 + pgprot_val(clear_mask) != PTE_PRESENT_INVALID) 145 141 flush_tlb_kernel_range(start, start + size); 146 142 return ret; 147 143 } ··· 243 237 { 244 238 if (enable) 245 239 return __change_memory_common(addr, PAGE_SIZE * numpages, 246 - __pgprot(PTE_VALID), 247 - __pgprot(0)); 240 + __pgprot(PTE_PRESENT_VALID_KERNEL), 241 + __pgprot(PTE_PRESENT_INVALID)); 248 242 else 249 243 return __change_memory_common(addr, PAGE_SIZE * numpages, 250 - __pgprot(0), 251 - __pgprot(PTE_VALID)); 244 + __pgprot(PTE_PRESENT_INVALID), 245 + __pgprot(PTE_PRESENT_VALID_KERNEL)); 252 246 } 253 247 254 248 int set_direct_map_invalid_noflush(struct page *page) 255 249 { 256 - pgprot_t clear_mask = __pgprot(PTE_VALID); 257 - pgprot_t set_mask = __pgprot(0); 250 + pgprot_t clear_mask = __pgprot(PTE_PRESENT_VALID_KERNEL); 251 + pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID); 258 252 259 253 if (!can_set_direct_map()) 260 254 return 0; ··· 265 259 266 260 int set_direct_map_default_noflush(struct page *page) 267 261 { 268 - pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE); 269 - pgprot_t clear_mask = __pgprot(PTE_RDONLY); 262 + pgprot_t set_mask = __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE); 263 + pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY); 270 264 271 265 if (!can_set_direct_map()) 272 266 return 0; ··· 302 296 * entries or Synchronous External Aborts caused by RIPAS_EMPTY 303 297 */ 304 298 ret = __change_memory_common(addr, PAGE_SIZE * numpages, 305 - __pgprot(set_prot), 306 - __pgprot(clear_prot | PTE_VALID)); 299 + __pgprot(set_prot | PTE_PRESENT_INVALID), 300 + __pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL)); 307 301 308 302 if (ret) 309 303 return ret; ··· 317 311 return ret; 318 312 319 313 return __change_memory_common(addr, PAGE_SIZE * numpages, 320 - __pgprot(PTE_VALID), 321 - __pgprot(0)); 314 + __pgprot(PTE_PRESENT_VALID_KERNEL), 315 + __pgprot(PTE_PRESENT_INVALID)); 322 316 } 323 317 324 318 static int realm_set_memory_encrypted(unsigned long addr, int numpages) ··· 410 404 pud = READ_ONCE(*pudp); 411 405 if (pud_none(pud)) 412 406 return false; 413 - if (pud_sect(pud)) 414 - return true; 407 + if (pud_leaf(pud)) 408 + return pud_valid(pud); 415 409 416 410 pmdp = pmd_offset(pudp, addr); 417 411 pmd = READ_ONCE(*pmdp); 418 412 if (pmd_none(pmd)) 419 413 return false; 420 - if (pmd_sect(pmd)) 421 - return true; 414 + if (pmd_leaf(pmd)) 415 + return pmd_valid(pmd); 422 416 423 417 ptep = pte_offset_kernel(pmdp, addr); 424 418 return pte_valid(__ptep_get(ptep));
+7 -35
arch/arm64/mm/trans_pgd.c
··· 31 31 return info->trans_alloc_page(info->trans_alloc_arg); 32 32 } 33 33 34 - static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr) 35 - { 36 - pte_t pte = __ptep_get(src_ptep); 37 - 38 - if (pte_valid(pte)) { 39 - /* 40 - * Resume will overwrite areas that may be marked 41 - * read only (code, rodata). Clear the RDONLY bit from 42 - * the temporary mappings we use during restore. 43 - */ 44 - __set_pte(dst_ptep, pte_mkwrite_novma(pte)); 45 - } else if (!pte_none(pte)) { 46 - /* 47 - * debug_pagealloc will removed the PTE_VALID bit if 48 - * the page isn't in use by the resume kernel. It may have 49 - * been in use by the original kernel, in which case we need 50 - * to put it back in our copy to do the restore. 51 - * 52 - * Other cases include kfence / vmalloc / memfd_secret which 53 - * may call `set_direct_map_invalid_noflush()`. 54 - * 55 - * Before marking this entry valid, check the pfn should 56 - * be mapped. 57 - */ 58 - BUG_ON(!pfn_valid(pte_pfn(pte))); 59 - 60 - __set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte))); 61 - } 62 - } 63 - 64 34 static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp, 65 35 pmd_t *src_pmdp, unsigned long start, unsigned long end) 66 36 { ··· 46 76 47 77 src_ptep = pte_offset_kernel(src_pmdp, start); 48 78 do { 49 - _copy_pte(dst_ptep, src_ptep, addr); 79 + pte_t pte = __ptep_get(src_ptep); 80 + 81 + if (pte_none(pte)) 82 + continue; 83 + __set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte))); 50 84 } while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end); 51 85 52 86 return 0; ··· 83 109 if (copy_pte(info, dst_pmdp, src_pmdp, addr, next)) 84 110 return -ENOMEM; 85 111 } else { 86 - set_pmd(dst_pmdp, 87 - __pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY)); 112 + set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd))); 88 113 } 89 114 } while (dst_pmdp++, src_pmdp++, addr = next, addr != end); 90 115 ··· 118 145 if (copy_pmd(info, dst_pudp, src_pudp, addr, next)) 119 146 return -ENOMEM; 120 147 } else { 121 - set_pud(dst_pudp, 122 - __pud(pud_val(pud) & ~PUD_SECT_RDONLY)); 148 + set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud))); 123 149 } 124 150 } while (dst_pudp++, src_pudp++, addr = next, addr != end); 125 151
+7 -1
arch/arm64/tools/Makefile
··· 3 3 gen := arch/$(ARCH)/include/generated 4 4 kapi := $(gen)/asm 5 5 6 - kapisyshdr-y := cpucap-defs.h sysreg-defs.h 6 + kapisyshdr-y := cpucap-defs.h kernel-hwcap.h sysreg-defs.h 7 7 8 8 kapi-hdrs-y := $(addprefix $(kapi)/, $(kapisyshdr-y)) 9 9 ··· 18 18 quiet_cmd_gen_cpucaps = GEN $@ 19 19 cmd_gen_cpucaps = mkdir -p $(dir $@); $(AWK) -f $(real-prereqs) > $@ 20 20 21 + quiet_cmd_gen_kernel_hwcap = GEN $@ 22 + cmd_gen_kernel_hwcap = mkdir -p $(dir $@); /bin/sh -e $(real-prereqs) > $@ 23 + 21 24 quiet_cmd_gen_sysreg = GEN $@ 22 25 cmd_gen_sysreg = mkdir -p $(dir $@); $(AWK) -f $(real-prereqs) > $@ 23 26 24 27 $(kapi)/cpucap-defs.h: $(src)/gen-cpucaps.awk $(src)/cpucaps FORCE 25 28 $(call if_changed,gen_cpucaps) 29 + 30 + $(kapi)/kernel-hwcap.h: $(src)/gen-kernel-hwcaps.sh $(srctree)/arch/arm64/include/uapi/asm/hwcap.h FORCE 31 + $(call if_changed,gen_kernel_hwcap) 26 32 27 33 $(kapi)/sysreg-defs.h: $(src)/gen-sysreg.awk $(src)/sysreg FORCE 28 34 $(call if_changed,gen_sysreg)
+1
arch/arm64/tools/cpucaps
··· 48 48 HAS_LSE_ATOMICS 49 49 HAS_LS64 50 50 HAS_LS64_V 51 + HAS_LSUI 51 52 HAS_MOPS 52 53 HAS_NESTED_VIRT 53 54 HAS_BBML2_NOABORT
+23
arch/arm64/tools/gen-kernel-hwcaps.sh
··· 1 + #!/bin/sh -e 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # gen-kernel-hwcap.sh - Generate kernel internal hwcap.h definitions 5 + # 6 + # Copyright 2026 Arm, Ltd. 7 + 8 + if [ "$1" = "" ]; then 9 + echo "$0: no filename specified" 10 + exit 1 11 + fi 12 + 13 + echo "#ifndef __ASM_KERNEL_HWCAPS_H" 14 + echo "#define __ASM_KERNEL_HWCAPS_H" 15 + echo "" 16 + echo "/* Generated file - do not edit */" 17 + echo "" 18 + 19 + grep -E '^#define HWCAP[0-9]*_[A-Z0-9_]+' $1 | \ 20 + sed 's/.*HWCAP\([0-9]*\)_\([A-Z0-9_]\+\).*/#define KERNEL_HWCAP_\2\t__khwcap\1_feature(\2)/' 21 + 22 + echo "" 23 + echo "#endif /* __ASM_KERNEL_HWCAPS_H */"
+32 -4
arch/arm64/tools/sysreg
··· 1496 1496 0b0000 NI 1497 1497 0b0001 IMP 1498 1498 0b0010 BFSCALE 1499 + 0b0011 B16MM 1499 1500 EndEnum 1500 1501 UnsignedEnum 23:20 BF16 1501 1502 0b0000 NI ··· 1523 1522 0b0001 SVE2 1524 1523 0b0010 SVE2p1 1525 1524 0b0011 SVE2p2 1525 + 0b0100 SVE2p3 1526 1526 EndEnum 1527 1527 EndSysreg 1528 1528 ··· 1532 1530 0b0 NI 1533 1531 0b1 IMP 1534 1532 EndEnum 1535 - Res0 62:61 1533 + Res0 62 1534 + UnsignedEnum 61 LUT6 1535 + 0b0 NI 1536 + 0b1 IMP 1537 + EndEnum 1536 1538 UnsignedEnum 60 LUTv2 1537 1539 0b0 NI 1538 1540 0b1 IMP ··· 1546 1540 0b0001 SME2 1547 1541 0b0010 SME2p1 1548 1542 0b0011 SME2p2 1543 + 0b0100 SME2p3 1549 1544 EndEnum 1550 1545 UnsignedEnum 55:52 I16I64 1551 1546 0b0000 NI ··· 1661 1654 0b0 NI 1662 1655 0b1 IMP 1663 1656 EndEnum 1664 - Res0 25:2 1657 + Res0 25:16 1658 + UnsignedEnum 15 F16MM2 1659 + 0b0 NI 1660 + 0b1 IMP 1661 + EndEnum 1662 + Res0 14:8 1663 + Raz 7:2 1665 1664 UnsignedEnum 1 F8E4M3 1666 1665 0b0 NI 1667 1666 0b1 IMP ··· 1848 1835 UnsignedEnum 51:48 FHM 1849 1836 0b0000 NI 1850 1837 0b0001 IMP 1838 + 0b0010 F16F32DOT 1839 + 0b0011 F16F32MM 1851 1840 EndEnum 1852 1841 UnsignedEnum 47:44 DP 1853 1842 0b0000 NI ··· 1991 1976 UnsignedEnum 59:56 LUT 1992 1977 0b0000 NI 1993 1978 0b0001 IMP 1979 + 0b0010 LUT6 1994 1980 EndEnum 1995 1981 UnsignedEnum 55:52 CSSC 1996 1982 0b0000 NI ··· 3671 3655 EndSysreg 3672 3656 3673 3657 Sysreg SMIDR_EL1 3 1 0 0 6 3674 - Res0 63:32 3658 + Res0 63:60 3659 + Field 59:56 NSMC 3660 + Field 55:52 HIP 3661 + Field 51:32 AFFINITY2 3675 3662 Field 31:24 IMPLEMENTER 3676 3663 Field 23:16 REVISION 3677 3664 Field 15 SMPS 3678 - Res0 14:12 3665 + Field 14:13 SH 3666 + Res0 12 3679 3667 Field 11:0 AFFINITY 3680 3668 EndSysreg 3681 3669 ··· 5190 5170 Field 39:32 PMG_I 5191 5171 Field 31:16 PARTID_D 5192 5172 Field 15:0 PARTID_I 5173 + EndSysreg 5174 + 5175 + Sysreg MPAMSM_EL1 3 0 10 5 3 5176 + Res0 63:48 5177 + Field 47:40 PMG_D 5178 + Res0 39:32 5179 + Field 31:16 PARTID_D 5180 + Res0 15:0 5193 5181 EndSysreg 5194 5182 5195 5183 Sysreg ISR_EL1 3 0 12 1 0
+1 -1
drivers/acpi/arm64/agdi.c
··· 36 36 37 37 err = sdei_event_register(adata->sdei_event, agdi_sdei_handler, pdev); 38 38 if (err) { 39 - dev_err(&pdev->dev, "Failed to register for SDEI event %d", 39 + dev_err(&pdev->dev, "Failed to register for SDEI event %d\n", 40 40 adata->sdei_event); 41 41 return err; 42 42 }
+8 -1
drivers/resctrl/Kconfig
··· 1 1 menuconfig ARM64_MPAM_DRIVER 2 2 bool "MPAM driver" 3 - depends on ARM64 && ARM64_MPAM && EXPERT 3 + depends on ARM64 && ARM64_MPAM 4 + select ACPI_MPAM if ACPI 4 5 help 5 6 Memory System Resource Partitioning and Monitoring (MPAM) driver for 6 7 System IP, e.g. caches and memory controllers. ··· 23 22 If unsure, say N. 24 23 25 24 endif 25 + 26 + config ARM64_MPAM_RESCTRL_FS 27 + bool 28 + default y if ARM64_MPAM_DRIVER && RESCTRL_FS 29 + select RESCTRL_RMID_DEPENDS_ON_CLOSID 30 + select RESCTRL_ASSIGN_FIXED
+1
drivers/resctrl/Makefile
··· 1 1 obj-$(CONFIG_ARM64_MPAM_DRIVER) += mpam.o 2 2 mpam-y += mpam_devices.o 3 + mpam-$(CONFIG_ARM64_MPAM_RESCTRL_FS) += mpam_resctrl.o 3 4 4 5 ccflags-$(CONFIG_ARM64_MPAM_DRIVER_DEBUG) += -DDEBUG
+267 -38
drivers/resctrl/mpam_devices.c
··· 29 29 30 30 #include "mpam_internal.h" 31 31 32 - DEFINE_STATIC_KEY_FALSE(mpam_enabled); /* This moves to arch code */ 32 + /* Values for the T241 errata workaround */ 33 + #define T241_CHIPS_MAX 4 34 + #define T241_CHIP_NSLICES 12 35 + #define T241_SPARE_REG0_OFF 0x1b0000 36 + #define T241_SPARE_REG1_OFF 0x1c0000 37 + #define T241_CHIP_ID(phys) FIELD_GET(GENMASK_ULL(44, 43), phys) 38 + #define T241_SHADOW_REG_OFF(sidx, pid) (0x360048 + (sidx) * 0x10000 + (pid) * 8) 39 + #define SMCCC_SOC_ID_T241 0x036b0241 40 + static void __iomem *t241_scratch_regs[T241_CHIPS_MAX]; 33 41 34 42 /* 35 43 * mpam_list_lock protects the SRCU lists when writing. Once the ··· 82 74 83 75 /* When mpam is disabled, the printed reason to aid debugging */ 84 76 static char *mpam_disable_reason; 77 + 78 + /* 79 + * Whether resctrl has been setup. Used by cpuhp in preference to 80 + * mpam_is_enabled(). The disable call after an error interrupt makes 81 + * mpam_is_enabled() false before the cpuhp callbacks are made. 82 + * Reads/writes should hold mpam_cpuhp_state_lock, (or be cpuhp callbacks). 83 + */ 84 + static bool mpam_resctrl_enabled; 85 85 86 86 /* 87 87 * An MSC is a physical container for controls and monitors, each identified by ··· 640 624 return ERR_PTR(-ENOENT); 641 625 } 642 626 627 + static int mpam_enable_quirk_nvidia_t241_1(struct mpam_msc *msc, 628 + const struct mpam_quirk *quirk) 629 + { 630 + s32 soc_id = arm_smccc_get_soc_id_version(); 631 + struct resource *r; 632 + phys_addr_t phys; 633 + 634 + /* 635 + * A mapping to a device other than the MSC is needed, check 636 + * SOC_ID is NVIDIA T241 chip (036b:0241) 637 + */ 638 + if (soc_id < 0 || soc_id != SMCCC_SOC_ID_T241) 639 + return -EINVAL; 640 + 641 + r = platform_get_resource(msc->pdev, IORESOURCE_MEM, 0); 642 + if (!r) 643 + return -EINVAL; 644 + 645 + /* Find the internal registers base addr from the CHIP ID */ 646 + msc->t241_id = T241_CHIP_ID(r->start); 647 + phys = FIELD_PREP(GENMASK_ULL(45, 44), msc->t241_id) | 0x19000000ULL; 648 + 649 + t241_scratch_regs[msc->t241_id] = ioremap(phys, SZ_8M); 650 + if (WARN_ON_ONCE(!t241_scratch_regs[msc->t241_id])) 651 + return -EINVAL; 652 + 653 + pr_info_once("Enabled workaround for NVIDIA T241 erratum T241-MPAM-1\n"); 654 + 655 + return 0; 656 + } 657 + 658 + static const struct mpam_quirk mpam_quirks[] = { 659 + { 660 + /* NVIDIA t241 erratum T241-MPAM-1 */ 661 + .init = mpam_enable_quirk_nvidia_t241_1, 662 + .iidr = MPAM_IIDR_NVIDIA_T241, 663 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 664 + .workaround = T241_SCRUB_SHADOW_REGS, 665 + }, 666 + { 667 + /* NVIDIA t241 erratum T241-MPAM-4 */ 668 + .iidr = MPAM_IIDR_NVIDIA_T241, 669 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 670 + .workaround = T241_FORCE_MBW_MIN_TO_ONE, 671 + }, 672 + { 673 + /* NVIDIA t241 erratum T241-MPAM-6 */ 674 + .iidr = MPAM_IIDR_NVIDIA_T241, 675 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 676 + .workaround = T241_MBW_COUNTER_SCALE_64, 677 + }, 678 + { 679 + /* ARM CMN-650 CSU erratum 3642720 */ 680 + .iidr = MPAM_IIDR_ARM_CMN_650, 681 + .iidr_mask = MPAM_IIDR_MATCH_ONE, 682 + .workaround = IGNORE_CSU_NRDY, 683 + }, 684 + { NULL } /* Sentinel */ 685 + }; 686 + 687 + static void mpam_enable_quirks(struct mpam_msc *msc) 688 + { 689 + const struct mpam_quirk *quirk; 690 + 691 + for (quirk = &mpam_quirks[0]; quirk->iidr_mask; quirk++) { 692 + int err = 0; 693 + 694 + if (quirk->iidr != (msc->iidr & quirk->iidr_mask)) 695 + continue; 696 + 697 + if (quirk->init) 698 + err = quirk->init(msc, quirk); 699 + 700 + if (err) 701 + continue; 702 + 703 + mpam_set_quirk(quirk->workaround, msc); 704 + } 705 + } 706 + 643 707 /* 644 708 * IHI009A.a has this nugget: "If a monitor does not support automatic behaviour 645 709 * of NRDY, software can use this bit for any purpose" - so hardware might not ··· 811 715 mpam_set_feature(mpam_feat_mbw_part, props); 812 716 813 717 props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); 718 + 719 + /* 720 + * The BWA_WD field can represent 0-63, but the control fields it 721 + * describes have a maximum of 16 bits. 722 + */ 723 + props->bwa_wd = min(props->bwa_wd, 16); 724 + 814 725 if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) 815 726 mpam_set_feature(mpam_feat_mbw_max, props); 816 727 ··· 954 851 /* Grab an IDR value to find out how many RIS there are */ 955 852 mutex_lock(&msc->part_sel_lock); 956 853 idr = mpam_msc_read_idr(msc); 854 + msc->iidr = mpam_read_partsel_reg(msc, IIDR); 957 855 mutex_unlock(&msc->part_sel_lock); 856 + 857 + mpam_enable_quirks(msc); 958 858 959 859 msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); 960 860 ··· 1009 903 enum mpam_device_features type; 1010 904 u64 *val; 1011 905 int err; 906 + bool waited_timeout; 1012 907 }; 1013 908 1014 909 static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris) ··· 1159 1052 } 1160 1053 } 1161 1054 1162 - static u64 mpam_msmon_overflow_val(enum mpam_device_features type) 1055 + static u64 __mpam_msmon_overflow_val(enum mpam_device_features type) 1163 1056 { 1164 1057 /* TODO: implement scaling counters */ 1165 1058 switch (type) { ··· 1172 1065 default: 1173 1066 return 0; 1174 1067 } 1068 + } 1069 + 1070 + static u64 mpam_msmon_overflow_val(enum mpam_device_features type, 1071 + struct mpam_msc *msc) 1072 + { 1073 + u64 overflow_val = __mpam_msmon_overflow_val(type); 1074 + 1075 + if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) && 1076 + type != mpam_feat_msmon_mbwu_63counter) 1077 + overflow_val *= 64; 1078 + 1079 + return overflow_val; 1175 1080 } 1176 1081 1177 1082 static void __ris_msmon_read(void *arg) ··· 1256 1137 if (mpam_has_feature(mpam_feat_msmon_csu_hw_nrdy, rprops)) 1257 1138 nrdy = now & MSMON___NRDY; 1258 1139 now = FIELD_GET(MSMON___VALUE, now); 1140 + 1141 + if (mpam_has_quirk(IGNORE_CSU_NRDY, msc) && m->waited_timeout) 1142 + nrdy = false; 1143 + 1259 1144 break; 1260 1145 case mpam_feat_msmon_mbwu_31counter: 1261 1146 case mpam_feat_msmon_mbwu_44counter: ··· 1280 1157 now = FIELD_GET(MSMON___VALUE, now); 1281 1158 } 1282 1159 1160 + if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) && 1161 + m->type != mpam_feat_msmon_mbwu_63counter) 1162 + now *= 64; 1163 + 1283 1164 if (nrdy) 1284 1165 break; 1285 1166 1286 1167 mbwu_state = &ris->mbwu_state[ctx->mon]; 1287 1168 1288 1169 if (overflow) 1289 - mbwu_state->correction += mpam_msmon_overflow_val(m->type); 1170 + mbwu_state->correction += mpam_msmon_overflow_val(m->type, msc); 1290 1171 1291 1172 /* 1292 1173 * Include bandwidth consumed before the last hardware reset and ··· 1397 1270 .ctx = ctx, 1398 1271 .type = type, 1399 1272 .val = val, 1273 + .waited_timeout = true, 1400 1274 }; 1401 1275 *val = 0; 1402 1276 ··· 1466 1338 __mpam_write_reg(msc, reg, bm); 1467 1339 } 1468 1340 1341 + static void mpam_apply_t241_erratum(struct mpam_msc_ris *ris, u16 partid) 1342 + { 1343 + int sidx, i, lcount = 1000; 1344 + void __iomem *regs; 1345 + u64 val0, val; 1346 + 1347 + regs = t241_scratch_regs[ris->vmsc->msc->t241_id]; 1348 + 1349 + for (i = 0; i < lcount; i++) { 1350 + /* Read the shadow register at index 0 */ 1351 + val0 = readq_relaxed(regs + T241_SHADOW_REG_OFF(0, partid)); 1352 + 1353 + /* Check if all the shadow registers have the same value */ 1354 + for (sidx = 1; sidx < T241_CHIP_NSLICES; sidx++) { 1355 + val = readq_relaxed(regs + 1356 + T241_SHADOW_REG_OFF(sidx, partid)); 1357 + if (val != val0) 1358 + break; 1359 + } 1360 + if (sidx == T241_CHIP_NSLICES) 1361 + break; 1362 + } 1363 + 1364 + if (i == lcount) 1365 + pr_warn_once("t241: inconsistent values in shadow regs"); 1366 + 1367 + /* Write a value zero to spare registers to take effect of MBW conf */ 1368 + writeq_relaxed(0, regs + T241_SPARE_REG0_OFF); 1369 + writeq_relaxed(0, regs + T241_SPARE_REG1_OFF); 1370 + } 1371 + 1372 + static void mpam_quirk_post_config_change(struct mpam_msc_ris *ris, u16 partid, 1373 + struct mpam_config *cfg) 1374 + { 1375 + if (mpam_has_quirk(T241_SCRUB_SHADOW_REGS, ris->vmsc->msc)) 1376 + mpam_apply_t241_erratum(ris, partid); 1377 + } 1378 + 1379 + static u16 mpam_wa_t241_force_mbw_min_to_one(struct mpam_props *props) 1380 + { 1381 + u16 max_hw_value, min_hw_granule, res0_bits; 1382 + 1383 + res0_bits = 16 - props->bwa_wd; 1384 + max_hw_value = ((1 << props->bwa_wd) - 1) << res0_bits; 1385 + min_hw_granule = ~max_hw_value; 1386 + 1387 + return min_hw_granule + 1; 1388 + } 1389 + 1390 + static u16 mpam_wa_t241_calc_min_from_max(struct mpam_props *props, 1391 + struct mpam_config *cfg) 1392 + { 1393 + u16 val = 0; 1394 + u16 max; 1395 + u16 delta = ((5 * MPAMCFG_MBW_MAX_MAX) / 100) - 1; 1396 + 1397 + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) { 1398 + max = cfg->mbw_max; 1399 + } else { 1400 + /* Resetting. Hence, use the ris specific default. */ 1401 + max = GENMASK(15, 16 - props->bwa_wd); 1402 + } 1403 + 1404 + if (max > delta) 1405 + val = max - delta; 1406 + 1407 + return val; 1408 + } 1409 + 1469 1410 /* Called via IPI. Call while holding an SRCU reference */ 1470 1411 static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, 1471 1412 struct mpam_config *cfg) ··· 1561 1364 __mpam_intpart_sel(ris->ris_idx, partid, msc); 1562 1365 } 1563 1366 1564 - if (mpam_has_feature(mpam_feat_cpor_part, rprops) && 1565 - mpam_has_feature(mpam_feat_cpor_part, cfg)) { 1566 - if (cfg->reset_cpbm) 1567 - mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); 1568 - else 1367 + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { 1368 + if (mpam_has_feature(mpam_feat_cpor_part, cfg)) 1569 1369 mpam_write_partsel_reg(msc, CPBM, cfg->cpbm); 1370 + else 1371 + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); 1570 1372 } 1571 1373 1572 - if (mpam_has_feature(mpam_feat_mbw_part, rprops) && 1573 - mpam_has_feature(mpam_feat_mbw_part, cfg)) { 1574 - if (cfg->reset_mbw_pbm) 1374 + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { 1375 + if (mpam_has_feature(mpam_feat_mbw_part, cfg)) 1575 1376 mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits); 1576 1377 else 1577 1378 mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); 1578 1379 } 1579 1380 1580 - if (mpam_has_feature(mpam_feat_mbw_min, rprops) && 1581 - mpam_has_feature(mpam_feat_mbw_min, cfg)) 1582 - mpam_write_partsel_reg(msc, MBW_MIN, 0); 1381 + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) { 1382 + u16 val = 0; 1583 1383 1584 - if (mpam_has_feature(mpam_feat_mbw_max, rprops) && 1585 - mpam_has_feature(mpam_feat_mbw_max, cfg)) { 1586 - if (cfg->reset_mbw_max) 1587 - mpam_write_partsel_reg(msc, MBW_MAX, MPAMCFG_MBW_MAX_MAX); 1588 - else 1589 - mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); 1384 + if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, msc)) { 1385 + u16 min = mpam_wa_t241_force_mbw_min_to_one(rprops); 1386 + 1387 + val = mpam_wa_t241_calc_min_from_max(rprops, cfg); 1388 + val = max(val, min); 1389 + } 1390 + 1391 + mpam_write_partsel_reg(msc, MBW_MIN, val); 1590 1392 } 1591 1393 1592 - if (mpam_has_feature(mpam_feat_mbw_prop, rprops) && 1593 - mpam_has_feature(mpam_feat_mbw_prop, cfg)) 1394 + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { 1395 + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) 1396 + mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); 1397 + else 1398 + mpam_write_partsel_reg(msc, MBW_MAX, MPAMCFG_MBW_MAX_MAX); 1399 + } 1400 + 1401 + if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) 1594 1402 mpam_write_partsel_reg(msc, MBW_PROP, 0); 1595 1403 1596 1404 if (mpam_has_feature(mpam_feat_cmax_cmax, rprops)) ··· 1622 1420 1623 1421 mpam_write_partsel_reg(msc, PRI, pri_val); 1624 1422 } 1423 + 1424 + mpam_quirk_post_config_change(ris, partid, cfg); 1625 1425 1626 1426 mutex_unlock(&msc->part_sel_lock); 1627 1427 } ··· 1695 1491 return 0; 1696 1492 } 1697 1493 1698 - static void mpam_init_reset_cfg(struct mpam_config *reset_cfg) 1699 - { 1700 - *reset_cfg = (struct mpam_config) { 1701 - .reset_cpbm = true, 1702 - .reset_mbw_pbm = true, 1703 - .reset_mbw_max = true, 1704 - }; 1705 - bitmap_fill(reset_cfg->features, MPAM_FEATURE_LAST); 1706 - } 1707 - 1708 1494 /* 1709 1495 * Called via smp_call_on_cpu() to prevent migration, while still being 1710 1496 * pre-emptible. Caller must hold mpam_srcu. ··· 1702 1508 static int mpam_reset_ris(void *arg) 1703 1509 { 1704 1510 u16 partid, partid_max; 1705 - struct mpam_config reset_cfg; 1511 + struct mpam_config reset_cfg = {}; 1706 1512 struct mpam_msc_ris *ris = arg; 1707 1513 1708 1514 if (ris->in_reset_state) 1709 1515 return 0; 1710 - 1711 - mpam_init_reset_cfg(&reset_cfg); 1712 1516 1713 1517 spin_lock(&partid_max_lock); 1714 1518 partid_max = mpam_partid_max; ··· 1822 1630 mpam_reprogram_msc(msc); 1823 1631 } 1824 1632 1633 + if (mpam_resctrl_enabled) 1634 + return mpam_resctrl_online_cpu(cpu); 1635 + 1825 1636 return 0; 1826 1637 } 1827 1638 ··· 1867 1672 static int mpam_cpu_offline(unsigned int cpu) 1868 1673 { 1869 1674 struct mpam_msc *msc; 1675 + 1676 + if (mpam_resctrl_enabled) 1677 + mpam_resctrl_offline_cpu(cpu); 1870 1678 1871 1679 guard(srcu)(&mpam_srcu); 1872 1680 list_for_each_entry_srcu(msc, &mpam_all_msc, all_msc_list, ··· 2167 1969 * resulting safe value must be compatible with both. When merging values in 2168 1970 * the tree, all the aliasing resources must be handled first. 2169 1971 * On mismatch, parent is modified. 1972 + * Quirks on an MSC will apply to all MSC in that class. 2170 1973 */ 2171 1974 static void __props_mismatch(struct mpam_props *parent, 2172 1975 struct mpam_props *child, bool alias) ··· 2287 2088 * nobble the class feature, as we can't configure all the resources. 2288 2089 * e.g. The L3 cache is composed of two resources with 13 and 17 portion 2289 2090 * bitmaps respectively. 2091 + * Quirks on an MSC will apply to all MSC in that class. 2290 2092 */ 2291 2093 static void 2292 2094 __class_props_mismatch(struct mpam_class *class, struct mpam_vmsc *vmsc) ··· 2300 2100 2301 2101 dev_dbg(dev, "Merging features for class:0x%lx &= vmsc:0x%lx\n", 2302 2102 (long)cprops->features, (long)vprops->features); 2103 + 2104 + /* Merge quirks */ 2105 + class->quirks |= vmsc->msc->quirks; 2303 2106 2304 2107 /* Take the safe value for any common features */ 2305 2108 __props_mismatch(cprops, vprops, false); ··· 2368 2165 2369 2166 list_for_each_entry(vmsc, &comp->vmsc, comp_list) 2370 2167 __class_props_mismatch(class, vmsc); 2168 + 2169 + if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, class)) 2170 + mpam_clear_feature(mpam_feat_mbw_min, &class->props); 2371 2171 } 2372 2172 2373 2173 /* ··· 2724 2518 mutex_unlock(&mpam_list_lock); 2725 2519 cpus_read_unlock(); 2726 2520 2521 + if (!err) { 2522 + err = mpam_resctrl_setup(); 2523 + if (err) 2524 + pr_err("Failed to initialise resctrl: %d\n", err); 2525 + } 2526 + 2727 2527 if (err) { 2728 2528 mpam_disable_reason = "Failed to enable."; 2729 2529 schedule_work(&mpam_broken_work); ··· 2737 2525 } 2738 2526 2739 2527 static_branch_enable(&mpam_enabled); 2528 + mpam_resctrl_enabled = true; 2740 2529 mpam_register_cpuhp_callbacks(mpam_cpu_online, mpam_cpu_offline, 2741 2530 "mpam:online"); 2742 2531 ··· 2770 2557 } 2771 2558 } 2772 2559 2773 - static void mpam_reset_class_locked(struct mpam_class *class) 2560 + void mpam_reset_class_locked(struct mpam_class *class) 2774 2561 { 2775 2562 struct mpam_component *comp; 2776 2563 ··· 2797 2584 void mpam_disable(struct work_struct *ignored) 2798 2585 { 2799 2586 int idx; 2587 + bool do_resctrl_exit; 2800 2588 struct mpam_class *class; 2801 2589 struct mpam_msc *msc, *tmp; 2590 + 2591 + if (mpam_is_enabled()) 2592 + static_branch_disable(&mpam_enabled); 2802 2593 2803 2594 mutex_lock(&mpam_cpuhp_state_lock); 2804 2595 if (mpam_cpuhp_state) { 2805 2596 cpuhp_remove_state(mpam_cpuhp_state); 2806 2597 mpam_cpuhp_state = 0; 2807 2598 } 2599 + 2600 + /* 2601 + * Removing the cpuhp state called mpam_cpu_offline() and told resctrl 2602 + * all the CPUs are offline. 2603 + */ 2604 + do_resctrl_exit = mpam_resctrl_enabled; 2605 + mpam_resctrl_enabled = false; 2808 2606 mutex_unlock(&mpam_cpuhp_state_lock); 2809 2607 2810 - static_branch_disable(&mpam_enabled); 2608 + if (do_resctrl_exit) 2609 + mpam_resctrl_exit(); 2811 2610 2812 2611 mpam_unregister_irqs(); 2813 2612 2814 2613 idx = srcu_read_lock(&mpam_srcu); 2815 2614 list_for_each_entry_srcu(class, &mpam_classes, classes_list, 2816 - srcu_read_lock_held(&mpam_srcu)) 2615 + srcu_read_lock_held(&mpam_srcu)) { 2817 2616 mpam_reset_class(class); 2617 + if (do_resctrl_exit) 2618 + mpam_resctrl_teardown_class(class); 2619 + } 2818 2620 srcu_read_unlock(&mpam_srcu, idx); 2819 2621 2820 2622 mutex_lock(&mpam_list_lock); ··· 2920 2692 srcu_read_lock_held(&mpam_srcu)) { 2921 2693 arg.ris = ris; 2922 2694 mpam_touch_msc(msc, __write_config, &arg); 2695 + ris->in_reset_state = false; 2923 2696 } 2924 2697 mutex_unlock(&msc->cfg_lock); 2925 2698 }
+101 -7
drivers/resctrl/mpam_internal.h
··· 12 12 #include <linux/jump_label.h> 13 13 #include <linux/llist.h> 14 14 #include <linux/mutex.h> 15 + #include <linux/resctrl.h> 15 16 #include <linux/spinlock.h> 16 17 #include <linux/srcu.h> 17 18 #include <linux/types.h> 18 19 20 + #include <asm/mpam.h> 21 + 19 22 #define MPAM_MSC_MAX_NUM_RIS 16 20 23 21 24 struct platform_device; 22 - 23 - DECLARE_STATIC_KEY_FALSE(mpam_enabled); 24 25 25 26 #ifdef CONFIG_MPAM_KUNIT_TEST 26 27 #define PACKED_FOR_KUNIT __packed 27 28 #else 28 29 #define PACKED_FOR_KUNIT 29 30 #endif 31 + 32 + /* 33 + * This 'mon' values must not alias an actual monitor, so must be larger than 34 + * U16_MAX, but not be confused with an errno value, so smaller than 35 + * (u32)-SZ_4K. 36 + * USE_PRE_ALLOCATED is used to avoid confusion with an actual monitor. 37 + */ 38 + #define USE_PRE_ALLOCATED (U16_MAX + 1) 30 39 31 40 static inline bool mpam_is_enabled(void) 32 41 { ··· 85 76 u8 pmg_max; 86 77 unsigned long ris_idxs; 87 78 u32 ris_max; 79 + u32 iidr; 80 + u16 quirks; 88 81 89 82 /* 90 83 * error_irq_lock is taken when registering/unregistering the error ··· 129 118 130 119 void __iomem *mapped_hwpage; 131 120 size_t mapped_hwpage_sz; 121 + 122 + /* Values only used on some platforms for quirks */ 123 + u32 t241_id; 132 124 133 125 struct mpam_garbage garbage; 134 126 }; ··· 221 207 #define mpam_set_feature(_feat, x) __set_bit(_feat, (x)->features) 222 208 #define mpam_clear_feature(_feat, x) __clear_bit(_feat, (x)->features) 223 209 210 + /* Workaround bits for msc->quirks */ 211 + enum mpam_device_quirks { 212 + T241_SCRUB_SHADOW_REGS, 213 + T241_FORCE_MBW_MIN_TO_ONE, 214 + T241_MBW_COUNTER_SCALE_64, 215 + IGNORE_CSU_NRDY, 216 + MPAM_QUIRK_LAST 217 + }; 218 + 219 + #define mpam_has_quirk(_quirk, x) ((1 << (_quirk) & (x)->quirks)) 220 + #define mpam_set_quirk(_quirk, x) ((x)->quirks |= (1 << (_quirk))) 221 + 222 + struct mpam_quirk { 223 + int (*init)(struct mpam_msc *msc, const struct mpam_quirk *quirk); 224 + 225 + u32 iidr; 226 + u32 iidr_mask; 227 + 228 + enum mpam_device_quirks workaround; 229 + }; 230 + 231 + #define MPAM_IIDR_MATCH_ONE (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0xfff) | \ 232 + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0xf) | \ 233 + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0xf) | \ 234 + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0xfff)) 235 + 236 + #define MPAM_IIDR_NVIDIA_T241 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0x241) | \ 237 + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \ 238 + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \ 239 + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x36b)) 240 + 241 + #define MPAM_IIDR_ARM_CMN_650 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0) | \ 242 + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \ 243 + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \ 244 + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x43b)) 245 + 224 246 /* The values for MSMON_CFG_MBWU_FLT.RWBW */ 225 247 enum mon_filter_options { 226 248 COUNT_BOTH = 0, ··· 265 215 }; 266 216 267 217 struct mon_cfg { 268 - u16 mon; 218 + /* 219 + * mon must be large enough to hold out of range values like 220 + * USE_PRE_ALLOCATED 221 + */ 222 + u32 mon; 269 223 u8 pmg; 270 224 bool match_pmg; 271 225 bool csu_exclude_clean; ··· 300 246 301 247 struct mpam_props props; 302 248 u32 nrdy_usec; 249 + u16 quirks; 303 250 u8 level; 304 251 enum mpam_class_types type; 305 252 ··· 320 265 u32 cpbm; 321 266 u32 mbw_pbm; 322 267 u16 mbw_max; 323 - 324 - bool reset_cpbm; 325 - bool reset_mbw_pbm; 326 - bool reset_mbw_max; 327 268 328 269 struct mpam_garbage garbage; 329 270 }; ··· 388 337 struct mpam_garbage garbage; 389 338 }; 390 339 340 + struct mpam_resctrl_dom { 341 + struct mpam_component *ctrl_comp; 342 + 343 + /* 344 + * There is no single mon_comp because different events may be backed 345 + * by different class/components. mon_comp is indexed by the event 346 + * number. 347 + */ 348 + struct mpam_component *mon_comp[QOS_NUM_EVENTS]; 349 + 350 + struct rdt_ctrl_domain resctrl_ctrl_dom; 351 + struct rdt_l3_mon_domain resctrl_mon_dom; 352 + }; 353 + 354 + struct mpam_resctrl_res { 355 + struct mpam_class *class; 356 + struct rdt_resource resctrl_res; 357 + bool cdp_enabled; 358 + }; 359 + 360 + struct mpam_resctrl_mon { 361 + struct mpam_class *class; 362 + 363 + /* per-class data that resctrl needs will live here */ 364 + }; 365 + 391 366 static inline int mpam_alloc_csu_mon(struct mpam_class *class) 392 367 { 393 368 struct mpam_props *cprops = &class->props; ··· 458 381 void mpam_enable(struct work_struct *work); 459 382 void mpam_disable(struct work_struct *work); 460 383 384 + /* Reset all the RIS in a class under cpus_read_lock() */ 385 + void mpam_reset_class_locked(struct mpam_class *class); 386 + 461 387 int mpam_apply_config(struct mpam_component *comp, u16 partid, 462 388 struct mpam_config *cfg); 463 389 ··· 470 390 471 391 int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level, 472 392 cpumask_t *affinity); 393 + 394 + #ifdef CONFIG_RESCTRL_FS 395 + int mpam_resctrl_setup(void); 396 + void mpam_resctrl_exit(void); 397 + int mpam_resctrl_online_cpu(unsigned int cpu); 398 + void mpam_resctrl_offline_cpu(unsigned int cpu); 399 + void mpam_resctrl_teardown_class(struct mpam_class *class); 400 + #else 401 + static inline int mpam_resctrl_setup(void) { return 0; } 402 + static inline void mpam_resctrl_exit(void) { } 403 + static inline int mpam_resctrl_online_cpu(unsigned int cpu) { return 0; } 404 + static inline void mpam_resctrl_offline_cpu(unsigned int cpu) { } 405 + static inline void mpam_resctrl_teardown_class(struct mpam_class *class) { } 406 + #endif /* CONFIG_RESCTRL_FS */ 473 407 474 408 /* 475 409 * MPAM MSCs have the following register layout. See:
+1704
drivers/resctrl/mpam_resctrl.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (C) 2025 Arm Ltd. 3 + 4 + #define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__ 5 + 6 + #include <linux/arm_mpam.h> 7 + #include <linux/cacheinfo.h> 8 + #include <linux/cpu.h> 9 + #include <linux/cpumask.h> 10 + #include <linux/errno.h> 11 + #include <linux/limits.h> 12 + #include <linux/list.h> 13 + #include <linux/math.h> 14 + #include <linux/printk.h> 15 + #include <linux/rculist.h> 16 + #include <linux/resctrl.h> 17 + #include <linux/slab.h> 18 + #include <linux/types.h> 19 + #include <linux/wait.h> 20 + 21 + #include <asm/mpam.h> 22 + 23 + #include "mpam_internal.h" 24 + 25 + DECLARE_WAIT_QUEUE_HEAD(resctrl_mon_ctx_waiters); 26 + 27 + /* 28 + * The classes we've picked to map to resctrl resources, wrapped 29 + * in with their resctrl structure. 30 + * Class pointer may be NULL. 31 + */ 32 + static struct mpam_resctrl_res mpam_resctrl_controls[RDT_NUM_RESOURCES]; 33 + 34 + #define for_each_mpam_resctrl_control(res, rid) \ 35 + for (rid = 0, res = &mpam_resctrl_controls[rid]; \ 36 + rid < RDT_NUM_RESOURCES; \ 37 + rid++, res = &mpam_resctrl_controls[rid]) 38 + 39 + /* 40 + * The classes we've picked to map to resctrl events. 41 + * Resctrl believes all the worlds a Xeon, and these are all on the L3. This 42 + * array lets us find the actual class backing the event counters. e.g. 43 + * the only memory bandwidth counters may be on the memory controller, but to 44 + * make use of them, we pretend they are on L3. Restrict the events considered 45 + * to those supported by MPAM. 46 + * Class pointer may be NULL. 47 + */ 48 + #define MPAM_MAX_EVENT QOS_L3_MBM_TOTAL_EVENT_ID 49 + static struct mpam_resctrl_mon mpam_resctrl_counters[MPAM_MAX_EVENT + 1]; 50 + 51 + #define for_each_mpam_resctrl_mon(mon, eventid) \ 52 + for (eventid = QOS_FIRST_EVENT, mon = &mpam_resctrl_counters[eventid]; \ 53 + eventid <= MPAM_MAX_EVENT; \ 54 + eventid++, mon = &mpam_resctrl_counters[eventid]) 55 + 56 + /* The lock for modifying resctrl's domain lists from cpuhp callbacks. */ 57 + static DEFINE_MUTEX(domain_list_lock); 58 + 59 + /* 60 + * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM0_EL1. 61 + * This applies globally to all traffic the CPU generates. 62 + */ 63 + static bool cdp_enabled; 64 + 65 + /* 66 + * We use cacheinfo to discover the size of the caches and their id. cacheinfo 67 + * populates this from a device_initcall(). mpam_resctrl_setup() must wait. 68 + */ 69 + static bool cacheinfo_ready; 70 + static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); 71 + 72 + /* 73 + * If resctrl_init() succeeded, resctrl_exit() can be used to remove support 74 + * for the filesystem in the event of an error. 75 + */ 76 + static bool resctrl_enabled; 77 + 78 + bool resctrl_arch_alloc_capable(void) 79 + { 80 + struct mpam_resctrl_res *res; 81 + enum resctrl_res_level rid; 82 + 83 + for_each_mpam_resctrl_control(res, rid) { 84 + if (res->resctrl_res.alloc_capable) 85 + return true; 86 + } 87 + 88 + return false; 89 + } 90 + 91 + bool resctrl_arch_mon_capable(void) 92 + { 93 + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 94 + struct rdt_resource *l3 = &res->resctrl_res; 95 + 96 + /* All monitors are presented as being on the L3 cache */ 97 + return l3->mon_capable; 98 + } 99 + 100 + bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) 101 + { 102 + return false; 103 + } 104 + 105 + void resctrl_arch_mon_event_config_read(void *info) 106 + { 107 + } 108 + 109 + void resctrl_arch_mon_event_config_write(void *info) 110 + { 111 + } 112 + 113 + void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d) 114 + { 115 + } 116 + 117 + void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 118 + u32 closid, u32 rmid, enum resctrl_event_id eventid) 119 + { 120 + } 121 + 122 + void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 123 + u32 closid, u32 rmid, int cntr_id, 124 + enum resctrl_event_id eventid) 125 + { 126 + } 127 + 128 + void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 129 + enum resctrl_event_id evtid, u32 rmid, u32 closid, 130 + u32 cntr_id, bool assign) 131 + { 132 + } 133 + 134 + int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d, 135 + u32 unused, u32 rmid, int cntr_id, 136 + enum resctrl_event_id eventid, u64 *val) 137 + { 138 + return -EOPNOTSUPP; 139 + } 140 + 141 + bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r) 142 + { 143 + return false; 144 + } 145 + 146 + int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable) 147 + { 148 + return -EINVAL; 149 + } 150 + 151 + int resctrl_arch_io_alloc_enable(struct rdt_resource *r, bool enable) 152 + { 153 + return -EOPNOTSUPP; 154 + } 155 + 156 + bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r) 157 + { 158 + return false; 159 + } 160 + 161 + void resctrl_arch_pre_mount(void) 162 + { 163 + } 164 + 165 + bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) 166 + { 167 + return mpam_resctrl_controls[rid].cdp_enabled; 168 + } 169 + 170 + /** 171 + * resctrl_reset_task_closids() - Reset the PARTID/PMG values for all tasks. 172 + * 173 + * At boot, all existing tasks use partid zero for D and I. 174 + * To enable/disable CDP emulation, all these tasks need relabelling. 175 + */ 176 + static void resctrl_reset_task_closids(void) 177 + { 178 + struct task_struct *p, *t; 179 + 180 + read_lock(&tasklist_lock); 181 + for_each_process_thread(p, t) { 182 + resctrl_arch_set_closid_rmid(t, RESCTRL_RESERVED_CLOSID, 183 + RESCTRL_RESERVED_RMID); 184 + } 185 + read_unlock(&tasklist_lock); 186 + } 187 + 188 + int resctrl_arch_set_cdp_enabled(enum resctrl_res_level rid, bool enable) 189 + { 190 + u32 partid_i = RESCTRL_RESERVED_CLOSID, partid_d = RESCTRL_RESERVED_CLOSID; 191 + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 192 + struct rdt_resource *l3 = &res->resctrl_res; 193 + int cpu; 194 + 195 + if (!IS_ENABLED(CONFIG_EXPERT) && enable) { 196 + /* 197 + * If the resctrl fs is mounted more than once, sequentially, 198 + * then CDP can lead to the use of out of range PARTIDs. 199 + */ 200 + pr_warn("CDP not supported\n"); 201 + return -EOPNOTSUPP; 202 + } 203 + 204 + if (enable) 205 + pr_warn("CDP is an expert feature and may cause MPAM to malfunction.\n"); 206 + 207 + /* 208 + * resctrl_arch_set_cdp_enabled() is only called with enable set to 209 + * false on error and unmount. 210 + */ 211 + cdp_enabled = enable; 212 + mpam_resctrl_controls[rid].cdp_enabled = enable; 213 + 214 + if (enable) 215 + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx() / 2; 216 + else 217 + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx(); 218 + 219 + /* The mbw_max feature can't hide cdp as it's a per-partid maximum. */ 220 + if (cdp_enabled && !mpam_resctrl_controls[RDT_RESOURCE_MBA].cdp_enabled) 221 + mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = false; 222 + 223 + if (mpam_resctrl_controls[RDT_RESOURCE_MBA].cdp_enabled && 224 + mpam_resctrl_controls[RDT_RESOURCE_MBA].class) 225 + mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = true; 226 + 227 + if (enable) { 228 + if (mpam_partid_max < 1) 229 + return -EINVAL; 230 + 231 + partid_d = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_DATA); 232 + partid_i = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_CODE); 233 + } 234 + 235 + mpam_set_task_partid_pmg(current, partid_d, partid_i, 0, 0); 236 + WRITE_ONCE(arm64_mpam_global_default, mpam_get_regval(current)); 237 + 238 + resctrl_reset_task_closids(); 239 + 240 + for_each_possible_cpu(cpu) 241 + mpam_set_cpu_defaults(cpu, partid_d, partid_i, 0, 0); 242 + on_each_cpu(resctrl_arch_sync_cpu_closid_rmid, NULL, 1); 243 + 244 + return 0; 245 + } 246 + 247 + static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) 248 + { 249 + return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); 250 + } 251 + 252 + /* 253 + * MSC may raise an error interrupt if it sees an out or range partid/pmg, 254 + * and go on to truncate the value. Regardless of what the hardware supports, 255 + * only the system wide safe value is safe to use. 256 + */ 257 + u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) 258 + { 259 + return mpam_partid_max + 1; 260 + } 261 + 262 + u32 resctrl_arch_system_num_rmid_idx(void) 263 + { 264 + return (mpam_pmg_max + 1) * (mpam_partid_max + 1); 265 + } 266 + 267 + u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) 268 + { 269 + return closid * (mpam_pmg_max + 1) + rmid; 270 + } 271 + 272 + void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) 273 + { 274 + *closid = idx / (mpam_pmg_max + 1); 275 + *rmid = idx % (mpam_pmg_max + 1); 276 + } 277 + 278 + void resctrl_arch_sched_in(struct task_struct *tsk) 279 + { 280 + lockdep_assert_preemption_disabled(); 281 + 282 + mpam_thread_switch(tsk); 283 + } 284 + 285 + void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid) 286 + { 287 + WARN_ON_ONCE(closid > U16_MAX); 288 + WARN_ON_ONCE(rmid > U8_MAX); 289 + 290 + if (!cdp_enabled) { 291 + mpam_set_cpu_defaults(cpu, closid, closid, rmid, rmid); 292 + } else { 293 + /* 294 + * When CDP is enabled, resctrl halves the closid range and we 295 + * use odd/even partid for one closid. 296 + */ 297 + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); 298 + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); 299 + 300 + mpam_set_cpu_defaults(cpu, partid_d, partid_i, rmid, rmid); 301 + } 302 + } 303 + 304 + void resctrl_arch_sync_cpu_closid_rmid(void *info) 305 + { 306 + struct resctrl_cpu_defaults *r = info; 307 + 308 + lockdep_assert_preemption_disabled(); 309 + 310 + if (r) { 311 + resctrl_arch_set_cpu_default_closid_rmid(smp_processor_id(), 312 + r->closid, r->rmid); 313 + } 314 + 315 + resctrl_arch_sched_in(current); 316 + } 317 + 318 + void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) 319 + { 320 + WARN_ON_ONCE(closid > U16_MAX); 321 + WARN_ON_ONCE(rmid > U8_MAX); 322 + 323 + if (!cdp_enabled) { 324 + mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); 325 + } else { 326 + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); 327 + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); 328 + 329 + mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); 330 + } 331 + } 332 + 333 + bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) 334 + { 335 + u64 regval = mpam_get_regval(tsk); 336 + u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); 337 + 338 + if (cdp_enabled) 339 + tsk_closid >>= 1; 340 + 341 + return tsk_closid == closid; 342 + } 343 + 344 + /* The task's pmg is not unique, the partid must be considered too */ 345 + bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) 346 + { 347 + u64 regval = mpam_get_regval(tsk); 348 + u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); 349 + u32 tsk_rmid = FIELD_GET(MPAM0_EL1_PMG_D, regval); 350 + 351 + if (cdp_enabled) 352 + tsk_closid >>= 1; 353 + 354 + return (tsk_closid == closid) && (tsk_rmid == rmid); 355 + } 356 + 357 + struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) 358 + { 359 + if (l >= RDT_NUM_RESOURCES) 360 + return NULL; 361 + 362 + return &mpam_resctrl_controls[l].resctrl_res; 363 + } 364 + 365 + static int resctrl_arch_mon_ctx_alloc_no_wait(enum resctrl_event_id evtid) 366 + { 367 + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evtid]; 368 + 369 + if (!mpam_is_enabled()) 370 + return -EINVAL; 371 + 372 + if (!mon->class) 373 + return -EINVAL; 374 + 375 + switch (evtid) { 376 + case QOS_L3_OCCUP_EVENT_ID: 377 + /* With CDP, one monitor gets used for both code/data reads */ 378 + return mpam_alloc_csu_mon(mon->class); 379 + case QOS_L3_MBM_LOCAL_EVENT_ID: 380 + case QOS_L3_MBM_TOTAL_EVENT_ID: 381 + return USE_PRE_ALLOCATED; 382 + default: 383 + return -EOPNOTSUPP; 384 + } 385 + } 386 + 387 + void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, 388 + enum resctrl_event_id evtid) 389 + { 390 + DEFINE_WAIT(wait); 391 + int *ret; 392 + 393 + ret = kmalloc_obj(*ret); 394 + if (!ret) 395 + return ERR_PTR(-ENOMEM); 396 + 397 + do { 398 + prepare_to_wait(&resctrl_mon_ctx_waiters, &wait, 399 + TASK_INTERRUPTIBLE); 400 + *ret = resctrl_arch_mon_ctx_alloc_no_wait(evtid); 401 + if (*ret == -ENOSPC) 402 + schedule(); 403 + } while (*ret == -ENOSPC && !signal_pending(current)); 404 + finish_wait(&resctrl_mon_ctx_waiters, &wait); 405 + 406 + return ret; 407 + } 408 + 409 + static void resctrl_arch_mon_ctx_free_no_wait(enum resctrl_event_id evtid, 410 + u32 mon_idx) 411 + { 412 + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evtid]; 413 + 414 + if (!mpam_is_enabled()) 415 + return; 416 + 417 + if (!mon->class) 418 + return; 419 + 420 + if (evtid == QOS_L3_OCCUP_EVENT_ID) 421 + mpam_free_csu_mon(mon->class, mon_idx); 422 + 423 + wake_up(&resctrl_mon_ctx_waiters); 424 + } 425 + 426 + void resctrl_arch_mon_ctx_free(struct rdt_resource *r, 427 + enum resctrl_event_id evtid, void *arch_mon_ctx) 428 + { 429 + u32 mon_idx = *(u32 *)arch_mon_ctx; 430 + 431 + kfree(arch_mon_ctx); 432 + 433 + resctrl_arch_mon_ctx_free_no_wait(evtid, mon_idx); 434 + } 435 + 436 + static int __read_mon(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, 437 + enum mpam_device_features mon_type, 438 + int mon_idx, 439 + enum resctrl_conf_type cdp_type, u32 closid, u32 rmid, u64 *val) 440 + { 441 + struct mon_cfg cfg; 442 + 443 + if (!mpam_is_enabled()) 444 + return -EINVAL; 445 + 446 + /* Shift closid to account for CDP */ 447 + closid = resctrl_get_config_index(closid, cdp_type); 448 + 449 + if (irqs_disabled()) { 450 + /* Check if we can access this domain without an IPI */ 451 + return -EIO; 452 + } 453 + 454 + cfg = (struct mon_cfg) { 455 + .mon = mon_idx, 456 + .match_pmg = true, 457 + .partid = closid, 458 + .pmg = rmid, 459 + }; 460 + 461 + return mpam_msmon_read(mon_comp, &cfg, mon_type, val); 462 + } 463 + 464 + static int read_mon_cdp_safe(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, 465 + enum mpam_device_features mon_type, 466 + int mon_idx, u32 closid, u32 rmid, u64 *val) 467 + { 468 + if (cdp_enabled) { 469 + u64 code_val = 0, data_val = 0; 470 + int err; 471 + 472 + err = __read_mon(mon, mon_comp, mon_type, mon_idx, 473 + CDP_CODE, closid, rmid, &code_val); 474 + if (err) 475 + return err; 476 + 477 + err = __read_mon(mon, mon_comp, mon_type, mon_idx, 478 + CDP_DATA, closid, rmid, &data_val); 479 + if (err) 480 + return err; 481 + 482 + *val += code_val + data_val; 483 + return 0; 484 + } 485 + 486 + return __read_mon(mon, mon_comp, mon_type, mon_idx, 487 + CDP_NONE, closid, rmid, val); 488 + } 489 + 490 + /* MBWU when not in ABMC mode (not supported), and CSU counters. */ 491 + int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr, 492 + u32 closid, u32 rmid, enum resctrl_event_id eventid, 493 + void *arch_priv, u64 *val, void *arch_mon_ctx) 494 + { 495 + struct mpam_resctrl_dom *l3_dom; 496 + struct mpam_component *mon_comp; 497 + u32 mon_idx = *(u32 *)arch_mon_ctx; 498 + enum mpam_device_features mon_type; 499 + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[eventid]; 500 + 501 + resctrl_arch_rmid_read_context_check(); 502 + 503 + if (!mpam_is_enabled()) 504 + return -EINVAL; 505 + 506 + if (eventid >= QOS_NUM_EVENTS || !mon->class) 507 + return -EINVAL; 508 + 509 + l3_dom = container_of(hdr, struct mpam_resctrl_dom, resctrl_mon_dom.hdr); 510 + mon_comp = l3_dom->mon_comp[eventid]; 511 + 512 + if (eventid != QOS_L3_OCCUP_EVENT_ID) 513 + return -EINVAL; 514 + 515 + mon_type = mpam_feat_msmon_csu; 516 + 517 + return read_mon_cdp_safe(mon, mon_comp, mon_type, mon_idx, 518 + closid, rmid, val); 519 + } 520 + 521 + /* 522 + * The rmid realloc threshold should be for the smallest cache exposed to 523 + * resctrl. 524 + */ 525 + static int update_rmid_limits(struct mpam_class *class) 526 + { 527 + u32 num_unique_pmg = resctrl_arch_system_num_rmid_idx(); 528 + struct mpam_props *cprops = &class->props; 529 + struct cacheinfo *ci; 530 + 531 + lockdep_assert_cpus_held(); 532 + 533 + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) 534 + return 0; 535 + 536 + /* 537 + * Assume cache levels are the same size for all CPUs... 538 + * The check just requires any online CPU and it can't go offline as we 539 + * hold the cpu lock. 540 + */ 541 + ci = get_cpu_cacheinfo_level(raw_smp_processor_id(), class->level); 542 + if (!ci || ci->size == 0) { 543 + pr_debug("Could not read cache size for class %u\n", 544 + class->level); 545 + return -EINVAL; 546 + } 547 + 548 + if (!resctrl_rmid_realloc_limit || 549 + ci->size < resctrl_rmid_realloc_limit) { 550 + resctrl_rmid_realloc_limit = ci->size; 551 + resctrl_rmid_realloc_threshold = ci->size / num_unique_pmg; 552 + } 553 + 554 + return 0; 555 + } 556 + 557 + static bool cache_has_usable_cpor(struct mpam_class *class) 558 + { 559 + struct mpam_props *cprops = &class->props; 560 + 561 + if (!mpam_has_feature(mpam_feat_cpor_part, cprops)) 562 + return false; 563 + 564 + /* resctrl uses u32 for all bitmap configurations */ 565 + return class->props.cpbm_wd <= 32; 566 + } 567 + 568 + static bool mba_class_use_mbw_max(struct mpam_props *cprops) 569 + { 570 + return (mpam_has_feature(mpam_feat_mbw_max, cprops) && 571 + cprops->bwa_wd); 572 + } 573 + 574 + static bool class_has_usable_mba(struct mpam_props *cprops) 575 + { 576 + return mba_class_use_mbw_max(cprops); 577 + } 578 + 579 + static bool cache_has_usable_csu(struct mpam_class *class) 580 + { 581 + struct mpam_props *cprops; 582 + 583 + if (!class) 584 + return false; 585 + 586 + cprops = &class->props; 587 + 588 + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) 589 + return false; 590 + 591 + /* 592 + * CSU counters settle on the value, so we can get away with 593 + * having only one. 594 + */ 595 + if (!cprops->num_csu_mon) 596 + return false; 597 + 598 + return true; 599 + } 600 + 601 + /* 602 + * Calculate the worst-case percentage change from each implemented step 603 + * in the control. 604 + */ 605 + static u32 get_mba_granularity(struct mpam_props *cprops) 606 + { 607 + if (!mba_class_use_mbw_max(cprops)) 608 + return 0; 609 + 610 + /* 611 + * bwa_wd is the number of bits implemented in the 0.xxx 612 + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. 613 + */ 614 + return DIV_ROUND_UP(MAX_MBA_BW, 1 << cprops->bwa_wd); 615 + } 616 + 617 + /* 618 + * Each fixed-point hardware value architecturally represents a range 619 + * of values: the full range 0% - 100% is split contiguously into 620 + * (1 << cprops->bwa_wd) equal bands. 621 + * 622 + * Although the bwa_bwd fields have 6 bits the maximum valid value is 16 623 + * as it reports the width of fields that are at most 16 bits. When 624 + * fewer than 16 bits are valid the least significant bits are 625 + * ignored. The implied binary point is kept between bits 15 and 16 and 626 + * so the valid bits are leftmost. 627 + * 628 + * See ARM IHI0099B.a "MPAM system component specification", Section 9.3, 629 + * "The fixed-point fractional format" for more information. 630 + * 631 + * Find the nearest percentage value to the upper bound of the selected band: 632 + */ 633 + static u32 mbw_max_to_percent(u16 mbw_max, struct mpam_props *cprops) 634 + { 635 + u32 val = mbw_max; 636 + 637 + val >>= 16 - cprops->bwa_wd; 638 + val += 1; 639 + val *= MAX_MBA_BW; 640 + val = DIV_ROUND_CLOSEST(val, 1 << cprops->bwa_wd); 641 + 642 + return val; 643 + } 644 + 645 + /* 646 + * Find the band whose upper bound is closest to the specified percentage. 647 + * 648 + * A round-to-nearest policy is followed here as a balanced compromise 649 + * between unexpected under-commit of the resource (where the total of 650 + * a set of resource allocations after conversion is less than the 651 + * expected total, due to rounding of the individual converted 652 + * percentages) and over-commit (where the total of the converted 653 + * allocations is greater than expected). 654 + */ 655 + static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) 656 + { 657 + u32 val = pc; 658 + 659 + val <<= cprops->bwa_wd; 660 + val = DIV_ROUND_CLOSEST(val, MAX_MBA_BW); 661 + val = max(val, 1) - 1; 662 + val <<= 16 - cprops->bwa_wd; 663 + 664 + return val; 665 + } 666 + 667 + static u32 get_mba_min(struct mpam_props *cprops) 668 + { 669 + if (!mba_class_use_mbw_max(cprops)) { 670 + WARN_ON_ONCE(1); 671 + return 0; 672 + } 673 + 674 + return mbw_max_to_percent(0, cprops); 675 + } 676 + 677 + /* Find the L3 cache that has affinity with this CPU */ 678 + static int find_l3_equivalent_bitmask(int cpu, cpumask_var_t tmp_cpumask) 679 + { 680 + u32 cache_id = get_cpu_cacheinfo_id(cpu, 3); 681 + 682 + lockdep_assert_cpus_held(); 683 + 684 + return mpam_get_cpumask_from_cache_id(cache_id, 3, tmp_cpumask); 685 + } 686 + 687 + /* 688 + * topology_matches_l3() - Is the provided class the same shape as L3 689 + * @victim: The class we'd like to pretend is L3. 690 + * 691 + * resctrl expects all the world's a Xeon, and all counters are on the 692 + * L3. We allow some mapping counters on other classes. This requires 693 + * that the CPU->domain mapping is the same kind of shape. 694 + * 695 + * Using cacheinfo directly would make this work even if resctrl can't 696 + * use the L3 - but cacheinfo can't tell us anything about offline CPUs. 697 + * Using the L3 resctrl domain list also depends on CPUs being online. 698 + * Using the mpam_class we picked for L3 so we can use its domain list 699 + * assumes that there are MPAM controls on the L3. 700 + * Instead, this path eventually uses the mpam_get_cpumask_from_cache_id() 701 + * helper which can tell us about offline CPUs ... but getting the cache_id 702 + * to start with relies on at least one CPU per L3 cache being online at 703 + * boot. 704 + * 705 + * Walk the victim component list and compare the affinity mask with the 706 + * corresponding L3. The topology matches if each victim:component's affinity 707 + * mask is the same as the CPU's corresponding L3's. These lists/masks are 708 + * computed from firmware tables so don't change at runtime. 709 + */ 710 + static bool topology_matches_l3(struct mpam_class *victim) 711 + { 712 + int cpu, err; 713 + struct mpam_component *victim_iter; 714 + 715 + lockdep_assert_cpus_held(); 716 + 717 + cpumask_var_t __free(free_cpumask_var) tmp_cpumask = CPUMASK_VAR_NULL; 718 + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) 719 + return false; 720 + 721 + guard(srcu)(&mpam_srcu); 722 + list_for_each_entry_srcu(victim_iter, &victim->components, class_list, 723 + srcu_read_lock_held(&mpam_srcu)) { 724 + if (cpumask_empty(&victim_iter->affinity)) { 725 + pr_debug("class %u has CPU-less component %u - can't match L3!\n", 726 + victim->level, victim_iter->comp_id); 727 + return false; 728 + } 729 + 730 + cpu = cpumask_any_and(&victim_iter->affinity, cpu_online_mask); 731 + if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) 732 + return false; 733 + 734 + cpumask_clear(tmp_cpumask); 735 + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); 736 + if (err) { 737 + pr_debug("Failed to find L3's equivalent component to class %u component %u\n", 738 + victim->level, victim_iter->comp_id); 739 + return false; 740 + } 741 + 742 + /* Any differing bits in the affinity mask? */ 743 + if (!cpumask_equal(tmp_cpumask, &victim_iter->affinity)) { 744 + pr_debug("class %u component %u has Mismatched CPU mask with L3 equivalent\n" 745 + "L3:%*pbl != victim:%*pbl\n", 746 + victim->level, victim_iter->comp_id, 747 + cpumask_pr_args(tmp_cpumask), 748 + cpumask_pr_args(&victim_iter->affinity)); 749 + 750 + return false; 751 + } 752 + } 753 + 754 + return true; 755 + } 756 + 757 + /* 758 + * Test if the traffic for a class matches that at egress from the L3. For 759 + * MSC at memory controllers this is only possible if there is a single L3 760 + * as otherwise the counters at the memory can include bandwidth from the 761 + * non-local L3. 762 + */ 763 + static bool traffic_matches_l3(struct mpam_class *class) 764 + { 765 + int err, cpu; 766 + 767 + lockdep_assert_cpus_held(); 768 + 769 + if (class->type == MPAM_CLASS_CACHE && class->level == 3) 770 + return true; 771 + 772 + if (class->type == MPAM_CLASS_CACHE && class->level != 3) { 773 + pr_debug("class %u is a different cache from L3\n", class->level); 774 + return false; 775 + } 776 + 777 + if (class->type != MPAM_CLASS_MEMORY) { 778 + pr_debug("class %u is neither of type cache or memory\n", class->level); 779 + return false; 780 + } 781 + 782 + cpumask_var_t __free(free_cpumask_var) tmp_cpumask = CPUMASK_VAR_NULL; 783 + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) { 784 + pr_debug("cpumask allocation failed\n"); 785 + return false; 786 + } 787 + 788 + cpu = cpumask_any_and(&class->affinity, cpu_online_mask); 789 + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); 790 + if (err) { 791 + pr_debug("Failed to find L3 downstream to cpu %d\n", cpu); 792 + return false; 793 + } 794 + 795 + if (!cpumask_equal(tmp_cpumask, cpu_possible_mask)) { 796 + pr_debug("There is more than one L3\n"); 797 + return false; 798 + } 799 + 800 + /* Be strict; the traffic might stop in the intermediate cache. */ 801 + if (get_cpu_cacheinfo_id(cpu, 4) != -1) { 802 + pr_debug("L3 isn't the last level of cache\n"); 803 + return false; 804 + } 805 + 806 + if (num_possible_nodes() > 1) { 807 + pr_debug("There is more than one numa node\n"); 808 + return false; 809 + } 810 + 811 + #ifdef CONFIG_HMEM_REPORTING 812 + if (node_devices[cpu_to_node(cpu)]->cache_dev) { 813 + pr_debug("There is a memory side cache\n"); 814 + return false; 815 + } 816 + #endif 817 + 818 + return true; 819 + } 820 + 821 + /* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ 822 + static void mpam_resctrl_pick_caches(void) 823 + { 824 + struct mpam_class *class; 825 + struct mpam_resctrl_res *res; 826 + 827 + lockdep_assert_cpus_held(); 828 + 829 + guard(srcu)(&mpam_srcu); 830 + list_for_each_entry_srcu(class, &mpam_classes, classes_list, 831 + srcu_read_lock_held(&mpam_srcu)) { 832 + if (class->type != MPAM_CLASS_CACHE) { 833 + pr_debug("class %u is not a cache\n", class->level); 834 + continue; 835 + } 836 + 837 + if (class->level != 2 && class->level != 3) { 838 + pr_debug("class %u is not L2 or L3\n", class->level); 839 + continue; 840 + } 841 + 842 + if (!cache_has_usable_cpor(class)) { 843 + pr_debug("class %u cache misses CPOR\n", class->level); 844 + continue; 845 + } 846 + 847 + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { 848 + pr_debug("class %u has missing CPUs, mask %*pb != %*pb\n", class->level, 849 + cpumask_pr_args(&class->affinity), 850 + cpumask_pr_args(cpu_possible_mask)); 851 + continue; 852 + } 853 + 854 + if (class->level == 2) 855 + res = &mpam_resctrl_controls[RDT_RESOURCE_L2]; 856 + else 857 + res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 858 + res->class = class; 859 + } 860 + } 861 + 862 + static void mpam_resctrl_pick_mba(void) 863 + { 864 + struct mpam_class *class, *candidate_class = NULL; 865 + struct mpam_resctrl_res *res; 866 + 867 + lockdep_assert_cpus_held(); 868 + 869 + guard(srcu)(&mpam_srcu); 870 + list_for_each_entry_srcu(class, &mpam_classes, classes_list, 871 + srcu_read_lock_held(&mpam_srcu)) { 872 + struct mpam_props *cprops = &class->props; 873 + 874 + if (class->level != 3 && class->type == MPAM_CLASS_CACHE) { 875 + pr_debug("class %u is a cache but not the L3\n", class->level); 876 + continue; 877 + } 878 + 879 + if (!class_has_usable_mba(cprops)) { 880 + pr_debug("class %u has no bandwidth control\n", 881 + class->level); 882 + continue; 883 + } 884 + 885 + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { 886 + pr_debug("class %u has missing CPUs\n", class->level); 887 + continue; 888 + } 889 + 890 + if (!topology_matches_l3(class)) { 891 + pr_debug("class %u topology doesn't match L3\n", 892 + class->level); 893 + continue; 894 + } 895 + 896 + if (!traffic_matches_l3(class)) { 897 + pr_debug("class %u traffic doesn't match L3 egress\n", 898 + class->level); 899 + continue; 900 + } 901 + 902 + /* 903 + * Pick a resource to be MBA that as close as possible to 904 + * the L3. mbm_total counts the bandwidth leaving the L3 905 + * cache and MBA should correspond as closely as possible 906 + * for proper operation of mba_sc. 907 + */ 908 + if (!candidate_class || class->level < candidate_class->level) 909 + candidate_class = class; 910 + } 911 + 912 + if (candidate_class) { 913 + pr_debug("selected class %u to back MBA\n", 914 + candidate_class->level); 915 + res = &mpam_resctrl_controls[RDT_RESOURCE_MBA]; 916 + res->class = candidate_class; 917 + } 918 + } 919 + 920 + static void counter_update_class(enum resctrl_event_id evt_id, 921 + struct mpam_class *class) 922 + { 923 + struct mpam_class *existing_class = mpam_resctrl_counters[evt_id].class; 924 + 925 + if (existing_class) { 926 + if (class->level == 3) { 927 + pr_debug("Existing class is L3 - L3 wins\n"); 928 + return; 929 + } 930 + 931 + if (existing_class->level < class->level) { 932 + pr_debug("Existing class is closer to L3, %u versus %u - closer is better\n", 933 + existing_class->level, class->level); 934 + return; 935 + } 936 + } 937 + 938 + mpam_resctrl_counters[evt_id].class = class; 939 + } 940 + 941 + static void mpam_resctrl_pick_counters(void) 942 + { 943 + struct mpam_class *class; 944 + 945 + lockdep_assert_cpus_held(); 946 + 947 + guard(srcu)(&mpam_srcu); 948 + list_for_each_entry_srcu(class, &mpam_classes, classes_list, 949 + srcu_read_lock_held(&mpam_srcu)) { 950 + /* The name of the resource is L3... */ 951 + if (class->type == MPAM_CLASS_CACHE && class->level != 3) { 952 + pr_debug("class %u is a cache but not the L3", class->level); 953 + continue; 954 + } 955 + 956 + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { 957 + pr_debug("class %u does not cover all CPUs", 958 + class->level); 959 + continue; 960 + } 961 + 962 + if (cache_has_usable_csu(class)) { 963 + pr_debug("class %u has usable CSU", 964 + class->level); 965 + 966 + /* CSU counters only make sense on a cache. */ 967 + switch (class->type) { 968 + case MPAM_CLASS_CACHE: 969 + if (update_rmid_limits(class)) 970 + break; 971 + 972 + counter_update_class(QOS_L3_OCCUP_EVENT_ID, class); 973 + break; 974 + default: 975 + break; 976 + } 977 + } 978 + } 979 + } 980 + 981 + static int mpam_resctrl_control_init(struct mpam_resctrl_res *res) 982 + { 983 + struct mpam_class *class = res->class; 984 + struct mpam_props *cprops = &class->props; 985 + struct rdt_resource *r = &res->resctrl_res; 986 + 987 + switch (r->rid) { 988 + case RDT_RESOURCE_L2: 989 + case RDT_RESOURCE_L3: 990 + r->schema_fmt = RESCTRL_SCHEMA_BITMAP; 991 + r->cache.arch_has_sparse_bitmasks = true; 992 + 993 + r->cache.cbm_len = class->props.cpbm_wd; 994 + /* mpam_devices will reject empty bitmaps */ 995 + r->cache.min_cbm_bits = 1; 996 + 997 + if (r->rid == RDT_RESOURCE_L2) { 998 + r->name = "L2"; 999 + r->ctrl_scope = RESCTRL_L2_CACHE; 1000 + r->cdp_capable = true; 1001 + } else { 1002 + r->name = "L3"; 1003 + r->ctrl_scope = RESCTRL_L3_CACHE; 1004 + r->cdp_capable = true; 1005 + } 1006 + 1007 + /* 1008 + * Which bits are shared with other ...things... Unknown 1009 + * devices use partid-0 which uses all the bitmap fields. Until 1010 + * we have configured the SMMU and GIC not to do this 'all the 1011 + * bits' is the correct answer here. 1012 + */ 1013 + r->cache.shareable_bits = resctrl_get_default_ctrl(r); 1014 + r->alloc_capable = true; 1015 + break; 1016 + case RDT_RESOURCE_MBA: 1017 + r->schema_fmt = RESCTRL_SCHEMA_RANGE; 1018 + r->ctrl_scope = RESCTRL_L3_CACHE; 1019 + 1020 + r->membw.delay_linear = true; 1021 + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; 1022 + r->membw.min_bw = get_mba_min(cprops); 1023 + r->membw.max_bw = MAX_MBA_BW; 1024 + r->membw.bw_gran = get_mba_granularity(cprops); 1025 + 1026 + r->name = "MB"; 1027 + r->alloc_capable = true; 1028 + break; 1029 + default: 1030 + return -EINVAL; 1031 + } 1032 + 1033 + return 0; 1034 + } 1035 + 1036 + static int mpam_resctrl_pick_domain_id(int cpu, struct mpam_component *comp) 1037 + { 1038 + struct mpam_class *class = comp->class; 1039 + 1040 + if (class->type == MPAM_CLASS_CACHE) 1041 + return comp->comp_id; 1042 + 1043 + if (topology_matches_l3(class)) { 1044 + /* Use the corresponding L3 component ID as the domain ID */ 1045 + int id = get_cpu_cacheinfo_id(cpu, 3); 1046 + 1047 + /* Implies topology_matches_l3() made a mistake */ 1048 + if (WARN_ON_ONCE(id == -1)) 1049 + return comp->comp_id; 1050 + 1051 + return id; 1052 + } 1053 + 1054 + /* Otherwise, expose the ID used by the firmware table code. */ 1055 + return comp->comp_id; 1056 + } 1057 + 1058 + static int mpam_resctrl_monitor_init(struct mpam_resctrl_mon *mon, 1059 + enum resctrl_event_id type) 1060 + { 1061 + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 1062 + struct rdt_resource *l3 = &res->resctrl_res; 1063 + 1064 + lockdep_assert_cpus_held(); 1065 + 1066 + /* 1067 + * There also needs to be an L3 cache present. 1068 + * The check just requires any online CPU and it can't go offline as we 1069 + * hold the cpu lock. 1070 + */ 1071 + if (get_cpu_cacheinfo_id(raw_smp_processor_id(), 3) == -1) 1072 + return 0; 1073 + 1074 + /* 1075 + * If there are no MPAM resources on L3, force it into existence. 1076 + * topology_matches_l3() already ensures this looks like the L3. 1077 + * The domain-ids will be fixed up by mpam_resctrl_domain_hdr_init(). 1078 + */ 1079 + if (!res->class) { 1080 + pr_warn_once("Faking L3 MSC to enable counters.\n"); 1081 + res->class = mpam_resctrl_counters[type].class; 1082 + } 1083 + 1084 + /* 1085 + * Called multiple times!, once per event type that has a 1086 + * monitoring class. 1087 + * Setting name is necessary on monitor only platforms. 1088 + */ 1089 + l3->name = "L3"; 1090 + l3->mon_scope = RESCTRL_L3_CACHE; 1091 + 1092 + /* 1093 + * num-rmid is the upper bound for the number of monitoring groups that 1094 + * can exist simultaneously, including the default monitoring group for 1095 + * each control group. Hence, advertise the whole rmid_idx space even 1096 + * though each control group has its own pmg/rmid space. Unfortunately, 1097 + * this does mean userspace needs to know the architecture to correctly 1098 + * interpret this value. 1099 + */ 1100 + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx(); 1101 + 1102 + if (resctrl_enable_mon_event(type, false, 0, NULL)) 1103 + l3->mon_capable = true; 1104 + 1105 + return 0; 1106 + } 1107 + 1108 + u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, 1109 + u32 closid, enum resctrl_conf_type type) 1110 + { 1111 + u32 partid; 1112 + struct mpam_config *cfg; 1113 + struct mpam_props *cprops; 1114 + struct mpam_resctrl_res *res; 1115 + struct mpam_resctrl_dom *dom; 1116 + enum mpam_device_features configured_by; 1117 + 1118 + lockdep_assert_cpus_held(); 1119 + 1120 + if (!mpam_is_enabled()) 1121 + return resctrl_get_default_ctrl(r); 1122 + 1123 + res = container_of(r, struct mpam_resctrl_res, resctrl_res); 1124 + dom = container_of(d, struct mpam_resctrl_dom, resctrl_ctrl_dom); 1125 + cprops = &res->class->props; 1126 + 1127 + /* 1128 + * When CDP is enabled, but the resource doesn't support it, 1129 + * the control is cloned across both partids. 1130 + * Pick one at random to read: 1131 + */ 1132 + if (mpam_resctrl_hide_cdp(r->rid)) 1133 + type = CDP_DATA; 1134 + 1135 + partid = resctrl_get_config_index(closid, type); 1136 + cfg = &dom->ctrl_comp->cfg[partid]; 1137 + 1138 + switch (r->rid) { 1139 + case RDT_RESOURCE_L2: 1140 + case RDT_RESOURCE_L3: 1141 + configured_by = mpam_feat_cpor_part; 1142 + break; 1143 + case RDT_RESOURCE_MBA: 1144 + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { 1145 + configured_by = mpam_feat_mbw_max; 1146 + break; 1147 + } 1148 + fallthrough; 1149 + default: 1150 + return resctrl_get_default_ctrl(r); 1151 + } 1152 + 1153 + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r) || 1154 + !mpam_has_feature(configured_by, cfg)) 1155 + return resctrl_get_default_ctrl(r); 1156 + 1157 + switch (configured_by) { 1158 + case mpam_feat_cpor_part: 1159 + return cfg->cpbm; 1160 + case mpam_feat_mbw_max: 1161 + return mbw_max_to_percent(cfg->mbw_max, cprops); 1162 + default: 1163 + return resctrl_get_default_ctrl(r); 1164 + } 1165 + } 1166 + 1167 + int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, 1168 + u32 closid, enum resctrl_conf_type t, u32 cfg_val) 1169 + { 1170 + int err; 1171 + u32 partid; 1172 + struct mpam_config cfg; 1173 + struct mpam_props *cprops; 1174 + struct mpam_resctrl_res *res; 1175 + struct mpam_resctrl_dom *dom; 1176 + 1177 + lockdep_assert_cpus_held(); 1178 + lockdep_assert_irqs_enabled(); 1179 + 1180 + if (!mpam_is_enabled()) 1181 + return -EINVAL; 1182 + 1183 + /* 1184 + * No need to check the CPU as mpam_apply_config() doesn't care, and 1185 + * resctrl_arch_update_domains() relies on this. 1186 + */ 1187 + res = container_of(r, struct mpam_resctrl_res, resctrl_res); 1188 + dom = container_of(d, struct mpam_resctrl_dom, resctrl_ctrl_dom); 1189 + cprops = &res->class->props; 1190 + 1191 + if (mpam_resctrl_hide_cdp(r->rid)) 1192 + t = CDP_DATA; 1193 + 1194 + partid = resctrl_get_config_index(closid, t); 1195 + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) { 1196 + pr_debug("Not alloc capable or computed PARTID out of range\n"); 1197 + return -EINVAL; 1198 + } 1199 + 1200 + /* 1201 + * Copy the current config to avoid clearing other resources when the 1202 + * same component is exposed multiple times through resctrl. 1203 + */ 1204 + cfg = dom->ctrl_comp->cfg[partid]; 1205 + 1206 + switch (r->rid) { 1207 + case RDT_RESOURCE_L2: 1208 + case RDT_RESOURCE_L3: 1209 + cfg.cpbm = cfg_val; 1210 + mpam_set_feature(mpam_feat_cpor_part, &cfg); 1211 + break; 1212 + case RDT_RESOURCE_MBA: 1213 + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { 1214 + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); 1215 + mpam_set_feature(mpam_feat_mbw_max, &cfg); 1216 + break; 1217 + } 1218 + fallthrough; 1219 + default: 1220 + return -EINVAL; 1221 + } 1222 + 1223 + /* 1224 + * When CDP is enabled, but the resource doesn't support it, we need to 1225 + * apply the same configuration to the other partid. 1226 + */ 1227 + if (mpam_resctrl_hide_cdp(r->rid)) { 1228 + partid = resctrl_get_config_index(closid, CDP_CODE); 1229 + err = mpam_apply_config(dom->ctrl_comp, partid, &cfg); 1230 + if (err) 1231 + return err; 1232 + 1233 + partid = resctrl_get_config_index(closid, CDP_DATA); 1234 + return mpam_apply_config(dom->ctrl_comp, partid, &cfg); 1235 + } 1236 + 1237 + return mpam_apply_config(dom->ctrl_comp, partid, &cfg); 1238 + } 1239 + 1240 + int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) 1241 + { 1242 + int err; 1243 + struct rdt_ctrl_domain *d; 1244 + 1245 + lockdep_assert_cpus_held(); 1246 + lockdep_assert_irqs_enabled(); 1247 + 1248 + if (!mpam_is_enabled()) 1249 + return -EINVAL; 1250 + 1251 + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) { 1252 + for (enum resctrl_conf_type t = 0; t < CDP_NUM_TYPES; t++) { 1253 + struct resctrl_staged_config *cfg = &d->staged_config[t]; 1254 + 1255 + if (!cfg->have_new_ctrl) 1256 + continue; 1257 + 1258 + err = resctrl_arch_update_one(r, d, closid, t, 1259 + cfg->new_ctrl); 1260 + if (err) 1261 + return err; 1262 + } 1263 + } 1264 + 1265 + return 0; 1266 + } 1267 + 1268 + void resctrl_arch_reset_all_ctrls(struct rdt_resource *r) 1269 + { 1270 + struct mpam_resctrl_res *res; 1271 + 1272 + lockdep_assert_cpus_held(); 1273 + 1274 + if (!mpam_is_enabled()) 1275 + return; 1276 + 1277 + res = container_of(r, struct mpam_resctrl_res, resctrl_res); 1278 + mpam_reset_class_locked(res->class); 1279 + } 1280 + 1281 + static void mpam_resctrl_domain_hdr_init(int cpu, struct mpam_component *comp, 1282 + enum resctrl_res_level rid, 1283 + struct rdt_domain_hdr *hdr) 1284 + { 1285 + lockdep_assert_cpus_held(); 1286 + 1287 + INIT_LIST_HEAD(&hdr->list); 1288 + hdr->id = mpam_resctrl_pick_domain_id(cpu, comp); 1289 + hdr->rid = rid; 1290 + cpumask_set_cpu(cpu, &hdr->cpu_mask); 1291 + } 1292 + 1293 + static void mpam_resctrl_online_domain_hdr(unsigned int cpu, 1294 + struct rdt_domain_hdr *hdr) 1295 + { 1296 + lockdep_assert_cpus_held(); 1297 + 1298 + cpumask_set_cpu(cpu, &hdr->cpu_mask); 1299 + } 1300 + 1301 + /** 1302 + * mpam_resctrl_offline_domain_hdr() - Update the domain header to remove a CPU. 1303 + * @cpu: The CPU to remove from the domain. 1304 + * @hdr: The domain's header. 1305 + * 1306 + * Removes @cpu from the header mask. If this was the last CPU in the domain, 1307 + * the domain header is removed from its parent list and true is returned, 1308 + * indicating the parent structure can be freed. 1309 + * If there are other CPUs in the domain, returns false. 1310 + */ 1311 + static bool mpam_resctrl_offline_domain_hdr(unsigned int cpu, 1312 + struct rdt_domain_hdr *hdr) 1313 + { 1314 + lockdep_assert_held(&domain_list_lock); 1315 + 1316 + cpumask_clear_cpu(cpu, &hdr->cpu_mask); 1317 + if (cpumask_empty(&hdr->cpu_mask)) { 1318 + list_del_rcu(&hdr->list); 1319 + synchronize_rcu(); 1320 + return true; 1321 + } 1322 + 1323 + return false; 1324 + } 1325 + 1326 + static void mpam_resctrl_domain_insert(struct list_head *list, 1327 + struct rdt_domain_hdr *new) 1328 + { 1329 + struct rdt_domain_hdr *err; 1330 + struct list_head *pos = NULL; 1331 + 1332 + lockdep_assert_held(&domain_list_lock); 1333 + 1334 + err = resctrl_find_domain(list, new->id, &pos); 1335 + if (WARN_ON_ONCE(err)) 1336 + return; 1337 + 1338 + list_add_tail_rcu(&new->list, pos); 1339 + } 1340 + 1341 + static struct mpam_component *find_component(struct mpam_class *class, int cpu) 1342 + { 1343 + struct mpam_component *comp; 1344 + 1345 + guard(srcu)(&mpam_srcu); 1346 + list_for_each_entry_srcu(comp, &class->components, class_list, 1347 + srcu_read_lock_held(&mpam_srcu)) { 1348 + if (cpumask_test_cpu(cpu, &comp->affinity)) 1349 + return comp; 1350 + } 1351 + 1352 + return NULL; 1353 + } 1354 + 1355 + static struct mpam_resctrl_dom * 1356 + mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) 1357 + { 1358 + int err; 1359 + struct mpam_resctrl_dom *dom; 1360 + struct rdt_l3_mon_domain *mon_d; 1361 + struct rdt_ctrl_domain *ctrl_d; 1362 + struct mpam_class *class = res->class; 1363 + struct mpam_component *comp_iter, *ctrl_comp; 1364 + struct rdt_resource *r = &res->resctrl_res; 1365 + 1366 + lockdep_assert_held(&domain_list_lock); 1367 + 1368 + ctrl_comp = NULL; 1369 + guard(srcu)(&mpam_srcu); 1370 + list_for_each_entry_srcu(comp_iter, &class->components, class_list, 1371 + srcu_read_lock_held(&mpam_srcu)) { 1372 + if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { 1373 + ctrl_comp = comp_iter; 1374 + break; 1375 + } 1376 + } 1377 + 1378 + /* class has no component for this CPU */ 1379 + if (WARN_ON_ONCE(!ctrl_comp)) 1380 + return ERR_PTR(-EINVAL); 1381 + 1382 + dom = kzalloc_node(sizeof(*dom), GFP_KERNEL, cpu_to_node(cpu)); 1383 + if (!dom) 1384 + return ERR_PTR(-ENOMEM); 1385 + 1386 + if (r->alloc_capable) { 1387 + dom->ctrl_comp = ctrl_comp; 1388 + 1389 + ctrl_d = &dom->resctrl_ctrl_dom; 1390 + mpam_resctrl_domain_hdr_init(cpu, ctrl_comp, r->rid, &ctrl_d->hdr); 1391 + ctrl_d->hdr.type = RESCTRL_CTRL_DOMAIN; 1392 + err = resctrl_online_ctrl_domain(r, ctrl_d); 1393 + if (err) 1394 + goto free_domain; 1395 + 1396 + mpam_resctrl_domain_insert(&r->ctrl_domains, &ctrl_d->hdr); 1397 + } else { 1398 + pr_debug("Skipped control domain online - no controls\n"); 1399 + } 1400 + 1401 + if (r->mon_capable) { 1402 + struct mpam_component *any_mon_comp; 1403 + struct mpam_resctrl_mon *mon; 1404 + enum resctrl_event_id eventid; 1405 + 1406 + /* 1407 + * Even if the monitor domain is backed by a different 1408 + * component, the L3 component IDs need to be used... only 1409 + * there may be no ctrl_comp for the L3. 1410 + * Search each event's class list for a component with 1411 + * overlapping CPUs and set up the dom->mon_comp array. 1412 + */ 1413 + 1414 + for_each_mpam_resctrl_mon(mon, eventid) { 1415 + struct mpam_component *mon_comp; 1416 + 1417 + if (!mon->class) 1418 + continue; // dummy resource 1419 + 1420 + mon_comp = find_component(mon->class, cpu); 1421 + dom->mon_comp[eventid] = mon_comp; 1422 + if (mon_comp) 1423 + any_mon_comp = mon_comp; 1424 + } 1425 + if (!any_mon_comp) { 1426 + WARN_ON_ONCE(0); 1427 + err = -EFAULT; 1428 + goto offline_ctrl_domain; 1429 + } 1430 + 1431 + mon_d = &dom->resctrl_mon_dom; 1432 + mpam_resctrl_domain_hdr_init(cpu, any_mon_comp, r->rid, &mon_d->hdr); 1433 + mon_d->hdr.type = RESCTRL_MON_DOMAIN; 1434 + err = resctrl_online_mon_domain(r, &mon_d->hdr); 1435 + if (err) 1436 + goto offline_ctrl_domain; 1437 + 1438 + mpam_resctrl_domain_insert(&r->mon_domains, &mon_d->hdr); 1439 + } else { 1440 + pr_debug("Skipped monitor domain online - no monitors\n"); 1441 + } 1442 + 1443 + return dom; 1444 + 1445 + offline_ctrl_domain: 1446 + if (r->alloc_capable) { 1447 + mpam_resctrl_offline_domain_hdr(cpu, &ctrl_d->hdr); 1448 + resctrl_offline_ctrl_domain(r, ctrl_d); 1449 + } 1450 + free_domain: 1451 + kfree(dom); 1452 + dom = ERR_PTR(err); 1453 + 1454 + return dom; 1455 + } 1456 + 1457 + /* 1458 + * We know all the monitors are associated with the L3, even if there are no 1459 + * controls and therefore no control component. Find the cache-id for the CPU 1460 + * and use that to search for existing resctrl domains. 1461 + * This relies on mpam_resctrl_pick_domain_id() using the L3 cache-id 1462 + * for anything that is not a cache. 1463 + */ 1464 + static struct mpam_resctrl_dom *mpam_resctrl_get_mon_domain_from_cpu(int cpu) 1465 + { 1466 + int cache_id; 1467 + struct mpam_resctrl_dom *dom; 1468 + struct mpam_resctrl_res *l3 = &mpam_resctrl_controls[RDT_RESOURCE_L3]; 1469 + 1470 + lockdep_assert_cpus_held(); 1471 + 1472 + if (!l3->class) 1473 + return NULL; 1474 + cache_id = get_cpu_cacheinfo_id(cpu, 3); 1475 + if (cache_id < 0) 1476 + return NULL; 1477 + 1478 + list_for_each_entry_rcu(dom, &l3->resctrl_res.mon_domains, resctrl_mon_dom.hdr.list) { 1479 + if (dom->resctrl_mon_dom.hdr.id == cache_id) 1480 + return dom; 1481 + } 1482 + 1483 + return NULL; 1484 + } 1485 + 1486 + static struct mpam_resctrl_dom * 1487 + mpam_resctrl_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res) 1488 + { 1489 + struct mpam_resctrl_dom *dom; 1490 + struct rdt_resource *r = &res->resctrl_res; 1491 + 1492 + lockdep_assert_cpus_held(); 1493 + 1494 + list_for_each_entry_rcu(dom, &r->ctrl_domains, resctrl_ctrl_dom.hdr.list) { 1495 + if (cpumask_test_cpu(cpu, &dom->ctrl_comp->affinity)) 1496 + return dom; 1497 + } 1498 + 1499 + if (r->rid != RDT_RESOURCE_L3) 1500 + return NULL; 1501 + 1502 + /* Search the mon domain list too - needed on monitor only platforms. */ 1503 + return mpam_resctrl_get_mon_domain_from_cpu(cpu); 1504 + } 1505 + 1506 + int mpam_resctrl_online_cpu(unsigned int cpu) 1507 + { 1508 + struct mpam_resctrl_res *res; 1509 + enum resctrl_res_level rid; 1510 + 1511 + guard(mutex)(&domain_list_lock); 1512 + for_each_mpam_resctrl_control(res, rid) { 1513 + struct mpam_resctrl_dom *dom; 1514 + struct rdt_resource *r = &res->resctrl_res; 1515 + 1516 + if (!res->class) 1517 + continue; // dummy_resource; 1518 + 1519 + dom = mpam_resctrl_get_domain_from_cpu(cpu, res); 1520 + if (!dom) { 1521 + dom = mpam_resctrl_alloc_domain(cpu, res); 1522 + if (IS_ERR(dom)) 1523 + return PTR_ERR(dom); 1524 + } else { 1525 + if (r->alloc_capable) { 1526 + struct rdt_ctrl_domain *ctrl_d = &dom->resctrl_ctrl_dom; 1527 + 1528 + mpam_resctrl_online_domain_hdr(cpu, &ctrl_d->hdr); 1529 + } 1530 + if (r->mon_capable) { 1531 + struct rdt_l3_mon_domain *mon_d = &dom->resctrl_mon_dom; 1532 + 1533 + mpam_resctrl_online_domain_hdr(cpu, &mon_d->hdr); 1534 + } 1535 + } 1536 + } 1537 + 1538 + resctrl_online_cpu(cpu); 1539 + 1540 + return 0; 1541 + } 1542 + 1543 + void mpam_resctrl_offline_cpu(unsigned int cpu) 1544 + { 1545 + struct mpam_resctrl_res *res; 1546 + enum resctrl_res_level rid; 1547 + 1548 + resctrl_offline_cpu(cpu); 1549 + 1550 + guard(mutex)(&domain_list_lock); 1551 + for_each_mpam_resctrl_control(res, rid) { 1552 + struct mpam_resctrl_dom *dom; 1553 + struct rdt_l3_mon_domain *mon_d; 1554 + struct rdt_ctrl_domain *ctrl_d; 1555 + bool ctrl_dom_empty, mon_dom_empty; 1556 + struct rdt_resource *r = &res->resctrl_res; 1557 + 1558 + if (!res->class) 1559 + continue; // dummy resource 1560 + 1561 + dom = mpam_resctrl_get_domain_from_cpu(cpu, res); 1562 + if (WARN_ON_ONCE(!dom)) 1563 + continue; 1564 + 1565 + if (r->alloc_capable) { 1566 + ctrl_d = &dom->resctrl_ctrl_dom; 1567 + ctrl_dom_empty = mpam_resctrl_offline_domain_hdr(cpu, &ctrl_d->hdr); 1568 + if (ctrl_dom_empty) 1569 + resctrl_offline_ctrl_domain(&res->resctrl_res, ctrl_d); 1570 + } else { 1571 + ctrl_dom_empty = true; 1572 + } 1573 + 1574 + if (r->mon_capable) { 1575 + mon_d = &dom->resctrl_mon_dom; 1576 + mon_dom_empty = mpam_resctrl_offline_domain_hdr(cpu, &mon_d->hdr); 1577 + if (mon_dom_empty) 1578 + resctrl_offline_mon_domain(&res->resctrl_res, &mon_d->hdr); 1579 + } else { 1580 + mon_dom_empty = true; 1581 + } 1582 + 1583 + if (ctrl_dom_empty && mon_dom_empty) 1584 + kfree(dom); 1585 + } 1586 + } 1587 + 1588 + int mpam_resctrl_setup(void) 1589 + { 1590 + int err = 0; 1591 + struct mpam_resctrl_res *res; 1592 + enum resctrl_res_level rid; 1593 + struct mpam_resctrl_mon *mon; 1594 + enum resctrl_event_id eventid; 1595 + 1596 + wait_event(wait_cacheinfo_ready, cacheinfo_ready); 1597 + 1598 + cpus_read_lock(); 1599 + for_each_mpam_resctrl_control(res, rid) { 1600 + INIT_LIST_HEAD_RCU(&res->resctrl_res.ctrl_domains); 1601 + INIT_LIST_HEAD_RCU(&res->resctrl_res.mon_domains); 1602 + res->resctrl_res.rid = rid; 1603 + } 1604 + 1605 + /* Find some classes to use for controls */ 1606 + mpam_resctrl_pick_caches(); 1607 + mpam_resctrl_pick_mba(); 1608 + 1609 + /* Initialise the resctrl structures from the classes */ 1610 + for_each_mpam_resctrl_control(res, rid) { 1611 + if (!res->class) 1612 + continue; // dummy resource 1613 + 1614 + err = mpam_resctrl_control_init(res); 1615 + if (err) { 1616 + pr_debug("Failed to initialise rid %u\n", rid); 1617 + goto internal_error; 1618 + } 1619 + } 1620 + 1621 + /* Find some classes to use for monitors */ 1622 + mpam_resctrl_pick_counters(); 1623 + 1624 + for_each_mpam_resctrl_mon(mon, eventid) { 1625 + if (!mon->class) 1626 + continue; // dummy resource 1627 + 1628 + err = mpam_resctrl_monitor_init(mon, eventid); 1629 + if (err) { 1630 + pr_debug("Failed to initialise event %u\n", eventid); 1631 + goto internal_error; 1632 + } 1633 + } 1634 + 1635 + cpus_read_unlock(); 1636 + 1637 + if (!resctrl_arch_alloc_capable() && !resctrl_arch_mon_capable()) { 1638 + pr_debug("No alloc(%u) or monitor(%u) found - resctrl not supported\n", 1639 + resctrl_arch_alloc_capable(), resctrl_arch_mon_capable()); 1640 + return -EOPNOTSUPP; 1641 + } 1642 + 1643 + err = resctrl_init(); 1644 + if (err) 1645 + return err; 1646 + 1647 + WRITE_ONCE(resctrl_enabled, true); 1648 + 1649 + return 0; 1650 + 1651 + internal_error: 1652 + cpus_read_unlock(); 1653 + pr_debug("Internal error %d - resctrl not supported\n", err); 1654 + return err; 1655 + } 1656 + 1657 + void mpam_resctrl_exit(void) 1658 + { 1659 + if (!READ_ONCE(resctrl_enabled)) 1660 + return; 1661 + 1662 + WRITE_ONCE(resctrl_enabled, false); 1663 + resctrl_exit(); 1664 + } 1665 + 1666 + /* 1667 + * The driver is detaching an MSC from this class, if resctrl was using it, 1668 + * pull on resctrl_exit(). 1669 + */ 1670 + void mpam_resctrl_teardown_class(struct mpam_class *class) 1671 + { 1672 + struct mpam_resctrl_res *res; 1673 + enum resctrl_res_level rid; 1674 + struct mpam_resctrl_mon *mon; 1675 + enum resctrl_event_id eventid; 1676 + 1677 + might_sleep(); 1678 + 1679 + for_each_mpam_resctrl_control(res, rid) { 1680 + if (res->class == class) { 1681 + res->class = NULL; 1682 + break; 1683 + } 1684 + } 1685 + for_each_mpam_resctrl_mon(mon, eventid) { 1686 + if (mon->class == class) { 1687 + mon->class = NULL; 1688 + break; 1689 + } 1690 + } 1691 + } 1692 + 1693 + static int __init __cacheinfo_ready(void) 1694 + { 1695 + cacheinfo_ready = true; 1696 + wake_up(&wait_cacheinfo_ready); 1697 + 1698 + return 0; 1699 + } 1700 + device_initcall_sync(__cacheinfo_ready); 1701 + 1702 + #ifdef CONFIG_MPAM_KUNIT_TEST 1703 + #include "test_mpam_resctrl.c" 1704 + #endif
+315
drivers/resctrl/test_mpam_resctrl.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + // Copyright (C) 2025 Arm Ltd. 3 + /* This file is intended to be included into mpam_resctrl.c */ 4 + 5 + #include <kunit/test.h> 6 + #include <linux/array_size.h> 7 + #include <linux/bits.h> 8 + #include <linux/math.h> 9 + #include <linux/sprintf.h> 10 + 11 + struct percent_value_case { 12 + u8 pc; 13 + u8 width; 14 + u16 value; 15 + }; 16 + 17 + /* 18 + * Mysterious inscriptions taken from the union of ARM DDI 0598D.b, 19 + * "Arm Architecture Reference Manual Supplement - Memory System 20 + * Resource Partitioning and Monitoring (MPAM), for A-profile 21 + * architecture", Section 9.8, "About the fixed-point fractional 22 + * format" (exact percentage entries only) and ARM IHI0099B.a 23 + * "MPAM system component specification", Section 9.3, 24 + * "The fixed-point fractional format": 25 + */ 26 + static const struct percent_value_case percent_value_cases[] = { 27 + /* Architectural cases: */ 28 + { 1, 8, 1 }, { 1, 12, 0x27 }, { 1, 16, 0x28e }, 29 + { 25, 8, 0x3f }, { 25, 12, 0x3ff }, { 25, 16, 0x3fff }, 30 + { 33, 8, 0x53 }, { 33, 12, 0x546 }, { 33, 16, 0x5479 }, 31 + { 35, 8, 0x58 }, { 35, 12, 0x598 }, { 35, 16, 0x5998 }, 32 + { 45, 8, 0x72 }, { 45, 12, 0x732 }, { 45, 16, 0x7332 }, 33 + { 50, 8, 0x7f }, { 50, 12, 0x7ff }, { 50, 16, 0x7fff }, 34 + { 52, 8, 0x84 }, { 52, 12, 0x850 }, { 52, 16, 0x851d }, 35 + { 55, 8, 0x8b }, { 55, 12, 0x8cb }, { 55, 16, 0x8ccb }, 36 + { 58, 8, 0x93 }, { 58, 12, 0x946 }, { 58, 16, 0x9479 }, 37 + { 75, 8, 0xbf }, { 75, 12, 0xbff }, { 75, 16, 0xbfff }, 38 + { 80, 8, 0xcb }, { 80, 12, 0xccb }, { 80, 16, 0xcccb }, 39 + { 88, 8, 0xe0 }, { 88, 12, 0xe13 }, { 88, 16, 0xe146 }, 40 + { 95, 8, 0xf2 }, { 95, 12, 0xf32 }, { 95, 16, 0xf332 }, 41 + { 100, 8, 0xff }, { 100, 12, 0xfff }, { 100, 16, 0xffff }, 42 + }; 43 + 44 + static void test_percent_value_desc(const struct percent_value_case *param, 45 + char *desc) 46 + { 47 + snprintf(desc, KUNIT_PARAM_DESC_SIZE, 48 + "pc=%d, width=%d, value=0x%.*x\n", 49 + param->pc, param->width, 50 + DIV_ROUND_UP(param->width, 4), param->value); 51 + } 52 + 53 + KUNIT_ARRAY_PARAM(test_percent_value, percent_value_cases, 54 + test_percent_value_desc); 55 + 56 + struct percent_value_test_info { 57 + u32 pc; /* result of value-to-percent conversion */ 58 + u32 value; /* result of percent-to-value conversion */ 59 + u32 max_value; /* maximum raw value allowed by test params */ 60 + unsigned int shift; /* promotes raw testcase value to 16 bits */ 61 + }; 62 + 63 + /* 64 + * Convert a reference percentage to a fixed-point MAX value and 65 + * vice-versa, based on param (not test->param_value!) 66 + */ 67 + static void __prepare_percent_value_test(struct kunit *test, 68 + struct percent_value_test_info *res, 69 + const struct percent_value_case *param) 70 + { 71 + struct mpam_props fake_props = { }; 72 + 73 + /* Reject bogus test parameters that would break the tests: */ 74 + KUNIT_ASSERT_GE(test, param->width, 1); 75 + KUNIT_ASSERT_LE(test, param->width, 16); 76 + KUNIT_ASSERT_LT(test, param->value, 1 << param->width); 77 + 78 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 79 + fake_props.bwa_wd = param->width; 80 + 81 + res->shift = 16 - param->width; 82 + res->max_value = GENMASK_U32(param->width - 1, 0); 83 + res->value = percent_to_mbw_max(param->pc, &fake_props); 84 + res->pc = mbw_max_to_percent(param->value << res->shift, &fake_props); 85 + } 86 + 87 + static void test_get_mba_granularity(struct kunit *test) 88 + { 89 + int ret; 90 + struct mpam_props fake_props = { }; 91 + 92 + /* Use MBW_MAX */ 93 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 94 + 95 + fake_props.bwa_wd = 0; 96 + KUNIT_EXPECT_FALSE(test, mba_class_use_mbw_max(&fake_props)); 97 + 98 + fake_props.bwa_wd = 1; 99 + KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props)); 100 + 101 + /* Architectural maximum: */ 102 + fake_props.bwa_wd = 16; 103 + KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props)); 104 + 105 + /* No usable control... */ 106 + fake_props.bwa_wd = 0; 107 + ret = get_mba_granularity(&fake_props); 108 + KUNIT_EXPECT_EQ(test, ret, 0); 109 + 110 + fake_props.bwa_wd = 1; 111 + ret = get_mba_granularity(&fake_props); 112 + KUNIT_EXPECT_EQ(test, ret, 50); /* DIV_ROUND_UP(100, 1 << 1)% = 50% */ 113 + 114 + fake_props.bwa_wd = 2; 115 + ret = get_mba_granularity(&fake_props); 116 + KUNIT_EXPECT_EQ(test, ret, 25); /* DIV_ROUND_UP(100, 1 << 2)% = 25% */ 117 + 118 + fake_props.bwa_wd = 3; 119 + ret = get_mba_granularity(&fake_props); 120 + KUNIT_EXPECT_EQ(test, ret, 13); /* DIV_ROUND_UP(100, 1 << 3)% = 13% */ 121 + 122 + fake_props.bwa_wd = 6; 123 + ret = get_mba_granularity(&fake_props); 124 + KUNIT_EXPECT_EQ(test, ret, 2); /* DIV_ROUND_UP(100, 1 << 6)% = 2% */ 125 + 126 + fake_props.bwa_wd = 7; 127 + ret = get_mba_granularity(&fake_props); 128 + KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 7)% = 1% */ 129 + 130 + /* Granularity saturates at 1% */ 131 + fake_props.bwa_wd = 16; /* architectural maximum */ 132 + ret = get_mba_granularity(&fake_props); 133 + KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 16)% = 1% */ 134 + } 135 + 136 + static void test_mbw_max_to_percent(struct kunit *test) 137 + { 138 + const struct percent_value_case *param = test->param_value; 139 + struct percent_value_test_info res; 140 + 141 + /* 142 + * Since the reference values in percent_value_cases[] all 143 + * correspond to exact percentages, round-to-nearest will 144 + * always give the exact percentage back when the MPAM max 145 + * value has precision of 0.5% or finer. (Always true for the 146 + * reference data, since they all specify 8 bits or more of 147 + * precision. 148 + * 149 + * So, keep it simple and demand an exact match: 150 + */ 151 + __prepare_percent_value_test(test, &res, param); 152 + KUNIT_EXPECT_EQ(test, res.pc, param->pc); 153 + } 154 + 155 + static void test_percent_to_mbw_max(struct kunit *test) 156 + { 157 + const struct percent_value_case *param = test->param_value; 158 + struct percent_value_test_info res; 159 + 160 + __prepare_percent_value_test(test, &res, param); 161 + 162 + KUNIT_EXPECT_GE(test, res.value, param->value << res.shift); 163 + KUNIT_EXPECT_LE(test, res.value, (param->value + 1) << res.shift); 164 + KUNIT_EXPECT_LE(test, res.value, res.max_value << res.shift); 165 + 166 + /* No flexibility allowed for 0% and 100%! */ 167 + 168 + if (param->pc == 0) 169 + KUNIT_EXPECT_EQ(test, res.value, 0); 170 + 171 + if (param->pc == 100) 172 + KUNIT_EXPECT_EQ(test, res.value, res.max_value << res.shift); 173 + } 174 + 175 + static const void *test_all_bwa_wd_gen_params(struct kunit *test, const void *prev, 176 + char *desc) 177 + { 178 + uintptr_t param = (uintptr_t)prev; 179 + 180 + if (param > 15) 181 + return NULL; 182 + 183 + param++; 184 + 185 + snprintf(desc, KUNIT_PARAM_DESC_SIZE, "wd=%u\n", (unsigned int)param); 186 + 187 + return (void *)param; 188 + } 189 + 190 + static unsigned int test_get_bwa_wd(struct kunit *test) 191 + { 192 + uintptr_t param = (uintptr_t)test->param_value; 193 + 194 + KUNIT_ASSERT_GE(test, param, 1); 195 + KUNIT_ASSERT_LE(test, param, 16); 196 + 197 + return param; 198 + } 199 + 200 + static void test_mbw_max_to_percent_limits(struct kunit *test) 201 + { 202 + struct mpam_props fake_props = {0}; 203 + u32 max_value; 204 + 205 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 206 + fake_props.bwa_wd = test_get_bwa_wd(test); 207 + max_value = GENMASK(15, 16 - fake_props.bwa_wd); 208 + 209 + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(max_value, &fake_props), 210 + MAX_MBA_BW); 211 + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), 212 + get_mba_min(&fake_props)); 213 + 214 + /* 215 + * Rounding policy dependent 0% sanity-check: 216 + * With round-to-nearest, the minimum mbw_max value really 217 + * should map to 0% if there are at least 200 steps. 218 + * (100 steps may be enough for some other rounding policies.) 219 + */ 220 + if (fake_props.bwa_wd >= 8) 221 + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), 0); 222 + 223 + if (fake_props.bwa_wd < 8 && 224 + mbw_max_to_percent(0, &fake_props) == 0) 225 + kunit_warn(test, "wd=%d: Testsuite/driver Rounding policy mismatch?", 226 + fake_props.bwa_wd); 227 + } 228 + 229 + /* 230 + * Check that converting a percentage to mbw_max and back again (or, as 231 + * appropriate, vice-versa) always restores the original value: 232 + */ 233 + static void test_percent_max_roundtrip_stability(struct kunit *test) 234 + { 235 + struct mpam_props fake_props = {0}; 236 + unsigned int shift; 237 + u32 pc, max, pc2, max2; 238 + 239 + mpam_set_feature(mpam_feat_mbw_max, &fake_props); 240 + fake_props.bwa_wd = test_get_bwa_wd(test); 241 + shift = 16 - fake_props.bwa_wd; 242 + 243 + /* 244 + * Converting a valid value from the coarser scale to the finer 245 + * scale and back again must yield the original value: 246 + */ 247 + if (fake_props.bwa_wd >= 7) { 248 + /* More than 100 steps: only test exact pc values: */ 249 + for (pc = get_mba_min(&fake_props); pc <= MAX_MBA_BW; pc++) { 250 + max = percent_to_mbw_max(pc, &fake_props); 251 + pc2 = mbw_max_to_percent(max, &fake_props); 252 + KUNIT_EXPECT_EQ(test, pc2, pc); 253 + } 254 + } else { 255 + /* Fewer than 100 steps: only test exact mbw_max values: */ 256 + for (max = 0; max < 1 << 16; max += 1 << shift) { 257 + pc = mbw_max_to_percent(max, &fake_props); 258 + max2 = percent_to_mbw_max(pc, &fake_props); 259 + KUNIT_EXPECT_EQ(test, max2, max); 260 + } 261 + } 262 + } 263 + 264 + static void test_percent_to_max_rounding(struct kunit *test) 265 + { 266 + const struct percent_value_case *param = test->param_value; 267 + unsigned int num_rounded_up = 0, total = 0; 268 + struct percent_value_test_info res; 269 + 270 + for (param = percent_value_cases, total = 0; 271 + param < &percent_value_cases[ARRAY_SIZE(percent_value_cases)]; 272 + param++, total++) { 273 + __prepare_percent_value_test(test, &res, param); 274 + if (res.value > param->value << res.shift) 275 + num_rounded_up++; 276 + } 277 + 278 + /* 279 + * The MPAM driver applies a round-to-nearest policy, whereas a 280 + * round-down policy seems to have been applied in the 281 + * reference table from which the test vectors were selected. 282 + * 283 + * For a large and well-distributed suite of test vectors, 284 + * about half should be rounded up and half down compared with 285 + * the reference table. The actual test vectors are few in 286 + * number and probably not very well distributed however, so 287 + * tolerate a round-up rate of between 1/4 and 3/4 before 288 + * crying foul: 289 + */ 290 + 291 + kunit_info(test, "Round-up rate: %u%% (%u/%u)\n", 292 + DIV_ROUND_CLOSEST(num_rounded_up * 100, total), 293 + num_rounded_up, total); 294 + 295 + KUNIT_EXPECT_GE(test, 4 * num_rounded_up, 1 * total); 296 + KUNIT_EXPECT_LE(test, 4 * num_rounded_up, 3 * total); 297 + } 298 + 299 + static struct kunit_case mpam_resctrl_test_cases[] = { 300 + KUNIT_CASE(test_get_mba_granularity), 301 + KUNIT_CASE_PARAM(test_mbw_max_to_percent, test_percent_value_gen_params), 302 + KUNIT_CASE_PARAM(test_percent_to_mbw_max, test_percent_value_gen_params), 303 + KUNIT_CASE_PARAM(test_mbw_max_to_percent_limits, test_all_bwa_wd_gen_params), 304 + KUNIT_CASE(test_percent_to_max_rounding), 305 + KUNIT_CASE_PARAM(test_percent_max_roundtrip_stability, 306 + test_all_bwa_wd_gen_params), 307 + {} 308 + }; 309 + 310 + static struct kunit_suite mpam_resctrl_test_suite = { 311 + .name = "mpam_resctrl_test_suite", 312 + .test_cases = mpam_resctrl_test_cases, 313 + }; 314 + 315 + kunit_test_suites(&mpam_resctrl_test_suite);
+32
include/linux/arm_mpam.h
··· 5 5 #define __LINUX_ARM_MPAM_H 6 6 7 7 #include <linux/acpi.h> 8 + #include <linux/resctrl_types.h> 8 9 #include <linux/types.h> 9 10 10 11 struct mpam_msc; ··· 49 48 return -EINVAL; 50 49 } 51 50 #endif 51 + 52 + bool resctrl_arch_alloc_capable(void); 53 + bool resctrl_arch_mon_capable(void); 54 + 55 + void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); 56 + void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); 57 + void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid); 58 + void resctrl_arch_sched_in(struct task_struct *tsk); 59 + bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); 60 + bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); 61 + u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); 62 + void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); 63 + u32 resctrl_arch_system_num_rmid_idx(void); 64 + 65 + struct rdt_resource; 66 + void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, enum resctrl_event_id evtid); 67 + void resctrl_arch_mon_ctx_free(struct rdt_resource *r, enum resctrl_event_id evtid, void *ctx); 68 + 69 + /* 70 + * The CPU configuration for MPAM is cheap to write, and is only written if it 71 + * has changed. No need for fine grained enables. 72 + */ 73 + static inline void resctrl_arch_enable_mon(void) { } 74 + static inline void resctrl_arch_disable_mon(void) { } 75 + static inline void resctrl_arch_enable_alloc(void) { } 76 + static inline void resctrl_arch_disable_alloc(void) { } 77 + 78 + static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) 79 + { 80 + return val; 81 + } 52 82 53 83 /** 54 84 * mpam_register_requestor() - Register a requestor with the MPAM driver
+1 -1
include/linux/entry-common.h
··· 321 321 { 322 322 instrumentation_begin(); 323 323 syscall_exit_to_user_mode_work(regs); 324 - local_irq_disable_exit_to_user(); 324 + local_irq_disable(); 325 325 syscall_exit_to_user_mode_prepare(regs); 326 326 instrumentation_end(); 327 327 exit_to_user_mode();
+202 -54
include/linux/irq-entry-common.h
··· 101 101 } 102 102 103 103 /** 104 - * local_irq_enable_exit_to_user - Exit to user variant of local_irq_enable() 105 - * @ti_work: Cached TIF flags gathered with interrupts disabled 106 - * 107 - * Defaults to local_irq_enable(). Can be supplied by architecture specific 108 - * code. 109 - */ 110 - static inline void local_irq_enable_exit_to_user(unsigned long ti_work); 111 - 112 - #ifndef local_irq_enable_exit_to_user 113 - static __always_inline void local_irq_enable_exit_to_user(unsigned long ti_work) 114 - { 115 - local_irq_enable(); 116 - } 117 - #endif 118 - 119 - /** 120 - * local_irq_disable_exit_to_user - Exit to user variant of local_irq_disable() 121 - * 122 - * Defaults to local_irq_disable(). Can be supplied by architecture specific 123 - * code. 124 - */ 125 - static inline void local_irq_disable_exit_to_user(void); 126 - 127 - #ifndef local_irq_disable_exit_to_user 128 - static __always_inline void local_irq_disable_exit_to_user(void) 129 - { 130 - local_irq_disable(); 131 - } 132 - #endif 133 - 134 - /** 135 104 * arch_exit_to_user_mode_work - Architecture specific TIF work for exit 136 105 * to user mode. 137 106 * @regs: Pointer to currents pt_regs ··· 304 335 */ 305 336 static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs) 306 337 { 338 + lockdep_assert_irqs_disabled(); 339 + 307 340 instrumentation_begin(); 308 341 irqentry_exit_to_user_mode_prepare(regs); 309 342 instrumentation_end(); ··· 337 366 #endif 338 367 339 368 /** 369 + * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt 370 + * 371 + * Conditional reschedule with additional sanity checks. 372 + */ 373 + void raw_irqentry_exit_cond_resched(void); 374 + 375 + #ifdef CONFIG_PREEMPT_DYNAMIC 376 + #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) 377 + #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched 378 + #define irqentry_exit_cond_resched_dynamic_disabled NULL 379 + DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched); 380 + #define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)() 381 + #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) 382 + DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched); 383 + void dynamic_irqentry_exit_cond_resched(void); 384 + #define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched() 385 + #endif 386 + #else /* CONFIG_PREEMPT_DYNAMIC */ 387 + #define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched() 388 + #endif /* CONFIG_PREEMPT_DYNAMIC */ 389 + 390 + /** 391 + * irqentry_enter_from_kernel_mode - Establish state before invoking the irq handler 392 + * @regs: Pointer to currents pt_regs 393 + * 394 + * Invoked from architecture specific entry code with interrupts disabled. 395 + * Can only be called when the interrupt entry came from kernel mode. The 396 + * calling code must be non-instrumentable. When the function returns all 397 + * state is correct and the subsequent functions can be instrumented. 398 + * 399 + * The function establishes state (lockdep, RCU (context tracking), tracing) and 400 + * is provided for architectures which require a strict split between entry from 401 + * kernel and user mode and therefore cannot use irqentry_enter() which handles 402 + * both entry modes. 403 + * 404 + * Returns: An opaque object that must be passed to irqentry_exit_to_kernel_mode(). 405 + */ 406 + static __always_inline irqentry_state_t irqentry_enter_from_kernel_mode(struct pt_regs *regs) 407 + { 408 + irqentry_state_t ret = { 409 + .exit_rcu = false, 410 + }; 411 + 412 + /* 413 + * If this entry hit the idle task invoke ct_irq_enter() whether 414 + * RCU is watching or not. 415 + * 416 + * Interrupts can nest when the first interrupt invokes softirq 417 + * processing on return which enables interrupts. 418 + * 419 + * Scheduler ticks in the idle task can mark quiescent state and 420 + * terminate a grace period, if and only if the timer interrupt is 421 + * not nested into another interrupt. 422 + * 423 + * Checking for rcu_is_watching() here would prevent the nesting 424 + * interrupt to invoke ct_irq_enter(). If that nested interrupt is 425 + * the tick then rcu_flavor_sched_clock_irq() would wrongfully 426 + * assume that it is the first interrupt and eventually claim 427 + * quiescent state and end grace periods prematurely. 428 + * 429 + * Unconditionally invoke ct_irq_enter() so RCU state stays 430 + * consistent. 431 + * 432 + * TINY_RCU does not support EQS, so let the compiler eliminate 433 + * this part when enabled. 434 + */ 435 + if (!IS_ENABLED(CONFIG_TINY_RCU) && 436 + (is_idle_task(current) || arch_in_rcu_eqs())) { 437 + /* 438 + * If RCU is not watching then the same careful 439 + * sequence vs. lockdep and tracing is required 440 + * as in irqentry_enter_from_user_mode(). 441 + */ 442 + lockdep_hardirqs_off(CALLER_ADDR0); 443 + ct_irq_enter(); 444 + instrumentation_begin(); 445 + kmsan_unpoison_entry_regs(regs); 446 + trace_hardirqs_off_finish(); 447 + instrumentation_end(); 448 + 449 + ret.exit_rcu = true; 450 + return ret; 451 + } 452 + 453 + /* 454 + * If RCU is watching then RCU only wants to check whether it needs 455 + * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick() 456 + * already contains a warning when RCU is not watching, so no point 457 + * in having another one here. 458 + */ 459 + lockdep_hardirqs_off(CALLER_ADDR0); 460 + instrumentation_begin(); 461 + kmsan_unpoison_entry_regs(regs); 462 + rcu_irq_enter_check_tick(); 463 + trace_hardirqs_off_finish(); 464 + instrumentation_end(); 465 + 466 + return ret; 467 + } 468 + 469 + /** 470 + * irqentry_exit_to_kernel_mode_preempt - Run preempt checks on return to kernel mode 471 + * @regs: Pointer to current's pt_regs 472 + * @state: Return value from matching call to irqentry_enter_from_kernel_mode() 473 + * 474 + * This is to be invoked before irqentry_exit_to_kernel_mode_after_preempt() to 475 + * allow kernel preemption on return from interrupt. 476 + * 477 + * Must be invoked with interrupts disabled and CPU state which allows kernel 478 + * preemption. 479 + * 480 + * After returning from this function, the caller can modify CPU state before 481 + * invoking irqentry_exit_to_kernel_mode_after_preempt(), which is required to 482 + * re-establish the tracing, lockdep and RCU state for returning to the 483 + * interrupted context. 484 + */ 485 + static inline void irqentry_exit_to_kernel_mode_preempt(struct pt_regs *regs, 486 + irqentry_state_t state) 487 + { 488 + if (regs_irqs_disabled(regs) || state.exit_rcu) 489 + return; 490 + 491 + if (IS_ENABLED(CONFIG_PREEMPTION)) 492 + irqentry_exit_cond_resched(); 493 + } 494 + 495 + /** 496 + * irqentry_exit_to_kernel_mode_after_preempt - Establish trace, lockdep and RCU state 497 + * @regs: Pointer to current's pt_regs 498 + * @state: Return value from matching call to irqentry_enter_from_kernel_mode() 499 + * 500 + * This is to be invoked after irqentry_exit_to_kernel_mode_preempt() and before 501 + * actually returning to the interrupted context. 502 + * 503 + * There are no requirements for the CPU state other than being able to complete 504 + * the tracing, lockdep and RCU state transitions. After this function returns 505 + * the caller must return directly to the interrupted context. 506 + */ 507 + static __always_inline void 508 + irqentry_exit_to_kernel_mode_after_preempt(struct pt_regs *regs, irqentry_state_t state) 509 + { 510 + if (!regs_irqs_disabled(regs)) { 511 + /* 512 + * If RCU was not watching on entry this needs to be done 513 + * carefully and needs the same ordering of lockdep/tracing 514 + * and RCU as the return to user mode path. 515 + */ 516 + if (state.exit_rcu) { 517 + instrumentation_begin(); 518 + /* Tell the tracer that IRET will enable interrupts */ 519 + trace_hardirqs_on_prepare(); 520 + lockdep_hardirqs_on_prepare(); 521 + instrumentation_end(); 522 + ct_irq_exit(); 523 + lockdep_hardirqs_on(CALLER_ADDR0); 524 + return; 525 + } 526 + 527 + instrumentation_begin(); 528 + /* Covers both tracing and lockdep */ 529 + trace_hardirqs_on(); 530 + instrumentation_end(); 531 + } else { 532 + /* 533 + * IRQ flags state is correct already. Just tell RCU if it 534 + * was not watching on entry. 535 + */ 536 + if (state.exit_rcu) 537 + ct_irq_exit(); 538 + } 539 + } 540 + 541 + /** 542 + * irqentry_exit_to_kernel_mode - Run preempt checks and establish state after 543 + * invoking the interrupt handler 544 + * @regs: Pointer to current's pt_regs 545 + * @state: Return value from matching call to irqentry_enter_from_kernel_mode() 546 + * 547 + * This is the counterpart of irqentry_enter_from_kernel_mode() and combines 548 + * the calls to irqentry_exit_to_kernel_mode_preempt() and 549 + * irqentry_exit_to_kernel_mode_after_preempt(). 550 + * 551 + * The requirement for the CPU state is that it can schedule. After the function 552 + * returns the tracing, lockdep and RCU state transitions are completed and the 553 + * caller must return directly to the interrupted context. 554 + */ 555 + static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs, 556 + irqentry_state_t state) 557 + { 558 + lockdep_assert_irqs_disabled(); 559 + 560 + instrumentation_begin(); 561 + irqentry_exit_to_kernel_mode_preempt(regs, state); 562 + instrumentation_end(); 563 + 564 + irqentry_exit_to_kernel_mode_after_preempt(regs, state); 565 + } 566 + 567 + /** 340 568 * irqentry_enter - Handle state tracking on ordinary interrupt entries 341 569 * @regs: Pointer to pt_regs of interrupted context 342 570 * ··· 564 394 * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit 565 395 * would not be possible. 566 396 * 567 - * Returns: An opaque object that must be passed to idtentry_exit() 397 + * Returns: An opaque object that must be passed to irqentry_exit() 568 398 */ 569 399 irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs); 570 - 571 - /** 572 - * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt 573 - * 574 - * Conditional reschedule with additional sanity checks. 575 - */ 576 - void raw_irqentry_exit_cond_resched(void); 577 - 578 - #ifdef CONFIG_PREEMPT_DYNAMIC 579 - #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) 580 - #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched 581 - #define irqentry_exit_cond_resched_dynamic_disabled NULL 582 - DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched); 583 - #define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)() 584 - #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) 585 - DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched); 586 - void dynamic_irqentry_exit_cond_resched(void); 587 - #define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched() 588 - #endif 589 - #else /* CONFIG_PREEMPT_DYNAMIC */ 590 - #define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched() 591 - #endif /* CONFIG_PREEMPT_DYNAMIC */ 592 400 593 401 /** 594 402 * irqentry_exit - Handle return from exception that used irqentry_enter()
+10 -97
kernel/entry/common.c
··· 47 47 */ 48 48 while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) { 49 49 50 - local_irq_enable_exit_to_user(ti_work); 50 + local_irq_enable(); 51 51 52 52 if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { 53 53 if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY)) ··· 74 74 * might have changed while interrupts and preemption was 75 75 * enabled above. 76 76 */ 77 - local_irq_disable_exit_to_user(); 77 + local_irq_disable(); 78 78 79 79 /* Check if any of the above work has queued a deferred wakeup */ 80 80 tick_nohz_user_enter_prepare(); ··· 105 105 106 106 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) 107 107 { 108 - irqentry_state_t ret = { 109 - .exit_rcu = false, 110 - }; 111 - 112 108 if (user_mode(regs)) { 109 + irqentry_state_t ret = { 110 + .exit_rcu = false, 111 + }; 112 + 113 113 irqentry_enter_from_user_mode(regs); 114 114 return ret; 115 115 } 116 116 117 - /* 118 - * If this entry hit the idle task invoke ct_irq_enter() whether 119 - * RCU is watching or not. 120 - * 121 - * Interrupts can nest when the first interrupt invokes softirq 122 - * processing on return which enables interrupts. 123 - * 124 - * Scheduler ticks in the idle task can mark quiescent state and 125 - * terminate a grace period, if and only if the timer interrupt is 126 - * not nested into another interrupt. 127 - * 128 - * Checking for rcu_is_watching() here would prevent the nesting 129 - * interrupt to invoke ct_irq_enter(). If that nested interrupt is 130 - * the tick then rcu_flavor_sched_clock_irq() would wrongfully 131 - * assume that it is the first interrupt and eventually claim 132 - * quiescent state and end grace periods prematurely. 133 - * 134 - * Unconditionally invoke ct_irq_enter() so RCU state stays 135 - * consistent. 136 - * 137 - * TINY_RCU does not support EQS, so let the compiler eliminate 138 - * this part when enabled. 139 - */ 140 - if (!IS_ENABLED(CONFIG_TINY_RCU) && 141 - (is_idle_task(current) || arch_in_rcu_eqs())) { 142 - /* 143 - * If RCU is not watching then the same careful 144 - * sequence vs. lockdep and tracing is required 145 - * as in irqentry_enter_from_user_mode(). 146 - */ 147 - lockdep_hardirqs_off(CALLER_ADDR0); 148 - ct_irq_enter(); 149 - instrumentation_begin(); 150 - kmsan_unpoison_entry_regs(regs); 151 - trace_hardirqs_off_finish(); 152 - instrumentation_end(); 153 - 154 - ret.exit_rcu = true; 155 - return ret; 156 - } 157 - 158 - /* 159 - * If RCU is watching then RCU only wants to check whether it needs 160 - * to restart the tick in NOHZ mode. rcu_irq_enter_check_tick() 161 - * already contains a warning when RCU is not watching, so no point 162 - * in having another one here. 163 - */ 164 - lockdep_hardirqs_off(CALLER_ADDR0); 165 - instrumentation_begin(); 166 - kmsan_unpoison_entry_regs(regs); 167 - rcu_irq_enter_check_tick(); 168 - trace_hardirqs_off_finish(); 169 - instrumentation_end(); 170 - 171 - return ret; 117 + return irqentry_enter_from_kernel_mode(regs); 172 118 } 173 119 174 120 /** ··· 158 212 159 213 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) 160 214 { 161 - lockdep_assert_irqs_disabled(); 162 - 163 - /* Check whether this returns to user mode */ 164 - if (user_mode(regs)) { 215 + if (user_mode(regs)) 165 216 irqentry_exit_to_user_mode(regs); 166 - } else if (!regs_irqs_disabled(regs)) { 167 - /* 168 - * If RCU was not watching on entry this needs to be done 169 - * carefully and needs the same ordering of lockdep/tracing 170 - * and RCU as the return to user mode path. 171 - */ 172 - if (state.exit_rcu) { 173 - instrumentation_begin(); 174 - /* Tell the tracer that IRET will enable interrupts */ 175 - trace_hardirqs_on_prepare(); 176 - lockdep_hardirqs_on_prepare(); 177 - instrumentation_end(); 178 - ct_irq_exit(); 179 - lockdep_hardirqs_on(CALLER_ADDR0); 180 - return; 181 - } 182 - 183 - instrumentation_begin(); 184 - if (IS_ENABLED(CONFIG_PREEMPTION)) 185 - irqentry_exit_cond_resched(); 186 - 187 - /* Covers both tracing and lockdep */ 188 - trace_hardirqs_on(); 189 - instrumentation_end(); 190 - } else { 191 - /* 192 - * IRQ flags state is correct already. Just tell RCU if it 193 - * was not watching on entry. 194 - */ 195 - if (state.exit_rcu) 196 - ct_irq_exit(); 197 - } 217 + else 218 + irqentry_exit_to_kernel_mode(regs, state); 198 219 } 199 220 200 221 irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+2 -1
tools/testing/selftests/arm64/abi/hwcap.c
··· 56 56 57 57 static void cmpbr_sigill(void) 58 58 { 59 - /* Not implemented, too complicated and unreliable anyway */ 59 + asm volatile(".inst 0x74C00040\n" /* CBEQ w0, w0, +8 */ 60 + "udf #0" : : : "cc"); /* UDF #0 */ 60 61 } 61 62 62 63 static void crc32_sigill(void)
+1
tools/testing/selftests/kvm/arm64/set_id_regs.c
··· 124 124 125 125 static const struct reg_ftr_bits ftr_id_aa64isar3_el1[] = { 126 126 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, FPRCVT, 0), 127 + REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, LSUI, 0), 127 128 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, LSFE, 0), 128 129 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64ISAR3_EL1, FAMINMAX, 0), 129 130 REG_FTR_END,