Merge tag 'kvmarm-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

+31 -10

Documentation/virt/kvm/api.rst

··· 7418 7418 tags as appropriate if the VM is migrated. 7419 7419 7420 7420 When this capability is enabled all memory in memslots must be mapped as 7421 - not-shareable (no MAP_SHARED), attempts to create a memslot with a 7422 - MAP_SHARED mmap will result in an -EINVAL return. 7421 + ``MAP_ANONYMOUS`` or with a RAM-based file mapping (``tmpfs``, ``memfd``), 7422 + attempts to create a memslot with an invalid mmap will result in an 7423 + -EINVAL return. 7423 7424 7424 7425 When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to 7425 7426 perform a bulk copy of tags to/from the guest. ··· 7955 7954 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL 7956 7955 ---------------------------------------------------------- 7957 7956 7958 - :Architectures: x86 7957 + :Architectures: x86, arm64 7959 7958 :Parameters: args[0] - size of the dirty log ring 7960 7959 7961 7960 KVM is capable of tracking dirty memory using ring buffers that are ··· 8037 8036 needs to kick the vcpu out of KVM_RUN using a signal. The resulting 8038 8037 vmexit ensures that all dirty GFNs are flushed to the dirty rings. 8039 8038 8040 - NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding 8041 - ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls 8042 - KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling 8043 - KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual 8044 - machine will switch to ring-buffer dirty page tracking and further 8045 - KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail. 8046 - 8047 8039 NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that 8048 8040 should be exposed by weakly ordered architecture, in order to indicate 8049 8041 the additional memory ordering requirements imposed on userspace when ··· 8044 8050 Architecture with TSO-like ordering (such as x86) are allowed to 8045 8051 expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8046 8052 to userspace. 8053 + 8054 + After enabling the dirty rings, the userspace needs to detect the 8055 + capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the 8056 + ring structures can be backed by per-slot bitmaps. With this capability 8057 + advertised, it means the architecture can dirty guest pages without 8058 + vcpu/ring context, so that some of the dirty information will still be 8059 + maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 8060 + can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8061 + hasn't been enabled, or any memslot has been existing. 8062 + 8063 + Note that the bitmap here is only a backup of the ring structure. The 8064 + use of the ring and bitmap combination is only beneficial if there is 8065 + only a very small amount of memory that is dirtied out of vcpu/ring 8066 + context. Otherwise, the stand-alone per-slot bitmap mechanism needs to 8067 + be considered. 8068 + 8069 + To collect dirty bits in the backup bitmap, userspace can use the same 8070 + KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all 8071 + the generation of the dirty bits is done in a single pass. Collecting 8072 + the dirty bitmap should be the very last thing that the VMM does before 8073 + considering the state as complete. VMM needs to ensure that the dirty 8074 + state is final and avoid missing dirty pages from another ioctl ordered 8075 + after the bitmap collection. 8076 + 8077 + NOTE: One example of using the backup bitmap is saving arm64 vgic/its 8078 + tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on 8079 + KVM device "kvm-arm-vgic-its" when dirty ring is enabled. 8047 8080 8048 8081 8.30 KVM_CAP_XEN_HVM 8049 8082 --------------------

+8 -6

Documentation/virt/kvm/arm/pvtime.rst

··· 23 23 ARCH_FEATURES mechanism before calling it. 24 24 25 25 PV_TIME_FEATURES 26 - ============= ======== ========== 26 + 27 + ============= ======== ================================================= 27 28 Function ID: (uint32) 0xC5000020 28 29 PV_call_id: (uint32) The function to query for support. 29 30 Currently only PV_TIME_ST is supported. 30 31 Return value: (int64) NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant 31 32 PV-time feature is supported by the hypervisor. 32 - ============= ======== ========== 33 + ============= ======== ================================================= 33 34 34 35 PV_TIME_ST 35 - ============= ======== ========== 36 + 37 + ============= ======== ============================================== 36 38 Function ID: (uint32) 0xC5000021 37 39 Return value: (int64) IPA of the stolen time data structure for this 38 40 VCPU. On failure: 39 41 NOT_SUPPORTED (-1) 40 - ============= ======== ========== 42 + ============= ======== ============================================== 41 43 42 44 The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory 43 45 with inner and outer write back caching attributes, in the inner shareable ··· 78 76 these structures and not used for other purposes, this enables the guest to map 79 77 the region using 64k pages and avoids conflicting attributes with other memory. 80 78 81 - For the user space interface see Documentation/virt/kvm/devices/vcpu.rst 82 - section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". 79 + For the user space interface see 80 + :ref:`Documentation/virt/kvm/devices/vcpu.rst <kvm_arm_vcpu_pvtime_ctrl>`.

+4 -1

Documentation/virt/kvm/devices/arm-vgic-its.rst

··· 52 52 53 53 KVM_DEV_ARM_ITS_SAVE_TABLES 54 54 save the ITS table data into guest RAM, at the location provisioned 55 - by the guest in corresponding registers/table entries. 55 + by the guest in corresponding registers/table entries. Should userspace 56 + require a form of dirty tracking to identify which pages are modified 57 + by the saving process, it should use a bitmap even if using another 58 + mechanism to track the memory dirtied by the vCPUs. 56 59 57 60 The layout of the tables in guest memory defines an ABI. The entries 58 61 are laid out in little endian format as described in the last paragraph.

+2

Documentation/virt/kvm/devices/vcpu.rst

··· 171 171 numbers on at least one VCPU after creating all VCPUs and before running any 172 172 VCPUs. 173 173 174 + .. _kvm_arm_vcpu_pvtime_ctrl: 175 + 174 176 3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL 175 177 ================================== 176 178

+1

arch/arm64/Kconfig

··· 1965 1965 depends on ARM64_PAN 1966 1966 select ARCH_HAS_SUBPAGE_FAULTS 1967 1967 select ARCH_USES_HIGH_VMA_FLAGS 1968 + select ARCH_USES_PG_ARCH_X 1968 1969 help 1969 1970 Memory Tagging (part of the ARMv8.5 Extensions) provides 1970 1971 architectural support for run-time, always-on detection of

+6 -2

arch/arm64/include/asm/kvm_arm.h

··· 135 135 * 40 bits wide (T0SZ = 24). Systems with a PARange smaller than 40 bits are 136 136 * not known to exist and will break with this configuration. 137 137 * 138 - * The VTCR_EL2 is configured per VM and is initialised in kvm_arm_setup_stage2(). 138 + * The VTCR_EL2 is configured per VM and is initialised in kvm_init_stage2_mmu. 139 139 * 140 140 * Note that when using 4K pages, we concatenate two first level page tables 141 141 * together. With 16K pages, we concatenate 16 first level page tables. ··· 340 340 * We have 341 341 * PAR [PA_Shift - 1 : 12] = PA [PA_Shift - 1 : 12] 342 342 * HPFAR [PA_Shift - 9 : 4] = FIPA [PA_Shift - 1 : 12] 343 + * 344 + * Always assume 52 bit PA since at this point, we don't know how many PA bits 345 + * the page table has been set up for. This should be safe since unused address 346 + * bits in PAR are res0. 343 347 */ 344 348 #define PAR_TO_HPFAR(par) \ 345 - (((par) & GENMASK_ULL(PHYS_MASK_SHIFT - 1, 12)) >> 8) 349 + (((par) & GENMASK_ULL(52 - 1, 12)) >> 8) 346 350 347 351 #define ECN(x) { ESR_ELx_EC_##x, #x } 348 352

+5 -2

arch/arm64/include/asm/kvm_asm.h

··· 76 76 __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs, 77 77 __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs, 78 78 __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_init_traps, 79 + __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, 80 + __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, 81 + __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm, 79 82 }; 80 83 81 84 #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] ··· 109 106 #define per_cpu_ptr_nvhe_sym(sym, cpu) \ 110 107 ({ \ 111 108 unsigned long base, off; \ 112 - base = kvm_arm_hyp_percpu_base[cpu]; \ 109 + base = kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu]; \ 113 110 off = (unsigned long)&CHOOSE_NVHE_SYM(sym) - \ 114 111 (unsigned long)&CHOOSE_NVHE_SYM(__per_cpu_start); \ 115 112 base ? (typeof(CHOOSE_NVHE_SYM(sym))*)(base + off) : NULL; \ ··· 214 211 #define __kvm_hyp_init CHOOSE_NVHE_SYM(__kvm_hyp_init) 215 212 #define __kvm_hyp_vector CHOOSE_HYP_SYM(__kvm_hyp_vector) 216 213 217 - extern unsigned long kvm_arm_hyp_percpu_base[NR_CPUS]; 214 + extern unsigned long kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[]; 218 215 DECLARE_KVM_NVHE_SYM(__per_cpu_start); 219 216 DECLARE_KVM_NVHE_SYM(__per_cpu_end); 220 217

+74 -2

arch/arm64/include/asm/kvm_host.h

··· 73 73 int kvm_reset_vcpu(struct kvm_vcpu *vcpu); 74 74 void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu); 75 75 76 + struct kvm_hyp_memcache { 77 + phys_addr_t head; 78 + unsigned long nr_pages; 79 + }; 80 + 81 + static inline void push_hyp_memcache(struct kvm_hyp_memcache *mc, 82 + phys_addr_t *p, 83 + phys_addr_t (*to_pa)(void *virt)) 84 + { 85 + *p = mc->head; 86 + mc->head = to_pa(p); 87 + mc->nr_pages++; 88 + } 89 + 90 + static inline void *pop_hyp_memcache(struct kvm_hyp_memcache *mc, 91 + void *(*to_va)(phys_addr_t phys)) 92 + { 93 + phys_addr_t *p = to_va(mc->head); 94 + 95 + if (!mc->nr_pages) 96 + return NULL; 97 + 98 + mc->head = *p; 99 + mc->nr_pages--; 100 + 101 + return p; 102 + } 103 + 104 + static inline int __topup_hyp_memcache(struct kvm_hyp_memcache *mc, 105 + unsigned long min_pages, 106 + void *(*alloc_fn)(void *arg), 107 + phys_addr_t (*to_pa)(void *virt), 108 + void *arg) 109 + { 110 + while (mc->nr_pages < min_pages) { 111 + phys_addr_t *p = alloc_fn(arg); 112 + 113 + if (!p) 114 + return -ENOMEM; 115 + push_hyp_memcache(mc, p, to_pa); 116 + } 117 + 118 + return 0; 119 + } 120 + 121 + static inline void __free_hyp_memcache(struct kvm_hyp_memcache *mc, 122 + void (*free_fn)(void *virt, void *arg), 123 + void *(*to_va)(phys_addr_t phys), 124 + void *arg) 125 + { 126 + while (mc->nr_pages) 127 + free_fn(pop_hyp_memcache(mc, to_va), arg); 128 + } 129 + 130 + void free_hyp_memcache(struct kvm_hyp_memcache *mc); 131 + int topup_hyp_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages); 132 + 76 133 struct kvm_vmid { 77 134 atomic64_t id; 78 135 }; ··· 170 113 unsigned long std_bmap; 171 114 unsigned long std_hyp_bmap; 172 115 unsigned long vendor_hyp_bmap; 116 + }; 117 + 118 + typedef unsigned int pkvm_handle_t; 119 + 120 + struct kvm_protected_vm { 121 + pkvm_handle_t handle; 122 + struct kvm_hyp_memcache teardown_mc; 173 123 }; 174 124 175 125 struct kvm_arch { ··· 227 163 228 164 u8 pfr0_csv2; 229 165 u8 pfr0_csv3; 166 + struct { 167 + u8 imp:4; 168 + u8 unimp:4; 169 + } dfr0_pmuver; 230 170 231 171 /* Hypercall features firmware registers' descriptor */ 232 172 struct kvm_smccc_features smccc_feat; 173 + 174 + /* 175 + * For an untrusted host VM, 'pkvm.handle' is used to lookup 176 + * the associated pKVM instance in the hypervisor. 177 + */ 178 + struct kvm_protected_vm pkvm; 233 179 }; 234 180 235 181 struct kvm_vcpu_fault_info { ··· 988 914 989 915 #define __KVM_HAVE_ARCH_VM_ALLOC 990 916 struct kvm *kvm_arch_alloc_vm(void); 991 - 992 - int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type); 993 917 994 918 static inline bool kvm_vm_is_protected(struct kvm *kvm) 995 919 {

+3

arch/arm64/include/asm/kvm_hyp.h

··· 123 123 extern u64 kvm_nvhe_sym(id_aa64mmfr1_el1_sys_val); 124 124 extern u64 kvm_nvhe_sym(id_aa64mmfr2_el1_sys_val); 125 125 126 + extern unsigned long kvm_nvhe_sym(__icache_flags); 127 + extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits); 128 + 126 129 #endif /* __ARM64_KVM_HYP_H__ */

+1 -1

arch/arm64/include/asm/kvm_mmu.h

··· 166 166 void free_hyp_pgds(void); 167 167 168 168 void stage2_unmap_vm(struct kvm *kvm); 169 - int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu); 169 + int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type); 170 170 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu); 171 171 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, 172 172 phys_addr_t pa, unsigned long size, bool writable);

+154 -35

arch/arm64/include/asm/kvm_pgtable.h

··· 42 42 #define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT) 43 43 #define KVM_PTE_ADDR_51_48 GENMASK(15, 12) 44 44 45 + #define KVM_PHYS_INVALID (-1ULL) 46 + 45 47 static inline bool kvm_pte_valid(kvm_pte_t pte) 46 48 { 47 49 return pte & KVM_PTE_VALID; ··· 57 55 pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48; 58 56 59 57 return pa; 58 + } 59 + 60 + static inline kvm_pte_t kvm_phys_to_pte(u64 pa) 61 + { 62 + kvm_pte_t pte = pa & KVM_PTE_ADDR_MASK; 63 + 64 + if (PAGE_SHIFT == 16) { 65 + pa &= GENMASK(51, 48); 66 + pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48); 67 + } 68 + 69 + return pte; 60 70 } 61 71 62 72 static inline u64 kvm_granule_shift(u32 level) ··· 99 85 * allocation is physically contiguous. 100 86 * @free_pages_exact: Free an exact number of memory pages previously 101 87 * allocated by zalloc_pages_exact. 88 + * @free_removed_table: Free a removed paging structure by unlinking and 89 + * dropping references. 102 90 * @get_page: Increment the refcount on a page. 103 91 * @put_page: Decrement the refcount on a page. When the 104 92 * refcount reaches 0 the page is automatically ··· 119 103 void* (*zalloc_page)(void *arg); 120 104 void* (*zalloc_pages_exact)(size_t size); 121 105 void (*free_pages_exact)(void *addr, size_t size); 106 + void (*free_removed_table)(void *addr, u32 level); 122 107 void (*get_page)(void *addr); 123 108 void (*put_page)(void *addr); 124 109 int (*page_count)(void *addr); ··· 179 162 enum kvm_pgtable_prot prot); 180 163 181 164 /** 165 + * enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk. 166 + * @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid 167 + * entries. 168 + * @KVM_PGTABLE_WALK_TABLE_PRE: Visit table entries before their 169 + * children. 170 + * @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their 171 + * children. 172 + * @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared 173 + * with other software walkers. 174 + */ 175 + enum kvm_pgtable_walk_flags { 176 + KVM_PGTABLE_WALK_LEAF = BIT(0), 177 + KVM_PGTABLE_WALK_TABLE_PRE = BIT(1), 178 + KVM_PGTABLE_WALK_TABLE_POST = BIT(2), 179 + KVM_PGTABLE_WALK_SHARED = BIT(3), 180 + }; 181 + 182 + struct kvm_pgtable_visit_ctx { 183 + kvm_pte_t *ptep; 184 + kvm_pte_t old; 185 + void *arg; 186 + struct kvm_pgtable_mm_ops *mm_ops; 187 + u64 addr; 188 + u64 end; 189 + u32 level; 190 + enum kvm_pgtable_walk_flags flags; 191 + }; 192 + 193 + typedef int (*kvm_pgtable_visitor_fn_t)(const struct kvm_pgtable_visit_ctx *ctx, 194 + enum kvm_pgtable_walk_flags visit); 195 + 196 + static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *ctx) 197 + { 198 + return ctx->flags & KVM_PGTABLE_WALK_SHARED; 199 + } 200 + 201 + /** 202 + * struct kvm_pgtable_walker - Hook into a page-table walk. 203 + * @cb: Callback function to invoke during the walk. 204 + * @arg: Argument passed to the callback function. 205 + * @flags: Bitwise-OR of flags to identify the entry types on which to 206 + * invoke the callback function. 207 + */ 208 + struct kvm_pgtable_walker { 209 + const kvm_pgtable_visitor_fn_t cb; 210 + void * const arg; 211 + const enum kvm_pgtable_walk_flags flags; 212 + }; 213 + 214 + /* 215 + * RCU cannot be used in a non-kernel context such as the hyp. As such, page 216 + * table walkers used in hyp do not call into RCU and instead use other 217 + * synchronization mechanisms (such as a spinlock). 218 + */ 219 + #if defined(__KVM_NVHE_HYPERVISOR__) || defined(__KVM_VHE_HYPERVISOR__) 220 + 221 + typedef kvm_pte_t *kvm_pteref_t; 222 + 223 + static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker, 224 + kvm_pteref_t pteref) 225 + { 226 + return pteref; 227 + } 228 + 229 + static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker) 230 + { 231 + /* 232 + * Due to the lack of RCU (or a similar protection scheme), only 233 + * non-shared table walkers are allowed in the hypervisor. 234 + */ 235 + if (walker->flags & KVM_PGTABLE_WALK_SHARED) 236 + return -EPERM; 237 + 238 + return 0; 239 + } 240 + 241 + static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker) {} 242 + 243 + static inline bool kvm_pgtable_walk_lock_held(void) 244 + { 245 + return true; 246 + } 247 + 248 + #else 249 + 250 + typedef kvm_pte_t __rcu *kvm_pteref_t; 251 + 252 + static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker, 253 + kvm_pteref_t pteref) 254 + { 255 + return rcu_dereference_check(pteref, !(walker->flags & KVM_PGTABLE_WALK_SHARED)); 256 + } 257 + 258 + static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker) 259 + { 260 + if (walker->flags & KVM_PGTABLE_WALK_SHARED) 261 + rcu_read_lock(); 262 + 263 + return 0; 264 + } 265 + 266 + static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker) 267 + { 268 + if (walker->flags & KVM_PGTABLE_WALK_SHARED) 269 + rcu_read_unlock(); 270 + } 271 + 272 + static inline bool kvm_pgtable_walk_lock_held(void) 273 + { 274 + return rcu_read_lock_held(); 275 + } 276 + 277 + #endif 278 + 279 + /** 182 280 * struct kvm_pgtable - KVM page-table. 183 281 * @ia_bits: Maximum input address size, in bits. 184 282 * @start_level: Level at which the page-table walk starts. ··· 307 175 struct kvm_pgtable { 308 176 u32 ia_bits; 309 177 u32 start_level; 310 - kvm_pte_t *pgd; 178 + kvm_pteref_t pgd; 311 179 struct kvm_pgtable_mm_ops *mm_ops; 312 180 313 181 /* Stage-2 only */ 314 182 struct kvm_s2_mmu *mmu; 315 183 enum kvm_pgtable_stage2_flags flags; 316 184 kvm_pgtable_force_pte_cb_t force_pte_cb; 317 - }; 318 - 319 - /** 320 - * enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk. 321 - * @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid 322 - * entries. 323 - * @KVM_PGTABLE_WALK_TABLE_PRE: Visit table entries before their 324 - * children. 325 - * @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their 326 - * children. 327 - */ 328 - enum kvm_pgtable_walk_flags { 329 - KVM_PGTABLE_WALK_LEAF = BIT(0), 330 - KVM_PGTABLE_WALK_TABLE_PRE = BIT(1), 331 - KVM_PGTABLE_WALK_TABLE_POST = BIT(2), 332 - }; 333 - 334 - typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level, 335 - kvm_pte_t *ptep, 336 - enum kvm_pgtable_walk_flags flag, 337 - void * const arg); 338 - 339 - /** 340 - * struct kvm_pgtable_walker - Hook into a page-table walk. 341 - * @cb: Callback function to invoke during the walk. 342 - * @arg: Argument passed to the callback function. 343 - * @flags: Bitwise-OR of flags to identify the entry types on which to 344 - * invoke the callback function. 345 - */ 346 - struct kvm_pgtable_walker { 347 - const kvm_pgtable_visitor_fn_t cb; 348 - void * const arg; 349 - const enum kvm_pgtable_walk_flags flags; 350 185 }; 351 186 352 187 /** ··· 396 297 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); 397 298 398 299 /** 300 + * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD 301 + * @vtcr: Content of the VTCR register. 302 + * 303 + * Return: the size (in bytes) of the stage-2 PGD 304 + */ 305 + size_t kvm_pgtable_stage2_pgd_size(u64 vtcr); 306 + 307 + /** 399 308 * __kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table. 400 309 * @pgt: Uninitialised page-table structure to initialise. 401 310 * @mmu: S2 MMU context for this S2 translation ··· 432 325 void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); 433 326 434 327 /** 328 + * kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure. 329 + * @mm_ops: Memory management callbacks. 330 + * @pgtable: Unlinked stage-2 paging structure to be freed. 331 + * @level: Level of the stage-2 paging structure to be freed. 332 + * 333 + * The page-table is assumed to be unreachable by any hardware walkers prior to 334 + * freeing and therefore no TLB invalidation is performed. 335 + */ 336 + void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level); 337 + 338 + /** 435 339 * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table. 436 340 * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 437 341 * @addr: Intermediate physical address at which to place the mapping. ··· 451 333 * @prot: Permissions and attributes for the mapping. 452 334 * @mc: Cache of pre-allocated and zeroed memory from which to allocate 453 335 * page-table pages. 336 + * @flags: Flags to control the page-table walk (ex. a shared walk) 454 337 * 455 338 * The offset of @addr within a page is ignored, @size is rounded-up to 456 339 * the next page boundary and @phys is rounded-down to the previous page ··· 473 354 */ 474 355 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, 475 356 u64 phys, enum kvm_pgtable_prot prot, 476 - void *mc); 357 + void *mc, enum kvm_pgtable_walk_flags flags); 477 358 478 359 /** 479 360 * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to

+38

arch/arm64/include/asm/kvm_pkvm.h

··· 9 9 #include <linux/memblock.h> 10 10 #include <asm/kvm_pgtable.h> 11 11 12 + /* Maximum number of VMs that can co-exist under pKVM. */ 13 + #define KVM_MAX_PVMS 255 14 + 12 15 #define HYP_MEMBLOCK_REGIONS 128 16 + 17 + int pkvm_init_host_vm(struct kvm *kvm); 18 + int pkvm_create_hyp_vm(struct kvm *kvm); 19 + void pkvm_destroy_hyp_vm(struct kvm *kvm); 13 20 14 21 extern struct memblock_region kvm_nvhe_sym(hyp_memory)[]; 15 22 extern unsigned int kvm_nvhe_sym(hyp_memblock_nr); 23 + 24 + static inline unsigned long 25 + hyp_vmemmap_memblock_size(struct memblock_region *reg, size_t vmemmap_entry_size) 26 + { 27 + unsigned long nr_pages = reg->size >> PAGE_SHIFT; 28 + unsigned long start, end; 29 + 30 + start = (reg->base >> PAGE_SHIFT) * vmemmap_entry_size; 31 + end = start + nr_pages * vmemmap_entry_size; 32 + start = ALIGN_DOWN(start, PAGE_SIZE); 33 + end = ALIGN(end, PAGE_SIZE); 34 + 35 + return end - start; 36 + } 37 + 38 + static inline unsigned long hyp_vmemmap_pages(size_t vmemmap_entry_size) 39 + { 40 + unsigned long res = 0, i; 41 + 42 + for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) { 43 + res += hyp_vmemmap_memblock_size(&kvm_nvhe_sym(hyp_memory)[i], 44 + vmemmap_entry_size); 45 + } 46 + 47 + return res >> PAGE_SHIFT; 48 + } 49 + 50 + static inline unsigned long hyp_vm_table_pages(void) 51 + { 52 + return PAGE_ALIGN(KVM_MAX_PVMS * sizeof(void *)) >> PAGE_SHIFT; 53 + } 16 54 17 55 static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages) 18 56 {

+64 -1

arch/arm64/include/asm/mte.h

··· 25 25 unsigned long n); 26 26 int mte_save_tags(struct page *page); 27 27 void mte_save_page_tags(const void *page_addr, void *tag_storage); 28 - bool mte_restore_tags(swp_entry_t entry, struct page *page); 28 + void mte_restore_tags(swp_entry_t entry, struct page *page); 29 29 void mte_restore_page_tags(void *page_addr, const void *tag_storage); 30 30 void mte_invalidate_tags(int type, pgoff_t offset); 31 31 void mte_invalidate_tags_area(int type); ··· 36 36 37 37 /* track which pages have valid allocation tags */ 38 38 #define PG_mte_tagged PG_arch_2 39 + /* simple lock to avoid multiple threads tagging the same page */ 40 + #define PG_mte_lock PG_arch_3 41 + 42 + static inline void set_page_mte_tagged(struct page *page) 43 + { 44 + /* 45 + * Ensure that the tags written prior to this function are visible 46 + * before the page flags update. 47 + */ 48 + smp_wmb(); 49 + set_bit(PG_mte_tagged, &page->flags); 50 + } 51 + 52 + static inline bool page_mte_tagged(struct page *page) 53 + { 54 + bool ret = test_bit(PG_mte_tagged, &page->flags); 55 + 56 + /* 57 + * If the page is tagged, ensure ordering with a likely subsequent 58 + * read of the tags. 59 + */ 60 + if (ret) 61 + smp_rmb(); 62 + return ret; 63 + } 64 + 65 + /* 66 + * Lock the page for tagging and return 'true' if the page can be tagged, 67 + * 'false' if already tagged. PG_mte_tagged is never cleared and therefore the 68 + * locking only happens once for page initialisation. 69 + * 70 + * The page MTE lock state: 71 + * 72 + * Locked: PG_mte_lock && !PG_mte_tagged 73 + * Unlocked: !PG_mte_lock || PG_mte_tagged 74 + * 75 + * Acquire semantics only if the page is tagged (returning 'false'). 76 + */ 77 + static inline bool try_page_mte_tagging(struct page *page) 78 + { 79 + if (!test_and_set_bit(PG_mte_lock, &page->flags)) 80 + return true; 81 + 82 + /* 83 + * The tags are either being initialised or may have been initialised 84 + * already. Check if the PG_mte_tagged flag has been set or wait 85 + * otherwise. 86 + */ 87 + smp_cond_load_acquire(&page->flags, VAL & (1UL << PG_mte_tagged)); 88 + 89 + return false; 90 + } 39 91 40 92 void mte_zero_clear_page_tags(void *addr); 41 93 void mte_sync_tags(pte_t old_pte, pte_t pte); ··· 108 56 /* unused if !CONFIG_ARM64_MTE, silence the compiler */ 109 57 #define PG_mte_tagged 0 110 58 59 + static inline void set_page_mte_tagged(struct page *page) 60 + { 61 + } 62 + static inline bool page_mte_tagged(struct page *page) 63 + { 64 + return false; 65 + } 66 + static inline bool try_page_mte_tagging(struct page *page) 67 + { 68 + return false; 69 + } 111 70 static inline void mte_zero_clear_page_tags(void *addr) 112 71 { 113 72 }

+2 -2

arch/arm64/include/asm/pgtable.h

··· 1049 1049 #define __HAVE_ARCH_SWAP_RESTORE 1050 1050 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) 1051 1051 { 1052 - if (system_supports_mte() && mte_restore_tags(entry, &folio->page)) 1053 - set_bit(PG_mte_tagged, &folio->flags); 1052 + if (system_supports_mte()) 1053 + mte_restore_tags(entry, &folio->page); 1054 1054 } 1055 1055 1056 1056 #endif /* CONFIG_ARM64_MTE */

-138

arch/arm64/include/asm/sysreg.h

··· 165 165 #define SYS_MPIDR_EL1 sys_reg(3, 0, 0, 0, 5) 166 166 #define SYS_REVIDR_EL1 sys_reg(3, 0, 0, 0, 6) 167 167 168 - #define SYS_ID_PFR0_EL1 sys_reg(3, 0, 0, 1, 0) 169 - #define SYS_ID_PFR1_EL1 sys_reg(3, 0, 0, 1, 1) 170 - #define SYS_ID_PFR2_EL1 sys_reg(3, 0, 0, 3, 4) 171 - #define SYS_ID_DFR0_EL1 sys_reg(3, 0, 0, 1, 2) 172 - #define SYS_ID_DFR1_EL1 sys_reg(3, 0, 0, 3, 5) 173 - #define SYS_ID_AFR0_EL1 sys_reg(3, 0, 0, 1, 3) 174 - #define SYS_ID_MMFR0_EL1 sys_reg(3, 0, 0, 1, 4) 175 - #define SYS_ID_MMFR1_EL1 sys_reg(3, 0, 0, 1, 5) 176 - #define SYS_ID_MMFR2_EL1 sys_reg(3, 0, 0, 1, 6) 177 - #define SYS_ID_MMFR3_EL1 sys_reg(3, 0, 0, 1, 7) 178 - #define SYS_ID_MMFR4_EL1 sys_reg(3, 0, 0, 2, 6) 179 - #define SYS_ID_MMFR5_EL1 sys_reg(3, 0, 0, 3, 6) 180 - 181 - #define SYS_ID_ISAR0_EL1 sys_reg(3, 0, 0, 2, 0) 182 - #define SYS_ID_ISAR1_EL1 sys_reg(3, 0, 0, 2, 1) 183 - #define SYS_ID_ISAR2_EL1 sys_reg(3, 0, 0, 2, 2) 184 - #define SYS_ID_ISAR3_EL1 sys_reg(3, 0, 0, 2, 3) 185 - #define SYS_ID_ISAR4_EL1 sys_reg(3, 0, 0, 2, 4) 186 - #define SYS_ID_ISAR5_EL1 sys_reg(3, 0, 0, 2, 5) 187 - #define SYS_ID_ISAR6_EL1 sys_reg(3, 0, 0, 2, 7) 188 - 189 - #define SYS_MVFR0_EL1 sys_reg(3, 0, 0, 3, 0) 190 - #define SYS_MVFR1_EL1 sys_reg(3, 0, 0, 3, 1) 191 - #define SYS_MVFR2_EL1 sys_reg(3, 0, 0, 3, 2) 192 - 193 168 #define SYS_ACTLR_EL1 sys_reg(3, 0, 1, 0, 1) 194 169 #define SYS_RGSR_EL1 sys_reg(3, 0, 1, 0, 5) 195 170 #define SYS_GCR_EL1 sys_reg(3, 0, 1, 0, 6) ··· 667 692 #define ID_AA64MMFR0_EL1_PARANGE_MAX ID_AA64MMFR0_EL1_PARANGE_48 668 693 #endif 669 694 670 - #define ID_DFR0_PERFMON_SHIFT 24 671 - 672 - #define ID_DFR0_PERFMON_8_0 0x3 673 - #define ID_DFR0_PERFMON_8_1 0x4 674 - #define ID_DFR0_PERFMON_8_4 0x5 675 - #define ID_DFR0_PERFMON_8_5 0x6 676 - 677 - #define ID_ISAR4_SWP_FRAC_SHIFT 28 678 - #define ID_ISAR4_PSR_M_SHIFT 24 679 - #define ID_ISAR4_SYNCH_PRIM_FRAC_SHIFT 20 680 - #define ID_ISAR4_BARRIER_SHIFT 16 681 - #define ID_ISAR4_SMC_SHIFT 12 682 - #define ID_ISAR4_WRITEBACK_SHIFT 8 683 - #define ID_ISAR4_WITHSHIFTS_SHIFT 4 684 - #define ID_ISAR4_UNPRIV_SHIFT 0 685 - 686 - #define ID_DFR1_MTPMU_SHIFT 0 687 - 688 - #define ID_ISAR0_DIVIDE_SHIFT 24 689 - #define ID_ISAR0_DEBUG_SHIFT 20 690 - #define ID_ISAR0_COPROC_SHIFT 16 691 - #define ID_ISAR0_CMPBRANCH_SHIFT 12 692 - #define ID_ISAR0_BITFIELD_SHIFT 8 693 - #define ID_ISAR0_BITCOUNT_SHIFT 4 694 - #define ID_ISAR0_SWAP_SHIFT 0 695 - 696 - #define ID_ISAR5_RDM_SHIFT 24 697 - #define ID_ISAR5_CRC32_SHIFT 16 698 - #define ID_ISAR5_SHA2_SHIFT 12 699 - #define ID_ISAR5_SHA1_SHIFT 8 700 - #define ID_ISAR5_AES_SHIFT 4 701 - #define ID_ISAR5_SEVL_SHIFT 0 702 - 703 - #define ID_ISAR6_I8MM_SHIFT 24 704 - #define ID_ISAR6_BF16_SHIFT 20 705 - #define ID_ISAR6_SPECRES_SHIFT 16 706 - #define ID_ISAR6_SB_SHIFT 12 707 - #define ID_ISAR6_FHM_SHIFT 8 708 - #define ID_ISAR6_DP_SHIFT 4 709 - #define ID_ISAR6_JSCVT_SHIFT 0 710 - 711 - #define ID_MMFR0_INNERSHR_SHIFT 28 712 - #define ID_MMFR0_FCSE_SHIFT 24 713 - #define ID_MMFR0_AUXREG_SHIFT 20 714 - #define ID_MMFR0_TCM_SHIFT 16 715 - #define ID_MMFR0_SHARELVL_SHIFT 12 716 - #define ID_MMFR0_OUTERSHR_SHIFT 8 717 - #define ID_MMFR0_PMSA_SHIFT 4 718 - #define ID_MMFR0_VMSA_SHIFT 0 719 - 720 - #define ID_MMFR4_EVT_SHIFT 28 721 - #define ID_MMFR4_CCIDX_SHIFT 24 722 - #define ID_MMFR4_LSM_SHIFT 20 723 - #define ID_MMFR4_HPDS_SHIFT 16 724 - #define ID_MMFR4_CNP_SHIFT 12 725 - #define ID_MMFR4_XNX_SHIFT 8 726 - #define ID_MMFR4_AC2_SHIFT 4 727 - #define ID_MMFR4_SPECSEI_SHIFT 0 728 - 729 - #define ID_MMFR5_ETS_SHIFT 0 730 - 731 - #define ID_PFR0_DIT_SHIFT 24 732 - #define ID_PFR0_CSV2_SHIFT 16 733 - #define ID_PFR0_STATE3_SHIFT 12 734 - #define ID_PFR0_STATE2_SHIFT 8 735 - #define ID_PFR0_STATE1_SHIFT 4 736 - #define ID_PFR0_STATE0_SHIFT 0 737 - 738 - #define ID_DFR0_PERFMON_SHIFT 24 739 - #define ID_DFR0_MPROFDBG_SHIFT 20 740 - #define ID_DFR0_MMAPTRC_SHIFT 16 741 - #define ID_DFR0_COPTRC_SHIFT 12 742 - #define ID_DFR0_MMAPDBG_SHIFT 8 743 - #define ID_DFR0_COPSDBG_SHIFT 4 744 - #define ID_DFR0_COPDBG_SHIFT 0 745 - 746 - #define ID_PFR2_SSBS_SHIFT 4 747 - #define ID_PFR2_CSV3_SHIFT 0 748 - 749 - #define MVFR0_FPROUND_SHIFT 28 750 - #define MVFR0_FPSHVEC_SHIFT 24 751 - #define MVFR0_FPSQRT_SHIFT 20 752 - #define MVFR0_FPDIVIDE_SHIFT 16 753 - #define MVFR0_FPTRAP_SHIFT 12 754 - #define MVFR0_FPDP_SHIFT 8 755 - #define MVFR0_FPSP_SHIFT 4 756 - #define MVFR0_SIMD_SHIFT 0 757 - 758 - #define MVFR1_SIMDFMAC_SHIFT 28 759 - #define MVFR1_FPHP_SHIFT 24 760 - #define MVFR1_SIMDHP_SHIFT 20 761 - #define MVFR1_SIMDSP_SHIFT 16 762 - #define MVFR1_SIMDINT_SHIFT 12 763 - #define MVFR1_SIMDLS_SHIFT 8 764 - #define MVFR1_FPDNAN_SHIFT 4 765 - #define MVFR1_FPFTZ_SHIFT 0 766 - 767 - #define ID_PFR1_GIC_SHIFT 28 768 - #define ID_PFR1_VIRT_FRAC_SHIFT 24 769 - #define ID_PFR1_SEC_FRAC_SHIFT 20 770 - #define ID_PFR1_GENTIMER_SHIFT 16 771 - #define ID_PFR1_VIRTUALIZATION_SHIFT 12 772 - #define ID_PFR1_MPROGMOD_SHIFT 8 773 - #define ID_PFR1_SECURITY_SHIFT 4 774 - #define ID_PFR1_PROGMOD_SHIFT 0 775 - 776 695 #if defined(CONFIG_ARM64_4K_PAGES) 777 696 #define ID_AA64MMFR0_EL1_TGRAN_SHIFT ID_AA64MMFR0_EL1_TGRAN4_SHIFT 778 697 #define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN ··· 683 814 #define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MAX 684 815 #define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT ID_AA64MMFR0_EL1_TGRAN64_2_SHIFT 685 816 #endif 686 - 687 - #define MVFR2_FPMISC_SHIFT 4 688 - #define MVFR2_SIMDMISC_SHIFT 0 689 817 690 818 #define CPACR_EL1_FPEN_EL1EN (BIT(20)) /* enable EL1 access */ 691 819 #define CPACR_EL1_FPEN_EL0EN (BIT(21)) /* enable EL0 access, if EL1EN set */ ··· 716 850 #define SYS_RGSR_EL1_TAG_MASK 0xfUL 717 851 #define SYS_RGSR_EL1_SEED_SHIFT 8 718 852 #define SYS_RGSR_EL1_SEED_MASK 0xffffUL 719 - 720 - /* GMID_EL1 field definitions */ 721 - #define GMID_EL1_BS_SHIFT 0 722 - #define GMID_EL1_BS_SIZE 4 723 853 724 854 /* TFSR{,E0}_EL1 bit definitions */ 725 855 #define SYS_TFSR_EL1_TF0_SHIFT 0

+1

arch/arm64/include/uapi/asm/kvm.h

··· 43 43 #define __KVM_HAVE_VCPU_EVENTS 44 44 45 45 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 46 + #define KVM_DIRTY_LOG_PAGE_OFFSET 64 46 47 47 48 #define KVM_REG_SIZE(id) \ 48 49 (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))

+107 -105

arch/arm64/kernel/cpufeature.c

··· 402 402 }; 403 403 404 404 static const struct arm64_ftr_bits ftr_id_mmfr0[] = { 405 - S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_INNERSHR_SHIFT, 4, 0xf), 406 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_FCSE_SHIFT, 4, 0), 407 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_MMFR0_AUXREG_SHIFT, 4, 0), 408 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_TCM_SHIFT, 4, 0), 409 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_SHARELVL_SHIFT, 4, 0), 410 - S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_OUTERSHR_SHIFT, 4, 0xf), 411 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_PMSA_SHIFT, 4, 0), 412 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_VMSA_SHIFT, 4, 0), 405 + S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_InnerShr_SHIFT, 4, 0xf), 406 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_FCSE_SHIFT, 4, 0), 407 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_AuxReg_SHIFT, 4, 0), 408 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_TCM_SHIFT, 4, 0), 409 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_ShareLvl_SHIFT, 4, 0), 410 + S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_OuterShr_SHIFT, 4, 0xf), 411 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_PMSA_SHIFT, 4, 0), 412 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR0_EL1_VMSA_SHIFT, 4, 0), 413 413 ARM64_FTR_END, 414 414 }; 415 415 ··· 429 429 }; 430 430 431 431 static const struct arm64_ftr_bits ftr_mvfr0[] = { 432 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPROUND_SHIFT, 4, 0), 433 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPSHVEC_SHIFT, 4, 0), 434 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPSQRT_SHIFT, 4, 0), 435 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPDIVIDE_SHIFT, 4, 0), 436 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPTRAP_SHIFT, 4, 0), 437 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPDP_SHIFT, 4, 0), 438 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_FPSP_SHIFT, 4, 0), 439 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_SIMD_SHIFT, 4, 0), 432 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPRound_SHIFT, 4, 0), 433 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPShVec_SHIFT, 4, 0), 434 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPSqrt_SHIFT, 4, 0), 435 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPDivide_SHIFT, 4, 0), 436 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPTrap_SHIFT, 4, 0), 437 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPDP_SHIFT, 4, 0), 438 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_FPSP_SHIFT, 4, 0), 439 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR0_EL1_SIMDReg_SHIFT, 4, 0), 440 440 ARM64_FTR_END, 441 441 }; 442 442 443 443 static const struct arm64_ftr_bits ftr_mvfr1[] = { 444 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_SIMDFMAC_SHIFT, 4, 0), 445 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_FPHP_SHIFT, 4, 0), 446 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_SIMDHP_SHIFT, 4, 0), 447 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_SIMDSP_SHIFT, 4, 0), 448 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_SIMDINT_SHIFT, 4, 0), 449 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_SIMDLS_SHIFT, 4, 0), 450 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_FPDNAN_SHIFT, 4, 0), 451 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_FPFTZ_SHIFT, 4, 0), 444 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_SIMDFMAC_SHIFT, 4, 0), 445 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_FPHP_SHIFT, 4, 0), 446 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_SIMDHP_SHIFT, 4, 0), 447 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_SIMDSP_SHIFT, 4, 0), 448 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_SIMDInt_SHIFT, 4, 0), 449 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_SIMDLS_SHIFT, 4, 0), 450 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_FPDNaN_SHIFT, 4, 0), 451 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR1_EL1_FPFtZ_SHIFT, 4, 0), 452 452 ARM64_FTR_END, 453 453 }; 454 454 455 455 static const struct arm64_ftr_bits ftr_mvfr2[] = { 456 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR2_FPMISC_SHIFT, 4, 0), 457 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR2_SIMDMISC_SHIFT, 4, 0), 456 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR2_EL1_FPMisc_SHIFT, 4, 0), 457 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MVFR2_EL1_SIMDMisc_SHIFT, 4, 0), 458 458 ARM64_FTR_END, 459 459 }; 460 460 ··· 470 470 }; 471 471 472 472 static const struct arm64_ftr_bits ftr_id_isar0[] = { 473 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_DIVIDE_SHIFT, 4, 0), 474 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_DEBUG_SHIFT, 4, 0), 475 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_COPROC_SHIFT, 4, 0), 476 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_CMPBRANCH_SHIFT, 4, 0), 477 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_BITFIELD_SHIFT, 4, 0), 478 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_BITCOUNT_SHIFT, 4, 0), 479 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_SWAP_SHIFT, 4, 0), 473 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_Divide_SHIFT, 4, 0), 474 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_Debug_SHIFT, 4, 0), 475 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_Coproc_SHIFT, 4, 0), 476 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_CmpBranch_SHIFT, 4, 0), 477 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_BitField_SHIFT, 4, 0), 478 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_BitCount_SHIFT, 4, 0), 479 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR0_EL1_Swap_SHIFT, 4, 0), 480 480 ARM64_FTR_END, 481 481 }; 482 482 483 483 static const struct arm64_ftr_bits ftr_id_isar5[] = { 484 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_RDM_SHIFT, 4, 0), 485 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_CRC32_SHIFT, 4, 0), 486 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_SHA2_SHIFT, 4, 0), 487 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_SHA1_SHIFT, 4, 0), 488 - ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_AES_SHIFT, 4, 0), 489 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_SEVL_SHIFT, 4, 0), 484 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_EL1_RDM_SHIFT, 4, 0), 485 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_EL1_CRC32_SHIFT, 4, 0), 486 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_EL1_SHA2_SHIFT, 4, 0), 487 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_EL1_SHA1_SHIFT, 4, 0), 488 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_EL1_AES_SHIFT, 4, 0), 489 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR5_EL1_SEVL_SHIFT, 4, 0), 490 490 ARM64_FTR_END, 491 491 }; 492 492 493 493 static const struct arm64_ftr_bits ftr_id_mmfr4[] = { 494 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EVT_SHIFT, 4, 0), 495 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_CCIDX_SHIFT, 4, 0), 496 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_LSM_SHIFT, 4, 0), 497 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_HPDS_SHIFT, 4, 0), 498 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_CNP_SHIFT, 4, 0), 499 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_XNX_SHIFT, 4, 0), 500 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_AC2_SHIFT, 4, 0), 494 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_EVT_SHIFT, 4, 0), 495 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_CCIDX_SHIFT, 4, 0), 496 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_LSM_SHIFT, 4, 0), 497 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_HPDS_SHIFT, 4, 0), 498 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_CnP_SHIFT, 4, 0), 499 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_XNX_SHIFT, 4, 0), 500 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR4_EL1_AC2_SHIFT, 4, 0), 501 501 502 502 /* 503 503 * SpecSEI = 1 indicates that the PE might generate an SError on an ··· 505 505 * SError might be generated than it will not be. Hence it has been 506 506 * classified as FTR_HIGHER_SAFE. 507 507 */ 508 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_HIGHER_SAFE, ID_MMFR4_SPECSEI_SHIFT, 4, 0), 508 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_HIGHER_SAFE, ID_MMFR4_EL1_SpecSEI_SHIFT, 4, 0), 509 509 ARM64_FTR_END, 510 510 }; 511 511 512 512 static const struct arm64_ftr_bits ftr_id_isar4[] = { 513 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_SWP_FRAC_SHIFT, 4, 0), 514 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_PSR_M_SHIFT, 4, 0), 515 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_SYNCH_PRIM_FRAC_SHIFT, 4, 0), 516 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_BARRIER_SHIFT, 4, 0), 517 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_SMC_SHIFT, 4, 0), 518 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_WRITEBACK_SHIFT, 4, 0), 519 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_WITHSHIFTS_SHIFT, 4, 0), 520 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_UNPRIV_SHIFT, 4, 0), 513 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_SWP_frac_SHIFT, 4, 0), 514 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_PSR_M_SHIFT, 4, 0), 515 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_SynchPrim_frac_SHIFT, 4, 0), 516 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_Barrier_SHIFT, 4, 0), 517 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_SMC_SHIFT, 4, 0), 518 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_Writeback_SHIFT, 4, 0), 519 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_WithShifts_SHIFT, 4, 0), 520 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR4_EL1_Unpriv_SHIFT, 4, 0), 521 521 ARM64_FTR_END, 522 522 }; 523 523 524 524 static const struct arm64_ftr_bits ftr_id_mmfr5[] = { 525 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR5_ETS_SHIFT, 4, 0), 525 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_MMFR5_EL1_ETS_SHIFT, 4, 0), 526 526 ARM64_FTR_END, 527 527 }; 528 528 529 529 static const struct arm64_ftr_bits ftr_id_isar6[] = { 530 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_I8MM_SHIFT, 4, 0), 531 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_BF16_SHIFT, 4, 0), 532 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_SPECRES_SHIFT, 4, 0), 533 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_SB_SHIFT, 4, 0), 534 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_FHM_SHIFT, 4, 0), 535 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_DP_SHIFT, 4, 0), 536 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_JSCVT_SHIFT, 4, 0), 530 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_I8MM_SHIFT, 4, 0), 531 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_BF16_SHIFT, 4, 0), 532 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_SPECRES_SHIFT, 4, 0), 533 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_SB_SHIFT, 4, 0), 534 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_FHM_SHIFT, 4, 0), 535 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_DP_SHIFT, 4, 0), 536 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_ISAR6_EL1_JSCVT_SHIFT, 4, 0), 537 537 ARM64_FTR_END, 538 538 }; 539 539 540 540 static const struct arm64_ftr_bits ftr_id_pfr0[] = { 541 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_DIT_SHIFT, 4, 0), 542 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_PFR0_CSV2_SHIFT, 4, 0), 543 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_STATE3_SHIFT, 4, 0), 544 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_STATE2_SHIFT, 4, 0), 545 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_STATE1_SHIFT, 4, 0), 546 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_STATE0_SHIFT, 4, 0), 541 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_EL1_DIT_SHIFT, 4, 0), 542 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_PFR0_EL1_CSV2_SHIFT, 4, 0), 543 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_EL1_State3_SHIFT, 4, 0), 544 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_EL1_State2_SHIFT, 4, 0), 545 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_EL1_State1_SHIFT, 4, 0), 546 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR0_EL1_State0_SHIFT, 4, 0), 547 547 ARM64_FTR_END, 548 548 }; 549 549 550 550 static const struct arm64_ftr_bits ftr_id_pfr1[] = { 551 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_GIC_SHIFT, 4, 0), 552 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_VIRT_FRAC_SHIFT, 4, 0), 553 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_SEC_FRAC_SHIFT, 4, 0), 554 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_GENTIMER_SHIFT, 4, 0), 555 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_VIRTUALIZATION_SHIFT, 4, 0), 556 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_MPROGMOD_SHIFT, 4, 0), 557 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_SECURITY_SHIFT, 4, 0), 558 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_PROGMOD_SHIFT, 4, 0), 551 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_GIC_SHIFT, 4, 0), 552 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_Virt_frac_SHIFT, 4, 0), 553 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_Sec_frac_SHIFT, 4, 0), 554 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_GenTimer_SHIFT, 4, 0), 555 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_Virtualization_SHIFT, 4, 0), 556 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_MProgMod_SHIFT, 4, 0), 557 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_Security_SHIFT, 4, 0), 558 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_PFR1_EL1_ProgMod_SHIFT, 4, 0), 559 559 ARM64_FTR_END, 560 560 }; 561 561 562 562 static const struct arm64_ftr_bits ftr_id_pfr2[] = { 563 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_PFR2_SSBS_SHIFT, 4, 0), 564 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_PFR2_CSV3_SHIFT, 4, 0), 563 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_PFR2_EL1_SSBS_SHIFT, 4, 0), 564 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_PFR2_EL1_CSV3_SHIFT, 4, 0), 565 565 ARM64_FTR_END, 566 566 }; 567 567 568 568 static const struct arm64_ftr_bits ftr_id_dfr0[] = { 569 569 /* [31:28] TraceFilt */ 570 - S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_DFR0_PERFMON_SHIFT, 4, 0), 571 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_MPROFDBG_SHIFT, 4, 0), 572 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_MMAPTRC_SHIFT, 4, 0), 573 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_COPTRC_SHIFT, 4, 0), 574 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_MMAPDBG_SHIFT, 4, 0), 575 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_COPSDBG_SHIFT, 4, 0), 576 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_COPDBG_SHIFT, 4, 0), 570 + S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_DFR0_EL1_PerfMon_SHIFT, 4, 0), 571 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MProfDbg_SHIFT, 4, 0), 572 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MMapTrc_SHIFT, 4, 0), 573 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_CopTrc_SHIFT, 4, 0), 574 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_MMapDbg_SHIFT, 4, 0), 575 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_CopSDbg_SHIFT, 4, 0), 576 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR0_EL1_CopDbg_SHIFT, 4, 0), 577 577 ARM64_FTR_END, 578 578 }; 579 579 580 580 static const struct arm64_ftr_bits ftr_id_dfr1[] = { 581 - S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR1_MTPMU_SHIFT, 4, 0), 581 + S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_DFR1_EL1_MTPMU_SHIFT, 4, 0), 582 582 ARM64_FTR_END, 583 583 }; 584 584 ··· 1119 1119 * EL1-dependent register fields to avoid spurious sanity check fails. 1120 1120 */ 1121 1121 if (!id_aa64pfr0_32bit_el1(pfr0)) { 1122 - relax_cpu_ftr_reg(SYS_ID_ISAR4_EL1, ID_ISAR4_SMC_SHIFT); 1123 - relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_VIRT_FRAC_SHIFT); 1124 - relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_SEC_FRAC_SHIFT); 1125 - relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_VIRTUALIZATION_SHIFT); 1126 - relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_SECURITY_SHIFT); 1127 - relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_PROGMOD_SHIFT); 1122 + relax_cpu_ftr_reg(SYS_ID_ISAR4_EL1, ID_ISAR4_EL1_SMC_SHIFT); 1123 + relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_EL1_Virt_frac_SHIFT); 1124 + relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_EL1_Sec_frac_SHIFT); 1125 + relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_EL1_Virtualization_SHIFT); 1126 + relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_EL1_Security_SHIFT); 1127 + relax_cpu_ftr_reg(SYS_ID_PFR1_EL1, ID_PFR1_EL1_ProgMod_SHIFT); 1128 1128 } 1129 1129 1130 1130 taint |= check_update_ftr_reg(SYS_ID_DFR0_EL1, cpu, ··· 2074 2074 * Clear the tags in the zero page. This needs to be done via the 2075 2075 * linear map which has the Tagged attribute. 2076 2076 */ 2077 - if (!test_and_set_bit(PG_mte_tagged, &ZERO_PAGE(0)->flags)) 2077 + if (try_page_mte_tagging(ZERO_PAGE(0))) { 2078 2078 mte_clear_page_tags(lm_alias(empty_zero_page)); 2079 + set_page_mte_tagged(ZERO_PAGE(0)); 2080 + } 2079 2081 2080 2082 kasan_init_hw_tags_cpu(); 2081 2083 } ··· 2831 2829 else 2832 2830 mvfr1 = read_sysreg_s(SYS_MVFR1_EL1); 2833 2831 2834 - return cpuid_feature_extract_unsigned_field(mvfr1, MVFR1_SIMDSP_SHIFT) && 2835 - cpuid_feature_extract_unsigned_field(mvfr1, MVFR1_SIMDINT_SHIFT) && 2836 - cpuid_feature_extract_unsigned_field(mvfr1, MVFR1_SIMDLS_SHIFT); 2832 + return cpuid_feature_extract_unsigned_field(mvfr1, MVFR1_EL1_SIMDSP_SHIFT) && 2833 + cpuid_feature_extract_unsigned_field(mvfr1, MVFR1_EL1_SIMDInt_SHIFT) && 2834 + cpuid_feature_extract_unsigned_field(mvfr1, MVFR1_EL1_SIMDLS_SHIFT); 2837 2835 } 2838 2836 #endif 2839 2837 2840 2838 static const struct arm64_cpu_capabilities compat_elf_hwcaps[] = { 2841 2839 #ifdef CONFIG_COMPAT 2842 2840 HWCAP_CAP_MATCH(compat_has_neon, CAP_COMPAT_HWCAP, COMPAT_HWCAP_NEON), 2843 - HWCAP_CAP(SYS_MVFR1_EL1, MVFR1_SIMDFMAC_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP, COMPAT_HWCAP_VFPv4), 2841 + HWCAP_CAP(SYS_MVFR1_EL1, MVFR1_EL1_SIMDFMAC_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP, COMPAT_HWCAP_VFPv4), 2844 2842 /* Arm v8 mandates MVFR0.FPDP == {0, 2}. So, piggy back on this for the presence of VFP support */ 2845 - HWCAP_CAP(SYS_MVFR0_EL1, MVFR0_FPDP_SHIFT, 4, FTR_UNSIGNED, 2, CAP_COMPAT_HWCAP, COMPAT_HWCAP_VFP), 2846 - HWCAP_CAP(SYS_MVFR0_EL1, MVFR0_FPDP_SHIFT, 4, FTR_UNSIGNED, 2, CAP_COMPAT_HWCAP, COMPAT_HWCAP_VFPv3), 2847 - HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_AES_SHIFT, 4, FTR_UNSIGNED, 2, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_PMULL), 2848 - HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_AES_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_AES), 2849 - HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_SHA1_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_SHA1), 2850 - HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_SHA2_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_SHA2), 2851 - HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_CRC32_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_CRC32), 2843 + HWCAP_CAP(SYS_MVFR0_EL1, MVFR0_EL1_FPDP_SHIFT, 4, FTR_UNSIGNED, 2, CAP_COMPAT_HWCAP, COMPAT_HWCAP_VFP), 2844 + HWCAP_CAP(SYS_MVFR0_EL1, MVFR0_EL1_FPDP_SHIFT, 4, FTR_UNSIGNED, 2, CAP_COMPAT_HWCAP, COMPAT_HWCAP_VFPv3), 2845 + HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_EL1_AES_SHIFT, 4, FTR_UNSIGNED, 2, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_PMULL), 2846 + HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_EL1_AES_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_AES), 2847 + HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_EL1_SHA1_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_SHA1), 2848 + HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_EL1_SHA2_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_SHA2), 2849 + HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_EL1_CRC32_SHIFT, 4, FTR_UNSIGNED, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_CRC32), 2852 2850 #endif 2853 2851 {}, 2854 2852 };

+1 -1

arch/arm64/kernel/elfcore.c

··· 47 47 * Pages mapped in user space as !pte_access_permitted() (e.g. 48 48 * PROT_EXEC only) may not have the PG_mte_tagged flag set. 49 49 */ 50 - if (!test_bit(PG_mte_tagged, &page->flags)) { 50 + if (!page_mte_tagged(page)) { 51 51 put_page(page); 52 52 dump_skip(cprm, MTE_PAGE_TAG_STORAGE); 53 53 continue;

+1 -1

arch/arm64/kernel/hibernate.c

··· 271 271 if (!page) 272 272 continue; 273 273 274 - if (!test_bit(PG_mte_tagged, &page->flags)) 274 + if (!page_mte_tagged(page)) 275 275 continue; 276 276 277 277 ret = save_tags(page, pfn);

-15

arch/arm64/kernel/image-vars.h

··· 71 71 /* Vectors installed by hyp-init on reset HVC. */ 72 72 KVM_NVHE_ALIAS(__hyp_stub_vectors); 73 73 74 - /* Kernel symbol used by icache_is_vpipt(). */ 75 - KVM_NVHE_ALIAS(__icache_flags); 76 - 77 - /* VMID bits set by the KVM VMID allocator */ 78 - KVM_NVHE_ALIAS(kvm_arm_vmid_bits); 79 - 80 74 /* Static keys which are set if a vGIC trap should be handled in hyp. */ 81 75 KVM_NVHE_ALIAS(vgic_v2_cpuif_trap); 82 76 KVM_NVHE_ALIAS(vgic_v3_cpuif_trap); ··· 85 91 /* EL2 exception handling */ 86 92 KVM_NVHE_ALIAS(__start___kvm_ex_table); 87 93 KVM_NVHE_ALIAS(__stop___kvm_ex_table); 88 - 89 - /* Array containing bases of nVHE per-CPU memory regions. */ 90 - KVM_NVHE_ALIAS(kvm_arm_hyp_percpu_base); 91 94 92 95 /* PMU available static key */ 93 96 #ifdef CONFIG_HW_PERF_EVENTS ··· 101 110 KVM_NVHE_ALIAS_HYP(__memcpy, __pi_memcpy); 102 111 KVM_NVHE_ALIAS_HYP(__memset, __pi_memset); 103 112 #endif 104 - 105 - /* Kernel memory sections */ 106 - KVM_NVHE_ALIAS(__start_rodata); 107 - KVM_NVHE_ALIAS(__end_rodata); 108 - KVM_NVHE_ALIAS(__bss_start); 109 - KVM_NVHE_ALIAS(__bss_stop); 110 113 111 114 /* Hyp memory sections */ 112 115 KVM_NVHE_ALIAS(__hyp_idmap_text_start);

+10 -11

arch/arm64/kernel/mte.c

··· 41 41 if (check_swap && is_swap_pte(old_pte)) { 42 42 swp_entry_t entry = pte_to_swp_entry(old_pte); 43 43 44 - if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) 45 - return; 44 + if (!non_swap_entry(entry)) 45 + mte_restore_tags(entry, page); 46 46 } 47 47 48 48 if (!pte_is_tagged) 49 49 return; 50 50 51 - /* 52 - * Test PG_mte_tagged again in case it was racing with another 53 - * set_pte_at(). 54 - */ 55 - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) 51 + if (try_page_mte_tagging(page)) { 56 52 mte_clear_page_tags(page_address(page)); 53 + set_page_mte_tagged(page); 54 + } 57 55 } 58 56 59 57 void mte_sync_tags(pte_t old_pte, pte_t pte) ··· 67 69 68 70 /* if PG_mte_tagged is set, tags have already been initialised */ 69 71 for (i = 0; i < nr_pages; i++, page++) { 70 - if (!test_bit(PG_mte_tagged, &page->flags)) 72 + if (!page_mte_tagged(page)) { 71 73 mte_sync_page_tags(page, old_pte, check_swap, 72 74 pte_is_tagged); 75 + set_page_mte_tagged(page); 76 + } 73 77 } 74 78 75 79 /* ensure the tags are visible before the PTE is set */ ··· 96 96 * pages is tagged, set_pte_at() may zero or change the tags of the 97 97 * other page via mte_sync_tags(). 98 98 */ 99 - if (test_bit(PG_mte_tagged, &page1->flags) || 100 - test_bit(PG_mte_tagged, &page2->flags)) 99 + if (page_mte_tagged(page1) || page_mte_tagged(page2)) 101 100 return addr1 != addr2; 102 101 103 102 return ret; ··· 453 454 put_page(page); 454 455 break; 455 456 } 456 - WARN_ON_ONCE(!test_bit(PG_mte_tagged, &page->flags)); 457 + WARN_ON_ONCE(!page_mte_tagged(page)); 457 458 458 459 /* limit access to the end of the page */ 459 460 offset = offset_in_page(addr);

+2

arch/arm64/kvm/Kconfig

··· 32 32 select KVM_VFIO 33 33 select HAVE_KVM_EVENTFD 34 34 select HAVE_KVM_IRQFD 35 + select HAVE_KVM_DIRTY_RING_ACQ_REL 36 + select NEED_KVM_DIRTY_RING_WITH_BITMAP 35 37 select HAVE_KVM_MSI 36 38 select HAVE_KVM_IRQCHIP 37 39 select HAVE_KVM_IRQ_ROUTING

+53 -41

arch/arm64/kvm/arm.c

··· 37 37 #include <asm/kvm_arm.h> 38 38 #include <asm/kvm_asm.h> 39 39 #include <asm/kvm_mmu.h> 40 + #include <asm/kvm_pkvm.h> 40 41 #include <asm/kvm_emulate.h> 41 42 #include <asm/sections.h> 42 43 ··· 51 50 DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector); 52 51 53 52 DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); 54 - unsigned long kvm_arm_hyp_percpu_base[NR_CPUS]; 55 53 DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); 56 54 57 55 static bool vgic_present; ··· 138 138 { 139 139 int ret; 140 140 141 - ret = kvm_arm_setup_stage2(kvm, type); 142 - if (ret) 143 - return ret; 144 - 145 - ret = kvm_init_stage2_mmu(kvm, &kvm->arch.mmu); 146 - if (ret) 147 - return ret; 148 - 149 141 ret = kvm_share_hyp(kvm, kvm + 1); 150 142 if (ret) 151 - goto out_free_stage2_pgd; 143 + return ret; 144 + 145 + ret = pkvm_init_host_vm(kvm); 146 + if (ret) 147 + goto err_unshare_kvm; 152 148 153 149 if (!zalloc_cpumask_var(&kvm->arch.supported_cpus, GFP_KERNEL)) { 154 150 ret = -ENOMEM; 155 - goto out_free_stage2_pgd; 151 + goto err_unshare_kvm; 156 152 } 157 153 cpumask_copy(kvm->arch.supported_cpus, cpu_possible_mask); 154 + 155 + ret = kvm_init_stage2_mmu(kvm, &kvm->arch.mmu, type); 156 + if (ret) 157 + goto err_free_cpumask; 158 158 159 159 kvm_vgic_early_init(kvm); 160 160 ··· 164 164 set_default_spectre(kvm); 165 165 kvm_arm_init_hypercalls(kvm); 166 166 167 - return ret; 168 - out_free_stage2_pgd: 169 - kvm_free_stage2_pgd(&kvm->arch.mmu); 167 + /* 168 + * Initialise the default PMUver before there is a chance to 169 + * create an actual PMU. 170 + */ 171 + kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit(); 172 + 173 + return 0; 174 + 175 + err_free_cpumask: 176 + free_cpumask_var(kvm->arch.supported_cpus); 177 + err_unshare_kvm: 178 + kvm_unshare_hyp(kvm, kvm + 1); 170 179 return ret; 171 180 } 172 181 ··· 195 186 free_cpumask_var(kvm->arch.supported_cpus); 196 187 197 188 kvm_vgic_destroy(kvm); 189 + 190 + if (is_protected_kvm_enabled()) 191 + pkvm_destroy_hyp_vm(kvm); 198 192 199 193 kvm_destroy_vcpus(kvm); 200 194 ··· 581 569 if (ret) 582 570 return ret; 583 571 572 + if (is_protected_kvm_enabled()) { 573 + ret = pkvm_create_hyp_vm(kvm); 574 + if (ret) 575 + return ret; 576 + } 577 + 584 578 if (!irqchip_in_kernel(kvm)) { 585 579 /* 586 580 * Tell the rest of the code that there are userspace irqchip ··· 764 746 765 747 if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) 766 748 return kvm_vcpu_suspend(vcpu); 749 + 750 + if (kvm_dirty_ring_check_request(vcpu)) 751 + return 0; 767 752 } 768 753 769 754 return 1; ··· 1539 1518 return 0; 1540 1519 } 1541 1520 1542 - static void cpu_prepare_hyp_mode(int cpu) 1521 + static void cpu_prepare_hyp_mode(int cpu, u32 hyp_va_bits) 1543 1522 { 1544 1523 struct kvm_nvhe_init_params *params = per_cpu_ptr_nvhe_sym(kvm_init_params, cpu); 1545 1524 unsigned long tcr; ··· 1555 1534 1556 1535 params->mair_el2 = read_sysreg(mair_el1); 1557 1536 1558 - /* 1559 - * The ID map may be configured to use an extended virtual address 1560 - * range. This is only the case if system RAM is out of range for the 1561 - * currently configured page size and VA_BITS, in which case we will 1562 - * also need the extended virtual range for the HYP ID map, or we won't 1563 - * be able to enable the EL2 MMU. 1564 - * 1565 - * However, at EL2, there is only one TTBR register, and we can't switch 1566 - * between translation tables *and* update TCR_EL2.T0SZ at the same 1567 - * time. Bottom line: we need to use the extended range with *both* our 1568 - * translation tables. 1569 - * 1570 - * So use the same T0SZ value we use for the ID map. 1571 - */ 1572 1537 tcr = (read_sysreg(tcr_el1) & TCR_EL2_MASK) | TCR_EL2_RES1; 1573 1538 tcr &= ~TCR_T0SZ_MASK; 1574 - tcr |= (idmap_t0sz & GENMASK(TCR_TxSZ_WIDTH - 1, 0)) << TCR_T0SZ_OFFSET; 1539 + tcr |= TCR_T0SZ(hyp_va_bits); 1575 1540 params->tcr_el2 = tcr; 1576 1541 1577 1542 params->pgd_pa = kvm_mmu_get_httbr(); ··· 1851 1844 free_hyp_pgds(); 1852 1845 for_each_possible_cpu(cpu) { 1853 1846 free_page(per_cpu(kvm_arm_hyp_stack_page, cpu)); 1854 - free_pages(kvm_arm_hyp_percpu_base[cpu], nvhe_percpu_order()); 1847 + free_pages(kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu], nvhe_percpu_order()); 1855 1848 } 1856 1849 } 1857 1850 1858 1851 static int do_pkvm_init(u32 hyp_va_bits) 1859 1852 { 1860 - void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base); 1853 + void *per_cpu_base = kvm_ksym_ref(kvm_nvhe_sym(kvm_arm_hyp_percpu_base)); 1861 1854 int ret; 1862 1855 1863 1856 preempt_disable(); ··· 1877 1870 return ret; 1878 1871 } 1879 1872 1880 - static int kvm_hyp_init_protection(u32 hyp_va_bits) 1873 + static void kvm_hyp_init_symbols(void) 1881 1874 { 1882 - void *addr = phys_to_virt(hyp_mem_base); 1883 - int ret; 1884 - 1885 1875 kvm_nvhe_sym(id_aa64pfr0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); 1886 1876 kvm_nvhe_sym(id_aa64pfr1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); 1887 1877 kvm_nvhe_sym(id_aa64isar0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR0_EL1); ··· 1887 1883 kvm_nvhe_sym(id_aa64mmfr0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); 1888 1884 kvm_nvhe_sym(id_aa64mmfr1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1); 1889 1885 kvm_nvhe_sym(id_aa64mmfr2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64MMFR2_EL1); 1886 + kvm_nvhe_sym(__icache_flags) = __icache_flags; 1887 + kvm_nvhe_sym(kvm_arm_vmid_bits) = kvm_arm_vmid_bits; 1888 + } 1889 + 1890 + static int kvm_hyp_init_protection(u32 hyp_va_bits) 1891 + { 1892 + void *addr = phys_to_virt(hyp_mem_base); 1893 + int ret; 1890 1894 1891 1895 ret = create_hyp_mappings(addr, addr + hyp_mem_size, PAGE_HYP); 1892 1896 if (ret) ··· 1962 1950 1963 1951 page_addr = page_address(page); 1964 1952 memcpy(page_addr, CHOOSE_NVHE_SYM(__per_cpu_start), nvhe_percpu_size()); 1965 - kvm_arm_hyp_percpu_base[cpu] = (unsigned long)page_addr; 1953 + kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu] = (unsigned long)page_addr; 1966 1954 } 1967 1955 1968 1956 /* ··· 2055 2043 } 2056 2044 2057 2045 for_each_possible_cpu(cpu) { 2058 - char *percpu_begin = (char *)kvm_arm_hyp_percpu_base[cpu]; 2046 + char *percpu_begin = (char *)kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu]; 2059 2047 char *percpu_end = percpu_begin + nvhe_percpu_size(); 2060 2048 2061 2049 /* Map Hyp percpu pages */ ··· 2066 2054 } 2067 2055 2068 2056 /* Prepare the CPU initialization parameters */ 2069 - cpu_prepare_hyp_mode(cpu); 2057 + cpu_prepare_hyp_mode(cpu, hyp_va_bits); 2070 2058 } 2059 + 2060 + kvm_hyp_init_symbols(); 2071 2061 2072 2062 if (is_protected_kvm_enabled()) { 2073 2063 init_cpu_logical_map(); ··· 2078 2064 err = -ENODEV; 2079 2065 goto out_err; 2080 2066 } 2081 - } 2082 2067 2083 - if (is_protected_kvm_enabled()) { 2084 2068 err = kvm_hyp_init_protection(hyp_va_bits); 2085 2069 if (err) { 2086 2070 kvm_err("Failed to init hyp memory protection\n");

+11 -7

arch/arm64/kvm/guest.c

··· 1059 1059 maddr = page_address(page); 1060 1060 1061 1061 if (!write) { 1062 - if (test_bit(PG_mte_tagged, &page->flags)) 1062 + if (page_mte_tagged(page)) 1063 1063 num_tags = mte_copy_tags_to_user(tags, maddr, 1064 1064 MTE_GRANULES_PER_PAGE); 1065 1065 else ··· 1068 1068 clear_user(tags, MTE_GRANULES_PER_PAGE); 1069 1069 kvm_release_pfn_clean(pfn); 1070 1070 } else { 1071 + /* 1072 + * Only locking to serialise with a concurrent 1073 + * set_pte_at() in the VMM but still overriding the 1074 + * tags, hence ignoring the return value. 1075 + */ 1076 + try_page_mte_tagging(page); 1071 1077 num_tags = mte_copy_tags_from_user(maddr, tags, 1072 1078 MTE_GRANULES_PER_PAGE); 1073 1079 1074 - /* 1075 - * Set the flag after checking the write 1076 - * completed fully 1077 - */ 1078 - if (num_tags == MTE_GRANULES_PER_PAGE) 1079 - set_bit(PG_mte_tagged, &page->flags); 1080 + /* uaccess failed, don't leave stale tags */ 1081 + if (num_tags != MTE_GRANULES_PER_PAGE) 1082 + mte_clear_page_tags(page); 1083 + set_page_mte_tagged(page); 1080 1084 1081 1085 kvm_release_pfn_dirty(pfn); 1082 1086 }

+3

arch/arm64/kvm/hyp/hyp-constants.c

··· 2 2 3 3 #include <linux/kbuild.h> 4 4 #include <nvhe/memory.h> 5 + #include <nvhe/pkvm.h> 5 6 6 7 int main(void) 7 8 { 8 9 DEFINE(STRUCT_HYP_PAGE_SIZE, sizeof(struct hyp_page)); 10 + DEFINE(PKVM_HYP_VM_SIZE, sizeof(struct pkvm_hyp_vm)); 11 + DEFINE(PKVM_HYP_VCPU_SIZE, sizeof(struct pkvm_hyp_vcpu)); 9 12 return 0; 10 13 }

+21 -4

arch/arm64/kvm/hyp/include/nvhe/mem_protect.h

··· 8 8 #define __KVM_NVHE_MEM_PROTECT__ 9 9 #include <linux/kvm_host.h> 10 10 #include <asm/kvm_hyp.h> 11 + #include <asm/kvm_mmu.h> 11 12 #include <asm/kvm_pgtable.h> 12 13 #include <asm/virt.h> 14 + #include <nvhe/pkvm.h> 13 15 #include <nvhe/spinlock.h> 14 16 15 17 /* ··· 45 43 return prot & PKVM_PAGE_STATE_PROT_MASK; 46 44 } 47 45 48 - struct host_kvm { 46 + struct host_mmu { 49 47 struct kvm_arch arch; 50 48 struct kvm_pgtable pgt; 51 49 struct kvm_pgtable_mm_ops mm_ops; 52 50 hyp_spinlock_t lock; 53 51 }; 54 - extern struct host_kvm host_kvm; 52 + extern struct host_mmu host_mmu; 55 53 56 - extern const u8 pkvm_hyp_id; 54 + /* This corresponds to page-table locking order */ 55 + enum pkvm_component_id { 56 + PKVM_ID_HOST, 57 + PKVM_ID_HYP, 58 + }; 59 + 60 + extern unsigned long hyp_nr_cpus; 57 61 58 62 int __pkvm_prot_finalize(void); 59 63 int __pkvm_host_share_hyp(u64 pfn); 60 64 int __pkvm_host_unshare_hyp(u64 pfn); 65 + int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); 66 + int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); 61 67 62 68 bool addr_is_memory(phys_addr_t phys); 63 69 int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot); 64 70 int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id); 65 71 int kvm_host_prepare_stage2(void *pgt_pool_base); 72 + int kvm_guest_prepare_stage2(struct pkvm_hyp_vm *vm, void *pgd); 66 73 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt); 74 + 75 + int hyp_pin_shared_mem(void *from, void *to); 76 + void hyp_unpin_shared_mem(void *from, void *to); 77 + void reclaim_guest_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc); 78 + int refill_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages, 79 + struct kvm_hyp_memcache *host_mc); 67 80 68 81 static __always_inline void __load_host_stage2(void) 69 82 { 70 83 if (static_branch_likely(&kvm_protected_mode_initialized)) 71 - __load_stage2(&host_kvm.arch.mmu, &host_kvm.arch); 84 + __load_stage2(&host_mmu.arch.mmu, &host_mmu.arch); 72 85 else 73 86 write_sysreg(0, vttbr_el2); 74 87 }

+27

arch/arm64/kvm/hyp/include/nvhe/memory.h

··· 38 38 #define hyp_page_to_virt(page) __hyp_va(hyp_page_to_phys(page)) 39 39 #define hyp_page_to_pool(page) (((struct hyp_page *)page)->pool) 40 40 41 + /* 42 + * Refcounting for 'struct hyp_page'. 43 + * hyp_pool::lock must be held if atomic access to the refcount is required. 44 + */ 41 45 static inline int hyp_page_count(void *addr) 42 46 { 43 47 struct hyp_page *p = hyp_virt_to_page(addr); ··· 49 45 return p->refcount; 50 46 } 51 47 48 + static inline void hyp_page_ref_inc(struct hyp_page *p) 49 + { 50 + BUG_ON(p->refcount == USHRT_MAX); 51 + p->refcount++; 52 + } 53 + 54 + static inline void hyp_page_ref_dec(struct hyp_page *p) 55 + { 56 + BUG_ON(!p->refcount); 57 + p->refcount--; 58 + } 59 + 60 + static inline int hyp_page_ref_dec_and_test(struct hyp_page *p) 61 + { 62 + hyp_page_ref_dec(p); 63 + return (p->refcount == 0); 64 + } 65 + 66 + static inline void hyp_set_page_refcounted(struct hyp_page *p) 67 + { 68 + BUG_ON(p->refcount); 69 + p->refcount = 1; 70 + } 52 71 #endif /* __KVM_HYP_MEMORY_H */

+5 -13

arch/arm64/kvm/hyp/include/nvhe/mm.h

··· 13 13 extern struct kvm_pgtable pkvm_pgtable; 14 14 extern hyp_spinlock_t pkvm_pgd_lock; 15 15 16 + int hyp_create_pcpu_fixmap(void); 17 + void *hyp_fixmap_map(phys_addr_t phys); 18 + void hyp_fixmap_unmap(void); 19 + 16 20 int hyp_create_idmap(u32 hyp_va_bits); 17 21 int hyp_map_vectors(void); 18 - int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back); 22 + int hyp_back_vmemmap(phys_addr_t back); 19 23 int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot); 20 24 int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot); 21 25 int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot); ··· 27 23 enum kvm_pgtable_prot prot, 28 24 unsigned long *haddr); 29 25 int pkvm_alloc_private_va_range(size_t size, unsigned long *haddr); 30 - 31 - static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size, 32 - unsigned long *start, unsigned long *end) 33 - { 34 - unsigned long nr_pages = size >> PAGE_SHIFT; 35 - struct hyp_page *p = hyp_phys_to_page(phys); 36 - 37 - *start = (unsigned long)p; 38 - *end = *start + nr_pages * sizeof(struct hyp_page); 39 - *start = ALIGN_DOWN(*start, PAGE_SIZE); 40 - *end = ALIGN(*end, PAGE_SIZE); 41 - } 42 26 43 27 #endif /* __KVM_HYP_MM_H */

+68

arch/arm64/kvm/hyp/include/nvhe/pkvm.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2021 Google LLC 4 + * Author: Fuad Tabba <tabba@google.com> 5 + */ 6 + 7 + #ifndef __ARM64_KVM_NVHE_PKVM_H__ 8 + #define __ARM64_KVM_NVHE_PKVM_H__ 9 + 10 + #include <asm/kvm_pkvm.h> 11 + 12 + #include <nvhe/gfp.h> 13 + #include <nvhe/spinlock.h> 14 + 15 + /* 16 + * Holds the relevant data for maintaining the vcpu state completely at hyp. 17 + */ 18 + struct pkvm_hyp_vcpu { 19 + struct kvm_vcpu vcpu; 20 + 21 + /* Backpointer to the host's (untrusted) vCPU instance. */ 22 + struct kvm_vcpu *host_vcpu; 23 + }; 24 + 25 + /* 26 + * Holds the relevant data for running a protected vm. 27 + */ 28 + struct pkvm_hyp_vm { 29 + struct kvm kvm; 30 + 31 + /* Backpointer to the host's (untrusted) KVM instance. */ 32 + struct kvm *host_kvm; 33 + 34 + /* The guest's stage-2 page-table managed by the hypervisor. */ 35 + struct kvm_pgtable pgt; 36 + struct kvm_pgtable_mm_ops mm_ops; 37 + struct hyp_pool pool; 38 + hyp_spinlock_t lock; 39 + 40 + /* 41 + * The number of vcpus initialized and ready to run. 42 + * Modifying this is protected by 'vm_table_lock'. 43 + */ 44 + unsigned int nr_vcpus; 45 + 46 + /* Array of the hyp vCPU structures for this VM. */ 47 + struct pkvm_hyp_vcpu *vcpus[]; 48 + }; 49 + 50 + static inline struct pkvm_hyp_vm * 51 + pkvm_hyp_vcpu_to_hyp_vm(struct pkvm_hyp_vcpu *hyp_vcpu) 52 + { 53 + return container_of(hyp_vcpu->vcpu.kvm, struct pkvm_hyp_vm, kvm); 54 + } 55 + 56 + void pkvm_hyp_vm_table_init(void *tbl); 57 + 58 + int __pkvm_init_vm(struct kvm *host_kvm, unsigned long vm_hva, 59 + unsigned long pgd_hva); 60 + int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, 61 + unsigned long vcpu_hva); 62 + int __pkvm_teardown_vm(pkvm_handle_t handle); 63 + 64 + struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, 65 + unsigned int vcpu_idx); 66 + void pkvm_put_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu); 67 + 68 + #endif /* __ARM64_KVM_NVHE_PKVM_H__ */

+9 -1

arch/arm64/kvm/hyp/include/nvhe/spinlock.h

··· 28 28 }; 29 29 } hyp_spinlock_t; 30 30 31 + #define __HYP_SPIN_LOCK_INITIALIZER \ 32 + { .__val = 0 } 33 + 34 + #define __HYP_SPIN_LOCK_UNLOCKED \ 35 + ((hyp_spinlock_t) __HYP_SPIN_LOCK_INITIALIZER) 36 + 37 + #define DEFINE_HYP_SPINLOCK(x) hyp_spinlock_t x = __HYP_SPIN_LOCK_UNLOCKED 38 + 31 39 #define hyp_spin_lock_init(l) \ 32 40 do { \ 33 - *(l) = (hyp_spinlock_t){ .__val = 0 }; \ 41 + *(l) = __HYP_SPIN_LOCK_UNLOCKED; \ 34 42 } while (0) 35 43 36 44 static inline void hyp_spin_lock(hyp_spinlock_t *lock)

+11

arch/arm64/kvm/hyp/nvhe/cache.S

··· 12 12 ret 13 13 SYM_FUNC_END(__pi_dcache_clean_inval_poc) 14 14 SYM_FUNC_ALIAS(dcache_clean_inval_poc, __pi_dcache_clean_inval_poc) 15 + 16 + SYM_FUNC_START(__pi_icache_inval_pou) 17 + alternative_if ARM64_HAS_CACHE_DIC 18 + isb 19 + ret 20 + alternative_else_nop_endif 21 + 22 + invalidate_icache_by_line x0, x1, x2, x3 23 + ret 24 + SYM_FUNC_END(__pi_icache_inval_pou) 25 + SYM_FUNC_ALIAS(icache_inval_pou, __pi_icache_inval_pou)

+108 -2

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 15 15 16 16 #include <nvhe/mem_protect.h> 17 17 #include <nvhe/mm.h> 18 + #include <nvhe/pkvm.h> 18 19 #include <nvhe/trap_handler.h> 19 20 20 21 DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); 21 22 22 23 void __kvm_hyp_host_forward_smc(struct kvm_cpu_context *host_ctxt); 23 24 25 + static void flush_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) 26 + { 27 + struct kvm_vcpu *host_vcpu = hyp_vcpu->host_vcpu; 28 + 29 + hyp_vcpu->vcpu.arch.ctxt = host_vcpu->arch.ctxt; 30 + 31 + hyp_vcpu->vcpu.arch.sve_state = kern_hyp_va(host_vcpu->arch.sve_state); 32 + hyp_vcpu->vcpu.arch.sve_max_vl = host_vcpu->arch.sve_max_vl; 33 + 34 + hyp_vcpu->vcpu.arch.hw_mmu = host_vcpu->arch.hw_mmu; 35 + 36 + hyp_vcpu->vcpu.arch.hcr_el2 = host_vcpu->arch.hcr_el2; 37 + hyp_vcpu->vcpu.arch.mdcr_el2 = host_vcpu->arch.mdcr_el2; 38 + hyp_vcpu->vcpu.arch.cptr_el2 = host_vcpu->arch.cptr_el2; 39 + 40 + hyp_vcpu->vcpu.arch.iflags = host_vcpu->arch.iflags; 41 + hyp_vcpu->vcpu.arch.fp_state = host_vcpu->arch.fp_state; 42 + 43 + hyp_vcpu->vcpu.arch.debug_ptr = kern_hyp_va(host_vcpu->arch.debug_ptr); 44 + hyp_vcpu->vcpu.arch.host_fpsimd_state = host_vcpu->arch.host_fpsimd_state; 45 + 46 + hyp_vcpu->vcpu.arch.vsesr_el2 = host_vcpu->arch.vsesr_el2; 47 + 48 + hyp_vcpu->vcpu.arch.vgic_cpu.vgic_v3 = host_vcpu->arch.vgic_cpu.vgic_v3; 49 + } 50 + 51 + static void sync_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) 52 + { 53 + struct kvm_vcpu *host_vcpu = hyp_vcpu->host_vcpu; 54 + struct vgic_v3_cpu_if *hyp_cpu_if = &hyp_vcpu->vcpu.arch.vgic_cpu.vgic_v3; 55 + struct vgic_v3_cpu_if *host_cpu_if = &host_vcpu->arch.vgic_cpu.vgic_v3; 56 + unsigned int i; 57 + 58 + host_vcpu->arch.ctxt = hyp_vcpu->vcpu.arch.ctxt; 59 + 60 + host_vcpu->arch.hcr_el2 = hyp_vcpu->vcpu.arch.hcr_el2; 61 + host_vcpu->arch.cptr_el2 = hyp_vcpu->vcpu.arch.cptr_el2; 62 + 63 + host_vcpu->arch.fault = hyp_vcpu->vcpu.arch.fault; 64 + 65 + host_vcpu->arch.iflags = hyp_vcpu->vcpu.arch.iflags; 66 + host_vcpu->arch.fp_state = hyp_vcpu->vcpu.arch.fp_state; 67 + 68 + host_cpu_if->vgic_hcr = hyp_cpu_if->vgic_hcr; 69 + for (i = 0; i < hyp_cpu_if->used_lrs; ++i) 70 + host_cpu_if->vgic_lr[i] = hyp_cpu_if->vgic_lr[i]; 71 + } 72 + 24 73 static void handle___kvm_vcpu_run(struct kvm_cpu_context *host_ctxt) 25 74 { 26 - DECLARE_REG(struct kvm_vcpu *, vcpu, host_ctxt, 1); 75 + DECLARE_REG(struct kvm_vcpu *, host_vcpu, host_ctxt, 1); 76 + int ret; 27 77 28 - cpu_reg(host_ctxt, 1) = __kvm_vcpu_run(kern_hyp_va(vcpu)); 78 + host_vcpu = kern_hyp_va(host_vcpu); 79 + 80 + if (unlikely(is_protected_kvm_enabled())) { 81 + struct pkvm_hyp_vcpu *hyp_vcpu; 82 + struct kvm *host_kvm; 83 + 84 + host_kvm = kern_hyp_va(host_vcpu->kvm); 85 + hyp_vcpu = pkvm_load_hyp_vcpu(host_kvm->arch.pkvm.handle, 86 + host_vcpu->vcpu_idx); 87 + if (!hyp_vcpu) { 88 + ret = -EINVAL; 89 + goto out; 90 + } 91 + 92 + flush_hyp_vcpu(hyp_vcpu); 93 + 94 + ret = __kvm_vcpu_run(&hyp_vcpu->vcpu); 95 + 96 + sync_hyp_vcpu(hyp_vcpu); 97 + pkvm_put_hyp_vcpu(hyp_vcpu); 98 + } else { 99 + /* The host is fully trusted, run its vCPU directly. */ 100 + ret = __kvm_vcpu_run(host_vcpu); 101 + } 102 + 103 + out: 104 + cpu_reg(host_ctxt, 1) = ret; 29 105 } 30 106 31 107 static void handle___kvm_adjust_pc(struct kvm_cpu_context *host_ctxt) ··· 267 191 __pkvm_vcpu_init_traps(kern_hyp_va(vcpu)); 268 192 } 269 193 194 + static void handle___pkvm_init_vm(struct kvm_cpu_context *host_ctxt) 195 + { 196 + DECLARE_REG(struct kvm *, host_kvm, host_ctxt, 1); 197 + DECLARE_REG(unsigned long, vm_hva, host_ctxt, 2); 198 + DECLARE_REG(unsigned long, pgd_hva, host_ctxt, 3); 199 + 200 + host_kvm = kern_hyp_va(host_kvm); 201 + cpu_reg(host_ctxt, 1) = __pkvm_init_vm(host_kvm, vm_hva, pgd_hva); 202 + } 203 + 204 + static void handle___pkvm_init_vcpu(struct kvm_cpu_context *host_ctxt) 205 + { 206 + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 207 + DECLARE_REG(struct kvm_vcpu *, host_vcpu, host_ctxt, 2); 208 + DECLARE_REG(unsigned long, vcpu_hva, host_ctxt, 3); 209 + 210 + host_vcpu = kern_hyp_va(host_vcpu); 211 + cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); 212 + } 213 + 214 + static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) 215 + { 216 + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 217 + 218 + cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); 219 + } 220 + 270 221 typedef void (*hcall_t)(struct kvm_cpu_context *); 271 222 272 223 #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x ··· 323 220 HANDLE_FUNC(__vgic_v3_save_aprs), 324 221 HANDLE_FUNC(__vgic_v3_restore_aprs), 325 222 HANDLE_FUNC(__pkvm_vcpu_init_traps), 223 + HANDLE_FUNC(__pkvm_init_vm), 224 + HANDLE_FUNC(__pkvm_init_vcpu), 225 + HANDLE_FUNC(__pkvm_teardown_vm), 326 226 }; 327 227 328 228 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)

+2

arch/arm64/kvm/hyp/nvhe/hyp-smp.c

··· 23 23 return hyp_cpu_logical_map[cpu]; 24 24 } 25 25 26 + unsigned long __ro_after_init kvm_arm_hyp_percpu_base[NR_CPUS]; 27 + 26 28 unsigned long __hyp_per_cpu_offset(unsigned int cpu) 27 29 { 28 30 unsigned long *cpu_base_array;

+476 -45

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 21 21 22 22 #define KVM_HOST_S2_FLAGS (KVM_PGTABLE_S2_NOFWB | KVM_PGTABLE_S2_IDMAP) 23 23 24 - extern unsigned long hyp_nr_cpus; 25 - struct host_kvm host_kvm; 24 + struct host_mmu host_mmu; 26 25 27 26 static struct hyp_pool host_s2_pool; 28 27 29 - const u8 pkvm_hyp_id = 1; 28 + static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm); 29 + #define current_vm (*this_cpu_ptr(&__current_vm)) 30 + 31 + static void guest_lock_component(struct pkvm_hyp_vm *vm) 32 + { 33 + hyp_spin_lock(&vm->lock); 34 + current_vm = vm; 35 + } 36 + 37 + static void guest_unlock_component(struct pkvm_hyp_vm *vm) 38 + { 39 + current_vm = NULL; 40 + hyp_spin_unlock(&vm->lock); 41 + } 30 42 31 43 static void host_lock_component(void) 32 44 { 33 - hyp_spin_lock(&host_kvm.lock); 45 + hyp_spin_lock(&host_mmu.lock); 34 46 } 35 47 36 48 static void host_unlock_component(void) 37 49 { 38 - hyp_spin_unlock(&host_kvm.lock); 50 + hyp_spin_unlock(&host_mmu.lock); 39 51 } 40 52 41 53 static void hyp_lock_component(void) ··· 91 79 hyp_put_page(&host_s2_pool, addr); 92 80 } 93 81 82 + static void host_s2_free_removed_table(void *addr, u32 level) 83 + { 84 + kvm_pgtable_stage2_free_removed(&host_mmu.mm_ops, addr, level); 85 + } 86 + 94 87 static int prepare_s2_pool(void *pgt_pool_base) 95 88 { 96 89 unsigned long nr_pages, pfn; ··· 107 90 if (ret) 108 91 return ret; 109 92 110 - host_kvm.mm_ops = (struct kvm_pgtable_mm_ops) { 93 + host_mmu.mm_ops = (struct kvm_pgtable_mm_ops) { 111 94 .zalloc_pages_exact = host_s2_zalloc_pages_exact, 112 95 .zalloc_page = host_s2_zalloc_page, 96 + .free_removed_table = host_s2_free_removed_table, 113 97 .phys_to_virt = hyp_phys_to_virt, 114 98 .virt_to_phys = hyp_virt_to_phys, 115 99 .page_count = hyp_page_count, ··· 129 111 parange = kvm_get_parange(id_aa64mmfr0_el1_sys_val); 130 112 phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange); 131 113 132 - host_kvm.arch.vtcr = kvm_get_vtcr(id_aa64mmfr0_el1_sys_val, 114 + host_mmu.arch.vtcr = kvm_get_vtcr(id_aa64mmfr0_el1_sys_val, 133 115 id_aa64mmfr1_el1_sys_val, phys_shift); 134 116 } 135 117 ··· 137 119 138 120 int kvm_host_prepare_stage2(void *pgt_pool_base) 139 121 { 140 - struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu; 122 + struct kvm_s2_mmu *mmu = &host_mmu.arch.mmu; 141 123 int ret; 142 124 143 125 prepare_host_vtcr(); 144 - hyp_spin_lock_init(&host_kvm.lock); 145 - mmu->arch = &host_kvm.arch; 126 + hyp_spin_lock_init(&host_mmu.lock); 127 + mmu->arch = &host_mmu.arch; 146 128 147 129 ret = prepare_s2_pool(pgt_pool_base); 148 130 if (ret) 149 131 return ret; 150 132 151 - ret = __kvm_pgtable_stage2_init(&host_kvm.pgt, mmu, 152 - &host_kvm.mm_ops, KVM_HOST_S2_FLAGS, 133 + ret = __kvm_pgtable_stage2_init(&host_mmu.pgt, mmu, 134 + &host_mmu.mm_ops, KVM_HOST_S2_FLAGS, 153 135 host_stage2_force_pte_cb); 154 136 if (ret) 155 137 return ret; 156 138 157 - mmu->pgd_phys = __hyp_pa(host_kvm.pgt.pgd); 158 - mmu->pgt = &host_kvm.pgt; 139 + mmu->pgd_phys = __hyp_pa(host_mmu.pgt.pgd); 140 + mmu->pgt = &host_mmu.pgt; 159 141 atomic64_set(&mmu->vmid.id, 0); 160 142 161 143 return 0; 162 144 } 163 145 146 + static bool guest_stage2_force_pte_cb(u64 addr, u64 end, 147 + enum kvm_pgtable_prot prot) 148 + { 149 + return true; 150 + } 151 + 152 + static void *guest_s2_zalloc_pages_exact(size_t size) 153 + { 154 + void *addr = hyp_alloc_pages(&current_vm->pool, get_order(size)); 155 + 156 + WARN_ON(size != (PAGE_SIZE << get_order(size))); 157 + hyp_split_page(hyp_virt_to_page(addr)); 158 + 159 + return addr; 160 + } 161 + 162 + static void guest_s2_free_pages_exact(void *addr, unsigned long size) 163 + { 164 + u8 order = get_order(size); 165 + unsigned int i; 166 + 167 + for (i = 0; i < (1 << order); i++) 168 + hyp_put_page(&current_vm->pool, addr + (i * PAGE_SIZE)); 169 + } 170 + 171 + static void *guest_s2_zalloc_page(void *mc) 172 + { 173 + struct hyp_page *p; 174 + void *addr; 175 + 176 + addr = hyp_alloc_pages(&current_vm->pool, 0); 177 + if (addr) 178 + return addr; 179 + 180 + addr = pop_hyp_memcache(mc, hyp_phys_to_virt); 181 + if (!addr) 182 + return addr; 183 + 184 + memset(addr, 0, PAGE_SIZE); 185 + p = hyp_virt_to_page(addr); 186 + memset(p, 0, sizeof(*p)); 187 + p->refcount = 1; 188 + 189 + return addr; 190 + } 191 + 192 + static void guest_s2_get_page(void *addr) 193 + { 194 + hyp_get_page(&current_vm->pool, addr); 195 + } 196 + 197 + static void guest_s2_put_page(void *addr) 198 + { 199 + hyp_put_page(&current_vm->pool, addr); 200 + } 201 + 202 + static void clean_dcache_guest_page(void *va, size_t size) 203 + { 204 + __clean_dcache_guest_page(hyp_fixmap_map(__hyp_pa(va)), size); 205 + hyp_fixmap_unmap(); 206 + } 207 + 208 + static void invalidate_icache_guest_page(void *va, size_t size) 209 + { 210 + __invalidate_icache_guest_page(hyp_fixmap_map(__hyp_pa(va)), size); 211 + hyp_fixmap_unmap(); 212 + } 213 + 214 + int kvm_guest_prepare_stage2(struct pkvm_hyp_vm *vm, void *pgd) 215 + { 216 + struct kvm_s2_mmu *mmu = &vm->kvm.arch.mmu; 217 + unsigned long nr_pages; 218 + int ret; 219 + 220 + nr_pages = kvm_pgtable_stage2_pgd_size(vm->kvm.arch.vtcr) >> PAGE_SHIFT; 221 + ret = hyp_pool_init(&vm->pool, hyp_virt_to_pfn(pgd), nr_pages, 0); 222 + if (ret) 223 + return ret; 224 + 225 + hyp_spin_lock_init(&vm->lock); 226 + vm->mm_ops = (struct kvm_pgtable_mm_ops) { 227 + .zalloc_pages_exact = guest_s2_zalloc_pages_exact, 228 + .free_pages_exact = guest_s2_free_pages_exact, 229 + .zalloc_page = guest_s2_zalloc_page, 230 + .phys_to_virt = hyp_phys_to_virt, 231 + .virt_to_phys = hyp_virt_to_phys, 232 + .page_count = hyp_page_count, 233 + .get_page = guest_s2_get_page, 234 + .put_page = guest_s2_put_page, 235 + .dcache_clean_inval_poc = clean_dcache_guest_page, 236 + .icache_inval_pou = invalidate_icache_guest_page, 237 + }; 238 + 239 + guest_lock_component(vm); 240 + ret = __kvm_pgtable_stage2_init(mmu->pgt, mmu, &vm->mm_ops, 0, 241 + guest_stage2_force_pte_cb); 242 + guest_unlock_component(vm); 243 + if (ret) 244 + return ret; 245 + 246 + vm->kvm.arch.mmu.pgd_phys = __hyp_pa(vm->pgt.pgd); 247 + 248 + return 0; 249 + } 250 + 251 + void reclaim_guest_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc) 252 + { 253 + void *addr; 254 + 255 + /* Dump all pgtable pages in the hyp_pool */ 256 + guest_lock_component(vm); 257 + kvm_pgtable_stage2_destroy(&vm->pgt); 258 + vm->kvm.arch.mmu.pgd_phys = 0ULL; 259 + guest_unlock_component(vm); 260 + 261 + /* Drain the hyp_pool into the memcache */ 262 + addr = hyp_alloc_pages(&vm->pool, 0); 263 + while (addr) { 264 + memset(hyp_virt_to_page(addr), 0, sizeof(struct hyp_page)); 265 + push_hyp_memcache(mc, addr, hyp_virt_to_phys); 266 + WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(addr), 1)); 267 + addr = hyp_alloc_pages(&vm->pool, 0); 268 + } 269 + } 270 + 164 271 int __pkvm_prot_finalize(void) 165 272 { 166 - struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu; 273 + struct kvm_s2_mmu *mmu = &host_mmu.arch.mmu; 167 274 struct kvm_nvhe_init_params *params = this_cpu_ptr(&kvm_init_params); 168 275 169 276 if (params->hcr_el2 & HCR_VM) 170 277 return -EPERM; 171 278 172 279 params->vttbr = kvm_get_vttbr(mmu); 173 - params->vtcr = host_kvm.arch.vtcr; 280 + params->vtcr = host_mmu.arch.vtcr; 174 281 params->hcr_el2 |= HCR_VM; 175 282 kvm_flush_dcache_to_poc(params, sizeof(*params)); 176 283 177 284 write_sysreg(params->hcr_el2, hcr_el2); 178 - __load_stage2(&host_kvm.arch.mmu, &host_kvm.arch); 285 + __load_stage2(&host_mmu.arch.mmu, &host_mmu.arch); 179 286 180 287 /* 181 288 * Make sure to have an ISB before the TLB maintenance below but only ··· 318 175 319 176 static int host_stage2_unmap_dev_all(void) 320 177 { 321 - struct kvm_pgtable *pgt = &host_kvm.pgt; 178 + struct kvm_pgtable *pgt = &host_mmu.pgt; 322 179 struct memblock_region *reg; 323 180 u64 addr = 0; 324 181 int i, ret; ··· 338 195 u64 end; 339 196 }; 340 197 341 - static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range) 198 + static struct memblock_region *find_mem_range(phys_addr_t addr, struct kvm_mem_range *range) 342 199 { 343 200 int cur, left = 0, right = hyp_memblock_nr; 344 201 struct memblock_region *reg; ··· 361 218 } else { 362 219 range->start = reg->base; 363 220 range->end = end; 364 - return true; 221 + return reg; 365 222 } 366 223 } 367 224 368 - return false; 225 + return NULL; 369 226 } 370 227 371 228 bool addr_is_memory(phys_addr_t phys) 372 229 { 373 230 struct kvm_mem_range range; 374 231 375 - return find_mem_range(phys, &range); 232 + return !!find_mem_range(phys, &range); 233 + } 234 + 235 + static bool addr_is_allowed_memory(phys_addr_t phys) 236 + { 237 + struct memblock_region *reg; 238 + struct kvm_mem_range range; 239 + 240 + reg = find_mem_range(phys, &range); 241 + 242 + return reg && !(reg->flags & MEMBLOCK_NOMAP); 376 243 } 377 244 378 245 static bool is_in_mem_range(u64 addr, struct kvm_mem_range *range) ··· 403 250 static inline int __host_stage2_idmap(u64 start, u64 end, 404 251 enum kvm_pgtable_prot prot) 405 252 { 406 - return kvm_pgtable_stage2_map(&host_kvm.pgt, start, end - start, start, 407 - prot, &host_s2_pool); 253 + return kvm_pgtable_stage2_map(&host_mmu.pgt, start, end - start, start, 254 + prot, &host_s2_pool, 0); 408 255 } 409 256 410 257 /* ··· 416 263 #define host_stage2_try(fn, ...) \ 417 264 ({ \ 418 265 int __ret; \ 419 - hyp_assert_lock_held(&host_kvm.lock); \ 266 + hyp_assert_lock_held(&host_mmu.lock); \ 420 267 __ret = fn(__VA_ARGS__); \ 421 268 if (__ret == -ENOMEM) { \ 422 269 __ret = host_stage2_unmap_dev_all(); \ ··· 439 286 u32 level; 440 287 int ret; 441 288 442 - hyp_assert_lock_held(&host_kvm.lock); 443 - ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level); 289 + hyp_assert_lock_held(&host_mmu.lock); 290 + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, addr, &pte, &level); 444 291 if (ret) 445 292 return ret; 446 293 ··· 472 319 473 320 int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) 474 321 { 475 - return host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt, 322 + return host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, 476 323 addr, size, &host_s2_pool, owner_id); 477 324 } 478 325 ··· 501 348 static int host_stage2_idmap(u64 addr) 502 349 { 503 350 struct kvm_mem_range range; 504 - bool is_memory = find_mem_range(addr, &range); 351 + bool is_memory = !!find_mem_range(addr, &range); 505 352 enum kvm_pgtable_prot prot; 506 353 int ret; 507 354 ··· 533 380 BUG_ON(ret && ret != -EAGAIN); 534 381 } 535 382 536 - /* This corresponds to locking order */ 537 - enum pkvm_component_id { 538 - PKVM_ID_HOST, 539 - PKVM_ID_HYP, 540 - }; 541 - 542 383 struct pkvm_mem_transition { 543 384 u64 nr_pages; 544 385 ··· 546 399 /* Address in the completer's address space */ 547 400 u64 completer_addr; 548 401 } host; 402 + struct { 403 + u64 completer_addr; 404 + } hyp; 549 405 }; 550 406 } initiator; 551 407 ··· 562 412 const enum kvm_pgtable_prot completer_prot; 563 413 }; 564 414 415 + struct pkvm_mem_donation { 416 + const struct pkvm_mem_transition tx; 417 + }; 418 + 565 419 struct check_walk_data { 566 420 enum pkvm_page_state desired; 567 421 enum pkvm_page_state (*get_page_state)(kvm_pte_t pte); 568 422 }; 569 423 570 - static int __check_page_state_visitor(u64 addr, u64 end, u32 level, 571 - kvm_pte_t *ptep, 572 - enum kvm_pgtable_walk_flags flag, 573 - void * const arg) 424 + static int __check_page_state_visitor(const struct kvm_pgtable_visit_ctx *ctx, 425 + enum kvm_pgtable_walk_flags visit) 574 426 { 575 - struct check_walk_data *d = arg; 576 - kvm_pte_t pte = *ptep; 427 + struct check_walk_data *d = ctx->arg; 577 428 578 - if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte))) 429 + if (kvm_pte_valid(ctx->old) && !addr_is_allowed_memory(kvm_pte_to_phys(ctx->old))) 579 430 return -EINVAL; 580 431 581 - return d->get_page_state(pte) == d->desired ? 0 : -EPERM; 432 + return d->get_page_state(ctx->old) == d->desired ? 0 : -EPERM; 582 433 } 583 434 584 435 static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size, ··· 610 459 .get_page_state = host_get_page_state, 611 460 }; 612 461 613 - hyp_assert_lock_held(&host_kvm.lock); 614 - return check_page_state_range(&host_kvm.pgt, addr, size, &d); 462 + hyp_assert_lock_held(&host_mmu.lock); 463 + return check_page_state_range(&host_mmu.pgt, addr, size, &d); 615 464 } 616 465 617 466 static int __host_set_page_state_range(u64 addr, u64 size, ··· 662 511 return __host_set_page_state_range(addr, size, PKVM_PAGE_OWNED); 663 512 } 664 513 514 + static int host_initiate_donation(u64 *completer_addr, 515 + const struct pkvm_mem_transition *tx) 516 + { 517 + u8 owner_id = tx->completer.id; 518 + u64 size = tx->nr_pages * PAGE_SIZE; 519 + 520 + *completer_addr = tx->initiator.host.completer_addr; 521 + return host_stage2_set_owner_locked(tx->initiator.addr, size, owner_id); 522 + } 523 + 524 + static bool __host_ack_skip_pgtable_check(const struct pkvm_mem_transition *tx) 525 + { 526 + return !(IS_ENABLED(CONFIG_NVHE_EL2_DEBUG) || 527 + tx->initiator.id != PKVM_ID_HYP); 528 + } 529 + 530 + static int __host_ack_transition(u64 addr, const struct pkvm_mem_transition *tx, 531 + enum pkvm_page_state state) 532 + { 533 + u64 size = tx->nr_pages * PAGE_SIZE; 534 + 535 + if (__host_ack_skip_pgtable_check(tx)) 536 + return 0; 537 + 538 + return __host_check_page_state_range(addr, size, state); 539 + } 540 + 541 + static int host_ack_donation(u64 addr, const struct pkvm_mem_transition *tx) 542 + { 543 + return __host_ack_transition(addr, tx, PKVM_NOPAGE); 544 + } 545 + 546 + static int host_complete_donation(u64 addr, const struct pkvm_mem_transition *tx) 547 + { 548 + u64 size = tx->nr_pages * PAGE_SIZE; 549 + u8 host_id = tx->completer.id; 550 + 551 + return host_stage2_set_owner_locked(addr, size, host_id); 552 + } 553 + 665 554 static enum pkvm_page_state hyp_get_page_state(kvm_pte_t pte) 666 555 { 667 556 if (!kvm_pte_valid(pte)) ··· 720 529 721 530 hyp_assert_lock_held(&pkvm_pgd_lock); 722 531 return check_page_state_range(&pkvm_pgtable, addr, size, &d); 532 + } 533 + 534 + static int hyp_request_donation(u64 *completer_addr, 535 + const struct pkvm_mem_transition *tx) 536 + { 537 + u64 size = tx->nr_pages * PAGE_SIZE; 538 + u64 addr = tx->initiator.addr; 539 + 540 + *completer_addr = tx->initiator.hyp.completer_addr; 541 + return __hyp_check_page_state_range(addr, size, PKVM_PAGE_OWNED); 542 + } 543 + 544 + static int hyp_initiate_donation(u64 *completer_addr, 545 + const struct pkvm_mem_transition *tx) 546 + { 547 + u64 size = tx->nr_pages * PAGE_SIZE; 548 + int ret; 549 + 550 + *completer_addr = tx->initiator.hyp.completer_addr; 551 + ret = kvm_pgtable_hyp_unmap(&pkvm_pgtable, tx->initiator.addr, size); 552 + return (ret != size) ? -EFAULT : 0; 723 553 } 724 554 725 555 static bool __hyp_ack_skip_pgtable_check(const struct pkvm_mem_transition *tx) ··· 767 555 { 768 556 u64 size = tx->nr_pages * PAGE_SIZE; 769 557 558 + if (tx->initiator.id == PKVM_ID_HOST && hyp_page_count((void *)addr)) 559 + return -EBUSY; 560 + 770 561 if (__hyp_ack_skip_pgtable_check(tx)) 771 562 return 0; 772 563 773 564 return __hyp_check_page_state_range(addr, size, 774 565 PKVM_PAGE_SHARED_BORROWED); 566 + } 567 + 568 + static int hyp_ack_donation(u64 addr, const struct pkvm_mem_transition *tx) 569 + { 570 + u64 size = tx->nr_pages * PAGE_SIZE; 571 + 572 + if (__hyp_ack_skip_pgtable_check(tx)) 573 + return 0; 574 + 575 + return __hyp_check_page_state_range(addr, size, PKVM_NOPAGE); 775 576 } 776 577 777 578 static int hyp_complete_share(u64 addr, const struct pkvm_mem_transition *tx, ··· 803 578 int ret = kvm_pgtable_hyp_unmap(&pkvm_pgtable, addr, size); 804 579 805 580 return (ret != size) ? -EFAULT : 0; 581 + } 582 + 583 + static int hyp_complete_donation(u64 addr, 584 + const struct pkvm_mem_transition *tx) 585 + { 586 + void *start = (void *)addr, *end = start + (tx->nr_pages * PAGE_SIZE); 587 + enum kvm_pgtable_prot prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_OWNED); 588 + 589 + return pkvm_create_mappings_locked(start, end, prot); 806 590 } 807 591 808 592 static int check_share(struct pkvm_mem_share *share) ··· 966 732 return WARN_ON(__do_unshare(share)); 967 733 } 968 734 735 + static int check_donation(struct pkvm_mem_donation *donation) 736 + { 737 + const struct pkvm_mem_transition *tx = &donation->tx; 738 + u64 completer_addr; 739 + int ret; 740 + 741 + switch (tx->initiator.id) { 742 + case PKVM_ID_HOST: 743 + ret = host_request_owned_transition(&completer_addr, tx); 744 + break; 745 + case PKVM_ID_HYP: 746 + ret = hyp_request_donation(&completer_addr, tx); 747 + break; 748 + default: 749 + ret = -EINVAL; 750 + } 751 + 752 + if (ret) 753 + return ret; 754 + 755 + switch (tx->completer.id) { 756 + case PKVM_ID_HOST: 757 + ret = host_ack_donation(completer_addr, tx); 758 + break; 759 + case PKVM_ID_HYP: 760 + ret = hyp_ack_donation(completer_addr, tx); 761 + break; 762 + default: 763 + ret = -EINVAL; 764 + } 765 + 766 + return ret; 767 + } 768 + 769 + static int __do_donate(struct pkvm_mem_donation *donation) 770 + { 771 + const struct pkvm_mem_transition *tx = &donation->tx; 772 + u64 completer_addr; 773 + int ret; 774 + 775 + switch (tx->initiator.id) { 776 + case PKVM_ID_HOST: 777 + ret = host_initiate_donation(&completer_addr, tx); 778 + break; 779 + case PKVM_ID_HYP: 780 + ret = hyp_initiate_donation(&completer_addr, tx); 781 + break; 782 + default: 783 + ret = -EINVAL; 784 + } 785 + 786 + if (ret) 787 + return ret; 788 + 789 + switch (tx->completer.id) { 790 + case PKVM_ID_HOST: 791 + ret = host_complete_donation(completer_addr, tx); 792 + break; 793 + case PKVM_ID_HYP: 794 + ret = hyp_complete_donation(completer_addr, tx); 795 + break; 796 + default: 797 + ret = -EINVAL; 798 + } 799 + 800 + return ret; 801 + } 802 + 803 + /* 804 + * do_donate(): 805 + * 806 + * The page owner transfers ownership to another component, losing access 807 + * as a consequence. 808 + * 809 + * Initiator: OWNED => NOPAGE 810 + * Completer: NOPAGE => OWNED 811 + */ 812 + static int do_donate(struct pkvm_mem_donation *donation) 813 + { 814 + int ret; 815 + 816 + ret = check_donation(donation); 817 + if (ret) 818 + return ret; 819 + 820 + return WARN_ON(__do_donate(donation)); 821 + } 822 + 969 823 int __pkvm_host_share_hyp(u64 pfn) 970 824 { 971 825 int ret; ··· 1118 796 host_unlock_component(); 1119 797 1120 798 return ret; 799 + } 800 + 801 + int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages) 802 + { 803 + int ret; 804 + u64 host_addr = hyp_pfn_to_phys(pfn); 805 + u64 hyp_addr = (u64)__hyp_va(host_addr); 806 + struct pkvm_mem_donation donation = { 807 + .tx = { 808 + .nr_pages = nr_pages, 809 + .initiator = { 810 + .id = PKVM_ID_HOST, 811 + .addr = host_addr, 812 + .host = { 813 + .completer_addr = hyp_addr, 814 + }, 815 + }, 816 + .completer = { 817 + .id = PKVM_ID_HYP, 818 + }, 819 + }, 820 + }; 821 + 822 + host_lock_component(); 823 + hyp_lock_component(); 824 + 825 + ret = do_donate(&donation); 826 + 827 + hyp_unlock_component(); 828 + host_unlock_component(); 829 + 830 + return ret; 831 + } 832 + 833 + int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages) 834 + { 835 + int ret; 836 + u64 host_addr = hyp_pfn_to_phys(pfn); 837 + u64 hyp_addr = (u64)__hyp_va(host_addr); 838 + struct pkvm_mem_donation donation = { 839 + .tx = { 840 + .nr_pages = nr_pages, 841 + .initiator = { 842 + .id = PKVM_ID_HYP, 843 + .addr = hyp_addr, 844 + .hyp = { 845 + .completer_addr = host_addr, 846 + }, 847 + }, 848 + .completer = { 849 + .id = PKVM_ID_HOST, 850 + }, 851 + }, 852 + }; 853 + 854 + host_lock_component(); 855 + hyp_lock_component(); 856 + 857 + ret = do_donate(&donation); 858 + 859 + hyp_unlock_component(); 860 + host_unlock_component(); 861 + 862 + return ret; 863 + } 864 + 865 + int hyp_pin_shared_mem(void *from, void *to) 866 + { 867 + u64 cur, start = ALIGN_DOWN((u64)from, PAGE_SIZE); 868 + u64 end = PAGE_ALIGN((u64)to); 869 + u64 size = end - start; 870 + int ret; 871 + 872 + host_lock_component(); 873 + hyp_lock_component(); 874 + 875 + ret = __host_check_page_state_range(__hyp_pa(start), size, 876 + PKVM_PAGE_SHARED_OWNED); 877 + if (ret) 878 + goto unlock; 879 + 880 + ret = __hyp_check_page_state_range(start, size, 881 + PKVM_PAGE_SHARED_BORROWED); 882 + if (ret) 883 + goto unlock; 884 + 885 + for (cur = start; cur < end; cur += PAGE_SIZE) 886 + hyp_page_ref_inc(hyp_virt_to_page(cur)); 887 + 888 + unlock: 889 + hyp_unlock_component(); 890 + host_unlock_component(); 891 + 892 + return ret; 893 + } 894 + 895 + void hyp_unpin_shared_mem(void *from, void *to) 896 + { 897 + u64 cur, start = ALIGN_DOWN((u64)from, PAGE_SIZE); 898 + u64 end = PAGE_ALIGN((u64)to); 899 + 900 + host_lock_component(); 901 + hyp_lock_component(); 902 + 903 + for (cur = start; cur < end; cur += PAGE_SIZE) 904 + hyp_page_ref_dec(hyp_virt_to_page(cur)); 905 + 906 + hyp_unlock_component(); 907 + host_unlock_component(); 1121 908 }

+163 -4

arch/arm64/kvm/hyp/nvhe/mm.c

··· 14 14 #include <nvhe/early_alloc.h> 15 15 #include <nvhe/gfp.h> 16 16 #include <nvhe/memory.h> 17 + #include <nvhe/mem_protect.h> 17 18 #include <nvhe/mm.h> 18 19 #include <nvhe/spinlock.h> 19 20 ··· 25 24 unsigned int hyp_memblock_nr; 26 25 27 26 static u64 __io_map_base; 27 + 28 + struct hyp_fixmap_slot { 29 + u64 addr; 30 + kvm_pte_t *ptep; 31 + }; 32 + static DEFINE_PER_CPU(struct hyp_fixmap_slot, fixmap_slots); 28 33 29 34 static int __pkvm_create_mappings(unsigned long start, unsigned long size, 30 35 unsigned long phys, enum kvm_pgtable_prot prot) ··· 136 129 return ret; 137 130 } 138 131 139 - int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back) 132 + int hyp_back_vmemmap(phys_addr_t back) 140 133 { 141 - unsigned long start, end; 134 + unsigned long i, start, size, end = 0; 135 + int ret; 142 136 143 - hyp_vmemmap_range(phys, size, &start, &end); 137 + for (i = 0; i < hyp_memblock_nr; i++) { 138 + start = hyp_memory[i].base; 139 + start = ALIGN_DOWN((u64)hyp_phys_to_page(start), PAGE_SIZE); 140 + /* 141 + * The begining of the hyp_vmemmap region for the current 142 + * memblock may already be backed by the page backing the end 143 + * the previous region, so avoid mapping it twice. 144 + */ 145 + start = max(start, end); 144 146 145 - return __pkvm_create_mappings(start, end - start, back, PAGE_HYP); 147 + end = hyp_memory[i].base + hyp_memory[i].size; 148 + end = PAGE_ALIGN((u64)hyp_phys_to_page(end)); 149 + if (start >= end) 150 + continue; 151 + 152 + size = end - start; 153 + ret = __pkvm_create_mappings(start, size, back, PAGE_HYP); 154 + if (ret) 155 + return ret; 156 + 157 + memset(hyp_phys_to_virt(back), 0, size); 158 + back += size; 159 + } 160 + 161 + return 0; 146 162 } 147 163 148 164 static void *__hyp_bp_vect_base; ··· 219 189 return 0; 220 190 } 221 191 192 + void *hyp_fixmap_map(phys_addr_t phys) 193 + { 194 + struct hyp_fixmap_slot *slot = this_cpu_ptr(&fixmap_slots); 195 + kvm_pte_t pte, *ptep = slot->ptep; 196 + 197 + pte = *ptep; 198 + pte &= ~kvm_phys_to_pte(KVM_PHYS_INVALID); 199 + pte |= kvm_phys_to_pte(phys) | KVM_PTE_VALID; 200 + WRITE_ONCE(*ptep, pte); 201 + dsb(ishst); 202 + 203 + return (void *)slot->addr; 204 + } 205 + 206 + static void fixmap_clear_slot(struct hyp_fixmap_slot *slot) 207 + { 208 + kvm_pte_t *ptep = slot->ptep; 209 + u64 addr = slot->addr; 210 + 211 + WRITE_ONCE(*ptep, *ptep & ~KVM_PTE_VALID); 212 + 213 + /* 214 + * Irritatingly, the architecture requires that we use inner-shareable 215 + * broadcast TLB invalidation here in case another CPU speculates 216 + * through our fixmap and decides to create an "amalagamation of the 217 + * values held in the TLB" due to the apparent lack of a 218 + * break-before-make sequence. 219 + * 220 + * https://lore.kernel.org/kvm/20221017115209.2099-1-will@kernel.org/T/#mf10dfbaf1eaef9274c581b81c53758918c1d0f03 221 + */ 222 + dsb(ishst); 223 + __tlbi_level(vale2is, __TLBI_VADDR(addr, 0), (KVM_PGTABLE_MAX_LEVELS - 1)); 224 + dsb(ish); 225 + isb(); 226 + } 227 + 228 + void hyp_fixmap_unmap(void) 229 + { 230 + fixmap_clear_slot(this_cpu_ptr(&fixmap_slots)); 231 + } 232 + 233 + static int __create_fixmap_slot_cb(const struct kvm_pgtable_visit_ctx *ctx, 234 + enum kvm_pgtable_walk_flags visit) 235 + { 236 + struct hyp_fixmap_slot *slot = per_cpu_ptr(&fixmap_slots, (u64)ctx->arg); 237 + 238 + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MAX_LEVELS - 1) 239 + return -EINVAL; 240 + 241 + slot->addr = ctx->addr; 242 + slot->ptep = ctx->ptep; 243 + 244 + /* 245 + * Clear the PTE, but keep the page-table page refcount elevated to 246 + * prevent it from ever being freed. This lets us manipulate the PTEs 247 + * by hand safely without ever needing to allocate memory. 248 + */ 249 + fixmap_clear_slot(slot); 250 + 251 + return 0; 252 + } 253 + 254 + static int create_fixmap_slot(u64 addr, u64 cpu) 255 + { 256 + struct kvm_pgtable_walker walker = { 257 + .cb = __create_fixmap_slot_cb, 258 + .flags = KVM_PGTABLE_WALK_LEAF, 259 + .arg = (void *)cpu, 260 + }; 261 + 262 + return kvm_pgtable_walk(&pkvm_pgtable, addr, PAGE_SIZE, &walker); 263 + } 264 + 265 + int hyp_create_pcpu_fixmap(void) 266 + { 267 + unsigned long addr, i; 268 + int ret; 269 + 270 + for (i = 0; i < hyp_nr_cpus; i++) { 271 + ret = pkvm_alloc_private_va_range(PAGE_SIZE, &addr); 272 + if (ret) 273 + return ret; 274 + 275 + ret = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, PAGE_SIZE, 276 + __hyp_pa(__hyp_bss_start), PAGE_HYP); 277 + if (ret) 278 + return ret; 279 + 280 + ret = create_fixmap_slot(addr, i); 281 + if (ret) 282 + return ret; 283 + } 284 + 285 + return 0; 286 + } 287 + 222 288 int hyp_create_idmap(u32 hyp_va_bits) 223 289 { 224 290 unsigned long start, end; ··· 338 212 __hyp_vmemmap = __io_map_base | BIT(hyp_va_bits - 3); 339 213 340 214 return __pkvm_create_mappings(start, end - start, start, PAGE_HYP_EXEC); 215 + } 216 + 217 + static void *admit_host_page(void *arg) 218 + { 219 + struct kvm_hyp_memcache *host_mc = arg; 220 + 221 + if (!host_mc->nr_pages) 222 + return NULL; 223 + 224 + /* 225 + * The host still owns the pages in its memcache, so we need to go 226 + * through a full host-to-hyp donation cycle to change it. Fortunately, 227 + * __pkvm_host_donate_hyp() takes care of races for us, so if it 228 + * succeeds we're good to go. 229 + */ 230 + if (__pkvm_host_donate_hyp(hyp_phys_to_pfn(host_mc->head), 1)) 231 + return NULL; 232 + 233 + return pop_hyp_memcache(host_mc, hyp_phys_to_virt); 234 + } 235 + 236 + /* Refill our local memcache by poping pages from the one provided by the host. */ 237 + int refill_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages, 238 + struct kvm_hyp_memcache *host_mc) 239 + { 240 + struct kvm_hyp_memcache tmp = *host_mc; 241 + int ret; 242 + 243 + ret = __topup_hyp_memcache(mc, min_pages, admit_host_page, 244 + hyp_virt_to_phys, &tmp); 245 + *host_mc = tmp; 246 + 247 + return ret; 341 248 }

+7 -22

arch/arm64/kvm/hyp/nvhe/page_alloc.c

··· 93 93 static void __hyp_attach_page(struct hyp_pool *pool, 94 94 struct hyp_page *p) 95 95 { 96 + phys_addr_t phys = hyp_page_to_phys(p); 96 97 unsigned short order = p->order; 97 98 struct hyp_page *buddy; 98 99 99 100 memset(hyp_page_to_virt(p), 0, PAGE_SIZE << p->order); 101 + 102 + /* Skip coalescing for 'external' pages being freed into the pool. */ 103 + if (phys < pool->range_start || phys >= pool->range_end) 104 + goto insert; 100 105 101 106 /* 102 107 * Only the first struct hyp_page of a high-order page (otherwise known ··· 121 116 p = min(p, buddy); 122 117 } 123 118 119 + insert: 124 120 /* Mark the new head, and insert it */ 125 121 p->order = order; 126 122 page_add_to_list(p, &pool->free_area[order]); ··· 148 142 } 149 143 150 144 return p; 151 - } 152 - 153 - static inline void hyp_page_ref_inc(struct hyp_page *p) 154 - { 155 - BUG_ON(p->refcount == USHRT_MAX); 156 - p->refcount++; 157 - } 158 - 159 - static inline int hyp_page_ref_dec_and_test(struct hyp_page *p) 160 - { 161 - BUG_ON(!p->refcount); 162 - p->refcount--; 163 - return (p->refcount == 0); 164 - } 165 - 166 - static inline void hyp_set_page_refcounted(struct hyp_page *p) 167 - { 168 - BUG_ON(p->refcount); 169 - p->refcount = 1; 170 145 } 171 146 172 147 static void __hyp_put_page(struct hyp_pool *pool, struct hyp_page *p) ··· 236 249 237 250 /* Init the vmemmap portion */ 238 251 p = hyp_phys_to_page(phys); 239 - for (i = 0; i < nr_pages; i++) { 240 - p[i].order = 0; 252 + for (i = 0; i < nr_pages; i++) 241 253 hyp_set_page_refcounted(&p[i]); 242 - } 243 254 244 255 /* Attach the unused pages to the buddy tree */ 245 256 for (i = reserved_pages; i < nr_pages; i++)

+436

arch/arm64/kvm/hyp/nvhe/pkvm.c

··· 7 7 #include <linux/kvm_host.h> 8 8 #include <linux/mm.h> 9 9 #include <nvhe/fixed_config.h> 10 + #include <nvhe/mem_protect.h> 11 + #include <nvhe/memory.h> 12 + #include <nvhe/pkvm.h> 10 13 #include <nvhe/trap_handler.h> 14 + 15 + /* Used by icache_is_vpipt(). */ 16 + unsigned long __icache_flags; 17 + 18 + /* Used by kvm_get_vttbr(). */ 19 + unsigned int kvm_arm_vmid_bits; 11 20 12 21 /* 13 22 * Set trap register values based on features in ID_AA64PFR0. ··· 191 182 pvm_init_traps_aa64dfr0(vcpu); 192 183 pvm_init_traps_aa64mmfr0(vcpu); 193 184 pvm_init_traps_aa64mmfr1(vcpu); 185 + } 186 + 187 + /* 188 + * Start the VM table handle at the offset defined instead of at 0. 189 + * Mainly for sanity checking and debugging. 190 + */ 191 + #define HANDLE_OFFSET 0x1000 192 + 193 + static unsigned int vm_handle_to_idx(pkvm_handle_t handle) 194 + { 195 + return handle - HANDLE_OFFSET; 196 + } 197 + 198 + static pkvm_handle_t idx_to_vm_handle(unsigned int idx) 199 + { 200 + return idx + HANDLE_OFFSET; 201 + } 202 + 203 + /* 204 + * Spinlock for protecting state related to the VM table. Protects writes 205 + * to 'vm_table' and 'nr_table_entries' as well as reads and writes to 206 + * 'last_hyp_vcpu_lookup'. 207 + */ 208 + static DEFINE_HYP_SPINLOCK(vm_table_lock); 209 + 210 + /* 211 + * The table of VM entries for protected VMs in hyp. 212 + * Allocated at hyp initialization and setup. 213 + */ 214 + static struct pkvm_hyp_vm **vm_table; 215 + 216 + void pkvm_hyp_vm_table_init(void *tbl) 217 + { 218 + WARN_ON(vm_table); 219 + vm_table = tbl; 220 + } 221 + 222 + /* 223 + * Return the hyp vm structure corresponding to the handle. 224 + */ 225 + static struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) 226 + { 227 + unsigned int idx = vm_handle_to_idx(handle); 228 + 229 + if (unlikely(idx >= KVM_MAX_PVMS)) 230 + return NULL; 231 + 232 + return vm_table[idx]; 233 + } 234 + 235 + struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, 236 + unsigned int vcpu_idx) 237 + { 238 + struct pkvm_hyp_vcpu *hyp_vcpu = NULL; 239 + struct pkvm_hyp_vm *hyp_vm; 240 + 241 + hyp_spin_lock(&vm_table_lock); 242 + hyp_vm = get_vm_by_handle(handle); 243 + if (!hyp_vm || hyp_vm->nr_vcpus <= vcpu_idx) 244 + goto unlock; 245 + 246 + hyp_vcpu = hyp_vm->vcpus[vcpu_idx]; 247 + hyp_page_ref_inc(hyp_virt_to_page(hyp_vm)); 248 + unlock: 249 + hyp_spin_unlock(&vm_table_lock); 250 + return hyp_vcpu; 251 + } 252 + 253 + void pkvm_put_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) 254 + { 255 + struct pkvm_hyp_vm *hyp_vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu); 256 + 257 + hyp_spin_lock(&vm_table_lock); 258 + hyp_page_ref_dec(hyp_virt_to_page(hyp_vm)); 259 + hyp_spin_unlock(&vm_table_lock); 260 + } 261 + 262 + static void unpin_host_vcpu(struct kvm_vcpu *host_vcpu) 263 + { 264 + if (host_vcpu) 265 + hyp_unpin_shared_mem(host_vcpu, host_vcpu + 1); 266 + } 267 + 268 + static void unpin_host_vcpus(struct pkvm_hyp_vcpu *hyp_vcpus[], 269 + unsigned int nr_vcpus) 270 + { 271 + int i; 272 + 273 + for (i = 0; i < nr_vcpus; i++) 274 + unpin_host_vcpu(hyp_vcpus[i]->host_vcpu); 275 + } 276 + 277 + static void init_pkvm_hyp_vm(struct kvm *host_kvm, struct pkvm_hyp_vm *hyp_vm, 278 + unsigned int nr_vcpus) 279 + { 280 + hyp_vm->host_kvm = host_kvm; 281 + hyp_vm->kvm.created_vcpus = nr_vcpus; 282 + hyp_vm->kvm.arch.vtcr = host_mmu.arch.vtcr; 283 + } 284 + 285 + static int init_pkvm_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu, 286 + struct pkvm_hyp_vm *hyp_vm, 287 + struct kvm_vcpu *host_vcpu, 288 + unsigned int vcpu_idx) 289 + { 290 + int ret = 0; 291 + 292 + if (hyp_pin_shared_mem(host_vcpu, host_vcpu + 1)) 293 + return -EBUSY; 294 + 295 + if (host_vcpu->vcpu_idx != vcpu_idx) { 296 + ret = -EINVAL; 297 + goto done; 298 + } 299 + 300 + hyp_vcpu->host_vcpu = host_vcpu; 301 + 302 + hyp_vcpu->vcpu.kvm = &hyp_vm->kvm; 303 + hyp_vcpu->vcpu.vcpu_id = READ_ONCE(host_vcpu->vcpu_id); 304 + hyp_vcpu->vcpu.vcpu_idx = vcpu_idx; 305 + 306 + hyp_vcpu->vcpu.arch.hw_mmu = &hyp_vm->kvm.arch.mmu; 307 + hyp_vcpu->vcpu.arch.cflags = READ_ONCE(host_vcpu->arch.cflags); 308 + done: 309 + if (ret) 310 + unpin_host_vcpu(host_vcpu); 311 + return ret; 312 + } 313 + 314 + static int find_free_vm_table_entry(struct kvm *host_kvm) 315 + { 316 + int i; 317 + 318 + for (i = 0; i < KVM_MAX_PVMS; ++i) { 319 + if (!vm_table[i]) 320 + return i; 321 + } 322 + 323 + return -ENOMEM; 324 + } 325 + 326 + /* 327 + * Allocate a VM table entry and insert a pointer to the new vm. 328 + * 329 + * Return a unique handle to the protected VM on success, 330 + * negative error code on failure. 331 + */ 332 + static pkvm_handle_t insert_vm_table_entry(struct kvm *host_kvm, 333 + struct pkvm_hyp_vm *hyp_vm) 334 + { 335 + struct kvm_s2_mmu *mmu = &hyp_vm->kvm.arch.mmu; 336 + int idx; 337 + 338 + hyp_assert_lock_held(&vm_table_lock); 339 + 340 + /* 341 + * Initializing protected state might have failed, yet a malicious 342 + * host could trigger this function. Thus, ensure that 'vm_table' 343 + * exists. 344 + */ 345 + if (unlikely(!vm_table)) 346 + return -EINVAL; 347 + 348 + idx = find_free_vm_table_entry(host_kvm); 349 + if (idx < 0) 350 + return idx; 351 + 352 + hyp_vm->kvm.arch.pkvm.handle = idx_to_vm_handle(idx); 353 + 354 + /* VMID 0 is reserved for the host */ 355 + atomic64_set(&mmu->vmid.id, idx + 1); 356 + 357 + mmu->arch = &hyp_vm->kvm.arch; 358 + mmu->pgt = &hyp_vm->pgt; 359 + 360 + vm_table[idx] = hyp_vm; 361 + return hyp_vm->kvm.arch.pkvm.handle; 362 + } 363 + 364 + /* 365 + * Deallocate and remove the VM table entry corresponding to the handle. 366 + */ 367 + static void remove_vm_table_entry(pkvm_handle_t handle) 368 + { 369 + hyp_assert_lock_held(&vm_table_lock); 370 + vm_table[vm_handle_to_idx(handle)] = NULL; 371 + } 372 + 373 + static size_t pkvm_get_hyp_vm_size(unsigned int nr_vcpus) 374 + { 375 + return size_add(sizeof(struct pkvm_hyp_vm), 376 + size_mul(sizeof(struct pkvm_hyp_vcpu *), nr_vcpus)); 377 + } 378 + 379 + static void *map_donated_memory_noclear(unsigned long host_va, size_t size) 380 + { 381 + void *va = (void *)kern_hyp_va(host_va); 382 + 383 + if (!PAGE_ALIGNED(va)) 384 + return NULL; 385 + 386 + if (__pkvm_host_donate_hyp(hyp_virt_to_pfn(va), 387 + PAGE_ALIGN(size) >> PAGE_SHIFT)) 388 + return NULL; 389 + 390 + return va; 391 + } 392 + 393 + static void *map_donated_memory(unsigned long host_va, size_t size) 394 + { 395 + void *va = map_donated_memory_noclear(host_va, size); 396 + 397 + if (va) 398 + memset(va, 0, size); 399 + 400 + return va; 401 + } 402 + 403 + static void __unmap_donated_memory(void *va, size_t size) 404 + { 405 + WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(va), 406 + PAGE_ALIGN(size) >> PAGE_SHIFT)); 407 + } 408 + 409 + static void unmap_donated_memory(void *va, size_t size) 410 + { 411 + if (!va) 412 + return; 413 + 414 + memset(va, 0, size); 415 + __unmap_donated_memory(va, size); 416 + } 417 + 418 + static void unmap_donated_memory_noclear(void *va, size_t size) 419 + { 420 + if (!va) 421 + return; 422 + 423 + __unmap_donated_memory(va, size); 424 + } 425 + 426 + /* 427 + * Initialize the hypervisor copy of the protected VM state using the 428 + * memory donated by the host. 429 + * 430 + * Unmaps the donated memory from the host at stage 2. 431 + * 432 + * host_kvm: A pointer to the host's struct kvm. 433 + * vm_hva: The host va of the area being donated for the VM state. 434 + * Must be page aligned. 435 + * pgd_hva: The host va of the area being donated for the stage-2 PGD for 436 + * the VM. Must be page aligned. Its size is implied by the VM's 437 + * VTCR. 438 + * 439 + * Return a unique handle to the protected VM on success, 440 + * negative error code on failure. 441 + */ 442 + int __pkvm_init_vm(struct kvm *host_kvm, unsigned long vm_hva, 443 + unsigned long pgd_hva) 444 + { 445 + struct pkvm_hyp_vm *hyp_vm = NULL; 446 + size_t vm_size, pgd_size; 447 + unsigned int nr_vcpus; 448 + void *pgd = NULL; 449 + int ret; 450 + 451 + ret = hyp_pin_shared_mem(host_kvm, host_kvm + 1); 452 + if (ret) 453 + return ret; 454 + 455 + nr_vcpus = READ_ONCE(host_kvm->created_vcpus); 456 + if (nr_vcpus < 1) { 457 + ret = -EINVAL; 458 + goto err_unpin_kvm; 459 + } 460 + 461 + vm_size = pkvm_get_hyp_vm_size(nr_vcpus); 462 + pgd_size = kvm_pgtable_stage2_pgd_size(host_mmu.arch.vtcr); 463 + 464 + ret = -ENOMEM; 465 + 466 + hyp_vm = map_donated_memory(vm_hva, vm_size); 467 + if (!hyp_vm) 468 + goto err_remove_mappings; 469 + 470 + pgd = map_donated_memory_noclear(pgd_hva, pgd_size); 471 + if (!pgd) 472 + goto err_remove_mappings; 473 + 474 + init_pkvm_hyp_vm(host_kvm, hyp_vm, nr_vcpus); 475 + 476 + hyp_spin_lock(&vm_table_lock); 477 + ret = insert_vm_table_entry(host_kvm, hyp_vm); 478 + if (ret < 0) 479 + goto err_unlock; 480 + 481 + ret = kvm_guest_prepare_stage2(hyp_vm, pgd); 482 + if (ret) 483 + goto err_remove_vm_table_entry; 484 + hyp_spin_unlock(&vm_table_lock); 485 + 486 + return hyp_vm->kvm.arch.pkvm.handle; 487 + 488 + err_remove_vm_table_entry: 489 + remove_vm_table_entry(hyp_vm->kvm.arch.pkvm.handle); 490 + err_unlock: 491 + hyp_spin_unlock(&vm_table_lock); 492 + err_remove_mappings: 493 + unmap_donated_memory(hyp_vm, vm_size); 494 + unmap_donated_memory(pgd, pgd_size); 495 + err_unpin_kvm: 496 + hyp_unpin_shared_mem(host_kvm, host_kvm + 1); 497 + return ret; 498 + } 499 + 500 + /* 501 + * Initialize the hypervisor copy of the protected vCPU state using the 502 + * memory donated by the host. 503 + * 504 + * handle: The handle for the protected vm. 505 + * host_vcpu: A pointer to the corresponding host vcpu. 506 + * vcpu_hva: The host va of the area being donated for the vcpu state. 507 + * Must be page aligned. The size of the area must be equal to 508 + * the page-aligned size of 'struct pkvm_hyp_vcpu'. 509 + * Return 0 on success, negative error code on failure. 510 + */ 511 + int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, 512 + unsigned long vcpu_hva) 513 + { 514 + struct pkvm_hyp_vcpu *hyp_vcpu; 515 + struct pkvm_hyp_vm *hyp_vm; 516 + unsigned int idx; 517 + int ret; 518 + 519 + hyp_vcpu = map_donated_memory(vcpu_hva, sizeof(*hyp_vcpu)); 520 + if (!hyp_vcpu) 521 + return -ENOMEM; 522 + 523 + hyp_spin_lock(&vm_table_lock); 524 + 525 + hyp_vm = get_vm_by_handle(handle); 526 + if (!hyp_vm) { 527 + ret = -ENOENT; 528 + goto unlock; 529 + } 530 + 531 + idx = hyp_vm->nr_vcpus; 532 + if (idx >= hyp_vm->kvm.created_vcpus) { 533 + ret = -EINVAL; 534 + goto unlock; 535 + } 536 + 537 + ret = init_pkvm_hyp_vcpu(hyp_vcpu, hyp_vm, host_vcpu, idx); 538 + if (ret) 539 + goto unlock; 540 + 541 + hyp_vm->vcpus[idx] = hyp_vcpu; 542 + hyp_vm->nr_vcpus++; 543 + unlock: 544 + hyp_spin_unlock(&vm_table_lock); 545 + 546 + if (ret) 547 + unmap_donated_memory(hyp_vcpu, sizeof(*hyp_vcpu)); 548 + 549 + return ret; 550 + } 551 + 552 + static void 553 + teardown_donated_memory(struct kvm_hyp_memcache *mc, void *addr, size_t size) 554 + { 555 + size = PAGE_ALIGN(size); 556 + memset(addr, 0, size); 557 + 558 + for (void *start = addr; start < addr + size; start += PAGE_SIZE) 559 + push_hyp_memcache(mc, start, hyp_virt_to_phys); 560 + 561 + unmap_donated_memory_noclear(addr, size); 562 + } 563 + 564 + int __pkvm_teardown_vm(pkvm_handle_t handle) 565 + { 566 + struct kvm_hyp_memcache *mc; 567 + struct pkvm_hyp_vm *hyp_vm; 568 + struct kvm *host_kvm; 569 + unsigned int idx; 570 + size_t vm_size; 571 + int err; 572 + 573 + hyp_spin_lock(&vm_table_lock); 574 + hyp_vm = get_vm_by_handle(handle); 575 + if (!hyp_vm) { 576 + err = -ENOENT; 577 + goto err_unlock; 578 + } 579 + 580 + if (WARN_ON(hyp_page_count(hyp_vm))) { 581 + err = -EBUSY; 582 + goto err_unlock; 583 + } 584 + 585 + host_kvm = hyp_vm->host_kvm; 586 + 587 + /* Ensure the VMID is clean before it can be reallocated */ 588 + __kvm_tlb_flush_vmid(&hyp_vm->kvm.arch.mmu); 589 + remove_vm_table_entry(handle); 590 + hyp_spin_unlock(&vm_table_lock); 591 + 592 + /* Reclaim guest pages (including page-table pages) */ 593 + mc = &host_kvm->arch.pkvm.teardown_mc; 594 + reclaim_guest_pages(hyp_vm, mc); 595 + unpin_host_vcpus(hyp_vm->vcpus, hyp_vm->nr_vcpus); 596 + 597 + /* Push the metadata pages to the teardown memcache */ 598 + for (idx = 0; idx < hyp_vm->nr_vcpus; ++idx) { 599 + struct pkvm_hyp_vcpu *hyp_vcpu = hyp_vm->vcpus[idx]; 600 + 601 + teardown_donated_memory(mc, hyp_vcpu, sizeof(*hyp_vcpu)); 602 + } 603 + 604 + vm_size = pkvm_get_hyp_vm_size(hyp_vm->kvm.created_vcpus); 605 + teardown_donated_memory(mc, hyp_vm, vm_size); 606 + hyp_unpin_shared_mem(host_kvm, host_kvm + 1); 607 + return 0; 608 + 609 + err_unlock: 610 + hyp_spin_unlock(&vm_table_lock); 611 + return err; 194 612 }

+61 -37

arch/arm64/kvm/hyp/nvhe/setup.c

··· 16 16 #include <nvhe/memory.h> 17 17 #include <nvhe/mem_protect.h> 18 18 #include <nvhe/mm.h> 19 + #include <nvhe/pkvm.h> 19 20 #include <nvhe/trap_handler.h> 20 21 21 22 unsigned long hyp_nr_cpus; ··· 25 24 (unsigned long)__per_cpu_start) 26 25 27 26 static void *vmemmap_base; 27 + static void *vm_table_base; 28 28 static void *hyp_pgt_base; 29 29 static void *host_s2_pgt_base; 30 30 static struct kvm_pgtable_mm_ops pkvm_pgtable_mm_ops; ··· 33 31 34 32 static int divide_memory_pool(void *virt, unsigned long size) 35 33 { 36 - unsigned long vstart, vend, nr_pages; 34 + unsigned long nr_pages; 37 35 38 36 hyp_early_alloc_init(virt, size); 39 37 40 - hyp_vmemmap_range(__hyp_pa(virt), size, &vstart, &vend); 41 - nr_pages = (vend - vstart) >> PAGE_SHIFT; 38 + nr_pages = hyp_vmemmap_pages(sizeof(struct hyp_page)); 42 39 vmemmap_base = hyp_early_alloc_contig(nr_pages); 43 40 if (!vmemmap_base) 41 + return -ENOMEM; 42 + 43 + nr_pages = hyp_vm_table_pages(); 44 + vm_table_base = hyp_early_alloc_contig(nr_pages); 45 + if (!vm_table_base) 44 46 return -ENOMEM; 45 47 46 48 nr_pages = hyp_s1_pgtable_pages(); ··· 84 78 if (ret) 85 79 return ret; 86 80 87 - ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base)); 81 + ret = hyp_back_vmemmap(hyp_virt_to_phys(vmemmap_base)); 88 82 if (ret) 89 83 return ret; 90 84 ··· 144 138 } 145 139 146 140 /* 147 - * Map the host's .bss and .rodata sections RO in the hypervisor, but 148 - * transfer the ownership from the host to the hypervisor itself to 149 - * make sure it can't be donated or shared with another entity. 141 + * Map the host sections RO in the hypervisor, but transfer the 142 + * ownership from the host to the hypervisor itself to make sure they 143 + * can't be donated or shared with another entity. 150 144 * 151 145 * The ownership transition requires matching changes in the host 152 146 * stage-2. This will be done later (see finalize_host_mappings()) once 153 147 * the hyp_vmemmap is addressable. 154 148 */ 155 149 prot = pkvm_mkstate(PAGE_HYP_RO, PKVM_PAGE_SHARED_OWNED); 156 - ret = pkvm_create_mappings(__start_rodata, __end_rodata, prot); 157 - if (ret) 158 - return ret; 159 - 160 - ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, prot); 150 + ret = pkvm_create_mappings(&kvm_vgic_global_state, 151 + &kvm_vgic_global_state + 1, prot); 161 152 if (ret) 162 153 return ret; 163 154 ··· 189 186 hyp_put_page(&hpool, addr); 190 187 } 191 188 192 - static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level, 193 - kvm_pte_t *ptep, 194 - enum kvm_pgtable_walk_flags flag, 195 - void * const arg) 189 + static int fix_host_ownership_walker(const struct kvm_pgtable_visit_ctx *ctx, 190 + enum kvm_pgtable_walk_flags visit) 196 191 { 197 - struct kvm_pgtable_mm_ops *mm_ops = arg; 198 192 enum kvm_pgtable_prot prot; 199 193 enum pkvm_page_state state; 200 - kvm_pte_t pte = *ptep; 201 194 phys_addr_t phys; 202 195 203 - if (!kvm_pte_valid(pte)) 196 + if (!kvm_pte_valid(ctx->old)) 204 197 return 0; 205 198 206 - /* 207 - * Fix-up the refcount for the page-table pages as the early allocator 208 - * was unable to access the hyp_vmemmap and so the buddy allocator has 209 - * initialised the refcount to '1'. 210 - */ 211 - mm_ops->get_page(ptep); 212 - if (flag != KVM_PGTABLE_WALK_LEAF) 213 - return 0; 214 - 215 - if (level != (KVM_PGTABLE_MAX_LEVELS - 1)) 199 + if (ctx->level != (KVM_PGTABLE_MAX_LEVELS - 1)) 216 200 return -EINVAL; 217 201 218 - phys = kvm_pte_to_phys(pte); 202 + phys = kvm_pte_to_phys(ctx->old); 219 203 if (!addr_is_memory(phys)) 220 204 return -EINVAL; 221 205 ··· 210 220 * Adjust the host stage-2 mappings to match the ownership attributes 211 221 * configured in the hypervisor stage-1. 212 222 */ 213 - state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte)); 223 + state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(ctx->old)); 214 224 switch (state) { 215 225 case PKVM_PAGE_OWNED: 216 - return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id); 226 + return host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HYP); 217 227 case PKVM_PAGE_SHARED_OWNED: 218 228 prot = pkvm_mkstate(PKVM_HOST_MEM_PROT, PKVM_PAGE_SHARED_BORROWED); 219 229 break; ··· 227 237 return host_stage2_idmap_locked(phys, PAGE_SIZE, prot); 228 238 } 229 239 230 - static int finalize_host_mappings(void) 240 + static int fix_hyp_pgtable_refcnt_walker(const struct kvm_pgtable_visit_ctx *ctx, 241 + enum kvm_pgtable_walk_flags visit) 242 + { 243 + /* 244 + * Fix-up the refcount for the page-table pages as the early allocator 245 + * was unable to access the hyp_vmemmap and so the buddy allocator has 246 + * initialised the refcount to '1'. 247 + */ 248 + if (kvm_pte_valid(ctx->old)) 249 + ctx->mm_ops->get_page(ctx->ptep); 250 + 251 + return 0; 252 + } 253 + 254 + static int fix_host_ownership(void) 231 255 { 232 256 struct kvm_pgtable_walker walker = { 233 - .cb = finalize_host_mappings_walker, 234 - .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST, 235 - .arg = pkvm_pgtable.mm_ops, 257 + .cb = fix_host_ownership_walker, 258 + .flags = KVM_PGTABLE_WALK_LEAF, 236 259 }; 237 260 int i, ret; 238 261 ··· 259 256 } 260 257 261 258 return 0; 259 + } 260 + 261 + static int fix_hyp_pgtable_refcnt(void) 262 + { 263 + struct kvm_pgtable_walker walker = { 264 + .cb = fix_hyp_pgtable_refcnt_walker, 265 + .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST, 266 + .arg = pkvm_pgtable.mm_ops, 267 + }; 268 + 269 + return kvm_pgtable_walk(&pkvm_pgtable, 0, BIT(pkvm_pgtable.ia_bits), 270 + &walker); 262 271 } 263 272 264 273 void __noreturn __pkvm_init_finalise(void) ··· 302 287 }; 303 288 pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops; 304 289 305 - ret = finalize_host_mappings(); 290 + ret = fix_host_ownership(); 306 291 if (ret) 307 292 goto out; 308 293 294 + ret = fix_hyp_pgtable_refcnt(); 295 + if (ret) 296 + goto out; 297 + 298 + ret = hyp_create_pcpu_fixmap(); 299 + if (ret) 300 + goto out; 301 + 302 + pkvm_hyp_vm_table_init(vm_table_base); 309 303 out: 310 304 /* 311 305 * We tail-called to here from handle___pkvm_init() and will not return,

+344 -310

arch/arm64/kvm/hyp/pgtable.c

··· 49 49 #define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) 50 50 #define KVM_MAX_OWNER_ID 1 51 51 52 + /* 53 + * Used to indicate a pte for which a 'break-before-make' sequence is in 54 + * progress. 55 + */ 56 + #define KVM_INVALID_PTE_LOCKED BIT(10) 57 + 52 58 struct kvm_pgtable_walk_data { 53 - struct kvm_pgtable *pgt; 54 59 struct kvm_pgtable_walker *walker; 55 60 56 61 u64 addr; 57 62 u64 end; 58 63 }; 59 64 60 - #define KVM_PHYS_INVALID (-1ULL) 61 - 62 65 static bool kvm_phys_is_valid(u64 phys) 63 66 { 64 67 return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_EL1_PARANGE_MAX)); 65 68 } 66 69 67 - static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level) 70 + static bool kvm_block_mapping_supported(const struct kvm_pgtable_visit_ctx *ctx, u64 phys) 68 71 { 69 - u64 granule = kvm_granule_size(level); 72 + u64 granule = kvm_granule_size(ctx->level); 70 73 71 - if (!kvm_level_supports_block_mapping(level)) 74 + if (!kvm_level_supports_block_mapping(ctx->level)) 72 75 return false; 73 76 74 - if (granule > (end - addr)) 77 + if (granule > (ctx->end - ctx->addr)) 75 78 return false; 76 79 77 80 if (kvm_phys_is_valid(phys) && !IS_ALIGNED(phys, granule)) 78 81 return false; 79 82 80 - return IS_ALIGNED(addr, granule); 83 + return IS_ALIGNED(ctx->addr, granule); 81 84 } 82 85 83 86 static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, u32 level) ··· 91 88 return (data->addr >> shift) & mask; 92 89 } 93 90 94 - static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr) 91 + static u32 kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr) 95 92 { 96 93 u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */ 97 94 u64 mask = BIT(pgt->ia_bits) - 1; 98 95 99 96 return (addr & mask) >> shift; 100 - } 101 - 102 - static u32 kvm_pgd_page_idx(struct kvm_pgtable_walk_data *data) 103 - { 104 - return __kvm_pgd_page_idx(data->pgt, data->addr); 105 97 } 106 98 107 99 static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level) ··· 106 108 .start_level = start_level, 107 109 }; 108 110 109 - return __kvm_pgd_page_idx(&pgt, -1ULL) + 1; 111 + return kvm_pgd_page_idx(&pgt, -1ULL) + 1; 110 112 } 111 113 112 114 static bool kvm_pte_table(kvm_pte_t pte, u32 level) ··· 120 122 return FIELD_GET(KVM_PTE_TYPE, pte) == KVM_PTE_TYPE_TABLE; 121 123 } 122 124 123 - static kvm_pte_t kvm_phys_to_pte(u64 pa) 124 - { 125 - kvm_pte_t pte = pa & KVM_PTE_ADDR_MASK; 126 - 127 - if (PAGE_SHIFT == 16) 128 - pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48); 129 - 130 - return pte; 131 - } 132 - 133 125 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops) 134 126 { 135 127 return mm_ops->phys_to_virt(kvm_pte_to_phys(pte)); ··· 130 142 WRITE_ONCE(*ptep, 0); 131 143 } 132 144 133 - static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp, 134 - struct kvm_pgtable_mm_ops *mm_ops) 145 + static kvm_pte_t kvm_init_table_pte(kvm_pte_t *childp, struct kvm_pgtable_mm_ops *mm_ops) 135 146 { 136 - kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp)); 147 + kvm_pte_t pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp)); 137 148 138 149 pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE); 139 150 pte |= KVM_PTE_VALID; 140 - 141 - WARN_ON(kvm_pte_valid(old)); 142 - smp_store_release(ptep, pte); 151 + return pte; 143 152 } 144 153 145 154 static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level) ··· 157 172 return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); 158 173 } 159 174 160 - static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr, 161 - u32 level, kvm_pte_t *ptep, 162 - enum kvm_pgtable_walk_flags flag) 175 + static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, 176 + const struct kvm_pgtable_visit_ctx *ctx, 177 + enum kvm_pgtable_walk_flags visit) 163 178 { 164 179 struct kvm_pgtable_walker *walker = data->walker; 165 - return walker->cb(addr, data->end, level, ptep, flag, walker->arg); 180 + 181 + /* Ensure the appropriate lock is held (e.g. RCU lock for stage-2 MMU) */ 182 + WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx) && !kvm_pgtable_walk_lock_held()); 183 + return walker->cb(ctx, visit); 166 184 } 167 185 168 186 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, 169 - kvm_pte_t *pgtable, u32 level); 187 + struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, u32 level); 170 188 171 189 static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data, 172 - kvm_pte_t *ptep, u32 level) 190 + struct kvm_pgtable_mm_ops *mm_ops, 191 + kvm_pteref_t pteref, u32 level) 173 192 { 174 - int ret = 0; 175 - u64 addr = data->addr; 176 - kvm_pte_t *childp, pte = *ptep; 177 - bool table = kvm_pte_table(pte, level); 178 193 enum kvm_pgtable_walk_flags flags = data->walker->flags; 194 + kvm_pte_t *ptep = kvm_dereference_pteref(data->walker, pteref); 195 + struct kvm_pgtable_visit_ctx ctx = { 196 + .ptep = ptep, 197 + .old = READ_ONCE(*ptep), 198 + .arg = data->walker->arg, 199 + .mm_ops = mm_ops, 200 + .addr = data->addr, 201 + .end = data->end, 202 + .level = level, 203 + .flags = flags, 204 + }; 205 + int ret = 0; 206 + kvm_pteref_t childp; 207 + bool table = kvm_pte_table(ctx.old, level); 179 208 180 - if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) { 181 - ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, 182 - KVM_PGTABLE_WALK_TABLE_PRE); 183 - } 209 + if (table && (ctx.flags & KVM_PGTABLE_WALK_TABLE_PRE)) 210 + ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_TABLE_PRE); 184 211 185 - if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) { 186 - ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, 187 - KVM_PGTABLE_WALK_LEAF); 188 - pte = *ptep; 189 - table = kvm_pte_table(pte, level); 212 + if (!table && (ctx.flags & KVM_PGTABLE_WALK_LEAF)) { 213 + ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_LEAF); 214 + ctx.old = READ_ONCE(*ptep); 215 + table = kvm_pte_table(ctx.old, level); 190 216 } 191 217 192 218 if (ret) ··· 209 213 goto out; 210 214 } 211 215 212 - childp = kvm_pte_follow(pte, data->pgt->mm_ops); 213 - ret = __kvm_pgtable_walk(data, childp, level + 1); 216 + childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops); 217 + ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1); 214 218 if (ret) 215 219 goto out; 216 220 217 - if (flags & KVM_PGTABLE_WALK_TABLE_POST) { 218 - ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, 219 - KVM_PGTABLE_WALK_TABLE_POST); 220 - } 221 + if (ctx.flags & KVM_PGTABLE_WALK_TABLE_POST) 222 + ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_TABLE_POST); 221 223 222 224 out: 223 225 return ret; 224 226 } 225 227 226 228 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, 227 - kvm_pte_t *pgtable, u32 level) 229 + struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, u32 level) 228 230 { 229 231 u32 idx; 230 232 int ret = 0; ··· 231 237 return -EINVAL; 232 238 233 239 for (idx = kvm_pgtable_idx(data, level); idx < PTRS_PER_PTE; ++idx) { 234 - kvm_pte_t *ptep = &pgtable[idx]; 240 + kvm_pteref_t pteref = &pgtable[idx]; 235 241 236 242 if (data->addr >= data->end) 237 243 break; 238 244 239 - ret = __kvm_pgtable_visit(data, ptep, level); 245 + ret = __kvm_pgtable_visit(data, mm_ops, pteref, level); 240 246 if (ret) 241 247 break; 242 248 } ··· 244 250 return ret; 245 251 } 246 252 247 - static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data) 253 + static int _kvm_pgtable_walk(struct kvm_pgtable *pgt, struct kvm_pgtable_walk_data *data) 248 254 { 249 255 u32 idx; 250 256 int ret = 0; 251 - struct kvm_pgtable *pgt = data->pgt; 252 257 u64 limit = BIT(pgt->ia_bits); 253 258 254 259 if (data->addr > limit || data->end > limit) ··· 256 263 if (!pgt->pgd) 257 264 return -EINVAL; 258 265 259 - for (idx = kvm_pgd_page_idx(data); data->addr < data->end; ++idx) { 260 - kvm_pte_t *ptep = &pgt->pgd[idx * PTRS_PER_PTE]; 266 + for (idx = kvm_pgd_page_idx(pgt, data->addr); data->addr < data->end; ++idx) { 267 + kvm_pteref_t pteref = &pgt->pgd[idx * PTRS_PER_PTE]; 261 268 262 - ret = __kvm_pgtable_walk(data, ptep, pgt->start_level); 269 + ret = __kvm_pgtable_walk(data, pgt->mm_ops, pteref, pgt->start_level); 263 270 if (ret) 264 271 break; 265 272 } ··· 271 278 struct kvm_pgtable_walker *walker) 272 279 { 273 280 struct kvm_pgtable_walk_data walk_data = { 274 - .pgt = pgt, 275 281 .addr = ALIGN_DOWN(addr, PAGE_SIZE), 276 282 .end = PAGE_ALIGN(walk_data.addr + size), 277 283 .walker = walker, 278 284 }; 285 + int r; 279 286 280 - return _kvm_pgtable_walk(&walk_data); 287 + r = kvm_pgtable_walk_begin(walker); 288 + if (r) 289 + return r; 290 + 291 + r = _kvm_pgtable_walk(pgt, &walk_data); 292 + kvm_pgtable_walk_end(walker); 293 + 294 + return r; 281 295 } 282 296 283 297 struct leaf_walk_data { ··· 292 292 u32 level; 293 293 }; 294 294 295 - static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 296 - enum kvm_pgtable_walk_flags flag, void * const arg) 295 + static int leaf_walker(const struct kvm_pgtable_visit_ctx *ctx, 296 + enum kvm_pgtable_walk_flags visit) 297 297 { 298 - struct leaf_walk_data *data = arg; 298 + struct leaf_walk_data *data = ctx->arg; 299 299 300 - data->pte = *ptep; 301 - data->level = level; 300 + data->pte = ctx->old; 301 + data->level = ctx->level; 302 302 303 303 return 0; 304 304 } ··· 329 329 struct hyp_map_data { 330 330 u64 phys; 331 331 kvm_pte_t attr; 332 - struct kvm_pgtable_mm_ops *mm_ops; 333 332 }; 334 333 335 334 static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep) ··· 382 383 return prot; 383 384 } 384 385 385 - static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, 386 - kvm_pte_t *ptep, struct hyp_map_data *data) 386 + static bool hyp_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx, 387 + struct hyp_map_data *data) 387 388 { 388 - kvm_pte_t new, old = *ptep; 389 - u64 granule = kvm_granule_size(level), phys = data->phys; 389 + kvm_pte_t new; 390 + u64 granule = kvm_granule_size(ctx->level), phys = data->phys; 390 391 391 - if (!kvm_block_mapping_supported(addr, end, phys, level)) 392 + if (!kvm_block_mapping_supported(ctx, phys)) 392 393 return false; 393 394 394 395 data->phys += granule; 395 - new = kvm_init_valid_leaf_pte(phys, data->attr, level); 396 - if (old == new) 396 + new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level); 397 + if (ctx->old == new) 397 398 return true; 398 - if (!kvm_pte_valid(old)) 399 - data->mm_ops->get_page(ptep); 400 - else if (WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_HI_SW)) 399 + if (!kvm_pte_valid(ctx->old)) 400 + ctx->mm_ops->get_page(ctx->ptep); 401 + else if (WARN_ON((ctx->old ^ new) & ~KVM_PTE_LEAF_ATTR_HI_SW)) 401 402 return false; 402 403 403 - smp_store_release(ptep, new); 404 + smp_store_release(ctx->ptep, new); 404 405 return true; 405 406 } 406 407 407 - static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 408 - enum kvm_pgtable_walk_flags flag, void * const arg) 408 + static int hyp_map_walker(const struct kvm_pgtable_visit_ctx *ctx, 409 + enum kvm_pgtable_walk_flags visit) 409 410 { 410 - kvm_pte_t *childp; 411 - struct hyp_map_data *data = arg; 412 - struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 411 + kvm_pte_t *childp, new; 412 + struct hyp_map_data *data = ctx->arg; 413 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 413 414 414 - if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg)) 415 + if (hyp_map_walker_try_leaf(ctx, data)) 415 416 return 0; 416 417 417 - if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1)) 418 + if (WARN_ON(ctx->level == KVM_PGTABLE_MAX_LEVELS - 1)) 418 419 return -EINVAL; 419 420 420 421 childp = (kvm_pte_t *)mm_ops->zalloc_page(NULL); 421 422 if (!childp) 422 423 return -ENOMEM; 423 424 424 - kvm_set_table_pte(ptep, childp, mm_ops); 425 - mm_ops->get_page(ptep); 425 + new = kvm_init_table_pte(childp, mm_ops); 426 + mm_ops->get_page(ctx->ptep); 427 + smp_store_release(ctx->ptep, new); 428 + 426 429 return 0; 427 430 } 428 431 ··· 434 433 int ret; 435 434 struct hyp_map_data map_data = { 436 435 .phys = ALIGN_DOWN(phys, PAGE_SIZE), 437 - .mm_ops = pgt->mm_ops, 438 436 }; 439 437 struct kvm_pgtable_walker walker = { 440 438 .cb = hyp_map_walker, ··· 451 451 return ret; 452 452 } 453 453 454 - struct hyp_unmap_data { 455 - u64 unmapped; 456 - struct kvm_pgtable_mm_ops *mm_ops; 457 - }; 458 - 459 - static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 460 - enum kvm_pgtable_walk_flags flag, void * const arg) 454 + static int hyp_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, 455 + enum kvm_pgtable_walk_flags visit) 461 456 { 462 - kvm_pte_t pte = *ptep, *childp = NULL; 463 - u64 granule = kvm_granule_size(level); 464 - struct hyp_unmap_data *data = arg; 465 - struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 457 + kvm_pte_t *childp = NULL; 458 + u64 granule = kvm_granule_size(ctx->level); 459 + u64 *unmapped = ctx->arg; 460 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 466 461 467 - if (!kvm_pte_valid(pte)) 462 + if (!kvm_pte_valid(ctx->old)) 468 463 return -EINVAL; 469 464 470 - if (kvm_pte_table(pte, level)) { 471 - childp = kvm_pte_follow(pte, mm_ops); 465 + if (kvm_pte_table(ctx->old, ctx->level)) { 466 + childp = kvm_pte_follow(ctx->old, mm_ops); 472 467 473 468 if (mm_ops->page_count(childp) != 1) 474 469 return 0; 475 470 476 - kvm_clear_pte(ptep); 471 + kvm_clear_pte(ctx->ptep); 477 472 dsb(ishst); 478 - __tlbi_level(vae2is, __TLBI_VADDR(addr, 0), level); 473 + __tlbi_level(vae2is, __TLBI_VADDR(ctx->addr, 0), ctx->level); 479 474 } else { 480 - if (end - addr < granule) 475 + if (ctx->end - ctx->addr < granule) 481 476 return -EINVAL; 482 477 483 - kvm_clear_pte(ptep); 478 + kvm_clear_pte(ctx->ptep); 484 479 dsb(ishst); 485 - __tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level); 486 - data->unmapped += granule; 480 + __tlbi_level(vale2is, __TLBI_VADDR(ctx->addr, 0), ctx->level); 481 + *unmapped += granule; 487 482 } 488 483 489 484 dsb(ish); 490 485 isb(); 491 - mm_ops->put_page(ptep); 486 + mm_ops->put_page(ctx->ptep); 492 487 493 488 if (childp) 494 489 mm_ops->put_page(childp); ··· 493 498 494 499 u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size) 495 500 { 496 - struct hyp_unmap_data unmap_data = { 497 - .mm_ops = pgt->mm_ops, 498 - }; 501 + u64 unmapped = 0; 499 502 struct kvm_pgtable_walker walker = { 500 503 .cb = hyp_unmap_walker, 501 - .arg = &unmap_data, 504 + .arg = &unmapped, 502 505 .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST, 503 506 }; 504 507 ··· 504 511 return 0; 505 512 506 513 kvm_pgtable_walk(pgt, addr, size, &walker); 507 - return unmap_data.unmapped; 514 + return unmapped; 508 515 } 509 516 510 517 int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits, ··· 512 519 { 513 520 u64 levels = ARM64_HW_PGTABLE_LEVELS(va_bits); 514 521 515 - pgt->pgd = (kvm_pte_t *)mm_ops->zalloc_page(NULL); 522 + pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_page(NULL); 516 523 if (!pgt->pgd) 517 524 return -ENOMEM; 518 525 ··· 525 532 return 0; 526 533 } 527 534 528 - static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 529 - enum kvm_pgtable_walk_flags flag, void * const arg) 535 + static int hyp_free_walker(const struct kvm_pgtable_visit_ctx *ctx, 536 + enum kvm_pgtable_walk_flags visit) 530 537 { 531 - struct kvm_pgtable_mm_ops *mm_ops = arg; 532 - kvm_pte_t pte = *ptep; 538 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 533 539 534 - if (!kvm_pte_valid(pte)) 540 + if (!kvm_pte_valid(ctx->old)) 535 541 return 0; 536 542 537 - mm_ops->put_page(ptep); 543 + mm_ops->put_page(ctx->ptep); 538 544 539 - if (kvm_pte_table(pte, level)) 540 - mm_ops->put_page(kvm_pte_follow(pte, mm_ops)); 545 + if (kvm_pte_table(ctx->old, ctx->level)) 546 + mm_ops->put_page(kvm_pte_follow(ctx->old, mm_ops)); 541 547 542 548 return 0; 543 549 } ··· 546 554 struct kvm_pgtable_walker walker = { 547 555 .cb = hyp_free_walker, 548 556 .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST, 549 - .arg = pgt->mm_ops, 550 557 }; 551 558 552 559 WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker)); 553 - pgt->mm_ops->put_page(pgt->pgd); 560 + pgt->mm_ops->put_page(kvm_dereference_pteref(&walker, pgt->pgd)); 554 561 pgt->pgd = NULL; 555 562 } 556 563 ··· 563 572 564 573 struct kvm_s2_mmu *mmu; 565 574 void *memcache; 566 - 567 - struct kvm_pgtable_mm_ops *mm_ops; 568 575 569 576 /* Force mappings to page granularity */ 570 577 bool force_pte; ··· 671 682 return !!pte; 672 683 } 673 684 674 - static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr, 675 - u32 level, struct kvm_pgtable_mm_ops *mm_ops) 685 + static bool stage2_pte_is_locked(kvm_pte_t pte) 686 + { 687 + return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED); 688 + } 689 + 690 + static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) 691 + { 692 + if (!kvm_pgtable_walk_shared(ctx)) { 693 + WRITE_ONCE(*ctx->ptep, new); 694 + return true; 695 + } 696 + 697 + return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old; 698 + } 699 + 700 + /** 701 + * stage2_try_break_pte() - Invalidates a pte according to the 702 + * 'break-before-make' requirements of the 703 + * architecture. 704 + * 705 + * @ctx: context of the visited pte. 706 + * @mmu: stage-2 mmu 707 + * 708 + * Returns: true if the pte was successfully broken. 709 + * 710 + * If the removed pte was valid, performs the necessary serialization and TLB 711 + * invalidation for the old value. For counted ptes, drops the reference count 712 + * on the containing table page. 713 + */ 714 + static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx, 715 + struct kvm_s2_mmu *mmu) 716 + { 717 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 718 + 719 + if (stage2_pte_is_locked(ctx->old)) { 720 + /* 721 + * Should never occur if this walker has exclusive access to the 722 + * page tables. 723 + */ 724 + WARN_ON(!kvm_pgtable_walk_shared(ctx)); 725 + return false; 726 + } 727 + 728 + if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED)) 729 + return false; 730 + 731 + /* 732 + * Perform the appropriate TLB invalidation based on the evicted pte 733 + * value (if any). 734 + */ 735 + if (kvm_pte_table(ctx->old, ctx->level)) 736 + kvm_call_hyp(__kvm_tlb_flush_vmid, mmu); 737 + else if (kvm_pte_valid(ctx->old)) 738 + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level); 739 + 740 + if (stage2_pte_is_counted(ctx->old)) 741 + mm_ops->put_page(ctx->ptep); 742 + 743 + return true; 744 + } 745 + 746 + static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) 747 + { 748 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 749 + 750 + WARN_ON(!stage2_pte_is_locked(*ctx->ptep)); 751 + 752 + if (stage2_pte_is_counted(new)) 753 + mm_ops->get_page(ctx->ptep); 754 + 755 + smp_store_release(ctx->ptep, new); 756 + } 757 + 758 + static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu, 759 + struct kvm_pgtable_mm_ops *mm_ops) 676 760 { 677 761 /* 678 762 * Clear the existing PTE, and perform break-before-make with 679 763 * TLB maintenance if it was valid. 680 764 */ 681 - if (kvm_pte_valid(*ptep)) { 682 - kvm_clear_pte(ptep); 683 - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level); 765 + if (kvm_pte_valid(ctx->old)) { 766 + kvm_clear_pte(ctx->ptep); 767 + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level); 684 768 } 685 769 686 - mm_ops->put_page(ptep); 770 + mm_ops->put_page(ctx->ptep); 687 771 } 688 772 689 773 static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte) ··· 770 708 return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN); 771 709 } 772 710 773 - static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level, 711 + static bool stage2_leaf_mapping_allowed(const struct kvm_pgtable_visit_ctx *ctx, 774 712 struct stage2_map_data *data) 775 713 { 776 - if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1))) 714 + if (data->force_pte && (ctx->level < (KVM_PGTABLE_MAX_LEVELS - 1))) 777 715 return false; 778 716 779 - return kvm_block_mapping_supported(addr, end, data->phys, level); 717 + return kvm_block_mapping_supported(ctx, data->phys); 780 718 } 781 719 782 - static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level, 783 - kvm_pte_t *ptep, 720 + static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx, 784 721 struct stage2_map_data *data) 785 722 { 786 - kvm_pte_t new, old = *ptep; 787 - u64 granule = kvm_granule_size(level), phys = data->phys; 723 + kvm_pte_t new; 724 + u64 granule = kvm_granule_size(ctx->level), phys = data->phys; 788 725 struct kvm_pgtable *pgt = data->mmu->pgt; 789 - struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 726 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 790 727 791 - if (!stage2_leaf_mapping_allowed(addr, end, level, data)) 728 + if (!stage2_leaf_mapping_allowed(ctx, data)) 792 729 return -E2BIG; 793 730 794 731 if (kvm_phys_is_valid(phys)) 795 - new = kvm_init_valid_leaf_pte(phys, data->attr, level); 732 + new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level); 796 733 else 797 734 new = kvm_init_invalid_leaf_owner(data->owner_id); 798 735 799 - if (stage2_pte_is_counted(old)) { 800 - /* 801 - * Skip updating the PTE if we are trying to recreate the exact 802 - * same mapping or only change the access permissions. Instead, 803 - * the vCPU will exit one more time from guest if still needed 804 - * and then go through the path of relaxing permissions. 805 - */ 806 - if (!stage2_pte_needs_update(old, new)) 807 - return -EAGAIN; 736 + /* 737 + * Skip updating the PTE if we are trying to recreate the exact 738 + * same mapping or only change the access permissions. Instead, 739 + * the vCPU will exit one more time from guest if still needed 740 + * and then go through the path of relaxing permissions. 741 + */ 742 + if (!stage2_pte_needs_update(ctx->old, new)) 743 + return -EAGAIN; 808 744 809 - stage2_put_pte(ptep, data->mmu, addr, level, mm_ops); 810 - } 745 + if (!stage2_try_break_pte(ctx, data->mmu)) 746 + return -EAGAIN; 811 747 812 748 /* Perform CMOs before installation of the guest stage-2 PTE */ 813 749 if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new)) ··· 815 755 if (mm_ops->icache_inval_pou && stage2_pte_executable(new)) 816 756 mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule); 817 757 818 - smp_store_release(ptep, new); 819 - if (stage2_pte_is_counted(new)) 820 - mm_ops->get_page(ptep); 758 + stage2_make_pte(ctx, new); 759 + 821 760 if (kvm_phys_is_valid(phys)) 822 761 data->phys += granule; 823 762 return 0; 824 763 } 825 764 826 - static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level, 827 - kvm_pte_t *ptep, 765 + static int stage2_map_walk_table_pre(const struct kvm_pgtable_visit_ctx *ctx, 828 766 struct stage2_map_data *data) 829 767 { 830 - if (data->anchor) 768 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 769 + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops); 770 + int ret; 771 + 772 + if (!stage2_leaf_mapping_allowed(ctx, data)) 831 773 return 0; 832 774 833 - if (!stage2_leaf_mapping_allowed(addr, end, level, data)) 834 - return 0; 775 + ret = stage2_map_walker_try_leaf(ctx, data); 776 + if (ret) 777 + return ret; 835 778 836 - data->childp = kvm_pte_follow(*ptep, data->mm_ops); 837 - kvm_clear_pte(ptep); 838 - 839 - /* 840 - * Invalidate the whole stage-2, as we may have numerous leaf 841 - * entries below us which would otherwise need invalidating 842 - * individually. 843 - */ 844 - kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu); 845 - data->anchor = ptep; 779 + mm_ops->free_removed_table(childp, ctx->level); 846 780 return 0; 847 781 } 848 782 849 - static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 783 + static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx, 850 784 struct stage2_map_data *data) 851 785 { 852 - struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 853 - kvm_pte_t *childp, pte = *ptep; 786 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 787 + kvm_pte_t *childp, new; 854 788 int ret; 855 789 856 - if (data->anchor) { 857 - if (stage2_pte_is_counted(pte)) 858 - mm_ops->put_page(ptep); 859 - 860 - return 0; 861 - } 862 - 863 - ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data); 790 + ret = stage2_map_walker_try_leaf(ctx, data); 864 791 if (ret != -E2BIG) 865 792 return ret; 866 793 867 - if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1)) 794 + if (WARN_ON(ctx->level == KVM_PGTABLE_MAX_LEVELS - 1)) 868 795 return -EINVAL; 869 796 870 797 if (!data->memcache) ··· 861 814 if (!childp) 862 815 return -ENOMEM; 863 816 817 + if (!stage2_try_break_pte(ctx, data->mmu)) { 818 + mm_ops->put_page(childp); 819 + return -EAGAIN; 820 + } 821 + 864 822 /* 865 823 * If we've run into an existing block mapping then replace it with 866 824 * a table. Accesses beyond 'end' that fall within the new table 867 825 * will be mapped lazily. 868 826 */ 869 - if (stage2_pte_is_counted(pte)) 870 - stage2_put_pte(ptep, data->mmu, addr, level, mm_ops); 871 - 872 - kvm_set_table_pte(ptep, childp, mm_ops); 873 - mm_ops->get_page(ptep); 827 + new = kvm_init_table_pte(childp, mm_ops); 828 + stage2_make_pte(ctx, new); 874 829 875 830 return 0; 876 831 } 877 832 878 - static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level, 879 - kvm_pte_t *ptep, 880 - struct stage2_map_data *data) 881 - { 882 - struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 883 - kvm_pte_t *childp; 884 - int ret = 0; 885 - 886 - if (!data->anchor) 887 - return 0; 888 - 889 - if (data->anchor == ptep) { 890 - childp = data->childp; 891 - data->anchor = NULL; 892 - data->childp = NULL; 893 - ret = stage2_map_walk_leaf(addr, end, level, ptep, data); 894 - } else { 895 - childp = kvm_pte_follow(*ptep, mm_ops); 896 - } 897 - 898 - mm_ops->put_page(childp); 899 - mm_ops->put_page(ptep); 900 - 901 - return ret; 902 - } 903 - 904 833 /* 905 - * This is a little fiddly, as we use all three of the walk flags. The idea 906 - * is that the TABLE_PRE callback runs for table entries on the way down, 907 - * looking for table entries which we could conceivably replace with a 908 - * block entry for this mapping. If it finds one, then it sets the 'anchor' 909 - * field in 'struct stage2_map_data' to point at the table entry, before 910 - * clearing the entry to zero and descending into the now detached table. 834 + * The TABLE_PRE callback runs for table entries on the way down, looking 835 + * for table entries which we could conceivably replace with a block entry 836 + * for this mapping. If it finds one it replaces the entry and calls 837 + * kvm_pgtable_mm_ops::free_removed_table() to tear down the detached table. 911 838 * 912 - * The behaviour of the LEAF callback then depends on whether or not the 913 - * anchor has been set. If not, then we're not using a block mapping higher 914 - * up the table and we perform the mapping at the existing leaves instead. 915 - * If, on the other hand, the anchor _is_ set, then we drop references to 916 - * all valid leaves so that the pages beneath the anchor can be freed. 917 - * 918 - * Finally, the TABLE_POST callback does nothing if the anchor has not 919 - * been set, but otherwise frees the page-table pages while walking back up 920 - * the page-table, installing the block entry when it revisits the anchor 921 - * pointer and clearing the anchor to NULL. 839 + * Otherwise, the LEAF callback performs the mapping at the existing leaves 840 + * instead. 922 841 */ 923 - static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 924 - enum kvm_pgtable_walk_flags flag, void * const arg) 842 + static int stage2_map_walker(const struct kvm_pgtable_visit_ctx *ctx, 843 + enum kvm_pgtable_walk_flags visit) 925 844 { 926 - struct stage2_map_data *data = arg; 845 + struct stage2_map_data *data = ctx->arg; 927 846 928 - switch (flag) { 847 + switch (visit) { 929 848 case KVM_PGTABLE_WALK_TABLE_PRE: 930 - return stage2_map_walk_table_pre(addr, end, level, ptep, data); 849 + return stage2_map_walk_table_pre(ctx, data); 931 850 case KVM_PGTABLE_WALK_LEAF: 932 - return stage2_map_walk_leaf(addr, end, level, ptep, data); 933 - case KVM_PGTABLE_WALK_TABLE_POST: 934 - return stage2_map_walk_table_post(addr, end, level, ptep, data); 851 + return stage2_map_walk_leaf(ctx, data); 852 + default: 853 + return -EINVAL; 935 854 } 936 - 937 - return -EINVAL; 938 855 } 939 856 940 857 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, 941 858 u64 phys, enum kvm_pgtable_prot prot, 942 - void *mc) 859 + void *mc, enum kvm_pgtable_walk_flags flags) 943 860 { 944 861 int ret; 945 862 struct stage2_map_data map_data = { 946 863 .phys = ALIGN_DOWN(phys, PAGE_SIZE), 947 864 .mmu = pgt->mmu, 948 865 .memcache = mc, 949 - .mm_ops = pgt->mm_ops, 950 866 .force_pte = pgt->force_pte_cb && pgt->force_pte_cb(addr, addr + size, prot), 951 867 }; 952 868 struct kvm_pgtable_walker walker = { 953 869 .cb = stage2_map_walker, 954 - .flags = KVM_PGTABLE_WALK_TABLE_PRE | 955 - KVM_PGTABLE_WALK_LEAF | 956 - KVM_PGTABLE_WALK_TABLE_POST, 870 + .flags = flags | 871 + KVM_PGTABLE_WALK_TABLE_PRE | 872 + KVM_PGTABLE_WALK_LEAF, 957 873 .arg = &map_data, 958 874 }; 959 875 ··· 940 930 .phys = KVM_PHYS_INVALID, 941 931 .mmu = pgt->mmu, 942 932 .memcache = mc, 943 - .mm_ops = pgt->mm_ops, 944 933 .owner_id = owner_id, 945 934 .force_pte = true, 946 935 }; 947 936 struct kvm_pgtable_walker walker = { 948 937 .cb = stage2_map_walker, 949 938 .flags = KVM_PGTABLE_WALK_TABLE_PRE | 950 - KVM_PGTABLE_WALK_LEAF | 951 - KVM_PGTABLE_WALK_TABLE_POST, 939 + KVM_PGTABLE_WALK_LEAF, 952 940 .arg = &map_data, 953 941 }; 954 942 ··· 957 949 return ret; 958 950 } 959 951 960 - static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 961 - enum kvm_pgtable_walk_flags flag, 962 - void * const arg) 952 + static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, 953 + enum kvm_pgtable_walk_flags visit) 963 954 { 964 - struct kvm_pgtable *pgt = arg; 955 + struct kvm_pgtable *pgt = ctx->arg; 965 956 struct kvm_s2_mmu *mmu = pgt->mmu; 966 - struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops; 967 - kvm_pte_t pte = *ptep, *childp = NULL; 957 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 958 + kvm_pte_t *childp = NULL; 968 959 bool need_flush = false; 969 960 970 - if (!kvm_pte_valid(pte)) { 971 - if (stage2_pte_is_counted(pte)) { 972 - kvm_clear_pte(ptep); 973 - mm_ops->put_page(ptep); 961 + if (!kvm_pte_valid(ctx->old)) { 962 + if (stage2_pte_is_counted(ctx->old)) { 963 + kvm_clear_pte(ctx->ptep); 964 + mm_ops->put_page(ctx->ptep); 974 965 } 975 966 return 0; 976 967 } 977 968 978 - if (kvm_pte_table(pte, level)) { 979 - childp = kvm_pte_follow(pte, mm_ops); 969 + if (kvm_pte_table(ctx->old, ctx->level)) { 970 + childp = kvm_pte_follow(ctx->old, mm_ops); 980 971 981 972 if (mm_ops->page_count(childp) != 1) 982 973 return 0; 983 - } else if (stage2_pte_cacheable(pgt, pte)) { 974 + } else if (stage2_pte_cacheable(pgt, ctx->old)) { 984 975 need_flush = !stage2_has_fwb(pgt); 985 976 } 986 977 ··· 988 981 * block entry and rely on the remaining portions being faulted 989 982 * back lazily. 990 983 */ 991 - stage2_put_pte(ptep, mmu, addr, level, mm_ops); 984 + stage2_put_pte(ctx, mmu, mm_ops); 992 985 993 986 if (need_flush && mm_ops->dcache_clean_inval_poc) 994 - mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops), 995 - kvm_granule_size(level)); 987 + mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops), 988 + kvm_granule_size(ctx->level)); 996 989 997 990 if (childp) 998 991 mm_ops->put_page(childp); ··· 1016 1009 kvm_pte_t attr_clr; 1017 1010 kvm_pte_t pte; 1018 1011 u32 level; 1019 - struct kvm_pgtable_mm_ops *mm_ops; 1020 1012 }; 1021 1013 1022 - static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 1023 - enum kvm_pgtable_walk_flags flag, 1024 - void * const arg) 1014 + static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx, 1015 + enum kvm_pgtable_walk_flags visit) 1025 1016 { 1026 - kvm_pte_t pte = *ptep; 1027 - struct stage2_attr_data *data = arg; 1028 - struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 1017 + kvm_pte_t pte = ctx->old; 1018 + struct stage2_attr_data *data = ctx->arg; 1019 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 1029 1020 1030 - if (!kvm_pte_valid(pte)) 1021 + if (!kvm_pte_valid(ctx->old)) 1031 1022 return 0; 1032 1023 1033 - data->level = level; 1024 + data->level = ctx->level; 1034 1025 data->pte = pte; 1035 1026 pte &= ~data->attr_clr; 1036 1027 pte |= data->attr_set; ··· 1044 1039 * stage-2 PTE if we are going to add executable permission. 1045 1040 */ 1046 1041 if (mm_ops->icache_inval_pou && 1047 - stage2_pte_executable(pte) && !stage2_pte_executable(*ptep)) 1042 + stage2_pte_executable(pte) && !stage2_pte_executable(ctx->old)) 1048 1043 mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops), 1049 - kvm_granule_size(level)); 1050 - WRITE_ONCE(*ptep, pte); 1044 + kvm_granule_size(ctx->level)); 1045 + 1046 + if (!stage2_try_set_pte(ctx, pte)) 1047 + return -EAGAIN; 1051 1048 } 1052 1049 1053 1050 return 0; ··· 1058 1051 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr, 1059 1052 u64 size, kvm_pte_t attr_set, 1060 1053 kvm_pte_t attr_clr, kvm_pte_t *orig_pte, 1061 - u32 *level) 1054 + u32 *level, enum kvm_pgtable_walk_flags flags) 1062 1055 { 1063 1056 int ret; 1064 1057 kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI; 1065 1058 struct stage2_attr_data data = { 1066 1059 .attr_set = attr_set & attr_mask, 1067 1060 .attr_clr = attr_clr & attr_mask, 1068 - .mm_ops = pgt->mm_ops, 1069 1061 }; 1070 1062 struct kvm_pgtable_walker walker = { 1071 1063 .cb = stage2_attr_walker, 1072 1064 .arg = &data, 1073 - .flags = KVM_PGTABLE_WALK_LEAF, 1065 + .flags = flags | KVM_PGTABLE_WALK_LEAF, 1074 1066 }; 1075 1067 1076 1068 ret = kvm_pgtable_walk(pgt, addr, size, &walker); ··· 1088 1082 { 1089 1083 return stage2_update_leaf_attrs(pgt, addr, size, 0, 1090 1084 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W, 1091 - NULL, NULL); 1085 + NULL, NULL, 0); 1092 1086 } 1093 1087 1094 1088 kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr) 1095 1089 { 1096 1090 kvm_pte_t pte = 0; 1097 1091 stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0, 1098 - &pte, NULL); 1092 + &pte, NULL, 0); 1099 1093 dsb(ishst); 1100 1094 return pte; 1101 1095 } ··· 1104 1098 { 1105 1099 kvm_pte_t pte = 0; 1106 1100 stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF, 1107 - &pte, NULL); 1101 + &pte, NULL, 0); 1108 1102 /* 1109 1103 * "But where's the TLBI?!", you scream. 1110 1104 * "Over in the core code", I sigh. ··· 1117 1111 bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr) 1118 1112 { 1119 1113 kvm_pte_t pte = 0; 1120 - stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL); 1114 + stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL, 0); 1121 1115 return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF; 1122 1116 } 1123 1117 ··· 1140 1134 if (prot & KVM_PGTABLE_PROT_X) 1141 1135 clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN; 1142 1136 1143 - ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level); 1137 + ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, 1138 + KVM_PGTABLE_WALK_SHARED); 1144 1139 if (!ret) 1145 1140 kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level); 1146 1141 return ret; 1147 1142 } 1148 1143 1149 - static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 1150 - enum kvm_pgtable_walk_flags flag, 1151 - void * const arg) 1144 + static int stage2_flush_walker(const struct kvm_pgtable_visit_ctx *ctx, 1145 + enum kvm_pgtable_walk_flags visit) 1152 1146 { 1153 - struct kvm_pgtable *pgt = arg; 1147 + struct kvm_pgtable *pgt = ctx->arg; 1154 1148 struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops; 1155 - kvm_pte_t pte = *ptep; 1156 1149 1157 - if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte)) 1150 + if (!kvm_pte_valid(ctx->old) || !stage2_pte_cacheable(pgt, ctx->old)) 1158 1151 return 0; 1159 1152 1160 1153 if (mm_ops->dcache_clean_inval_poc) 1161 - mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops), 1162 - kvm_granule_size(level)); 1154 + mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops), 1155 + kvm_granule_size(ctx->level)); 1163 1156 return 0; 1164 1157 } 1165 1158 ··· 1189 1184 u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0; 1190 1185 1191 1186 pgd_sz = kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE; 1192 - pgt->pgd = mm_ops->zalloc_pages_exact(pgd_sz); 1187 + pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_pages_exact(pgd_sz); 1193 1188 if (!pgt->pgd) 1194 1189 return -ENOMEM; 1195 1190 ··· 1205 1200 return 0; 1206 1201 } 1207 1202 1208 - static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 1209 - enum kvm_pgtable_walk_flags flag, 1210 - void * const arg) 1203 + size_t kvm_pgtable_stage2_pgd_size(u64 vtcr) 1211 1204 { 1212 - struct kvm_pgtable_mm_ops *mm_ops = arg; 1213 - kvm_pte_t pte = *ptep; 1205 + u32 ia_bits = VTCR_EL2_IPA(vtcr); 1206 + u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr); 1207 + u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0; 1214 1208 1215 - if (!stage2_pte_is_counted(pte)) 1209 + return kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE; 1210 + } 1211 + 1212 + static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx, 1213 + enum kvm_pgtable_walk_flags visit) 1214 + { 1215 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 1216 + 1217 + if (!stage2_pte_is_counted(ctx->old)) 1216 1218 return 0; 1217 1219 1218 - mm_ops->put_page(ptep); 1220 + mm_ops->put_page(ctx->ptep); 1219 1221 1220 - if (kvm_pte_table(pte, level)) 1221 - mm_ops->put_page(kvm_pte_follow(pte, mm_ops)); 1222 + if (kvm_pte_table(ctx->old, ctx->level)) 1223 + mm_ops->put_page(kvm_pte_follow(ctx->old, mm_ops)); 1222 1224 1223 1225 return 0; 1224 1226 } ··· 1237 1225 .cb = stage2_free_walker, 1238 1226 .flags = KVM_PGTABLE_WALK_LEAF | 1239 1227 KVM_PGTABLE_WALK_TABLE_POST, 1240 - .arg = pgt->mm_ops, 1241 1228 }; 1242 1229 1243 1230 WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker)); 1244 1231 pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE; 1245 - pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz); 1232 + pgt->mm_ops->free_pages_exact(kvm_dereference_pteref(&walker, pgt->pgd), pgd_sz); 1246 1233 pgt->pgd = NULL; 1234 + } 1235 + 1236 + void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level) 1237 + { 1238 + kvm_pteref_t ptep = (kvm_pteref_t)pgtable; 1239 + struct kvm_pgtable_walker walker = { 1240 + .cb = stage2_free_walker, 1241 + .flags = KVM_PGTABLE_WALK_LEAF | 1242 + KVM_PGTABLE_WALK_TABLE_POST, 1243 + }; 1244 + struct kvm_pgtable_walk_data data = { 1245 + .walker = &walker, 1246 + 1247 + /* 1248 + * At this point the IPA really doesn't matter, as the page 1249 + * table being traversed has already been removed from the stage 1250 + * 2. Set an appropriate range to cover the entire page table. 1251 + */ 1252 + .addr = 0, 1253 + .end = kvm_granule_size(level), 1254 + }; 1255 + 1256 + WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); 1247 1257 }

+1 -1

arch/arm64/kvm/hyp/vhe/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 # 3 - # Makefile for Kernel-based Virtual Machine module, HYP/nVHE part 3 + # Makefile for Kernel-based Virtual Machine module, HYP/VHE part 4 4 # 5 5 6 6 asflags-y := -D__KVM_VHE_HYPERVISOR__

+132 -61

arch/arm64/kvm/mmu.c

··· 128 128 free_pages_exact(virt, size); 129 129 } 130 130 131 + static struct kvm_pgtable_mm_ops kvm_s2_mm_ops; 132 + 133 + static void stage2_free_removed_table_rcu_cb(struct rcu_head *head) 134 + { 135 + struct page *page = container_of(head, struct page, rcu_head); 136 + void *pgtable = page_to_virt(page); 137 + u32 level = page_private(page); 138 + 139 + kvm_pgtable_stage2_free_removed(&kvm_s2_mm_ops, pgtable, level); 140 + } 141 + 142 + static void stage2_free_removed_table(void *addr, u32 level) 143 + { 144 + struct page *page = virt_to_page(addr); 145 + 146 + set_page_private(page, (unsigned long)level); 147 + call_rcu(&page->rcu_head, stage2_free_removed_table_rcu_cb); 148 + } 149 + 131 150 static void kvm_host_get_page(void *addr) 132 151 { 133 152 get_page(virt_to_page(addr)); ··· 659 640 static int get_user_mapping_size(struct kvm *kvm, u64 addr) 660 641 { 661 642 struct kvm_pgtable pgt = { 662 - .pgd = (kvm_pte_t *)kvm->mm->pgd, 663 - .ia_bits = VA_BITS, 643 + .pgd = (kvm_pteref_t)kvm->mm->pgd, 644 + .ia_bits = vabits_actual, 664 645 .start_level = (KVM_PGTABLE_MAX_LEVELS - 665 646 CONFIG_PGTABLE_LEVELS), 666 647 .mm_ops = &kvm_user_mm_ops, ··· 681 662 .zalloc_page = stage2_memcache_zalloc_page, 682 663 .zalloc_pages_exact = kvm_s2_zalloc_pages_exact, 683 664 .free_pages_exact = kvm_s2_free_pages_exact, 665 + .free_removed_table = stage2_free_removed_table, 684 666 .get_page = kvm_host_get_page, 685 667 .put_page = kvm_s2_put_page, 686 668 .page_count = kvm_host_page_count, ··· 695 675 * kvm_init_stage2_mmu - Initialise a S2 MMU structure 696 676 * @kvm: The pointer to the KVM structure 697 677 * @mmu: The pointer to the s2 MMU structure 678 + * @type: The machine type of the virtual machine 698 679 * 699 680 * Allocates only the stage-2 HW PGD level table(s). 700 681 * Note we don't need locking here as this is only called when the VM is 701 682 * created, which can only be done once. 702 683 */ 703 - int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu) 684 + int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type) 704 685 { 686 + u32 kvm_ipa_limit = get_kvm_ipa_limit(); 705 687 int cpu, err; 706 688 struct kvm_pgtable *pgt; 689 + u64 mmfr0, mmfr1; 690 + u32 phys_shift; 691 + 692 + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) 693 + return -EINVAL; 694 + 695 + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); 696 + if (is_protected_kvm_enabled()) { 697 + phys_shift = kvm_ipa_limit; 698 + } else if (phys_shift) { 699 + if (phys_shift > kvm_ipa_limit || 700 + phys_shift < ARM64_MIN_PARANGE_BITS) 701 + return -EINVAL; 702 + } else { 703 + phys_shift = KVM_PHYS_SHIFT; 704 + if (phys_shift > kvm_ipa_limit) { 705 + pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n", 706 + current->comm); 707 + return -EINVAL; 708 + } 709 + } 710 + 711 + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); 712 + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1); 713 + kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift); 707 714 708 715 if (mmu->pgt != NULL) { 709 716 kvm_err("kvm_arch already initialized?\n"); ··· 854 807 } 855 808 } 856 809 810 + static void hyp_mc_free_fn(void *addr, void *unused) 811 + { 812 + free_page((unsigned long)addr); 813 + } 814 + 815 + static void *hyp_mc_alloc_fn(void *unused) 816 + { 817 + return (void *)__get_free_page(GFP_KERNEL_ACCOUNT); 818 + } 819 + 820 + void free_hyp_memcache(struct kvm_hyp_memcache *mc) 821 + { 822 + if (is_protected_kvm_enabled()) 823 + __free_hyp_memcache(mc, hyp_mc_free_fn, 824 + kvm_host_va, NULL); 825 + } 826 + 827 + int topup_hyp_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages) 828 + { 829 + if (!is_protected_kvm_enabled()) 830 + return 0; 831 + 832 + return __topup_hyp_memcache(mc, min_pages, hyp_mc_alloc_fn, 833 + kvm_host_pa, NULL); 834 + } 835 + 857 836 /** 858 837 * kvm_phys_addr_ioremap - map a device range to guest IPA 859 838 * ··· 914 841 915 842 write_lock(&kvm->mmu_lock); 916 843 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot, 917 - &cache); 844 + &cache, 0); 918 845 write_unlock(&kvm->mmu_lock); 919 846 if (ret) 920 847 break; ··· 1164 1091 * - mmap_lock protects between a VM faulting a page in and the VMM performing 1165 1092 * an mprotect() to add VM_MTE 1166 1093 */ 1167 - static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, 1168 - unsigned long size) 1094 + static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, 1095 + unsigned long size) 1169 1096 { 1170 1097 unsigned long i, nr_pages = size >> PAGE_SHIFT; 1171 - struct page *page; 1098 + struct page *page = pfn_to_page(pfn); 1172 1099 1173 1100 if (!kvm_has_mte(kvm)) 1174 - return 0; 1175 - 1176 - /* 1177 - * pfn_to_online_page() is used to reject ZONE_DEVICE pages 1178 - * that may not support tags. 1179 - */ 1180 - page = pfn_to_online_page(pfn); 1181 - 1182 - if (!page) 1183 - return -EFAULT; 1101 + return; 1184 1102 1185 1103 for (i = 0; i < nr_pages; i++, page++) { 1186 - if (!test_bit(PG_mte_tagged, &page->flags)) { 1104 + if (try_page_mte_tagging(page)) { 1187 1105 mte_clear_page_tags(page_address(page)); 1188 - set_bit(PG_mte_tagged, &page->flags); 1106 + set_page_mte_tagged(page); 1189 1107 } 1190 1108 } 1109 + } 1191 1110 1192 - return 0; 1111 + static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) 1112 + { 1113 + return vma->vm_flags & VM_MTE_ALLOWED; 1193 1114 } 1194 1115 1195 1116 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, ··· 1194 1127 bool write_fault, writable, force_pte = false; 1195 1128 bool exec_fault; 1196 1129 bool device = false; 1197 - bool shared; 1198 1130 unsigned long mmu_seq; 1199 1131 struct kvm *kvm = vcpu->kvm; 1200 1132 struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; ··· 1202 1136 gfn_t gfn; 1203 1137 kvm_pfn_t pfn; 1204 1138 bool logging_active = memslot_is_logging(memslot); 1205 - bool use_read_lock = false; 1206 1139 unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu); 1207 1140 unsigned long vma_pagesize, fault_granule; 1208 1141 enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; ··· 1236 1171 if (logging_active) { 1237 1172 force_pte = true; 1238 1173 vma_shift = PAGE_SHIFT; 1239 - use_read_lock = (fault_status == FSC_PERM && write_fault && 1240 - fault_granule == PAGE_SIZE); 1241 1174 } else { 1242 1175 vma_shift = get_vma_page_shift(vma, hva); 1243 1176 } 1244 - 1245 - shared = (vma->vm_flags & VM_SHARED); 1246 1177 1247 1178 switch (vma_shift) { 1248 1179 #ifndef __PAGETABLE_PMD_FOLDED ··· 1332 1271 if (exec_fault && device) 1333 1272 return -ENOEXEC; 1334 1273 1335 - /* 1336 - * To reduce MMU contentions and enhance concurrency during dirty 1337 - * logging dirty logging, only acquire read lock for permission 1338 - * relaxation. 1339 - */ 1340 - if (use_read_lock) 1341 - read_lock(&kvm->mmu_lock); 1342 - else 1343 - write_lock(&kvm->mmu_lock); 1274 + read_lock(&kvm->mmu_lock); 1344 1275 pgt = vcpu->arch.hw_mmu->pgt; 1345 1276 if (mmu_invalidate_retry(kvm, mmu_seq)) 1346 1277 goto out_unlock; ··· 1351 1298 } 1352 1299 1353 1300 if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) { 1354 - /* Check the VMM hasn't introduced a new VM_SHARED VMA */ 1355 - if (!shared) 1356 - ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); 1357 - else 1301 + /* Check the VMM hasn't introduced a new disallowed VMA */ 1302 + if (kvm_vma_mte_allowed(vma)) { 1303 + sanitise_mte_tags(kvm, pfn, vma_pagesize); 1304 + } else { 1358 1305 ret = -EFAULT; 1359 - if (ret) 1360 1306 goto out_unlock; 1307 + } 1361 1308 } 1362 1309 1363 1310 if (writable) ··· 1376 1323 * permissions only if vma_pagesize equals fault_granule. Otherwise, 1377 1324 * kvm_pgtable_stage2_map() should be called to change block size. 1378 1325 */ 1379 - if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { 1326 + if (fault_status == FSC_PERM && vma_pagesize == fault_granule) 1380 1327 ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); 1381 - } else { 1382 - WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n"); 1383 - 1328 + else 1384 1329 ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, 1385 1330 __pfn_to_phys(pfn), prot, 1386 - memcache); 1387 - } 1331 + memcache, KVM_PGTABLE_WALK_SHARED); 1388 1332 1389 1333 /* Mark the page dirty only if the fault is handled successfully */ 1390 1334 if (writable && !ret) { ··· 1390 1340 } 1391 1341 1392 1342 out_unlock: 1393 - if (use_read_lock) 1394 - read_unlock(&kvm->mmu_lock); 1395 - else 1396 - write_unlock(&kvm->mmu_lock); 1343 + read_unlock(&kvm->mmu_lock); 1397 1344 kvm_set_pfn_accessed(pfn); 1398 1345 kvm_release_pfn_clean(pfn); 1399 1346 return ret != -EAGAIN ? ret : 0; ··· 1573 1526 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) 1574 1527 { 1575 1528 kvm_pfn_t pfn = pte_pfn(range->pte); 1576 - int ret; 1577 1529 1578 1530 if (!kvm->arch.mmu.pgt) 1579 1531 return false; 1580 1532 1581 1533 WARN_ON(range->end - range->start != 1); 1582 1534 1583 - ret = sanitise_mte_tags(kvm, pfn, PAGE_SIZE); 1584 - if (ret) 1535 + /* 1536 + * If the page isn't tagged, defer to user_mem_abort() for sanitising 1537 + * the MTE tags. The S2 pte should have been unmapped by 1538 + * mmu_notifier_invalidate_range_end(). 1539 + */ 1540 + if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) 1585 1541 return false; 1586 1542 1587 1543 /* ··· 1599 1549 */ 1600 1550 kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, 1601 1551 PAGE_SIZE, __pfn_to_phys(pfn), 1602 - KVM_PGTABLE_PROT_R, NULL); 1552 + KVM_PGTABLE_PROT_R, NULL, 0); 1603 1553 1604 1554 return false; 1605 1555 } ··· 1668 1618 int kvm_mmu_init(u32 *hyp_va_bits) 1669 1619 { 1670 1620 int err; 1621 + u32 idmap_bits; 1622 + u32 kernel_bits; 1671 1623 1672 1624 hyp_idmap_start = __pa_symbol(__hyp_idmap_text_start); 1673 1625 hyp_idmap_start = ALIGN_DOWN(hyp_idmap_start, PAGE_SIZE); ··· 1683 1631 */ 1684 1632 BUG_ON((hyp_idmap_start ^ (hyp_idmap_end - 1)) & PAGE_MASK); 1685 1633 1686 - *hyp_va_bits = 64 - ((idmap_t0sz & TCR_T0SZ_MASK) >> TCR_T0SZ_OFFSET); 1634 + /* 1635 + * The ID map may be configured to use an extended virtual address 1636 + * range. This is only the case if system RAM is out of range for the 1637 + * currently configured page size and VA_BITS_MIN, in which case we will 1638 + * also need the extended virtual range for the HYP ID map, or we won't 1639 + * be able to enable the EL2 MMU. 1640 + * 1641 + * However, in some cases the ID map may be configured for fewer than 1642 + * the number of VA bits used by the regular kernel stage 1. This 1643 + * happens when VA_BITS=52 and the kernel image is placed in PA space 1644 + * below 48 bits. 1645 + * 1646 + * At EL2, there is only one TTBR register, and we can't switch between 1647 + * translation tables *and* update TCR_EL2.T0SZ at the same time. Bottom 1648 + * line: we need to use the extended range with *both* our translation 1649 + * tables. 1650 + * 1651 + * So use the maximum of the idmap VA bits and the regular kernel stage 1652 + * 1 VA bits to assure that the hypervisor can both ID map its code page 1653 + * and map any kernel memory. 1654 + */ 1655 + idmap_bits = 64 - ((idmap_t0sz & TCR_T0SZ_MASK) >> TCR_T0SZ_OFFSET); 1656 + kernel_bits = vabits_actual; 1657 + *hyp_va_bits = max(idmap_bits, kernel_bits); 1658 + 1687 1659 kvm_debug("Using %u-bit virtual addresses at EL2\n", *hyp_va_bits); 1688 1660 kvm_debug("IDMAP page: %lx\n", hyp_idmap_start); 1689 1661 kvm_debug("HYP VA range: %lx:%lx\n", ··· 1816 1740 if (!vma) 1817 1741 break; 1818 1742 1819 - /* 1820 - * VM_SHARED mappings are not allowed with MTE to avoid races 1821 - * when updating the PG_mte_tagged page flag, see 1822 - * sanitise_mte_tags for more details. 1823 - */ 1824 - if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED) { 1743 + if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) { 1825 1744 ret = -EINVAL; 1826 1745 break; 1827 1746 }

+122 -16

arch/arm64/kvm/pkvm.c

··· 6 6 7 7 #include <linux/kvm_host.h> 8 8 #include <linux/memblock.h> 9 + #include <linux/mutex.h> 9 10 #include <linux/sort.h> 10 11 11 12 #include <asm/kvm_pkvm.h> ··· 54 53 55 54 void __init kvm_hyp_reserve(void) 56 55 { 57 - u64 nr_pages, prev, hyp_mem_pages = 0; 56 + u64 hyp_mem_pages = 0; 58 57 int ret; 59 58 60 59 if (!is_hyp_mode_available() || is_kernel_in_hyp_mode()) ··· 72 71 73 72 hyp_mem_pages += hyp_s1_pgtable_pages(); 74 73 hyp_mem_pages += host_s2_pgtable_pages(); 75 - 76 - /* 77 - * The hyp_vmemmap needs to be backed by pages, but these pages 78 - * themselves need to be present in the vmemmap, so compute the number 79 - * of pages needed by looking for a fixed point. 80 - */ 81 - nr_pages = 0; 82 - do { 83 - prev = nr_pages; 84 - nr_pages = hyp_mem_pages + prev; 85 - nr_pages = DIV_ROUND_UP(nr_pages * STRUCT_HYP_PAGE_SIZE, 86 - PAGE_SIZE); 87 - nr_pages += __hyp_pgtable_max_pages(nr_pages); 88 - } while (nr_pages != prev); 89 - hyp_mem_pages += nr_pages; 74 + hyp_mem_pages += hyp_vm_table_pages(); 75 + hyp_mem_pages += hyp_vmemmap_pages(STRUCT_HYP_PAGE_SIZE); 90 76 91 77 /* 92 78 * Try to allocate a PMD-aligned region to reduce TLB pressure once ··· 94 106 95 107 kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20, 96 108 hyp_mem_base); 109 + } 110 + 111 + /* 112 + * Allocates and donates memory for hypervisor VM structs at EL2. 113 + * 114 + * Allocates space for the VM state, which includes the hyp vm as well as 115 + * the hyp vcpus. 116 + * 117 + * Stores an opaque handler in the kvm struct for future reference. 118 + * 119 + * Return 0 on success, negative error code on failure. 120 + */ 121 + static int __pkvm_create_hyp_vm(struct kvm *host_kvm) 122 + { 123 + size_t pgd_sz, hyp_vm_sz, hyp_vcpu_sz; 124 + struct kvm_vcpu *host_vcpu; 125 + pkvm_handle_t handle; 126 + void *pgd, *hyp_vm; 127 + unsigned long idx; 128 + int ret; 129 + 130 + if (host_kvm->created_vcpus < 1) 131 + return -EINVAL; 132 + 133 + pgd_sz = kvm_pgtable_stage2_pgd_size(host_kvm->arch.vtcr); 134 + 135 + /* 136 + * The PGD pages will be reclaimed using a hyp_memcache which implies 137 + * page granularity. So, use alloc_pages_exact() to get individual 138 + * refcounts. 139 + */ 140 + pgd = alloc_pages_exact(pgd_sz, GFP_KERNEL_ACCOUNT); 141 + if (!pgd) 142 + return -ENOMEM; 143 + 144 + /* Allocate memory to donate to hyp for vm and vcpu pointers. */ 145 + hyp_vm_sz = PAGE_ALIGN(size_add(PKVM_HYP_VM_SIZE, 146 + size_mul(sizeof(void *), 147 + host_kvm->created_vcpus))); 148 + hyp_vm = alloc_pages_exact(hyp_vm_sz, GFP_KERNEL_ACCOUNT); 149 + if (!hyp_vm) { 150 + ret = -ENOMEM; 151 + goto free_pgd; 152 + } 153 + 154 + /* Donate the VM memory to hyp and let hyp initialize it. */ 155 + ret = kvm_call_hyp_nvhe(__pkvm_init_vm, host_kvm, hyp_vm, pgd); 156 + if (ret < 0) 157 + goto free_vm; 158 + 159 + handle = ret; 160 + 161 + host_kvm->arch.pkvm.handle = handle; 162 + 163 + /* Donate memory for the vcpus at hyp and initialize it. */ 164 + hyp_vcpu_sz = PAGE_ALIGN(PKVM_HYP_VCPU_SIZE); 165 + kvm_for_each_vcpu(idx, host_vcpu, host_kvm) { 166 + void *hyp_vcpu; 167 + 168 + /* Indexing of the vcpus to be sequential starting at 0. */ 169 + if (WARN_ON(host_vcpu->vcpu_idx != idx)) { 170 + ret = -EINVAL; 171 + goto destroy_vm; 172 + } 173 + 174 + hyp_vcpu = alloc_pages_exact(hyp_vcpu_sz, GFP_KERNEL_ACCOUNT); 175 + if (!hyp_vcpu) { 176 + ret = -ENOMEM; 177 + goto destroy_vm; 178 + } 179 + 180 + ret = kvm_call_hyp_nvhe(__pkvm_init_vcpu, handle, host_vcpu, 181 + hyp_vcpu); 182 + if (ret) { 183 + free_pages_exact(hyp_vcpu, hyp_vcpu_sz); 184 + goto destroy_vm; 185 + } 186 + } 187 + 188 + return 0; 189 + 190 + destroy_vm: 191 + pkvm_destroy_hyp_vm(host_kvm); 192 + return ret; 193 + free_vm: 194 + free_pages_exact(hyp_vm, hyp_vm_sz); 195 + free_pgd: 196 + free_pages_exact(pgd, pgd_sz); 197 + return ret; 198 + } 199 + 200 + int pkvm_create_hyp_vm(struct kvm *host_kvm) 201 + { 202 + int ret = 0; 203 + 204 + mutex_lock(&host_kvm->lock); 205 + if (!host_kvm->arch.pkvm.handle) 206 + ret = __pkvm_create_hyp_vm(host_kvm); 207 + mutex_unlock(&host_kvm->lock); 208 + 209 + return ret; 210 + } 211 + 212 + void pkvm_destroy_hyp_vm(struct kvm *host_kvm) 213 + { 214 + if (host_kvm->arch.pkvm.handle) { 215 + WARN_ON(kvm_call_hyp_nvhe(__pkvm_teardown_vm, 216 + host_kvm->arch.pkvm.handle)); 217 + } 218 + 219 + host_kvm->arch.pkvm.handle = 0; 220 + free_hyp_memcache(&host_kvm->arch.pkvm.teardown_mc); 221 + } 222 + 223 + int pkvm_init_host_vm(struct kvm *host_kvm) 224 + { 225 + mutex_init(&host_kvm->lock); 226 + return 0; 97 227 }

+204 -294

arch/arm64/kvm/pmu-emul.c

··· 15 15 #include <kvm/arm_pmu.h> 16 16 #include <kvm/arm_vgic.h> 17 17 18 + #define PERF_ATTR_CFG1_COUNTER_64BIT BIT(0) 19 + 18 20 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); 19 21 20 22 static LIST_HEAD(arm_pmus); 21 23 static DEFINE_MUTEX(arm_pmus_lock); 22 24 23 - static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx); 24 - static void kvm_pmu_update_pmc_chained(struct kvm_vcpu *vcpu, u64 select_idx); 25 - static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc); 25 + static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc); 26 + static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc); 26 27 27 - #define PERF_ATTR_CFG1_KVM_PMU_CHAINED 0x1 28 + static struct kvm_vcpu *kvm_pmc_to_vcpu(const struct kvm_pmc *pmc) 29 + { 30 + return container_of(pmc, struct kvm_vcpu, arch.pmu.pmc[pmc->idx]); 31 + } 32 + 33 + static struct kvm_pmc *kvm_vcpu_idx_to_pmc(struct kvm_vcpu *vcpu, int cnt_idx) 34 + { 35 + return &vcpu->arch.pmu.pmc[cnt_idx]; 36 + } 28 37 29 38 static u32 kvm_pmu_event_mask(struct kvm *kvm) 30 39 { ··· 56 47 } 57 48 58 49 /** 59 - * kvm_pmu_idx_is_64bit - determine if select_idx is a 64bit counter 60 - * @vcpu: The vcpu pointer 61 - * @select_idx: The counter index 50 + * kvm_pmc_is_64bit - determine if counter is 64bit 51 + * @pmc: counter context 62 52 */ 63 - static bool kvm_pmu_idx_is_64bit(struct kvm_vcpu *vcpu, u64 select_idx) 53 + static bool kvm_pmc_is_64bit(struct kvm_pmc *pmc) 64 54 { 65 - return (select_idx == ARMV8_PMU_CYCLE_IDX && 66 - __vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_LC); 55 + return (pmc->idx == ARMV8_PMU_CYCLE_IDX || 56 + kvm_pmu_is_3p5(kvm_pmc_to_vcpu(pmc))); 67 57 } 68 58 69 - static struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc *pmc) 59 + static bool kvm_pmc_has_64bit_overflow(struct kvm_pmc *pmc) 70 60 { 71 - struct kvm_pmu *pmu; 72 - struct kvm_vcpu_arch *vcpu_arch; 61 + u64 val = __vcpu_sys_reg(kvm_pmc_to_vcpu(pmc), PMCR_EL0); 73 62 74 - pmc -= pmc->idx; 75 - pmu = container_of(pmc, struct kvm_pmu, pmc[0]); 76 - vcpu_arch = container_of(pmu, struct kvm_vcpu_arch, pmu); 77 - return container_of(vcpu_arch, struct kvm_vcpu, arch); 63 + return (pmc->idx < ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LP)) || 64 + (pmc->idx == ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LC)); 78 65 } 79 66 80 - /** 81 - * kvm_pmu_pmc_is_chained - determine if the pmc is chained 82 - * @pmc: The PMU counter pointer 83 - */ 84 - static bool kvm_pmu_pmc_is_chained(struct kvm_pmc *pmc) 67 + static bool kvm_pmu_counter_can_chain(struct kvm_pmc *pmc) 68 + { 69 + return (!(pmc->idx & 1) && (pmc->idx + 1) < ARMV8_PMU_CYCLE_IDX && 70 + !kvm_pmc_has_64bit_overflow(pmc)); 71 + } 72 + 73 + static u32 counter_index_to_reg(u64 idx) 74 + { 75 + return (idx == ARMV8_PMU_CYCLE_IDX) ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + idx; 76 + } 77 + 78 + static u32 counter_index_to_evtreg(u64 idx) 79 + { 80 + return (idx == ARMV8_PMU_CYCLE_IDX) ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + idx; 81 + } 82 + 83 + static u64 kvm_pmu_get_pmc_value(struct kvm_pmc *pmc) 85 84 { 86 85 struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 86 + u64 counter, reg, enabled, running; 87 87 88 - return test_bit(pmc->idx >> 1, vcpu->arch.pmu.chained); 89 - } 90 - 91 - /** 92 - * kvm_pmu_idx_is_high_counter - determine if select_idx is a high/low counter 93 - * @select_idx: The counter index 94 - */ 95 - static bool kvm_pmu_idx_is_high_counter(u64 select_idx) 96 - { 97 - return select_idx & 0x1; 98 - } 99 - 100 - /** 101 - * kvm_pmu_get_canonical_pmc - obtain the canonical pmc 102 - * @pmc: The PMU counter pointer 103 - * 104 - * When a pair of PMCs are chained together we use the low counter (canonical) 105 - * to hold the underlying perf event. 106 - */ 107 - static struct kvm_pmc *kvm_pmu_get_canonical_pmc(struct kvm_pmc *pmc) 108 - { 109 - if (kvm_pmu_pmc_is_chained(pmc) && 110 - kvm_pmu_idx_is_high_counter(pmc->idx)) 111 - return pmc - 1; 112 - 113 - return pmc; 114 - } 115 - static struct kvm_pmc *kvm_pmu_get_alternate_pmc(struct kvm_pmc *pmc) 116 - { 117 - if (kvm_pmu_idx_is_high_counter(pmc->idx)) 118 - return pmc - 1; 119 - else 120 - return pmc + 1; 121 - } 122 - 123 - /** 124 - * kvm_pmu_idx_has_chain_evtype - determine if the event type is chain 125 - * @vcpu: The vcpu pointer 126 - * @select_idx: The counter index 127 - */ 128 - static bool kvm_pmu_idx_has_chain_evtype(struct kvm_vcpu *vcpu, u64 select_idx) 129 - { 130 - u64 eventsel, reg; 131 - 132 - select_idx |= 0x1; 133 - 134 - if (select_idx == ARMV8_PMU_CYCLE_IDX) 135 - return false; 136 - 137 - reg = PMEVTYPER0_EL0 + select_idx; 138 - eventsel = __vcpu_sys_reg(vcpu, reg) & kvm_pmu_event_mask(vcpu->kvm); 139 - 140 - return eventsel == ARMV8_PMUV3_PERFCTR_CHAIN; 141 - } 142 - 143 - /** 144 - * kvm_pmu_get_pair_counter_value - get PMU counter value 145 - * @vcpu: The vcpu pointer 146 - * @pmc: The PMU counter pointer 147 - */ 148 - static u64 kvm_pmu_get_pair_counter_value(struct kvm_vcpu *vcpu, 149 - struct kvm_pmc *pmc) 150 - { 151 - u64 counter, counter_high, reg, enabled, running; 152 - 153 - if (kvm_pmu_pmc_is_chained(pmc)) { 154 - pmc = kvm_pmu_get_canonical_pmc(pmc); 155 - reg = PMEVCNTR0_EL0 + pmc->idx; 156 - 157 - counter = __vcpu_sys_reg(vcpu, reg); 158 - counter_high = __vcpu_sys_reg(vcpu, reg + 1); 159 - 160 - counter = lower_32_bits(counter) | (counter_high << 32); 161 - } else { 162 - reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX) 163 - ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + pmc->idx; 164 - counter = __vcpu_sys_reg(vcpu, reg); 165 - } 88 + reg = counter_index_to_reg(pmc->idx); 89 + counter = __vcpu_sys_reg(vcpu, reg); 166 90 167 91 /* 168 92 * The real counter value is equal to the value of counter register plus ··· 104 162 if (pmc->perf_event) 105 163 counter += perf_event_read_value(pmc->perf_event, &enabled, 106 164 &running); 165 + 166 + if (!kvm_pmc_is_64bit(pmc)) 167 + counter = lower_32_bits(counter); 107 168 108 169 return counter; 109 170 } ··· 118 173 */ 119 174 u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx) 120 175 { 121 - u64 counter; 122 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 123 - struct kvm_pmc *pmc = &pmu->pmc[select_idx]; 124 - 125 176 if (!kvm_vcpu_has_pmu(vcpu)) 126 177 return 0; 127 178 128 - counter = kvm_pmu_get_pair_counter_value(vcpu, pmc); 179 + return kvm_pmu_get_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, select_idx)); 180 + } 129 181 130 - if (kvm_pmu_pmc_is_chained(pmc) && 131 - kvm_pmu_idx_is_high_counter(select_idx)) 132 - counter = upper_32_bits(counter); 133 - else if (select_idx != ARMV8_PMU_CYCLE_IDX) 134 - counter = lower_32_bits(counter); 182 + static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) 183 + { 184 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 185 + u64 reg; 135 186 136 - return counter; 187 + kvm_pmu_release_perf_event(pmc); 188 + 189 + reg = counter_index_to_reg(pmc->idx); 190 + 191 + if (vcpu_mode_is_32bit(vcpu) && pmc->idx != ARMV8_PMU_CYCLE_IDX && 192 + !force) { 193 + /* 194 + * Even with PMUv3p5, AArch32 cannot write to the top 195 + * 32bit of the counters. The only possible course of 196 + * action is to use PMCR.P, which will reset them to 197 + * 0 (the only use of the 'force' parameter). 198 + */ 199 + val = __vcpu_sys_reg(vcpu, reg) & GENMASK(63, 32); 200 + val |= lower_32_bits(val); 201 + } 202 + 203 + __vcpu_sys_reg(vcpu, reg) = val; 204 + 205 + /* Recreate the perf event to reflect the updated sample_period */ 206 + kvm_pmu_create_perf_event(pmc); 137 207 } 138 208 139 209 /** ··· 159 199 */ 160 200 void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) 161 201 { 162 - u64 reg; 163 - 164 202 if (!kvm_vcpu_has_pmu(vcpu)) 165 203 return; 166 204 167 - reg = (select_idx == ARMV8_PMU_CYCLE_IDX) 168 - ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + select_idx; 169 - __vcpu_sys_reg(vcpu, reg) += (s64)val - kvm_pmu_get_counter_value(vcpu, select_idx); 170 - 171 - /* Recreate the perf event to reflect the updated sample_period */ 172 - kvm_pmu_create_perf_event(vcpu, select_idx); 205 + kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, select_idx), val, false); 173 206 } 174 207 175 208 /** ··· 171 218 */ 172 219 static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc) 173 220 { 174 - pmc = kvm_pmu_get_canonical_pmc(pmc); 175 221 if (pmc->perf_event) { 176 222 perf_event_disable(pmc->perf_event); 177 223 perf_event_release_kernel(pmc->perf_event); ··· 184 232 * 185 233 * If this counter has been configured to monitor some event, release it here. 186 234 */ 187 - static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc) 235 + static void kvm_pmu_stop_counter(struct kvm_pmc *pmc) 188 236 { 189 - u64 counter, reg, val; 237 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 238 + u64 reg, val; 190 239 191 - pmc = kvm_pmu_get_canonical_pmc(pmc); 192 240 if (!pmc->perf_event) 193 241 return; 194 242 195 - counter = kvm_pmu_get_pair_counter_value(vcpu, pmc); 243 + val = kvm_pmu_get_pmc_value(pmc); 196 244 197 - if (pmc->idx == ARMV8_PMU_CYCLE_IDX) { 198 - reg = PMCCNTR_EL0; 199 - val = counter; 200 - } else { 201 - reg = PMEVCNTR0_EL0 + pmc->idx; 202 - val = lower_32_bits(counter); 203 - } 245 + reg = counter_index_to_reg(pmc->idx); 204 246 205 247 __vcpu_sys_reg(vcpu, reg) = val; 206 - 207 - if (kvm_pmu_pmc_is_chained(pmc)) 208 - __vcpu_sys_reg(vcpu, reg + 1) = upper_32_bits(counter); 209 248 210 249 kvm_pmu_release_perf_event(pmc); 211 250 } ··· 223 280 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) 224 281 { 225 282 unsigned long mask = kvm_pmu_valid_counter_mask(vcpu); 226 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 227 283 int i; 228 284 229 285 for_each_set_bit(i, &mask, 32) 230 - kvm_pmu_stop_counter(vcpu, &pmu->pmc[i]); 231 - 232 - bitmap_zero(vcpu->arch.pmu.chained, ARMV8_PMU_MAX_COUNTER_PAIRS); 286 + kvm_pmu_stop_counter(kvm_vcpu_idx_to_pmc(vcpu, i)); 233 287 } 234 288 235 289 /** ··· 237 297 void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) 238 298 { 239 299 int i; 240 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 241 300 242 301 for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) 243 - kvm_pmu_release_perf_event(&pmu->pmc[i]); 302 + kvm_pmu_release_perf_event(kvm_vcpu_idx_to_pmc(vcpu, i)); 244 303 irq_work_sync(&vcpu->arch.pmu.overflow_work); 245 304 } 246 305 ··· 264 325 void kvm_pmu_enable_counter_mask(struct kvm_vcpu *vcpu, u64 val) 265 326 { 266 327 int i; 267 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 268 - struct kvm_pmc *pmc; 269 - 270 328 if (!kvm_vcpu_has_pmu(vcpu)) 271 329 return; 272 330 ··· 271 335 return; 272 336 273 337 for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) { 338 + struct kvm_pmc *pmc; 339 + 274 340 if (!(val & BIT(i))) 275 341 continue; 276 342 277 - pmc = &pmu->pmc[i]; 343 + pmc = kvm_vcpu_idx_to_pmc(vcpu, i); 278 344 279 - /* A change in the enable state may affect the chain state */ 280 - kvm_pmu_update_pmc_chained(vcpu, i); 281 - kvm_pmu_create_perf_event(vcpu, i); 282 - 283 - /* At this point, pmc must be the canonical */ 284 - if (pmc->perf_event) { 345 + if (!pmc->perf_event) { 346 + kvm_pmu_create_perf_event(pmc); 347 + } else { 285 348 perf_event_enable(pmc->perf_event); 286 349 if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE) 287 350 kvm_debug("fail to enable perf event\n"); ··· 298 363 void kvm_pmu_disable_counter_mask(struct kvm_vcpu *vcpu, u64 val) 299 364 { 300 365 int i; 301 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 302 - struct kvm_pmc *pmc; 303 366 304 367 if (!kvm_vcpu_has_pmu(vcpu) || !val) 305 368 return; 306 369 307 370 for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) { 371 + struct kvm_pmc *pmc; 372 + 308 373 if (!(val & BIT(i))) 309 374 continue; 310 375 311 - pmc = &pmu->pmc[i]; 376 + pmc = kvm_vcpu_idx_to_pmc(vcpu, i); 312 377 313 - /* A change in the enable state may affect the chain state */ 314 - kvm_pmu_update_pmc_chained(vcpu, i); 315 - kvm_pmu_create_perf_event(vcpu, i); 316 - 317 - /* At this point, pmc must be the canonical */ 318 378 if (pmc->perf_event) 319 379 perf_event_disable(pmc->perf_event); 320 380 } ··· 406 476 static void kvm_pmu_perf_overflow_notify_vcpu(struct irq_work *work) 407 477 { 408 478 struct kvm_vcpu *vcpu; 409 - struct kvm_pmu *pmu; 410 479 411 - pmu = container_of(work, struct kvm_pmu, overflow_work); 412 - vcpu = kvm_pmc_to_vcpu(pmu->pmc); 413 - 480 + vcpu = container_of(work, struct kvm_vcpu, arch.pmu.overflow_work); 414 481 kvm_vcpu_kick(vcpu); 482 + } 483 + 484 + /* 485 + * Perform an increment on any of the counters described in @mask, 486 + * generating the overflow if required, and propagate it as a chained 487 + * event if possible. 488 + */ 489 + static void kvm_pmu_counter_increment(struct kvm_vcpu *vcpu, 490 + unsigned long mask, u32 event) 491 + { 492 + int i; 493 + 494 + if (!(__vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E)) 495 + return; 496 + 497 + /* Weed out disabled counters */ 498 + mask &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); 499 + 500 + for_each_set_bit(i, &mask, ARMV8_PMU_CYCLE_IDX) { 501 + struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); 502 + u64 type, reg; 503 + 504 + /* Filter on event type */ 505 + type = __vcpu_sys_reg(vcpu, counter_index_to_evtreg(i)); 506 + type &= kvm_pmu_event_mask(vcpu->kvm); 507 + if (type != event) 508 + continue; 509 + 510 + /* Increment this counter */ 511 + reg = __vcpu_sys_reg(vcpu, counter_index_to_reg(i)) + 1; 512 + if (!kvm_pmc_is_64bit(pmc)) 513 + reg = lower_32_bits(reg); 514 + __vcpu_sys_reg(vcpu, counter_index_to_reg(i)) = reg; 515 + 516 + /* No overflow? move on */ 517 + if (kvm_pmc_has_64bit_overflow(pmc) ? reg : lower_32_bits(reg)) 518 + continue; 519 + 520 + /* Mark overflow */ 521 + __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(i); 522 + 523 + if (kvm_pmu_counter_can_chain(pmc)) 524 + kvm_pmu_counter_increment(vcpu, BIT(i + 1), 525 + ARMV8_PMUV3_PERFCTR_CHAIN); 526 + } 527 + } 528 + 529 + /* Compute the sample period for a given counter value */ 530 + static u64 compute_period(struct kvm_pmc *pmc, u64 counter) 531 + { 532 + u64 val; 533 + 534 + if (kvm_pmc_is_64bit(pmc) && kvm_pmc_has_64bit_overflow(pmc)) 535 + val = (-counter) & GENMASK(63, 0); 536 + else 537 + val = (-counter) & GENMASK(31, 0); 538 + 539 + return val; 415 540 } 416 541 417 542 /** ··· 488 503 * Reset the sample period to the architectural limit, 489 504 * i.e. the point where the counter overflows. 490 505 */ 491 - period = -(local64_read(&perf_event->count)); 492 - 493 - if (!kvm_pmu_idx_is_64bit(vcpu, pmc->idx)) 494 - period &= GENMASK(31, 0); 506 + period = compute_period(pmc, local64_read(&perf_event->count)); 495 507 496 508 local64_set(&perf_event->hw.period_left, 0); 497 509 perf_event->attr.sample_period = period; 498 510 perf_event->hw.sample_period = period; 499 511 500 512 __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(idx); 513 + 514 + if (kvm_pmu_counter_can_chain(pmc)) 515 + kvm_pmu_counter_increment(vcpu, BIT(idx + 1), 516 + ARMV8_PMUV3_PERFCTR_CHAIN); 501 517 502 518 if (kvm_pmu_overflow_status(vcpu)) { 503 519 kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu); ··· 519 533 */ 520 534 void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val) 521 535 { 522 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 523 - int i; 524 - 525 - if (!kvm_vcpu_has_pmu(vcpu)) 526 - return; 527 - 528 - if (!(__vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E)) 529 - return; 530 - 531 - /* Weed out disabled counters */ 532 - val &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); 533 - 534 - for (i = 0; i < ARMV8_PMU_CYCLE_IDX; i++) { 535 - u64 type, reg; 536 - 537 - if (!(val & BIT(i))) 538 - continue; 539 - 540 - /* PMSWINC only applies to ... SW_INC! */ 541 - type = __vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + i); 542 - type &= kvm_pmu_event_mask(vcpu->kvm); 543 - if (type != ARMV8_PMUV3_PERFCTR_SW_INCR) 544 - continue; 545 - 546 - /* increment this even SW_INC counter */ 547 - reg = __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i) + 1; 548 - reg = lower_32_bits(reg); 549 - __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i) = reg; 550 - 551 - if (reg) /* no overflow on the low part */ 552 - continue; 553 - 554 - if (kvm_pmu_pmc_is_chained(&pmu->pmc[i])) { 555 - /* increment the high counter */ 556 - reg = __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i + 1) + 1; 557 - reg = lower_32_bits(reg); 558 - __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i + 1) = reg; 559 - if (!reg) /* mark overflow on the high counter */ 560 - __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(i + 1); 561 - } else { 562 - /* mark overflow on low counter */ 563 - __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(i); 564 - } 565 - } 536 + kvm_pmu_counter_increment(vcpu, val, ARMV8_PMUV3_PERFCTR_SW_INCR); 566 537 } 567 538 568 539 /** ··· 533 590 534 591 if (!kvm_vcpu_has_pmu(vcpu)) 535 592 return; 593 + 594 + /* Fixup PMCR_EL0 to reconcile the PMU version and the LP bit */ 595 + if (!kvm_pmu_is_3p5(vcpu)) 596 + val &= ~ARMV8_PMU_PMCR_LP; 597 + 598 + __vcpu_sys_reg(vcpu, PMCR_EL0) = val; 536 599 537 600 if (val & ARMV8_PMU_PMCR_E) { 538 601 kvm_pmu_enable_counter_mask(vcpu, ··· 555 606 unsigned long mask = kvm_pmu_valid_counter_mask(vcpu); 556 607 mask &= ~BIT(ARMV8_PMU_CYCLE_IDX); 557 608 for_each_set_bit(i, &mask, 32) 558 - kvm_pmu_set_counter_value(vcpu, i, 0); 609 + kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, i), 0, true); 559 610 } 560 611 } 561 612 562 - static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u64 select_idx) 613 + static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc) 563 614 { 615 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 564 616 return (__vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E) && 565 - (__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(select_idx)); 617 + (__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(pmc->idx)); 566 618 } 567 619 568 620 /** 569 621 * kvm_pmu_create_perf_event - create a perf event for a counter 570 - * @vcpu: The vcpu pointer 571 - * @select_idx: The number of selected counter 622 + * @pmc: Counter context 572 623 */ 573 - static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx) 624 + static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) 574 625 { 626 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 575 627 struct arm_pmu *arm_pmu = vcpu->kvm->arch.arm_pmu; 576 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 577 - struct kvm_pmc *pmc; 578 628 struct perf_event *event; 579 629 struct perf_event_attr attr; 580 - u64 eventsel, counter, reg, data; 630 + u64 eventsel, reg, data; 581 631 582 - /* 583 - * For chained counters the event type and filtering attributes are 584 - * obtained from the low/even counter. We also use this counter to 585 - * determine if the event is enabled/disabled. 586 - */ 587 - pmc = kvm_pmu_get_canonical_pmc(&pmu->pmc[select_idx]); 588 - 589 - reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX) 590 - ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + pmc->idx; 632 + reg = counter_index_to_evtreg(pmc->idx); 591 633 data = __vcpu_sys_reg(vcpu, reg); 592 634 593 - kvm_pmu_stop_counter(vcpu, pmc); 635 + kvm_pmu_stop_counter(pmc); 594 636 if (pmc->idx == ARMV8_PMU_CYCLE_IDX) 595 637 eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; 596 638 else 597 639 eventsel = data & kvm_pmu_event_mask(vcpu->kvm); 598 640 599 - /* Software increment event doesn't need to be backed by a perf event */ 600 - if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR) 641 + /* 642 + * Neither SW increment nor chained events need to be backed 643 + * by a perf event. 644 + */ 645 + if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR || 646 + eventsel == ARMV8_PMUV3_PERFCTR_CHAIN) 601 647 return; 602 648 603 649 /* ··· 607 663 attr.type = arm_pmu->pmu.type; 608 664 attr.size = sizeof(attr); 609 665 attr.pinned = 1; 610 - attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, pmc->idx); 666 + attr.disabled = !kvm_pmu_counter_is_enabled(pmc); 611 667 attr.exclude_user = data & ARMV8_PMU_EXCLUDE_EL0 ? 1 : 0; 612 668 attr.exclude_kernel = data & ARMV8_PMU_EXCLUDE_EL1 ? 1 : 0; 613 669 attr.exclude_hv = 1; /* Don't count EL2 events */ 614 670 attr.exclude_host = 1; /* Don't count host events */ 615 671 attr.config = eventsel; 616 672 617 - counter = kvm_pmu_get_pair_counter_value(vcpu, pmc); 673 + /* 674 + * If counting with a 64bit counter, advertise it to the perf 675 + * code, carefully dealing with the initial sample period 676 + * which also depends on the overflow. 677 + */ 678 + if (kvm_pmc_is_64bit(pmc)) 679 + attr.config1 |= PERF_ATTR_CFG1_COUNTER_64BIT; 618 680 619 - if (kvm_pmu_pmc_is_chained(pmc)) { 620 - /** 621 - * The initial sample period (overflow count) of an event. For 622 - * chained counters we only support overflow interrupts on the 623 - * high counter. 624 - */ 625 - attr.sample_period = (-counter) & GENMASK(63, 0); 626 - attr.config1 |= PERF_ATTR_CFG1_KVM_PMU_CHAINED; 681 + attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(pmc)); 627 682 628 - event = perf_event_create_kernel_counter(&attr, -1, current, 629 - kvm_pmu_perf_overflow, 630 - pmc + 1); 631 - } else { 632 - /* The initial sample period (overflow count) of an event. */ 633 - if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx)) 634 - attr.sample_period = (-counter) & GENMASK(63, 0); 635 - else 636 - attr.sample_period = (-counter) & GENMASK(31, 0); 637 - 638 - event = perf_event_create_kernel_counter(&attr, -1, current, 683 + event = perf_event_create_kernel_counter(&attr, -1, current, 639 684 kvm_pmu_perf_overflow, pmc); 640 - } 641 685 642 686 if (IS_ERR(event)) { 643 687 pr_err_once("kvm: pmu event creation failed %ld\n", ··· 634 702 } 635 703 636 704 pmc->perf_event = event; 637 - } 638 - 639 - /** 640 - * kvm_pmu_update_pmc_chained - update chained bitmap 641 - * @vcpu: The vcpu pointer 642 - * @select_idx: The number of selected counter 643 - * 644 - * Update the chained bitmap based on the event type written in the 645 - * typer register and the enable state of the odd register. 646 - */ 647 - static void kvm_pmu_update_pmc_chained(struct kvm_vcpu *vcpu, u64 select_idx) 648 - { 649 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 650 - struct kvm_pmc *pmc = &pmu->pmc[select_idx], *canonical_pmc; 651 - bool new_state, old_state; 652 - 653 - old_state = kvm_pmu_pmc_is_chained(pmc); 654 - new_state = kvm_pmu_idx_has_chain_evtype(vcpu, pmc->idx) && 655 - kvm_pmu_counter_is_enabled(vcpu, pmc->idx | 0x1); 656 - 657 - if (old_state == new_state) 658 - return; 659 - 660 - canonical_pmc = kvm_pmu_get_canonical_pmc(pmc); 661 - kvm_pmu_stop_counter(vcpu, canonical_pmc); 662 - if (new_state) { 663 - /* 664 - * During promotion from !chained to chained we must ensure 665 - * the adjacent counter is stopped and its event destroyed 666 - */ 667 - kvm_pmu_stop_counter(vcpu, kvm_pmu_get_alternate_pmc(pmc)); 668 - set_bit(pmc->idx >> 1, vcpu->arch.pmu.chained); 669 - return; 670 - } 671 - clear_bit(pmc->idx >> 1, vcpu->arch.pmu.chained); 672 705 } 673 706 674 707 /** ··· 649 752 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, 650 753 u64 select_idx) 651 754 { 755 + struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, select_idx); 652 756 u64 reg, mask; 653 757 654 758 if (!kvm_vcpu_has_pmu(vcpu)) ··· 659 761 mask &= ~ARMV8_PMU_EVTYPE_EVENT; 660 762 mask |= kvm_pmu_event_mask(vcpu->kvm); 661 763 662 - reg = (select_idx == ARMV8_PMU_CYCLE_IDX) 663 - ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + select_idx; 764 + reg = counter_index_to_evtreg(pmc->idx); 664 765 665 766 __vcpu_sys_reg(vcpu, reg) = data & mask; 666 767 667 - kvm_pmu_update_pmc_chained(vcpu, select_idx); 668 - kvm_pmu_create_perf_event(vcpu, select_idx); 768 + kvm_pmu_create_perf_event(pmc); 669 769 } 670 770 671 771 void kvm_host_pmu_init(struct arm_pmu *pmu) 672 772 { 673 773 struct arm_pmu_entry *entry; 674 774 675 - if (pmu->pmuver == 0 || pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) 775 + if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI || 776 + pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) 676 777 return; 677 778 678 779 mutex_lock(&arm_pmus_lock); ··· 724 827 725 828 if (event->pmu) { 726 829 pmu = to_arm_pmu(event->pmu); 727 - if (pmu->pmuver == 0 || 830 + if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI || 728 831 pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) 729 832 pmu = NULL; 730 833 } ··· 746 849 747 850 if (!pmceid1) { 748 851 val = read_sysreg(pmceid0_el0); 852 + /* always support CHAIN */ 853 + val |= BIT(ARMV8_PMUV3_PERFCTR_CHAIN); 749 854 base = 0; 750 855 } else { 751 856 val = read_sysreg(pmceid1_el0); ··· 1048 1149 } 1049 1150 1050 1151 return -ENXIO; 1152 + } 1153 + 1154 + u8 kvm_arm_pmu_get_pmuver_limit(void) 1155 + { 1156 + u64 tmp; 1157 + 1158 + tmp = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1); 1159 + tmp = cpuid_feature_cap_perfmon_field(tmp, 1160 + ID_AA64DFR0_EL1_PMUVer_SHIFT, 1161 + ID_AA64DFR0_EL1_PMUVer_V3P5); 1162 + return FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), tmp); 1051 1163 }

-29

arch/arm64/kvm/reset.c

··· 395 395 396 396 return 0; 397 397 } 398 - 399 - int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type) 400 - { 401 - u64 mmfr0, mmfr1; 402 - u32 phys_shift; 403 - 404 - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) 405 - return -EINVAL; 406 - 407 - phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); 408 - if (phys_shift) { 409 - if (phys_shift > kvm_ipa_limit || 410 - phys_shift < ARM64_MIN_PARANGE_BITS) 411 - return -EINVAL; 412 - } else { 413 - phys_shift = KVM_PHYS_SHIFT; 414 - if (phys_shift > kvm_ipa_limit) { 415 - pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n", 416 - current->comm); 417 - return -EINVAL; 418 - } 419 - } 420 - 421 - mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); 422 - mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1); 423 - kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift); 424 - 425 - return 0; 426 - }

+135 -22

arch/arm64/kvm/sys_regs.c

··· 639 639 640 640 static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 641 641 { 642 - u64 pmcr, val; 642 + u64 pmcr; 643 643 644 644 /* No PMU available, PMCR_EL0 may UNDEF... */ 645 645 if (!kvm_arm_support_pmu_v3()) 646 646 return; 647 647 648 - pmcr = read_sysreg(pmcr_el0); 649 - /* 650 - * Writable bits of PMCR_EL0 (ARMV8_PMU_PMCR_MASK) are reset to UNKNOWN 651 - * except PMCR.E resetting to zero. 652 - */ 653 - val = ((pmcr & ~ARMV8_PMU_PMCR_MASK) 654 - | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E); 648 + /* Only preserve PMCR_EL0.N, and reset the rest to 0 */ 649 + pmcr = read_sysreg(pmcr_el0) & ARMV8_PMU_PMCR_N_MASK; 655 650 if (!kvm_supports_32bit_el0()) 656 - val |= ARMV8_PMU_PMCR_LC; 657 - __vcpu_sys_reg(vcpu, r->reg) = val; 651 + pmcr |= ARMV8_PMU_PMCR_LC; 652 + 653 + __vcpu_sys_reg(vcpu, r->reg) = pmcr; 658 654 } 659 655 660 656 static bool check_pmu_access_disabled(struct kvm_vcpu *vcpu, u64 flags) ··· 693 697 return false; 694 698 695 699 if (p->is_write) { 696 - /* Only update writeable bits of PMCR */ 700 + /* 701 + * Only update writeable bits of PMCR (continuing into 702 + * kvm_pmu_handle_pmcr() as well) 703 + */ 697 704 val = __vcpu_sys_reg(vcpu, PMCR_EL0); 698 705 val &= ~ARMV8_PMU_PMCR_MASK; 699 706 val |= p->regval & ARMV8_PMU_PMCR_MASK; 700 707 if (!kvm_supports_32bit_el0()) 701 708 val |= ARMV8_PMU_PMCR_LC; 702 - __vcpu_sys_reg(vcpu, PMCR_EL0) = val; 703 709 kvm_pmu_handle_pmcr(vcpu, val); 704 710 kvm_vcpu_pmu_restore_guest(vcpu); 705 711 } else { ··· 1060 1062 return true; 1061 1063 } 1062 1064 1065 + static u8 vcpu_pmuver(const struct kvm_vcpu *vcpu) 1066 + { 1067 + if (kvm_vcpu_has_pmu(vcpu)) 1068 + return vcpu->kvm->arch.dfr0_pmuver.imp; 1069 + 1070 + return vcpu->kvm->arch.dfr0_pmuver.unimp; 1071 + } 1072 + 1073 + static u8 perfmon_to_pmuver(u8 perfmon) 1074 + { 1075 + switch (perfmon) { 1076 + case ID_DFR0_EL1_PerfMon_PMUv3: 1077 + return ID_AA64DFR0_EL1_PMUVer_IMP; 1078 + case ID_DFR0_EL1_PerfMon_IMPDEF: 1079 + return ID_AA64DFR0_EL1_PMUVer_IMP_DEF; 1080 + default: 1081 + /* Anything ARMv8.1+ and NI have the same value. For now. */ 1082 + return perfmon; 1083 + } 1084 + } 1085 + 1086 + static u8 pmuver_to_perfmon(u8 pmuver) 1087 + { 1088 + switch (pmuver) { 1089 + case ID_AA64DFR0_EL1_PMUVer_IMP: 1090 + return ID_DFR0_EL1_PerfMon_PMUv3; 1091 + case ID_AA64DFR0_EL1_PMUVer_IMP_DEF: 1092 + return ID_DFR0_EL1_PerfMon_IMPDEF; 1093 + default: 1094 + /* Anything ARMv8.1+ and NI have the same value. For now. */ 1095 + return pmuver; 1096 + } 1097 + } 1098 + 1063 1099 /* Read a sanitised cpufeature ID register by sys_reg_desc */ 1064 1100 static u64 read_id_reg(const struct kvm_vcpu *vcpu, struct sys_reg_desc const *r) 1065 1101 { ··· 1143 1111 /* Limit debug to ARMv8.0 */ 1144 1112 val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_DebugVer); 1145 1113 val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_DebugVer), 6); 1146 - /* Limit guests to PMUv3 for ARMv8.4 */ 1147 - val = cpuid_feature_cap_perfmon_field(val, 1148 - ID_AA64DFR0_EL1_PMUVer_SHIFT, 1149 - kvm_vcpu_has_pmu(vcpu) ? ID_AA64DFR0_EL1_PMUVer_V3P4 : 0); 1114 + /* Set PMUver to the required version */ 1115 + val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer); 1116 + val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), 1117 + vcpu_pmuver(vcpu)); 1150 1118 /* Hide SPE from guests */ 1151 1119 val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMSVer); 1152 1120 break; 1153 1121 case SYS_ID_DFR0_EL1: 1154 - /* Limit guests to PMUv3 for ARMv8.4 */ 1155 - val = cpuid_feature_cap_perfmon_field(val, 1156 - ID_DFR0_PERFMON_SHIFT, 1157 - kvm_vcpu_has_pmu(vcpu) ? ID_DFR0_PERFMON_8_4 : 0); 1122 + val &= ~ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon); 1123 + val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon), 1124 + pmuver_to_perfmon(vcpu_pmuver(vcpu))); 1158 1125 break; 1159 1126 } 1160 1127 ··· 1249 1218 1250 1219 vcpu->kvm->arch.pfr0_csv2 = csv2; 1251 1220 vcpu->kvm->arch.pfr0_csv3 = csv3; 1221 + 1222 + return 0; 1223 + } 1224 + 1225 + static int set_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, 1226 + const struct sys_reg_desc *rd, 1227 + u64 val) 1228 + { 1229 + u8 pmuver, host_pmuver; 1230 + bool valid_pmu; 1231 + 1232 + host_pmuver = kvm_arm_pmu_get_pmuver_limit(); 1233 + 1234 + /* 1235 + * Allow AA64DFR0_EL1.PMUver to be set from userspace as long 1236 + * as it doesn't promise more than what the HW gives us. We 1237 + * allow an IMPDEF PMU though, only if no PMU is supported 1238 + * (KVM backward compatibility handling). 1239 + */ 1240 + pmuver = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), val); 1241 + if ((pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF && pmuver > host_pmuver)) 1242 + return -EINVAL; 1243 + 1244 + valid_pmu = (pmuver != 0 && pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF); 1245 + 1246 + /* Make sure view register and PMU support do match */ 1247 + if (kvm_vcpu_has_pmu(vcpu) != valid_pmu) 1248 + return -EINVAL; 1249 + 1250 + /* We can only differ with PMUver, and anything else is an error */ 1251 + val ^= read_id_reg(vcpu, rd); 1252 + val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer); 1253 + if (val) 1254 + return -EINVAL; 1255 + 1256 + if (valid_pmu) 1257 + vcpu->kvm->arch.dfr0_pmuver.imp = pmuver; 1258 + else 1259 + vcpu->kvm->arch.dfr0_pmuver.unimp = pmuver; 1260 + 1261 + return 0; 1262 + } 1263 + 1264 + static int set_id_dfr0_el1(struct kvm_vcpu *vcpu, 1265 + const struct sys_reg_desc *rd, 1266 + u64 val) 1267 + { 1268 + u8 perfmon, host_perfmon; 1269 + bool valid_pmu; 1270 + 1271 + host_perfmon = pmuver_to_perfmon(kvm_arm_pmu_get_pmuver_limit()); 1272 + 1273 + /* 1274 + * Allow DFR0_EL1.PerfMon to be set from userspace as long as 1275 + * it doesn't promise more than what the HW gives us on the 1276 + * AArch64 side (as everything is emulated with that), and 1277 + * that this is a PMUv3. 1278 + */ 1279 + perfmon = FIELD_GET(ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon), val); 1280 + if ((perfmon != ID_DFR0_EL1_PerfMon_IMPDEF && perfmon > host_perfmon) || 1281 + (perfmon != 0 && perfmon < ID_DFR0_EL1_PerfMon_PMUv3)) 1282 + return -EINVAL; 1283 + 1284 + valid_pmu = (perfmon != 0 && perfmon != ID_DFR0_EL1_PerfMon_IMPDEF); 1285 + 1286 + /* Make sure view register and PMU support do match */ 1287 + if (kvm_vcpu_has_pmu(vcpu) != valid_pmu) 1288 + return -EINVAL; 1289 + 1290 + /* We can only differ with PerfMon, and anything else is an error */ 1291 + val ^= read_id_reg(vcpu, rd); 1292 + val &= ~ARM64_FEATURE_MASK(ID_DFR0_EL1_PerfMon); 1293 + if (val) 1294 + return -EINVAL; 1295 + 1296 + if (valid_pmu) 1297 + vcpu->kvm->arch.dfr0_pmuver.imp = perfmon_to_pmuver(perfmon); 1298 + else 1299 + vcpu->kvm->arch.dfr0_pmuver.unimp = perfmon_to_pmuver(perfmon); 1252 1300 1253 1301 return 0; 1254 1302 } ··· 1553 1443 /* CRm=1 */ 1554 1444 AA32_ID_SANITISED(ID_PFR0_EL1), 1555 1445 AA32_ID_SANITISED(ID_PFR1_EL1), 1556 - AA32_ID_SANITISED(ID_DFR0_EL1), 1446 + { SYS_DESC(SYS_ID_DFR0_EL1), .access = access_id_reg, 1447 + .get_user = get_id_reg, .set_user = set_id_dfr0_el1, 1448 + .visibility = aa32_id_visibility, }, 1557 1449 ID_HIDDEN(ID_AFR0_EL1), 1558 1450 AA32_ID_SANITISED(ID_MMFR0_EL1), 1559 1451 AA32_ID_SANITISED(ID_MMFR1_EL1), ··· 1595 1483 ID_UNALLOCATED(4,7), 1596 1484 1597 1485 /* CRm=5 */ 1598 - ID_SANITISED(ID_AA64DFR0_EL1), 1486 + { SYS_DESC(SYS_ID_AA64DFR0_EL1), .access = access_id_reg, 1487 + .get_user = get_id_reg, .set_user = set_id_aa64dfr0_el1, }, 1599 1488 ID_SANITISED(ID_AA64DFR1_EL1), 1600 1489 ID_UNALLOCATED(5,2), 1601 1490 ID_UNALLOCATED(5,3),

+20

arch/arm64/kvm/vgic/vgic-its.c

··· 2743 2743 static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its *its, u64 attr) 2744 2744 { 2745 2745 const struct vgic_its_abi *abi = vgic_its_get_abi(its); 2746 + struct vgic_dist *dist = &kvm->arch.vgic; 2746 2747 int ret = 0; 2747 2748 2748 2749 if (attr == KVM_DEV_ARM_VGIC_CTRL_INIT) /* Nothing to do */ ··· 2763 2762 vgic_its_reset(kvm, its); 2764 2763 break; 2765 2764 case KVM_DEV_ARM_ITS_SAVE_TABLES: 2765 + dist->save_its_tables_in_progress = true; 2766 2766 ret = abi->save_tables(its); 2767 + dist->save_its_tables_in_progress = false; 2767 2768 break; 2768 2769 case KVM_DEV_ARM_ITS_RESTORE_TABLES: 2769 2770 ret = abi->restore_tables(its); ··· 2776 2773 mutex_unlock(&its->its_lock); 2777 2774 mutex_unlock(&kvm->lock); 2778 2775 return ret; 2776 + } 2777 + 2778 + /* 2779 + * kvm_arch_allow_write_without_running_vcpu - allow writing guest memory 2780 + * without the running VCPU when dirty ring is enabled. 2781 + * 2782 + * The running VCPU is required to track dirty guest pages when dirty ring 2783 + * is enabled. Otherwise, the backup bitmap should be used to track the 2784 + * dirty guest pages. When vgic/its tables are being saved, the backup 2785 + * bitmap is used to track the dirty guest pages due to the missed running 2786 + * VCPU in the period. 2787 + */ 2788 + bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) 2789 + { 2790 + struct vgic_dist *dist = &kvm->arch.vgic; 2791 + 2792 + return dist->save_its_tables_in_progress; 2779 2793 } 2780 2794 2781 2795 static int vgic_its_set_attr(struct kvm_device *dev,

+1 -1

arch/arm64/lib/mte.S

··· 18 18 */ 19 19 .macro multitag_transfer_size, reg, tmp 20 20 mrs_s \reg, SYS_GMID_EL1 21 - ubfx \reg, \reg, #GMID_EL1_BS_SHIFT, #GMID_EL1_BS_SIZE 21 + ubfx \reg, \reg, #GMID_EL1_BS_SHIFT, #GMID_EL1_BS_WIDTH 22 22 mov \tmp, #4 23 23 lsl \reg, \tmp, \reg 24 24 .endm

+5 -2

arch/arm64/mm/copypage.c

··· 21 21 22 22 copy_page(kto, kfrom); 23 23 24 - if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { 25 - set_bit(PG_mte_tagged, &to->flags); 24 + if (system_supports_mte() && page_mte_tagged(from)) { 25 + page_kasan_tag_reset(to); 26 + /* It's a new page, shouldn't have been tagged yet */ 27 + WARN_ON_ONCE(!try_page_mte_tagging(to)); 26 28 mte_copy_page_tags(kto, kfrom); 29 + set_page_mte_tagged(to); 27 30 } 28 31 } 29 32 EXPORT_SYMBOL(copy_highpage);

+3 -1

arch/arm64/mm/fault.c

··· 937 937 938 938 void tag_clear_highpage(struct page *page) 939 939 { 940 + /* Newly allocated page, shouldn't have been tagged yet */ 941 + WARN_ON_ONCE(!try_page_mte_tagging(page)); 940 942 mte_zero_clear_page_tags(page_address(page)); 941 - set_bit(PG_mte_tagged, &page->flags); 943 + set_page_mte_tagged(page); 942 944 }

+6 -10

arch/arm64/mm/mteswap.c

··· 24 24 { 25 25 void *tag_storage, *ret; 26 26 27 - if (!test_bit(PG_mte_tagged, &page->flags)) 27 + if (!page_mte_tagged(page)) 28 28 return 0; 29 29 30 30 tag_storage = mte_allocate_tag_storage(); ··· 46 46 return 0; 47 47 } 48 48 49 - bool mte_restore_tags(swp_entry_t entry, struct page *page) 49 + void mte_restore_tags(swp_entry_t entry, struct page *page) 50 50 { 51 51 void *tags = xa_load(&mte_pages, entry.val); 52 52 53 53 if (!tags) 54 - return false; 54 + return; 55 55 56 - /* 57 - * Test PG_mte_tagged again in case it was racing with another 58 - * set_pte_at(). 59 - */ 60 - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) 56 + if (try_page_mte_tagging(page)) { 61 57 mte_restore_page_tags(page_address(page), tags); 62 - 63 - return true; 58 + set_page_mte_tagged(page); 59 + } 64 60 } 65 61 66 62 void mte_invalidate_tags(int type, pgoff_t offset)

+1 -1

arch/arm64/tools/gen-sysreg.awk

··· 33 33 # Print a CPP macro definition, padded with spaces so that the macro bodies 34 34 # line up in a column 35 35 function define(name, val) { 36 - printf "%-48s%s\n", "#define " name, val 36 + printf "%-56s%s\n", "#define " name, val 37 37 } 38 38 39 39 # Print standard BITMASK/SHIFT/WIDTH CPP definitions for a field

+754

arch/arm64/tools/sysreg

··· 46 46 # feature that introduces them (eg, FEAT_LS64_ACCDATA introduces enumeration 47 47 # item ACCDATA) though it may be more taseful to do something else. 48 48 49 + Sysreg ID_PFR0_EL1 3 0 0 1 0 50 + Res0 63:32 51 + Enum 31:28 RAS 52 + 0b0000 NI 53 + 0b0001 RAS 54 + 0b0010 RASv1p1 55 + EndEnum 56 + Enum 27:24 DIT 57 + 0b0000 NI 58 + 0b0001 IMP 59 + EndEnum 60 + Enum 23:20 AMU 61 + 0b0000 NI 62 + 0b0001 AMUv1 63 + 0b0010 AMUv1p1 64 + EndEnum 65 + Enum 19:16 CSV2 66 + 0b0000 UNDISCLOSED 67 + 0b0001 IMP 68 + 0b0010 CSV2p1 69 + EndEnum 70 + Enum 15:12 State3 71 + 0b0000 NI 72 + 0b0001 IMP 73 + EndEnum 74 + Enum 11:8 State2 75 + 0b0000 NI 76 + 0b0001 NO_CV 77 + 0b0010 CV 78 + EndEnum 79 + Enum 7:4 State1 80 + 0b0000 NI 81 + 0b0001 THUMB 82 + 0b0010 THUMB2 83 + EndEnum 84 + Enum 3:0 State0 85 + 0b0000 NI 86 + 0b0001 IMP 87 + EndEnum 88 + EndSysreg 89 + 90 + Sysreg ID_PFR1_EL1 3 0 0 1 1 91 + Res0 63:32 92 + Enum 31:28 GIC 93 + 0b0000 NI 94 + 0b0001 GICv3 95 + 0b0010 GICv4p1 96 + EndEnum 97 + Enum 27:24 Virt_frac 98 + 0b0000 NI 99 + 0b0001 IMP 100 + EndEnum 101 + Enum 23:20 Sec_frac 102 + 0b0000 NI 103 + 0b0001 WALK_DISABLE 104 + 0b0010 SECURE_MEMORY 105 + EndEnum 106 + Enum 19:16 GenTimer 107 + 0b0000 NI 108 + 0b0001 IMP 109 + 0b0010 ECV 110 + EndEnum 111 + Enum 15:12 Virtualization 112 + 0b0000 NI 113 + 0b0001 IMP 114 + EndEnum 115 + Enum 11:8 MProgMod 116 + 0b0000 NI 117 + 0b0001 IMP 118 + EndEnum 119 + Enum 7:4 Security 120 + 0b0000 NI 121 + 0b0001 EL3 122 + 0b0001 NSACR_RFR 123 + EndEnum 124 + Enum 3:0 ProgMod 125 + 0b0000 NI 126 + 0b0001 IMP 127 + EndEnum 128 + EndSysreg 129 + 130 + Sysreg ID_DFR0_EL1 3 0 0 1 2 131 + Res0 63:32 132 + Enum 31:28 TraceFilt 133 + 0b0000 NI 134 + 0b0001 IMP 135 + EndEnum 136 + Enum 27:24 PerfMon 137 + 0b0000 NI 138 + 0b0001 PMUv1 139 + 0b0010 PMUv2 140 + 0b0011 PMUv3 141 + 0b0100 PMUv3p1 142 + 0b0101 PMUv3p4 143 + 0b0110 PMUv3p5 144 + 0b0111 PMUv3p7 145 + 0b1000 PMUv3p8 146 + 0b1111 IMPDEF 147 + EndEnum 148 + Enum 23:20 MProfDbg 149 + 0b0000 NI 150 + 0b0001 IMP 151 + EndEnum 152 + Enum 19:16 MMapTrc 153 + 0b0000 NI 154 + 0b0001 IMP 155 + EndEnum 156 + Enum 15:12 CopTrc 157 + 0b0000 NI 158 + 0b0001 IMP 159 + EndEnum 160 + Enum 11:8 MMapDbg 161 + 0b0000 NI 162 + 0b0100 Armv7 163 + 0b0101 Armv7p1 164 + EndEnum 165 + Field 7:4 CopSDbg 166 + Enum 3:0 CopDbg 167 + 0b0000 NI 168 + 0b0010 Armv6 169 + 0b0011 Armv6p1 170 + 0b0100 Armv7 171 + 0b0101 Armv7p1 172 + 0b0110 Armv8 173 + 0b0111 VHE 174 + 0b1000 Debugv8p2 175 + 0b1001 Debugv8p4 176 + 0b1010 Debugv8p8 177 + EndEnum 178 + EndSysreg 179 + 180 + Sysreg ID_AFR0_EL1 3 0 0 1 3 181 + Res0 63:16 182 + Field 15:12 IMPDEF3 183 + Field 11:8 IMPDEF2 184 + Field 7:4 IMPDEF1 185 + Field 3:0 IMPDEF0 186 + EndSysreg 187 + 188 + Sysreg ID_MMFR0_EL1 3 0 0 1 4 189 + Res0 63:32 190 + Enum 31:28 InnerShr 191 + 0b0000 NC 192 + 0b0001 HW 193 + 0b1111 IGNORED 194 + EndEnum 195 + Enum 27:24 FCSE 196 + 0b0000 NI 197 + 0b0001 IMP 198 + EndEnum 199 + Enum 23:20 AuxReg 200 + 0b0000 NI 201 + 0b0001 ACTLR 202 + 0b0010 AIFSR 203 + EndEnum 204 + Enum 19:16 TCM 205 + 0b0000 NI 206 + 0b0001 IMPDEF 207 + 0b0010 TCM 208 + 0b0011 TCM_DMA 209 + EndEnum 210 + Enum 15:12 ShareLvl 211 + 0b0000 ONE 212 + 0b0001 TWO 213 + EndEnum 214 + Enum 11:8 OuterShr 215 + 0b0000 NC 216 + 0b0001 HW 217 + 0b1111 IGNORED 218 + EndEnum 219 + Enum 7:4 PMSA 220 + 0b0000 NI 221 + 0b0001 IMPDEF 222 + 0b0010 PMSAv6 223 + 0b0011 PMSAv7 224 + EndEnum 225 + Enum 3:0 VMSA 226 + 0b0000 NI 227 + 0b0001 IMPDEF 228 + 0b0010 VMSAv6 229 + 0b0011 VMSAv7 230 + 0b0100 VMSAv7_PXN 231 + 0b0101 VMSAv7_LONG 232 + EndEnum 233 + EndSysreg 234 + 235 + Sysreg ID_MMFR1_EL1 3 0 0 1 5 236 + Res0 63:32 237 + Enum 31:28 BPred 238 + 0b0000 NI 239 + 0b0001 BP_SW_MANGED 240 + 0b0010 BP_ASID_AWARE 241 + 0b0011 BP_NOSNOOP 242 + 0b0100 BP_INVISIBLE 243 + EndEnum 244 + Enum 27:24 L1TstCln 245 + 0b0000 NI 246 + 0b0001 NOINVALIDATE 247 + 0b0010 INVALIDATE 248 + EndEnum 249 + Enum 23:20 L1Uni 250 + 0b0000 NI 251 + 0b0001 INVALIDATE 252 + 0b0010 CLEAN_AND_INVALIDATE 253 + EndEnum 254 + Enum 19:16 L1Hvd 255 + 0b0000 NI 256 + 0b0001 INVALIDATE_ISIDE_ONLY 257 + 0b0010 INVALIDATE 258 + 0b0011 CLEAN_AND_INVALIDATE 259 + EndEnum 260 + Enum 15:12 L1UniSW 261 + 0b0000 NI 262 + 0b0001 CLEAN 263 + 0b0010 CLEAN_AND_INVALIDATE 264 + 0b0011 INVALIDATE 265 + EndEnum 266 + Enum 11:8 L1HvdSW 267 + 0b0000 NI 268 + 0b0001 CLEAN_AND_INVALIDATE 269 + 0b0010 INVALIDATE_DSIDE_ONLY 270 + 0b0011 INVALIDATE 271 + EndEnum 272 + Enum 7:4 L1UniVA 273 + 0b0000 NI 274 + 0b0001 CLEAN_AND_INVALIDATE 275 + 0b0010 INVALIDATE_BP 276 + EndEnum 277 + Enum 3:0 L1HvdVA 278 + 0b0000 NI 279 + 0b0001 CLEAN_AND_INVALIDATE 280 + 0b0010 INVALIDATE_BP 281 + EndEnum 282 + EndSysreg 283 + 284 + Sysreg ID_MMFR2_EL1 3 0 0 1 6 285 + Res0 63:32 286 + Enum 31:28 HWAccFlg 287 + 0b0000 NI 288 + 0b0001 IMP 289 + EndEnum 290 + Enum 27:24 WFIStall 291 + 0b0000 NI 292 + 0b0001 IMP 293 + EndEnum 294 + Enum 23:20 MemBarr 295 + 0b0000 NI 296 + 0b0001 DSB_ONLY 297 + 0b0010 IMP 298 + EndEnum 299 + Enum 19:16 UniTLB 300 + 0b0000 NI 301 + 0b0001 BY_VA 302 + 0b0010 BY_MATCH_ASID 303 + 0b0011 BY_ALL_ASID 304 + 0b0100 OTHER_TLBS 305 + 0b0101 BROADCAST 306 + 0b0110 BY_IPA 307 + EndEnum 308 + Enum 15:12 HvdTLB 309 + 0b0000 NI 310 + EndEnum 311 + Enum 11:8 L1HvdRng 312 + 0b0000 NI 313 + 0b0001 IMP 314 + EndEnum 315 + Enum 7:4 L1HvdBG 316 + 0b0000 NI 317 + 0b0001 IMP 318 + EndEnum 319 + Enum 3:0 L1HvdFG 320 + 0b0000 NI 321 + 0b0001 IMP 322 + EndEnum 323 + EndSysreg 324 + 325 + Sysreg ID_MMFR3_EL1 3 0 0 1 7 326 + Res0 63:32 327 + Enum 31:28 Supersec 328 + 0b0000 IMP 329 + 0b1111 NI 330 + EndEnum 331 + Enum 27:24 CMemSz 332 + 0b0000 4GB 333 + 0b0001 64GB 334 + 0b0010 1TB 335 + EndEnum 336 + Enum 23:20 CohWalk 337 + 0b0000 NI 338 + 0b0001 IMP 339 + EndEnum 340 + Enum 19:16 PAN 341 + 0b0000 NI 342 + 0b0001 PAN 343 + 0b0010 PAN2 344 + EndEnum 345 + Enum 15:12 MaintBcst 346 + 0b0000 NI 347 + 0b0001 NO_TLB 348 + 0b0010 ALL 349 + EndEnum 350 + Enum 11:8 BPMaint 351 + 0b0000 NI 352 + 0b0001 ALL 353 + 0b0010 BY_VA 354 + EndEnum 355 + Enum 7:4 CMaintSW 356 + 0b0000 NI 357 + 0b0001 IMP 358 + EndEnum 359 + Enum 3:0 CMaintVA 360 + 0b0000 NI 361 + 0b0001 IMP 362 + EndEnum 363 + EndSysreg 364 + 365 + Sysreg ID_ISAR0_EL1 3 0 0 2 0 366 + Res0 63:28 367 + Enum 27:24 Divide 368 + 0b0000 NI 369 + 0b0001 xDIV_T32 370 + 0b0010 xDIV_A32 371 + EndEnum 372 + Enum 23:20 Debug 373 + 0b0000 NI 374 + 0b0001 IMP 375 + EndEnum 376 + Enum 19:16 Coproc 377 + 0b0000 NI 378 + 0b0001 MRC 379 + 0b0010 MRC2 380 + 0b0011 MRRC 381 + 0b0100 MRRC2 382 + EndEnum 383 + Enum 15:12 CmpBranch 384 + 0b0000 NI 385 + 0b0001 IMP 386 + EndEnum 387 + Enum 11:8 BitField 388 + 0b0000 NI 389 + 0b0001 IMP 390 + EndEnum 391 + Enum 7:4 BitCount 392 + 0b0000 NI 393 + 0b0001 IMP 394 + EndEnum 395 + Enum 3:0 Swap 396 + 0b0000 NI 397 + 0b0001 IMP 398 + EndEnum 399 + EndSysreg 400 + 401 + Sysreg ID_ISAR1_EL1 3 0 0 2 1 402 + Res0 63:32 403 + Enum 31:28 Jazelle 404 + 0b0000 NI 405 + 0b0001 IMP 406 + EndEnum 407 + Enum 27:24 Interwork 408 + 0b0000 NI 409 + 0b0001 BX 410 + 0b0010 BLX 411 + 0b0011 A32_BX 412 + EndEnum 413 + Enum 23:20 Immediate 414 + 0b0000 NI 415 + 0b0001 IMP 416 + EndEnum 417 + Enum 19:16 IfThen 418 + 0b0000 NI 419 + 0b0001 IMP 420 + EndEnum 421 + Enum 15:12 Extend 422 + 0b0000 NI 423 + 0b0001 SXTB 424 + 0b0010 SXTB16 425 + EndEnum 426 + Enum 11:8 Except_AR 427 + 0b0000 NI 428 + 0b0001 IMP 429 + EndEnum 430 + Enum 7:4 Except 431 + 0b0000 NI 432 + 0b0001 IMP 433 + EndEnum 434 + Enum 3:0 Endian 435 + 0b0000 NI 436 + 0b0001 IMP 437 + EndEnum 438 + EndSysreg 439 + 440 + Sysreg ID_ISAR2_EL1 3 0 0 2 2 441 + Res0 63:32 442 + Enum 31:28 Reversal 443 + 0b0000 NI 444 + 0b0001 REV 445 + 0b0010 RBIT 446 + EndEnum 447 + Enum 27:24 PSR_AR 448 + 0b0000 NI 449 + 0b0001 IMP 450 + EndEnum 451 + Enum 23:20 MultU 452 + 0b0000 NI 453 + 0b0001 UMULL 454 + 0b0010 UMAAL 455 + EndEnum 456 + Enum 19:16 MultS 457 + 0b0000 NI 458 + 0b0001 SMULL 459 + 0b0010 SMLABB 460 + 0b0011 SMLAD 461 + EndEnum 462 + Enum 15:12 Mult 463 + 0b0000 NI 464 + 0b0001 MLA 465 + 0b0010 MLS 466 + EndEnum 467 + Enum 11:8 MultiAccessInt 468 + 0b0000 NI 469 + 0b0001 RESTARTABLE 470 + 0b0010 CONTINUABLE 471 + EndEnum 472 + Enum 7:4 MemHint 473 + 0b0000 NI 474 + 0b0001 PLD 475 + 0b0010 PLD2 476 + 0b0011 PLI 477 + 0b0100 PLDW 478 + EndEnum 479 + Enum 3:0 LoadStore 480 + 0b0000 NI 481 + 0b0001 DOUBLE 482 + 0b0010 ACQUIRE 483 + EndEnum 484 + EndSysreg 485 + 486 + Sysreg ID_ISAR3_EL1 3 0 0 2 3 487 + Res0 63:32 488 + Enum 31:28 T32EE 489 + 0b0000 NI 490 + 0b0001 IMP 491 + EndEnum 492 + Enum 27:24 TrueNOP 493 + 0b0000 NI 494 + 0b0001 IMP 495 + EndEnum 496 + Enum 23:20 T32Copy 497 + 0b0000 NI 498 + 0b0001 IMP 499 + EndEnum 500 + Enum 19:16 TabBranch 501 + 0b0000 NI 502 + 0b0001 IMP 503 + EndEnum 504 + Enum 15:12 SynchPrim 505 + 0b0000 NI 506 + 0b0001 EXCLUSIVE 507 + 0b0010 DOUBLE 508 + EndEnum 509 + Enum 11:8 SVC 510 + 0b0000 NI 511 + 0b0001 IMP 512 + EndEnum 513 + Enum 7:4 SIMD 514 + 0b0000 NI 515 + 0b0001 SSAT 516 + 0b0011 PKHBT 517 + EndEnum 518 + Enum 3:0 Saturate 519 + 0b0000 NI 520 + 0b0001 IMP 521 + EndEnum 522 + EndSysreg 523 + 524 + Sysreg ID_ISAR4_EL1 3 0 0 2 4 525 + Res0 63:32 526 + Enum 31:28 SWP_frac 527 + 0b0000 NI 528 + 0b0001 IMP 529 + EndEnum 530 + Enum 27:24 PSR_M 531 + 0b0000 NI 532 + 0b0001 IMP 533 + EndEnum 534 + Enum 23:20 SynchPrim_frac 535 + 0b0000 NI 536 + 0b0011 IMP 537 + EndEnum 538 + Enum 19:16 Barrier 539 + 0b0000 NI 540 + 0b0001 IMP 541 + EndEnum 542 + Enum 15:12 SMC 543 + 0b0000 NI 544 + 0b0001 IMP 545 + EndEnum 546 + Enum 11:8 Writeback 547 + 0b0000 NI 548 + 0b0001 IMP 549 + EndEnum 550 + Enum 7:4 WithShifts 551 + 0b0000 NI 552 + 0b0001 LSL3 553 + 0b0011 LS 554 + 0b0100 REG 555 + EndEnum 556 + Enum 3:0 Unpriv 557 + 0b0000 NI 558 + 0b0001 REG_BYTE 559 + 0b0010 SIGNED_HALFWORD 560 + EndEnum 561 + EndSysreg 562 + 563 + Sysreg ID_ISAR5_EL1 3 0 0 2 5 564 + Res0 63:32 565 + Enum 31:28 VCMA 566 + 0b0000 NI 567 + 0b0001 IMP 568 + EndEnum 569 + Enum 27:24 RDM 570 + 0b0000 NI 571 + 0b0001 IMP 572 + EndEnum 573 + Res0 23:20 574 + Enum 19:16 CRC32 575 + 0b0000 NI 576 + 0b0001 IMP 577 + EndEnum 578 + Enum 15:12 SHA2 579 + 0b0000 NI 580 + 0b0001 IMP 581 + EndEnum 582 + Enum 11:8 SHA1 583 + 0b0000 NI 584 + 0b0001 IMP 585 + EndEnum 586 + Enum 7:4 AES 587 + 0b0000 NI 588 + 0b0001 IMP 589 + 0b0010 VMULL 590 + EndEnum 591 + Enum 3:0 SEVL 592 + 0b0000 NI 593 + 0b0001 IMP 594 + EndEnum 595 + EndSysreg 596 + 597 + Sysreg ID_ISAR6_EL1 3 0 0 2 7 598 + Res0 63:28 599 + Enum 27:24 I8MM 600 + 0b0000 NI 601 + 0b0001 IMP 602 + EndEnum 603 + Enum 23:20 BF16 604 + 0b0000 NI 605 + 0b0001 IMP 606 + EndEnum 607 + Enum 19:16 SPECRES 608 + 0b0000 NI 609 + 0b0001 IMP 610 + EndEnum 611 + Enum 15:12 SB 612 + 0b0000 NI 613 + 0b0001 IMP 614 + EndEnum 615 + Enum 11:8 FHM 616 + 0b0000 NI 617 + 0b0001 IMP 618 + EndEnum 619 + Enum 7:4 DP 620 + 0b0000 NI 621 + 0b0001 IMP 622 + EndEnum 623 + Enum 3:0 JSCVT 624 + 0b0000 NI 625 + 0b0001 IMP 626 + EndEnum 627 + EndSysreg 628 + 629 + Sysreg ID_MMFR4_EL1 3 0 0 2 6 630 + Res0 63:32 631 + Enum 31:28 EVT 632 + 0b0000 NI 633 + 0b0001 NO_TLBIS 634 + 0b0010 TLBIS 635 + EndEnum 636 + Enum 27:24 CCIDX 637 + 0b0000 NI 638 + 0b0001 IMP 639 + EndEnum 640 + Enum 23:20 LSM 641 + 0b0000 NI 642 + 0b0001 IMP 643 + EndEnum 644 + Enum 19:16 HPDS 645 + 0b0000 NI 646 + 0b0001 AA32HPD 647 + 0b0010 HPDS2 648 + EndEnum 649 + Enum 15:12 CnP 650 + 0b0000 NI 651 + 0b0001 IMP 652 + EndEnum 653 + Enum 11:8 XNX 654 + 0b0000 NI 655 + 0b0001 IMP 656 + EndEnum 657 + Enum 7:4 AC2 658 + 0b0000 NI 659 + 0b0001 IMP 660 + EndEnum 661 + Enum 3:0 SpecSEI 662 + 0b0000 NI 663 + 0b0001 IMP 664 + EndEnum 665 + EndSysreg 666 + 667 + Sysreg MVFR0_EL1 3 0 0 3 0 668 + Res0 63:32 669 + Enum 31:28 FPRound 670 + 0b0000 NI 671 + 0b0001 IMP 672 + EndEnum 673 + Enum 27:24 FPShVec 674 + 0b0000 NI 675 + 0b0001 IMP 676 + EndEnum 677 + Enum 23:20 FPSqrt 678 + 0b0000 NI 679 + 0b0001 IMP 680 + EndEnum 681 + Enum 19:16 FPDivide 682 + 0b0000 NI 683 + 0b0001 IMP 684 + EndEnum 685 + Enum 15:12 FPTrap 686 + 0b0000 NI 687 + 0b0001 IMP 688 + EndEnum 689 + Enum 11:8 FPDP 690 + 0b0000 NI 691 + 0b0001 VFPv2 692 + 0b0001 VFPv3 693 + EndEnum 694 + Enum 7:4 FPSP 695 + 0b0000 NI 696 + 0b0001 VFPv2 697 + 0b0001 VFPv3 698 + EndEnum 699 + Enum 3:0 SIMDReg 700 + 0b0000 NI 701 + 0b0001 IMP_16x64 702 + 0b0001 IMP_32x64 703 + EndEnum 704 + EndSysreg 705 + 706 + Sysreg MVFR1_EL1 3 0 0 3 1 707 + Res0 63:32 708 + Enum 31:28 SIMDFMAC 709 + 0b0000 NI 710 + 0b0001 IMP 711 + EndEnum 712 + Enum 27:24 FPHP 713 + 0b0000 NI 714 + 0b0001 FPHP 715 + 0b0010 FPHP_CONV 716 + 0b0011 FP16 717 + EndEnum 718 + Enum 23:20 SIMDHP 719 + 0b0000 NI 720 + 0b0001 SIMDHP 721 + 0b0001 SIMDHP_FLOAT 722 + EndEnum 723 + Enum 19:16 SIMDSP 724 + 0b0000 NI 725 + 0b0001 IMP 726 + EndEnum 727 + Enum 15:12 SIMDInt 728 + 0b0000 NI 729 + 0b0001 IMP 730 + EndEnum 731 + Enum 11:8 SIMDLS 732 + 0b0000 NI 733 + 0b0001 IMP 734 + EndEnum 735 + Enum 7:4 FPDNaN 736 + 0b0000 NI 737 + 0b0001 IMP 738 + EndEnum 739 + Enum 3:0 FPFtZ 740 + 0b0000 NI 741 + 0b0001 IMP 742 + EndEnum 743 + EndSysreg 744 + 745 + Sysreg MVFR2_EL1 3 0 0 3 2 746 + Res0 63:8 747 + Enum 7:4 FPMisc 748 + 0b0000 NI 749 + 0b0001 FP 750 + 0b0010 FP_DIRECTED_ROUNDING 751 + 0b0011 FP_ROUNDING 752 + 0b0100 FP_MAX_MIN 753 + EndEnum 754 + Enum 3:0 SIMDMisc 755 + 0b0000 NI 756 + 0b0001 SIMD_DIRECTED_ROUNDING 757 + 0b0010 SIMD_ROUNDING 758 + 0b0011 SIMD_MAX_MIN 759 + EndEnum 760 + EndSysreg 761 + 762 + Sysreg ID_PFR2_EL1 3 0 0 3 4 763 + Res0 63:12 764 + Enum 11:8 RAS_frac 765 + 0b0000 NI 766 + 0b0001 RASv1p1 767 + EndEnum 768 + Enum 7:4 SSBS 769 + 0b0000 NI 770 + 0b0001 IMP 771 + EndEnum 772 + Enum 3:0 CSV3 773 + 0b0000 NI 774 + 0b0001 IMP 775 + EndEnum 776 + EndSysreg 777 + 778 + Sysreg ID_DFR1_EL1 3 0 0 3 5 779 + Res0 63:8 780 + Enum 7:4 HPMN0 781 + 0b0000 NI 782 + 0b0001 IMP 783 + EndEnum 784 + Enum 3:0 MTPMU 785 + 0b0000 IMPDEF 786 + 0b0001 IMP 787 + 0b1111 NI 788 + EndEnum 789 + EndSysreg 790 + 791 + Sysreg ID_MMFR5_EL1 3 0 0 3 6 792 + Res0 63:8 793 + Enum 7:4 nTLBPA 794 + 0b0000 NI 795 + 0b0001 IMP 796 + EndEnum 797 + Enum 3:0 ETS 798 + 0b0000 NI 799 + 0b0001 IMP 800 + EndEnum 801 + EndSysreg 802 + 49 803 Sysreg ID_AA64PFR0_EL1 3 0 0 4 0 50 804 Enum 63:60 CSV3 51 805 0b0000 NI

-2

arch/x86/include/asm/kvm_host.h

··· 2154 2154 #endif 2155 2155 } 2156 2156 2157 - int kvm_cpu_dirty_log_size(void); 2158 - 2159 2157 int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages); 2160 2158 2161 2159 #define KVM_CLOCK_VALID_FLAGS \

+6 -9

arch/x86/kvm/x86.c

··· 10208 10208 10209 10209 bool req_immediate_exit = false; 10210 10210 10211 - /* Forbid vmenter if vcpu dirty ring is soft-full */ 10212 - if (unlikely(vcpu->kvm->dirty_ring_size && 10213 - kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { 10214 - vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; 10215 - trace_kvm_dirty_ring_exit(vcpu); 10216 - r = 0; 10217 - goto out; 10218 - } 10219 - 10220 10211 if (kvm_request_pending(vcpu)) { 10221 10212 if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) { 10222 10213 r = -EIO; 10223 10214 goto out; 10224 10215 } 10216 + 10217 + if (kvm_dirty_ring_check_request(vcpu)) { 10218 + r = 0; 10219 + goto out; 10220 + } 10221 + 10225 10222 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { 10226 10223 if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { 10227 10224 r = 0;

+2 -1

fs/proc/page.c

··· 219 219 u |= kpf_copy_bit(k, KPF_PRIVATE_2, PG_private_2); 220 220 u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE, PG_owner_priv_1); 221 221 u |= kpf_copy_bit(k, KPF_ARCH, PG_arch_1); 222 - #ifdef CONFIG_64BIT 222 + #ifdef CONFIG_ARCH_USES_PG_ARCH_X 223 223 u |= kpf_copy_bit(k, KPF_ARCH_2, PG_arch_2); 224 + u |= kpf_copy_bit(k, KPF_ARCH_3, PG_arch_3); 224 225 #endif 225 226 226 227 return u;

+13 -2

include/kvm/arm_pmu.h

··· 11 11 #include <asm/perf_event.h> 12 12 13 13 #define ARMV8_PMU_CYCLE_IDX (ARMV8_PMU_MAX_COUNTERS - 1) 14 - #define ARMV8_PMU_MAX_COUNTER_PAIRS ((ARMV8_PMU_MAX_COUNTERS + 1) >> 1) 15 14 16 15 #ifdef CONFIG_HW_PERF_EVENTS 17 16 ··· 28 29 struct irq_work overflow_work; 29 30 struct kvm_pmu_events events; 30 31 struct kvm_pmc pmc[ARMV8_PMU_MAX_COUNTERS]; 31 - DECLARE_BITMAP(chained, ARMV8_PMU_MAX_COUNTER_PAIRS); 32 32 int irq_num; 33 33 bool created; 34 34 bool irq_level; ··· 88 90 if (!has_vhe() && kvm_vcpu_has_pmu(vcpu)) \ 89 91 vcpu->arch.pmu.events = *kvm_get_pmu_events(); \ 90 92 } while (0) 93 + 94 + /* 95 + * Evaluates as true when emulating PMUv3p5, and false otherwise. 96 + */ 97 + #define kvm_pmu_is_3p5(vcpu) \ 98 + (vcpu->kvm->arch.dfr0_pmuver.imp >= ID_AA64DFR0_EL1_PMUVer_V3P5) 99 + 100 + u8 kvm_arm_pmu_get_pmuver_limit(void); 91 101 92 102 #else 93 103 struct kvm_pmu { ··· 159 153 } 160 154 161 155 #define kvm_vcpu_has_pmu(vcpu) ({ false; }) 156 + #define kvm_pmu_is_3p5(vcpu) ({ false; }) 162 157 static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {} 163 158 static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {} 164 159 static inline void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) {} 160 + static inline u8 kvm_arm_pmu_get_pmuver_limit(void) 161 + { 162 + return 0; 163 + } 165 164 166 165 #endif 167 166

+1

include/kvm/arm_vgic.h

··· 263 263 struct vgic_io_device dist_iodev; 264 264 265 265 bool has_its; 266 + bool save_its_tables_in_progress; 266 267 267 268 /* 268 269 * Contains the attributes and gpa of the LPI configuration table.

+1

include/linux/kernel-page-flags.h

··· 18 18 #define KPF_UNCACHED 39 19 19 #define KPF_SOFTDIRTY 40 20 20 #define KPF_ARCH_2 41 21 + #define KPF_ARCH_3 42 21 22 22 23 #endif /* LINUX_KERNEL_PAGE_FLAGS_H */

+12 -8

include/linux/kvm_dirty_ring.h

··· 37 37 return 0; 38 38 } 39 39 40 + static inline bool kvm_use_dirty_bitmap(struct kvm *kvm) 41 + { 42 + return true; 43 + } 44 + 40 45 static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, 41 46 int index, u32 size) 42 47 { ··· 54 49 return 0; 55 50 } 56 51 57 - static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, 52 + static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, 58 53 u32 slot, u64 offset) 59 54 { 60 55 } ··· 69 64 { 70 65 } 71 66 72 - static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 73 - { 74 - return true; 75 - } 76 - 77 67 #else /* CONFIG_HAVE_KVM_DIRTY_RING */ 78 68 69 + int kvm_cpu_dirty_log_size(void); 70 + bool kvm_use_dirty_bitmap(struct kvm *kvm); 71 + bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm); 79 72 u32 kvm_dirty_ring_get_rsvd_entries(void); 80 73 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); 81 74 ··· 87 84 * returns =0: successfully pushed 88 85 * <0: unable to push, need to wait 89 86 */ 90 - void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset); 87 + void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset); 88 + 89 + bool kvm_dirty_ring_check_request(struct kvm_vcpu *vcpu); 91 90 92 91 /* for use in vm_operations_struct */ 93 92 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset); 94 93 95 94 void kvm_dirty_ring_free(struct kvm_dirty_ring *ring); 96 - bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring); 97 95 98 96 #endif /* CONFIG_HAVE_KVM_DIRTY_RING */ 99 97

+6 -4

include/linux/kvm_host.h

··· 163 163 * Architecture-independent vcpu->requests bit members 164 164 * Bits 3-7 are reserved for more arch-independent bits. 165 165 */ 166 - #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 167 - #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 168 - #define KVM_REQ_UNBLOCK 2 169 - #define KVM_REQUEST_ARCH_BASE 8 166 + #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 167 + #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 168 + #define KVM_REQ_UNBLOCK 2 169 + #define KVM_REQ_DIRTY_RING_SOFT_FULL 3 170 + #define KVM_REQUEST_ARCH_BASE 8 170 171 171 172 /* 172 173 * KVM_REQ_OUTSIDE_GUEST_MODE exists is purely as way to force the vCPU to ··· 796 795 pid_t userspace_pid; 797 796 unsigned int max_halt_poll_ns; 798 797 u32 dirty_ring_size; 798 + bool dirty_ring_with_bitmap; 799 799 bool vm_bugged; 800 800 bool vm_dead; 801 801

+2 -1

include/linux/page-flags.h

··· 132 132 PG_young, 133 133 PG_idle, 134 134 #endif 135 - #ifdef CONFIG_64BIT 135 + #ifdef CONFIG_ARCH_USES_PG_ARCH_X 136 136 PG_arch_2, 137 + PG_arch_3, 137 138 #endif 138 139 #ifdef CONFIG_KASAN_HW_TAGS 139 140 PG_skip_kasan_poison,

+5 -4

include/trace/events/mmflags.h

··· 91 91 #define IF_HAVE_PG_IDLE(flag,string) 92 92 #endif 93 93 94 - #ifdef CONFIG_64BIT 95 - #define IF_HAVE_PG_ARCH_2(flag,string) ,{1UL << flag, string} 94 + #ifdef CONFIG_ARCH_USES_PG_ARCH_X 95 + #define IF_HAVE_PG_ARCH_X(flag,string) ,{1UL << flag, string} 96 96 #else 97 - #define IF_HAVE_PG_ARCH_2(flag,string) 97 + #define IF_HAVE_PG_ARCH_X(flag,string) 98 98 #endif 99 99 100 100 #ifdef CONFIG_KASAN_HW_TAGS ··· 130 130 IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ 131 131 IF_HAVE_PG_IDLE(PG_young, "young" ) \ 132 132 IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ 133 - IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" ) \ 133 + IF_HAVE_PG_ARCH_X(PG_arch_2, "arch_2" ) \ 134 + IF_HAVE_PG_ARCH_X(PG_arch_3, "arch_3" ) \ 134 135 IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") 135 136 136 137 #define show_page_flags(flags) \

+1

include/uapi/linux/kvm.h

··· 1182 1182 #define KVM_CAP_S390_CPU_TOPOLOGY 222 1183 1183 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223 1184 1184 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224 1185 + #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225 1185 1186 1186 1187 #ifdef KVM_CAP_IRQ_ROUTING 1187 1188

+8

mm/Kconfig

··· 1005 1005 config ARCH_HAS_PKEYS 1006 1006 bool 1007 1007 1008 + config ARCH_USES_PG_ARCH_X 1009 + bool 1010 + help 1011 + Enable the definition of PG_arch_x page flags with x > 1. Only 1012 + suitable for 64-bit architectures with CONFIG_FLATMEM or 1013 + CONFIG_SPARSEMEM_VMEMMAP enabled, otherwise there may not be 1014 + enough room for additional bits in page->flags. 1015 + 1008 1016 config VM_EVENT_COUNTERS 1009 1017 default y 1010 1018 bool "Enable VM event counters for /proc/vmstat" if EXPERT

+2 -1

mm/huge_memory.c

+176

tools/include/linux/bitfield.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2014 Felix Fietkau <nbd@nbd.name> 4 + * Copyright (C) 2004 - 2009 Ivo van Doorn <IvDoorn@gmail.com> 5 + */ 6 + 7 + #ifndef _LINUX_BITFIELD_H 8 + #define _LINUX_BITFIELD_H 9 + 10 + #include <linux/build_bug.h> 11 + #include <asm/byteorder.h> 12 + 13 + /* 14 + * Bitfield access macros 15 + * 16 + * FIELD_{GET,PREP} macros take as first parameter shifted mask 17 + * from which they extract the base mask and shift amount. 18 + * Mask must be a compilation time constant. 19 + * 20 + * Example: 21 + * 22 + * #define REG_FIELD_A GENMASK(6, 0) 23 + * #define REG_FIELD_B BIT(7) 24 + * #define REG_FIELD_C GENMASK(15, 8) 25 + * #define REG_FIELD_D GENMASK(31, 16) 26 + * 27 + * Get: 28 + * a = FIELD_GET(REG_FIELD_A, reg); 29 + * b = FIELD_GET(REG_FIELD_B, reg); 30 + * 31 + * Set: 32 + * reg = FIELD_PREP(REG_FIELD_A, 1) | 33 + * FIELD_PREP(REG_FIELD_B, 0) | 34 + * FIELD_PREP(REG_FIELD_C, c) | 35 + * FIELD_PREP(REG_FIELD_D, 0x40); 36 + * 37 + * Modify: 38 + * reg &= ~REG_FIELD_C; 39 + * reg |= FIELD_PREP(REG_FIELD_C, c); 40 + */ 41 + 42 + #define __bf_shf(x) (__builtin_ffsll(x) - 1) 43 + 44 + #define __scalar_type_to_unsigned_cases(type) \ 45 + unsigned type: (unsigned type)0, \ 46 + signed type: (unsigned type)0 47 + 48 + #define __unsigned_scalar_typeof(x) typeof( \ 49 + _Generic((x), \ 50 + char: (unsigned char)0, \ 51 + __scalar_type_to_unsigned_cases(char), \ 52 + __scalar_type_to_unsigned_cases(short), \ 53 + __scalar_type_to_unsigned_cases(int), \ 54 + __scalar_type_to_unsigned_cases(long), \ 55 + __scalar_type_to_unsigned_cases(long long), \ 56 + default: (x))) 57 + 58 + #define __bf_cast_unsigned(type, x) ((__unsigned_scalar_typeof(type))(x)) 59 + 60 + #define __BF_FIELD_CHECK(_mask, _reg, _val, _pfx) \ 61 + ({ \ 62 + BUILD_BUG_ON_MSG(!__builtin_constant_p(_mask), \ 63 + _pfx "mask is not constant"); \ 64 + BUILD_BUG_ON_MSG((_mask) == 0, _pfx "mask is zero"); \ 65 + BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ? \ 66 + ~((_mask) >> __bf_shf(_mask)) & (_val) : 0, \ 67 + _pfx "value too large for the field"); \ 68 + BUILD_BUG_ON_MSG(__bf_cast_unsigned(_mask, _mask) > \ 69 + __bf_cast_unsigned(_reg, ~0ull), \ 70 + _pfx "type of reg too small for mask"); \ 71 + __BUILD_BUG_ON_NOT_POWER_OF_2((_mask) + \ 72 + (1ULL << __bf_shf(_mask))); \ 73 + }) 74 + 75 + /** 76 + * FIELD_MAX() - produce the maximum value representable by a field 77 + * @_mask: shifted mask defining the field's length and position 78 + * 79 + * FIELD_MAX() returns the maximum value that can be held in the field 80 + * specified by @_mask. 81 + */ 82 + #define FIELD_MAX(_mask) \ 83 + ({ \ 84 + __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_MAX: "); \ 85 + (typeof(_mask))((_mask) >> __bf_shf(_mask)); \ 86 + }) 87 + 88 + /** 89 + * FIELD_FIT() - check if value fits in the field 90 + * @_mask: shifted mask defining the field's length and position 91 + * @_val: value to test against the field 92 + * 93 + * Return: true if @_val can fit inside @_mask, false if @_val is too big. 94 + */ 95 + #define FIELD_FIT(_mask, _val) \ 96 + ({ \ 97 + __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: "); \ 98 + !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \ 99 + }) 100 + 101 + /** 102 + * FIELD_PREP() - prepare a bitfield element 103 + * @_mask: shifted mask defining the field's length and position 104 + * @_val: value to put in the field 105 + * 106 + * FIELD_PREP() masks and shifts up the value. The result should 107 + * be combined with other fields of the bitfield using logical OR. 108 + */ 109 + #define FIELD_PREP(_mask, _val) \ 110 + ({ \ 111 + __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_PREP: "); \ 112 + ((typeof(_mask))(_val) << __bf_shf(_mask)) & (_mask); \ 113 + }) 114 + 115 + /** 116 + * FIELD_GET() - extract a bitfield element 117 + * @_mask: shifted mask defining the field's length and position 118 + * @_reg: value of entire bitfield 119 + * 120 + * FIELD_GET() extracts the field specified by @_mask from the 121 + * bitfield passed in as @_reg by masking and shifting it down. 122 + */ 123 + #define FIELD_GET(_mask, _reg) \ 124 + ({ \ 125 + __BF_FIELD_CHECK(_mask, _reg, 0U, "FIELD_GET: "); \ 126 + (typeof(_mask))(((_reg) & (_mask)) >> __bf_shf(_mask)); \ 127 + }) 128 + 129 + extern void __compiletime_error("value doesn't fit into mask") 130 + __field_overflow(void); 131 + extern void __compiletime_error("bad bitfield mask") 132 + __bad_mask(void); 133 + static __always_inline u64 field_multiplier(u64 field) 134 + { 135 + if ((field | (field - 1)) & ((field | (field - 1)) + 1)) 136 + __bad_mask(); 137 + return field & -field; 138 + } 139 + static __always_inline u64 field_mask(u64 field) 140 + { 141 + return field / field_multiplier(field); 142 + } 143 + #define field_max(field) ((typeof(field))field_mask(field)) 144 + #define ____MAKE_OP(type,base,to,from) \ 145 + static __always_inline __##type type##_encode_bits(base v, base field) \ 146 + { \ 147 + if (__builtin_constant_p(v) && (v & ~field_mask(field))) \ 148 + __field_overflow(); \ 149 + return to((v & field_mask(field)) * field_multiplier(field)); \ 150 + } \ 151 + static __always_inline __##type type##_replace_bits(__##type old, \ 152 + base val, base field) \ 153 + { \ 154 + return (old & ~to(field)) | type##_encode_bits(val, field); \ 155 + } \ 156 + static __always_inline void type##p_replace_bits(__##type *p, \ 157 + base val, base field) \ 158 + { \ 159 + *p = (*p & ~to(field)) | type##_encode_bits(val, field); \ 160 + } \ 161 + static __always_inline base type##_get_bits(__##type v, base field) \ 162 + { \ 163 + return (from(v) & field)/field_multiplier(field); \ 164 + } 165 + #define __MAKE_OP(size) \ 166 + ____MAKE_OP(le##size,u##size,cpu_to_le##size,le##size##_to_cpu) \ 167 + ____MAKE_OP(be##size,u##size,cpu_to_be##size,be##size##_to_cpu) \ 168 + ____MAKE_OP(u##size,u##size,,) 169 + ____MAKE_OP(u8,u8,,) 170 + __MAKE_OP(16) 171 + __MAKE_OP(32) 172 + __MAKE_OP(64) 173 + #undef __MAKE_OP 174 + #undef ____MAKE_OP 175 + 176 + #endif

+1

tools/testing/selftests/kvm/.gitignore

··· 4 4 /aarch64/debug-exceptions 5 5 /aarch64/get-reg-list 6 6 /aarch64/hypercalls 7 + /aarch64/page_fault_test 7 8 /aarch64/psci_test 8 9 /aarch64/vcpu_width_config 9 10 /aarch64/vgic_init

+3

tools/testing/selftests/kvm/Makefile

··· 48 48 LIBKVM += lib/sparsebit.c 49 49 LIBKVM += lib/test_util.c 50 50 LIBKVM += lib/ucall_common.c 51 + LIBKVM += lib/userfaultfd_util.c 51 52 52 53 LIBKVM_STRING += lib/string_override.c 53 54 ··· 159 158 TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions 160 159 TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list 161 160 TEST_GEN_PROGS_aarch64 += aarch64/hypercalls 161 + TEST_GEN_PROGS_aarch64 += aarch64/page_fault_test 162 162 TEST_GEN_PROGS_aarch64 += aarch64/psci_test 163 163 TEST_GEN_PROGS_aarch64 += aarch64/vcpu_width_config 164 164 TEST_GEN_PROGS_aarch64 += aarch64/vgic_init 165 165 TEST_GEN_PROGS_aarch64 += aarch64/vgic_irq 166 + TEST_GEN_PROGS_aarch64 += access_tracking_perf_test 166 167 TEST_GEN_PROGS_aarch64 += demand_paging_test 167 168 TEST_GEN_PROGS_aarch64 += dirty_log_test 168 169 TEST_GEN_PROGS_aarch64 += dirty_log_perf_test

+2 -1

tools/testing/selftests/kvm/aarch64/aarch32_id_regs.c

··· 13 13 #include "kvm_util.h" 14 14 #include "processor.h" 15 15 #include "test_util.h" 16 + #include <linux/bitfield.h> 16 17 17 18 #define BAD_ID_REG_VAL 0x1badc0deul 18 19 ··· 146 145 147 146 vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1), &val); 148 147 149 - el0 = (val & ARM64_FEATURE_MASK(ID_AA64PFR0_EL0)) >> ID_AA64PFR0_EL0_SHIFT; 148 + el0 = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL0), val); 150 149 return el0 == ID_AA64PFR0_ELx_64BIT_ONLY; 151 150 } 152 151

+238 -71

tools/testing/selftests/kvm/aarch64/debug-exceptions.c

··· 2 2 #include <test_util.h> 3 3 #include <kvm_util.h> 4 4 #include <processor.h> 5 + #include <linux/bitfield.h> 5 6 6 7 #define MDSCR_KDE (1 << 13) 7 8 #define MDSCR_MDE (1 << 15) ··· 12 11 #define DBGBCR_EXEC (0x0 << 3) 13 12 #define DBGBCR_EL1 (0x1 << 1) 14 13 #define DBGBCR_E (0x1 << 0) 14 + #define DBGBCR_LBN_SHIFT 16 15 + #define DBGBCR_BT_SHIFT 20 16 + #define DBGBCR_BT_ADDR_LINK_CTX (0x1 << DBGBCR_BT_SHIFT) 17 + #define DBGBCR_BT_CTX_LINK (0x3 << DBGBCR_BT_SHIFT) 15 18 16 19 #define DBGWCR_LEN8 (0xff << 5) 17 20 #define DBGWCR_RD (0x1 << 3) 18 21 #define DBGWCR_WR (0x2 << 3) 19 22 #define DBGWCR_EL1 (0x1 << 1) 20 23 #define DBGWCR_E (0x1 << 0) 24 + #define DBGWCR_LBN_SHIFT 16 25 + #define DBGWCR_WT_SHIFT 20 26 + #define DBGWCR_WT_LINK (0x1 << DBGWCR_WT_SHIFT) 21 27 22 28 #define SPSR_D (1 << 9) 23 29 #define SPSR_SS (1 << 21) 24 30 25 - extern unsigned char sw_bp, sw_bp2, hw_bp, hw_bp2, bp_svc, bp_brk, hw_wp, ss_start; 31 + extern unsigned char sw_bp, sw_bp2, hw_bp, hw_bp2, bp_svc, bp_brk, hw_wp, ss_start, hw_bp_ctx; 26 32 extern unsigned char iter_ss_begin, iter_ss_end; 27 33 static volatile uint64_t sw_bp_addr, hw_bp_addr; 28 34 static volatile uint64_t wp_addr, wp_data_addr; ··· 37 29 static volatile uint64_t ss_addr[4], ss_idx; 38 30 #define PC(v) ((uint64_t)&(v)) 39 31 32 + #define GEN_DEBUG_WRITE_REG(reg_name) \ 33 + static void write_##reg_name(int num, uint64_t val) \ 34 + { \ 35 + switch (num) { \ 36 + case 0: \ 37 + write_sysreg(val, reg_name##0_el1); \ 38 + break; \ 39 + case 1: \ 40 + write_sysreg(val, reg_name##1_el1); \ 41 + break; \ 42 + case 2: \ 43 + write_sysreg(val, reg_name##2_el1); \ 44 + break; \ 45 + case 3: \ 46 + write_sysreg(val, reg_name##3_el1); \ 47 + break; \ 48 + case 4: \ 49 + write_sysreg(val, reg_name##4_el1); \ 50 + break; \ 51 + case 5: \ 52 + write_sysreg(val, reg_name##5_el1); \ 53 + break; \ 54 + case 6: \ 55 + write_sysreg(val, reg_name##6_el1); \ 56 + break; \ 57 + case 7: \ 58 + write_sysreg(val, reg_name##7_el1); \ 59 + break; \ 60 + case 8: \ 61 + write_sysreg(val, reg_name##8_el1); \ 62 + break; \ 63 + case 9: \ 64 + write_sysreg(val, reg_name##9_el1); \ 65 + break; \ 66 + case 10: \ 67 + write_sysreg(val, reg_name##10_el1); \ 68 + break; \ 69 + case 11: \ 70 + write_sysreg(val, reg_name##11_el1); \ 71 + break; \ 72 + case 12: \ 73 + write_sysreg(val, reg_name##12_el1); \ 74 + break; \ 75 + case 13: \ 76 + write_sysreg(val, reg_name##13_el1); \ 77 + break; \ 78 + case 14: \ 79 + write_sysreg(val, reg_name##14_el1); \ 80 + break; \ 81 + case 15: \ 82 + write_sysreg(val, reg_name##15_el1); \ 83 + break; \ 84 + default: \ 85 + GUEST_ASSERT(0); \ 86 + } \ 87 + } 88 + 89 + /* Define write_dbgbcr()/write_dbgbvr()/write_dbgwcr()/write_dbgwvr() */ 90 + GEN_DEBUG_WRITE_REG(dbgbcr) 91 + GEN_DEBUG_WRITE_REG(dbgbvr) 92 + GEN_DEBUG_WRITE_REG(dbgwcr) 93 + GEN_DEBUG_WRITE_REG(dbgwvr) 94 + 40 95 static void reset_debug_state(void) 41 96 { 97 + uint8_t brps, wrps, i; 98 + uint64_t dfr0; 99 + 42 100 asm volatile("msr daifset, #8"); 43 101 44 102 write_sysreg(0, osdlr_el1); ··· 112 38 isb(); 113 39 114 40 write_sysreg(0, mdscr_el1); 115 - /* This test only uses the first bp and wp slot. */ 116 - write_sysreg(0, dbgbvr0_el1); 117 - write_sysreg(0, dbgbcr0_el1); 118 - write_sysreg(0, dbgwcr0_el1); 119 - write_sysreg(0, dbgwvr0_el1); 41 + write_sysreg(0, contextidr_el1); 42 + 43 + /* Reset all bcr/bvr/wcr/wvr registers */ 44 + dfr0 = read_sysreg(id_aa64dfr0_el1); 45 + brps = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_BRPS), dfr0); 46 + for (i = 0; i <= brps; i++) { 47 + write_dbgbcr(i, 0); 48 + write_dbgbvr(i, 0); 49 + } 50 + wrps = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_WRPS), dfr0); 51 + for (i = 0; i <= wrps; i++) { 52 + write_dbgwcr(i, 0); 53 + write_dbgwvr(i, 0); 54 + } 55 + 120 56 isb(); 121 57 } 122 58 ··· 138 54 GUEST_ASSERT(read_sysreg(oslsr_el1) & 2); 139 55 } 140 56 141 - static void install_wp(uint64_t addr) 57 + static void enable_monitor_debug_exceptions(void) 142 58 { 143 - uint32_t wcr; 144 59 uint32_t mdscr; 145 - 146 - wcr = DBGWCR_LEN8 | DBGWCR_RD | DBGWCR_WR | DBGWCR_EL1 | DBGWCR_E; 147 - write_sysreg(wcr, dbgwcr0_el1); 148 - write_sysreg(addr, dbgwvr0_el1); 149 - isb(); 150 60 151 61 asm volatile("msr daifclr, #8"); 152 62 ··· 149 71 isb(); 150 72 } 151 73 152 - static void install_hw_bp(uint64_t addr) 74 + static void install_wp(uint8_t wpn, uint64_t addr) 75 + { 76 + uint32_t wcr; 77 + 78 + wcr = DBGWCR_LEN8 | DBGWCR_RD | DBGWCR_WR | DBGWCR_EL1 | DBGWCR_E; 79 + write_dbgwcr(wpn, wcr); 80 + write_dbgwvr(wpn, addr); 81 + 82 + isb(); 83 + 84 + enable_monitor_debug_exceptions(); 85 + } 86 + 87 + static void install_hw_bp(uint8_t bpn, uint64_t addr) 153 88 { 154 89 uint32_t bcr; 155 - uint32_t mdscr; 156 90 157 91 bcr = DBGBCR_LEN8 | DBGBCR_EXEC | DBGBCR_EL1 | DBGBCR_E; 158 - write_sysreg(bcr, dbgbcr0_el1); 159 - write_sysreg(addr, dbgbvr0_el1); 92 + write_dbgbcr(bpn, bcr); 93 + write_dbgbvr(bpn, addr); 160 94 isb(); 161 95 162 - asm volatile("msr daifclr, #8"); 96 + enable_monitor_debug_exceptions(); 97 + } 163 98 164 - mdscr = read_sysreg(mdscr_el1) | MDSCR_KDE | MDSCR_MDE; 165 - write_sysreg(mdscr, mdscr_el1); 99 + static void install_wp_ctx(uint8_t addr_wp, uint8_t ctx_bp, uint64_t addr, 100 + uint64_t ctx) 101 + { 102 + uint32_t wcr; 103 + uint64_t ctx_bcr; 104 + 105 + /* Setup a context-aware breakpoint for Linked Context ID Match */ 106 + ctx_bcr = DBGBCR_LEN8 | DBGBCR_EXEC | DBGBCR_EL1 | DBGBCR_E | 107 + DBGBCR_BT_CTX_LINK; 108 + write_dbgbcr(ctx_bp, ctx_bcr); 109 + write_dbgbvr(ctx_bp, ctx); 110 + 111 + /* Setup a linked watchpoint (linked to the context-aware breakpoint) */ 112 + wcr = DBGWCR_LEN8 | DBGWCR_RD | DBGWCR_WR | DBGWCR_EL1 | DBGWCR_E | 113 + DBGWCR_WT_LINK | ((uint32_t)ctx_bp << DBGWCR_LBN_SHIFT); 114 + write_dbgwcr(addr_wp, wcr); 115 + write_dbgwvr(addr_wp, addr); 166 116 isb(); 117 + 118 + enable_monitor_debug_exceptions(); 119 + } 120 + 121 + void install_hw_bp_ctx(uint8_t addr_bp, uint8_t ctx_bp, uint64_t addr, 122 + uint64_t ctx) 123 + { 124 + uint32_t addr_bcr, ctx_bcr; 125 + 126 + /* Setup a context-aware breakpoint for Linked Context ID Match */ 127 + ctx_bcr = DBGBCR_LEN8 | DBGBCR_EXEC | DBGBCR_EL1 | DBGBCR_E | 128 + DBGBCR_BT_CTX_LINK; 129 + write_dbgbcr(ctx_bp, ctx_bcr); 130 + write_dbgbvr(ctx_bp, ctx); 131 + 132 + /* 133 + * Setup a normal breakpoint for Linked Address Match, and link it 134 + * to the context-aware breakpoint. 135 + */ 136 + addr_bcr = DBGBCR_LEN8 | DBGBCR_EXEC | DBGBCR_EL1 | DBGBCR_E | 137 + DBGBCR_BT_ADDR_LINK_CTX | 138 + ((uint32_t)ctx_bp << DBGBCR_LBN_SHIFT); 139 + write_dbgbcr(addr_bp, addr_bcr); 140 + write_dbgbvr(addr_bp, addr); 141 + isb(); 142 + 143 + enable_monitor_debug_exceptions(); 167 144 } 168 145 169 146 static void install_ss(void) ··· 234 101 235 102 static volatile char write_data; 236 103 237 - static void guest_code(void) 104 + static void guest_code(uint8_t bpn, uint8_t wpn, uint8_t ctx_bpn) 238 105 { 239 - GUEST_SYNC(0); 106 + uint64_t ctx = 0xabcdef; /* a random context number */ 240 107 241 108 /* Software-breakpoint */ 242 109 reset_debug_state(); 243 110 asm volatile("sw_bp: brk #0"); 244 111 GUEST_ASSERT_EQ(sw_bp_addr, PC(sw_bp)); 245 112 246 - GUEST_SYNC(1); 247 - 248 113 /* Hardware-breakpoint */ 249 114 reset_debug_state(); 250 - install_hw_bp(PC(hw_bp)); 115 + install_hw_bp(bpn, PC(hw_bp)); 251 116 asm volatile("hw_bp: nop"); 252 117 GUEST_ASSERT_EQ(hw_bp_addr, PC(hw_bp)); 253 118 254 - GUEST_SYNC(2); 255 - 256 119 /* Hardware-breakpoint + svc */ 257 120 reset_debug_state(); 258 - install_hw_bp(PC(bp_svc)); 121 + install_hw_bp(bpn, PC(bp_svc)); 259 122 asm volatile("bp_svc: svc #0"); 260 123 GUEST_ASSERT_EQ(hw_bp_addr, PC(bp_svc)); 261 124 GUEST_ASSERT_EQ(svc_addr, PC(bp_svc) + 4); 262 125 263 - GUEST_SYNC(3); 264 - 265 126 /* Hardware-breakpoint + software-breakpoint */ 266 127 reset_debug_state(); 267 - install_hw_bp(PC(bp_brk)); 128 + install_hw_bp(bpn, PC(bp_brk)); 268 129 asm volatile("bp_brk: brk #0"); 269 130 GUEST_ASSERT_EQ(sw_bp_addr, PC(bp_brk)); 270 131 GUEST_ASSERT_EQ(hw_bp_addr, PC(bp_brk)); 271 132 272 - GUEST_SYNC(4); 273 - 274 133 /* Watchpoint */ 275 134 reset_debug_state(); 276 - install_wp(PC(write_data)); 135 + install_wp(wpn, PC(write_data)); 277 136 write_data = 'x'; 278 137 GUEST_ASSERT_EQ(write_data, 'x'); 279 138 GUEST_ASSERT_EQ(wp_data_addr, PC(write_data)); 280 - 281 - GUEST_SYNC(5); 282 139 283 140 /* Single-step */ 284 141 reset_debug_state(); ··· 283 160 GUEST_ASSERT_EQ(ss_addr[1], PC(ss_start) + 4); 284 161 GUEST_ASSERT_EQ(ss_addr[2], PC(ss_start) + 8); 285 162 286 - GUEST_SYNC(6); 287 - 288 163 /* OS Lock does not block software-breakpoint */ 289 164 reset_debug_state(); 290 165 enable_os_lock(); ··· 290 169 asm volatile("sw_bp2: brk #0"); 291 170 GUEST_ASSERT_EQ(sw_bp_addr, PC(sw_bp2)); 292 171 293 - GUEST_SYNC(7); 294 - 295 172 /* OS Lock blocking hardware-breakpoint */ 296 173 reset_debug_state(); 297 174 enable_os_lock(); 298 - install_hw_bp(PC(hw_bp2)); 175 + install_hw_bp(bpn, PC(hw_bp2)); 299 176 hw_bp_addr = 0; 300 177 asm volatile("hw_bp2: nop"); 301 178 GUEST_ASSERT_EQ(hw_bp_addr, 0); 302 - 303 - GUEST_SYNC(8); 304 179 305 180 /* OS Lock blocking watchpoint */ 306 181 reset_debug_state(); 307 182 enable_os_lock(); 308 183 write_data = '\0'; 309 184 wp_data_addr = 0; 310 - install_wp(PC(write_data)); 185 + install_wp(wpn, PC(write_data)); 311 186 write_data = 'x'; 312 187 GUEST_ASSERT_EQ(write_data, 'x'); 313 188 GUEST_ASSERT_EQ(wp_data_addr, 0); 314 - 315 - GUEST_SYNC(9); 316 189 317 190 /* OS Lock blocking single-step */ 318 191 reset_debug_state(); ··· 319 204 "msr daifset, #8\n\t" 320 205 : : : "x0"); 321 206 GUEST_ASSERT_EQ(ss_addr[0], 0); 207 + 208 + /* Linked hardware-breakpoint */ 209 + hw_bp_addr = 0; 210 + reset_debug_state(); 211 + install_hw_bp_ctx(bpn, ctx_bpn, PC(hw_bp_ctx), ctx); 212 + /* Set context id */ 213 + write_sysreg(ctx, contextidr_el1); 214 + isb(); 215 + asm volatile("hw_bp_ctx: nop"); 216 + write_sysreg(0, contextidr_el1); 217 + GUEST_ASSERT_EQ(hw_bp_addr, PC(hw_bp_ctx)); 218 + 219 + /* Linked watchpoint */ 220 + reset_debug_state(); 221 + install_wp_ctx(wpn, ctx_bpn, PC(write_data), ctx); 222 + /* Set context id */ 223 + write_sysreg(ctx, contextidr_el1); 224 + isb(); 225 + write_data = 'x'; 226 + GUEST_ASSERT_EQ(write_data, 'x'); 227 + GUEST_ASSERT_EQ(wp_data_addr, PC(write_data)); 322 228 323 229 GUEST_DONE(); 324 230 } ··· 412 276 GUEST_DONE(); 413 277 } 414 278 415 - static int debug_version(struct kvm_vcpu *vcpu) 279 + static int debug_version(uint64_t id_aa64dfr0) 416 280 { 417 - uint64_t id_aa64dfr0; 418 - 419 - vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64DFR0_EL1), &id_aa64dfr0); 420 - return id_aa64dfr0 & 0xf; 281 + return FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_DEBUGVER), id_aa64dfr0); 421 282 } 422 283 423 - static void test_guest_debug_exceptions(void) 284 + static void test_guest_debug_exceptions(uint8_t bpn, uint8_t wpn, uint8_t ctx_bpn) 424 285 { 425 286 struct kvm_vcpu *vcpu; 426 287 struct kvm_vm *vm; 427 288 struct ucall uc; 428 - int stage; 429 289 430 290 vm = vm_create_with_one_vcpu(&vcpu, guest_code); 431 291 ··· 439 307 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 440 308 ESR_EC_SVC64, guest_svc_handler); 441 309 442 - for (stage = 0; stage < 11; stage++) { 443 - vcpu_run(vcpu); 310 + /* Specify bpn/wpn/ctx_bpn to be tested */ 311 + vcpu_args_set(vcpu, 3, bpn, wpn, ctx_bpn); 312 + pr_debug("Use bpn#%d, wpn#%d and ctx_bpn#%d\n", bpn, wpn, ctx_bpn); 444 313 445 - switch (get_ucall(vcpu, &uc)) { 446 - case UCALL_SYNC: 447 - TEST_ASSERT(uc.args[1] == stage, 448 - "Stage %d: Unexpected sync ucall, got %lx", 449 - stage, (ulong)uc.args[1]); 450 - break; 451 - case UCALL_ABORT: 452 - REPORT_GUEST_ASSERT_2(uc, "values: %#lx, %#lx"); 453 - break; 454 - case UCALL_DONE: 455 - goto done; 456 - default: 457 - TEST_FAIL("Unknown ucall %lu", uc.cmd); 458 - } 314 + vcpu_run(vcpu); 315 + switch (get_ucall(vcpu, &uc)) { 316 + case UCALL_ABORT: 317 + REPORT_GUEST_ASSERT_2(uc, "values: %#lx, %#lx"); 318 + break; 319 + case UCALL_DONE: 320 + goto done; 321 + default: 322 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 459 323 } 460 324 461 325 done: ··· 528 400 kvm_vm_free(vm); 529 401 } 530 402 403 + /* 404 + * Run debug testing using the various breakpoint#, watchpoint# and 405 + * context-aware breakpoint# with the given ID_AA64DFR0_EL1 configuration. 406 + */ 407 + void test_guest_debug_exceptions_all(uint64_t aa64dfr0) 408 + { 409 + uint8_t brp_num, wrp_num, ctx_brp_num, normal_brp_num, ctx_brp_base; 410 + int b, w, c; 411 + 412 + /* Number of breakpoints */ 413 + brp_num = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_BRPS), aa64dfr0) + 1; 414 + __TEST_REQUIRE(brp_num >= 2, "At least two breakpoints are required"); 415 + 416 + /* Number of watchpoints */ 417 + wrp_num = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_WRPS), aa64dfr0) + 1; 418 + 419 + /* Number of context aware breakpoints */ 420 + ctx_brp_num = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_CTX_CMPS), aa64dfr0) + 1; 421 + 422 + pr_debug("%s brp_num:%d, wrp_num:%d, ctx_brp_num:%d\n", __func__, 423 + brp_num, wrp_num, ctx_brp_num); 424 + 425 + /* Number of normal (non-context aware) breakpoints */ 426 + normal_brp_num = brp_num - ctx_brp_num; 427 + 428 + /* Lowest context aware breakpoint number */ 429 + ctx_brp_base = normal_brp_num; 430 + 431 + /* Run tests with all supported breakpoints/watchpoints */ 432 + for (c = ctx_brp_base; c < ctx_brp_base + ctx_brp_num; c++) { 433 + for (b = 0; b < normal_brp_num; b++) { 434 + for (w = 0; w < wrp_num; w++) 435 + test_guest_debug_exceptions(b, w, c); 436 + } 437 + } 438 + } 439 + 531 440 static void help(char *name) 532 441 { 533 442 puts(""); ··· 579 414 struct kvm_vm *vm; 580 415 int opt; 581 416 int ss_iteration = 10000; 417 + uint64_t aa64dfr0; 582 418 583 419 vm = vm_create_with_one_vcpu(&vcpu, guest_code); 584 - __TEST_REQUIRE(debug_version(vcpu) >= 6, 420 + vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64DFR0_EL1), &aa64dfr0); 421 + __TEST_REQUIRE(debug_version(aa64dfr0) >= 6, 585 422 "Armv8 debug architecture not supported."); 586 423 kvm_vm_free(vm); 587 424 ··· 599 432 } 600 433 } 601 434 602 - test_guest_debug_exceptions(); 435 + test_guest_debug_exceptions_all(aa64dfr0); 603 436 test_single_step_from_userspace(ss_iteration); 604 437 605 438 return 0;

+1117

tools/testing/selftests/kvm/aarch64/page_fault_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * page_fault_test.c - Test stage 2 faults. 4 + * 5 + * This test tries different combinations of guest accesses (e.g., write, 6 + * S1PTW), backing source type (e.g., anon) and types of faults (e.g., read on 7 + * hugetlbfs with a hole). It checks that the expected handling method is 8 + * called (e.g., uffd faults with the right address and write/read flag). 9 + */ 10 + 11 + #define _GNU_SOURCE 12 + #include <linux/bitmap.h> 13 + #include <fcntl.h> 14 + #include <test_util.h> 15 + #include <kvm_util.h> 16 + #include <processor.h> 17 + #include <asm/sysreg.h> 18 + #include <linux/bitfield.h> 19 + #include "guest_modes.h" 20 + #include "userfaultfd_util.h" 21 + 22 + /* Guest virtual addresses that point to the test page and its PTE. */ 23 + #define TEST_GVA 0xc0000000 24 + #define TEST_EXEC_GVA (TEST_GVA + 0x8) 25 + #define TEST_PTE_GVA 0xb0000000 26 + #define TEST_DATA 0x0123456789ABCDEF 27 + 28 + static uint64_t *guest_test_memory = (uint64_t *)TEST_GVA; 29 + 30 + #define CMD_NONE (0) 31 + #define CMD_SKIP_TEST (1ULL << 1) 32 + #define CMD_HOLE_PT (1ULL << 2) 33 + #define CMD_HOLE_DATA (1ULL << 3) 34 + #define CMD_CHECK_WRITE_IN_DIRTY_LOG (1ULL << 4) 35 + #define CMD_CHECK_S1PTW_WR_IN_DIRTY_LOG (1ULL << 5) 36 + #define CMD_CHECK_NO_WRITE_IN_DIRTY_LOG (1ULL << 6) 37 + #define CMD_CHECK_NO_S1PTW_WR_IN_DIRTY_LOG (1ULL << 7) 38 + #define CMD_SET_PTE_AF (1ULL << 8) 39 + 40 + #define PREPARE_FN_NR 10 41 + #define CHECK_FN_NR 10 42 + 43 + static struct event_cnt { 44 + int mmio_exits; 45 + int fail_vcpu_runs; 46 + int uffd_faults; 47 + /* uffd_faults is incremented from multiple threads. */ 48 + pthread_mutex_t uffd_faults_mutex; 49 + } events; 50 + 51 + struct test_desc { 52 + const char *name; 53 + uint64_t mem_mark_cmd; 54 + /* Skip the test if any prepare function returns false */ 55 + bool (*guest_prepare[PREPARE_FN_NR])(void); 56 + void (*guest_test)(void); 57 + void (*guest_test_check[CHECK_FN_NR])(void); 58 + uffd_handler_t uffd_pt_handler; 59 + uffd_handler_t uffd_data_handler; 60 + void (*dabt_handler)(struct ex_regs *regs); 61 + void (*iabt_handler)(struct ex_regs *regs); 62 + void (*mmio_handler)(struct kvm_vm *vm, struct kvm_run *run); 63 + void (*fail_vcpu_run_handler)(int ret); 64 + uint32_t pt_memslot_flags; 65 + uint32_t data_memslot_flags; 66 + bool skip; 67 + struct event_cnt expected_events; 68 + }; 69 + 70 + struct test_params { 71 + enum vm_mem_backing_src_type src_type; 72 + struct test_desc *test_desc; 73 + }; 74 + 75 + static inline void flush_tlb_page(uint64_t vaddr) 76 + { 77 + uint64_t page = vaddr >> 12; 78 + 79 + dsb(ishst); 80 + asm volatile("tlbi vaae1is, %0" :: "r" (page)); 81 + dsb(ish); 82 + isb(); 83 + } 84 + 85 + static void guest_write64(void) 86 + { 87 + uint64_t val; 88 + 89 + WRITE_ONCE(*guest_test_memory, TEST_DATA); 90 + val = READ_ONCE(*guest_test_memory); 91 + GUEST_ASSERT_EQ(val, TEST_DATA); 92 + } 93 + 94 + /* Check the system for atomic instructions. */ 95 + static bool guest_check_lse(void) 96 + { 97 + uint64_t isar0 = read_sysreg(id_aa64isar0_el1); 98 + uint64_t atomic; 99 + 100 + atomic = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64ISAR0_ATOMICS), isar0); 101 + return atomic >= 2; 102 + } 103 + 104 + static bool guest_check_dc_zva(void) 105 + { 106 + uint64_t dczid = read_sysreg(dczid_el0); 107 + uint64_t dzp = FIELD_GET(ARM64_FEATURE_MASK(DCZID_DZP), dczid); 108 + 109 + return dzp == 0; 110 + } 111 + 112 + /* Compare and swap instruction. */ 113 + static void guest_cas(void) 114 + { 115 + uint64_t val; 116 + 117 + GUEST_ASSERT(guest_check_lse()); 118 + asm volatile(".arch_extension lse\n" 119 + "casal %0, %1, [%2]\n" 120 + :: "r" (0), "r" (TEST_DATA), "r" (guest_test_memory)); 121 + val = READ_ONCE(*guest_test_memory); 122 + GUEST_ASSERT_EQ(val, TEST_DATA); 123 + } 124 + 125 + static void guest_read64(void) 126 + { 127 + uint64_t val; 128 + 129 + val = READ_ONCE(*guest_test_memory); 130 + GUEST_ASSERT_EQ(val, 0); 131 + } 132 + 133 + /* Address translation instruction */ 134 + static void guest_at(void) 135 + { 136 + uint64_t par; 137 + 138 + asm volatile("at s1e1r, %0" :: "r" (guest_test_memory)); 139 + par = read_sysreg(par_el1); 140 + isb(); 141 + 142 + /* Bit 1 indicates whether the AT was successful */ 143 + GUEST_ASSERT_EQ(par & 1, 0); 144 + } 145 + 146 + /* 147 + * The size of the block written by "dc zva" is guaranteed to be between (2 << 148 + * 0) and (2 << 9), which is safe in our case as we need the write to happen 149 + * for at least a word, and not more than a page. 150 + */ 151 + static void guest_dc_zva(void) 152 + { 153 + uint16_t val; 154 + 155 + asm volatile("dc zva, %0" :: "r" (guest_test_memory)); 156 + dsb(ish); 157 + val = READ_ONCE(*guest_test_memory); 158 + GUEST_ASSERT_EQ(val, 0); 159 + } 160 + 161 + /* 162 + * Pre-indexing loads and stores don't have a valid syndrome (ESR_EL2.ISV==0). 163 + * And that's special because KVM must take special care with those: they 164 + * should still count as accesses for dirty logging or user-faulting, but 165 + * should be handled differently on mmio. 166 + */ 167 + static void guest_ld_preidx(void) 168 + { 169 + uint64_t val; 170 + uint64_t addr = TEST_GVA - 8; 171 + 172 + /* 173 + * This ends up accessing "TEST_GVA + 8 - 8", where "TEST_GVA - 8" is 174 + * in a gap between memslots not backing by anything. 175 + */ 176 + asm volatile("ldr %0, [%1, #8]!" 177 + : "=r" (val), "+r" (addr)); 178 + GUEST_ASSERT_EQ(val, 0); 179 + GUEST_ASSERT_EQ(addr, TEST_GVA); 180 + } 181 + 182 + static void guest_st_preidx(void) 183 + { 184 + uint64_t val = TEST_DATA; 185 + uint64_t addr = TEST_GVA - 8; 186 + 187 + asm volatile("str %0, [%1, #8]!" 188 + : "+r" (val), "+r" (addr)); 189 + 190 + GUEST_ASSERT_EQ(addr, TEST_GVA); 191 + val = READ_ONCE(*guest_test_memory); 192 + } 193 + 194 + static bool guest_set_ha(void) 195 + { 196 + uint64_t mmfr1 = read_sysreg(id_aa64mmfr1_el1); 197 + uint64_t hadbs, tcr; 198 + 199 + /* Skip if HA is not supported. */ 200 + hadbs = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR1_HADBS), mmfr1); 201 + if (hadbs == 0) 202 + return false; 203 + 204 + tcr = read_sysreg(tcr_el1) | TCR_EL1_HA; 205 + write_sysreg(tcr, tcr_el1); 206 + isb(); 207 + 208 + return true; 209 + } 210 + 211 + static bool guest_clear_pte_af(void) 212 + { 213 + *((uint64_t *)TEST_PTE_GVA) &= ~PTE_AF; 214 + flush_tlb_page(TEST_GVA); 215 + 216 + return true; 217 + } 218 + 219 + static void guest_check_pte_af(void) 220 + { 221 + dsb(ish); 222 + GUEST_ASSERT_EQ(*((uint64_t *)TEST_PTE_GVA) & PTE_AF, PTE_AF); 223 + } 224 + 225 + static void guest_check_write_in_dirty_log(void) 226 + { 227 + GUEST_SYNC(CMD_CHECK_WRITE_IN_DIRTY_LOG); 228 + } 229 + 230 + static void guest_check_no_write_in_dirty_log(void) 231 + { 232 + GUEST_SYNC(CMD_CHECK_NO_WRITE_IN_DIRTY_LOG); 233 + } 234 + 235 + static void guest_check_s1ptw_wr_in_dirty_log(void) 236 + { 237 + GUEST_SYNC(CMD_CHECK_S1PTW_WR_IN_DIRTY_LOG); 238 + } 239 + 240 + static void guest_exec(void) 241 + { 242 + int (*code)(void) = (int (*)(void))TEST_EXEC_GVA; 243 + int ret; 244 + 245 + ret = code(); 246 + GUEST_ASSERT_EQ(ret, 0x77); 247 + } 248 + 249 + static bool guest_prepare(struct test_desc *test) 250 + { 251 + bool (*prepare_fn)(void); 252 + int i; 253 + 254 + for (i = 0; i < PREPARE_FN_NR; i++) { 255 + prepare_fn = test->guest_prepare[i]; 256 + if (prepare_fn && !prepare_fn()) 257 + return false; 258 + } 259 + 260 + return true; 261 + } 262 + 263 + static void guest_test_check(struct test_desc *test) 264 + { 265 + void (*check_fn)(void); 266 + int i; 267 + 268 + for (i = 0; i < CHECK_FN_NR; i++) { 269 + check_fn = test->guest_test_check[i]; 270 + if (check_fn) 271 + check_fn(); 272 + } 273 + } 274 + 275 + static void guest_code(struct test_desc *test) 276 + { 277 + if (!guest_prepare(test)) 278 + GUEST_SYNC(CMD_SKIP_TEST); 279 + 280 + GUEST_SYNC(test->mem_mark_cmd); 281 + 282 + if (test->guest_test) 283 + test->guest_test(); 284 + 285 + guest_test_check(test); 286 + GUEST_DONE(); 287 + } 288 + 289 + static void no_dabt_handler(struct ex_regs *regs) 290 + { 291 + GUEST_ASSERT_1(false, read_sysreg(far_el1)); 292 + } 293 + 294 + static void no_iabt_handler(struct ex_regs *regs) 295 + { 296 + GUEST_ASSERT_1(false, regs->pc); 297 + } 298 + 299 + static struct uffd_args { 300 + char *copy; 301 + void *hva; 302 + uint64_t paging_size; 303 + } pt_args, data_args; 304 + 305 + /* Returns true to continue the test, and false if it should be skipped. */ 306 + static int uffd_generic_handler(int uffd_mode, int uffd, struct uffd_msg *msg, 307 + struct uffd_args *args, bool expect_write) 308 + { 309 + uint64_t addr = msg->arg.pagefault.address; 310 + uint64_t flags = msg->arg.pagefault.flags; 311 + struct uffdio_copy copy; 312 + int ret; 313 + 314 + TEST_ASSERT(uffd_mode == UFFDIO_REGISTER_MODE_MISSING, 315 + "The only expected UFFD mode is MISSING"); 316 + ASSERT_EQ(!!(flags & UFFD_PAGEFAULT_FLAG_WRITE), expect_write); 317 + ASSERT_EQ(addr, (uint64_t)args->hva); 318 + 319 + pr_debug("uffd fault: addr=%p write=%d\n", 320 + (void *)addr, !!(flags & UFFD_PAGEFAULT_FLAG_WRITE)); 321 + 322 + copy.src = (uint64_t)args->copy; 323 + copy.dst = addr; 324 + copy.len = args->paging_size; 325 + copy.mode = 0; 326 + 327 + ret = ioctl(uffd, UFFDIO_COPY, &copy); 328 + if (ret == -1) { 329 + pr_info("Failed UFFDIO_COPY in 0x%lx with errno: %d\n", 330 + addr, errno); 331 + return ret; 332 + } 333 + 334 + pthread_mutex_lock(&events.uffd_faults_mutex); 335 + events.uffd_faults += 1; 336 + pthread_mutex_unlock(&events.uffd_faults_mutex); 337 + return 0; 338 + } 339 + 340 + static int uffd_pt_write_handler(int mode, int uffd, struct uffd_msg *msg) 341 + { 342 + return uffd_generic_handler(mode, uffd, msg, &pt_args, true); 343 + } 344 + 345 + static int uffd_data_write_handler(int mode, int uffd, struct uffd_msg *msg) 346 + { 347 + return uffd_generic_handler(mode, uffd, msg, &data_args, true); 348 + } 349 + 350 + static int uffd_data_read_handler(int mode, int uffd, struct uffd_msg *msg) 351 + { 352 + return uffd_generic_handler(mode, uffd, msg, &data_args, false); 353 + } 354 + 355 + static void setup_uffd_args(struct userspace_mem_region *region, 356 + struct uffd_args *args) 357 + { 358 + args->hva = (void *)region->region.userspace_addr; 359 + args->paging_size = region->region.memory_size; 360 + 361 + args->copy = malloc(args->paging_size); 362 + TEST_ASSERT(args->copy, "Failed to allocate data copy."); 363 + memcpy(args->copy, args->hva, args->paging_size); 364 + } 365 + 366 + static void setup_uffd(struct kvm_vm *vm, struct test_params *p, 367 + struct uffd_desc **pt_uffd, struct uffd_desc **data_uffd) 368 + { 369 + struct test_desc *test = p->test_desc; 370 + int uffd_mode = UFFDIO_REGISTER_MODE_MISSING; 371 + 372 + setup_uffd_args(vm_get_mem_region(vm, MEM_REGION_PT), &pt_args); 373 + setup_uffd_args(vm_get_mem_region(vm, MEM_REGION_TEST_DATA), &data_args); 374 + 375 + *pt_uffd = NULL; 376 + if (test->uffd_pt_handler) 377 + *pt_uffd = uffd_setup_demand_paging(uffd_mode, 0, 378 + pt_args.hva, 379 + pt_args.paging_size, 380 + test->uffd_pt_handler); 381 + 382 + *data_uffd = NULL; 383 + if (test->uffd_data_handler) 384 + *data_uffd = uffd_setup_demand_paging(uffd_mode, 0, 385 + data_args.hva, 386 + data_args.paging_size, 387 + test->uffd_data_handler); 388 + } 389 + 390 + static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd, 391 + struct uffd_desc *data_uffd) 392 + { 393 + if (test->uffd_pt_handler) 394 + uffd_stop_demand_paging(pt_uffd); 395 + if (test->uffd_data_handler) 396 + uffd_stop_demand_paging(data_uffd); 397 + 398 + free(pt_args.copy); 399 + free(data_args.copy); 400 + } 401 + 402 + static int uffd_no_handler(int mode, int uffd, struct uffd_msg *msg) 403 + { 404 + TEST_FAIL("There was no UFFD fault expected."); 405 + return -1; 406 + } 407 + 408 + /* Returns false if the test should be skipped. */ 409 + static bool punch_hole_in_backing_store(struct kvm_vm *vm, 410 + struct userspace_mem_region *region) 411 + { 412 + void *hva = (void *)region->region.userspace_addr; 413 + uint64_t paging_size = region->region.memory_size; 414 + int ret, fd = region->fd; 415 + 416 + if (fd != -1) { 417 + ret = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 418 + 0, paging_size); 419 + TEST_ASSERT(ret == 0, "fallocate failed\n"); 420 + } else { 421 + ret = madvise(hva, paging_size, MADV_DONTNEED); 422 + TEST_ASSERT(ret == 0, "madvise failed\n"); 423 + } 424 + 425 + return true; 426 + } 427 + 428 + static void mmio_on_test_gpa_handler(struct kvm_vm *vm, struct kvm_run *run) 429 + { 430 + struct userspace_mem_region *region; 431 + void *hva; 432 + 433 + region = vm_get_mem_region(vm, MEM_REGION_TEST_DATA); 434 + hva = (void *)region->region.userspace_addr; 435 + 436 + ASSERT_EQ(run->mmio.phys_addr, region->region.guest_phys_addr); 437 + 438 + memcpy(hva, run->mmio.data, run->mmio.len); 439 + events.mmio_exits += 1; 440 + } 441 + 442 + static void mmio_no_handler(struct kvm_vm *vm, struct kvm_run *run) 443 + { 444 + uint64_t data; 445 + 446 + memcpy(&data, run->mmio.data, sizeof(data)); 447 + pr_debug("addr=%lld len=%d w=%d data=%lx\n", 448 + run->mmio.phys_addr, run->mmio.len, 449 + run->mmio.is_write, data); 450 + TEST_FAIL("There was no MMIO exit expected."); 451 + } 452 + 453 + static bool check_write_in_dirty_log(struct kvm_vm *vm, 454 + struct userspace_mem_region *region, 455 + uint64_t host_pg_nr) 456 + { 457 + unsigned long *bmap; 458 + bool first_page_dirty; 459 + uint64_t size = region->region.memory_size; 460 + 461 + /* getpage_size() is not always equal to vm->page_size */ 462 + bmap = bitmap_zalloc(size / getpagesize()); 463 + kvm_vm_get_dirty_log(vm, region->region.slot, bmap); 464 + first_page_dirty = test_bit(host_pg_nr, bmap); 465 + free(bmap); 466 + return first_page_dirty; 467 + } 468 + 469 + /* Returns true to continue the test, and false if it should be skipped. */ 470 + static bool handle_cmd(struct kvm_vm *vm, int cmd) 471 + { 472 + struct userspace_mem_region *data_region, *pt_region; 473 + bool continue_test = true; 474 + 475 + data_region = vm_get_mem_region(vm, MEM_REGION_TEST_DATA); 476 + pt_region = vm_get_mem_region(vm, MEM_REGION_PT); 477 + 478 + if (cmd == CMD_SKIP_TEST) 479 + continue_test = false; 480 + 481 + if (cmd & CMD_HOLE_PT) 482 + continue_test = punch_hole_in_backing_store(vm, pt_region); 483 + if (cmd & CMD_HOLE_DATA) 484 + continue_test = punch_hole_in_backing_store(vm, data_region); 485 + if (cmd & CMD_CHECK_WRITE_IN_DIRTY_LOG) 486 + TEST_ASSERT(check_write_in_dirty_log(vm, data_region, 0), 487 + "Missing write in dirty log"); 488 + if (cmd & CMD_CHECK_S1PTW_WR_IN_DIRTY_LOG) 489 + TEST_ASSERT(check_write_in_dirty_log(vm, pt_region, 0), 490 + "Missing s1ptw write in dirty log"); 491 + if (cmd & CMD_CHECK_NO_WRITE_IN_DIRTY_LOG) 492 + TEST_ASSERT(!check_write_in_dirty_log(vm, data_region, 0), 493 + "Unexpected write in dirty log"); 494 + if (cmd & CMD_CHECK_NO_S1PTW_WR_IN_DIRTY_LOG) 495 + TEST_ASSERT(!check_write_in_dirty_log(vm, pt_region, 0), 496 + "Unexpected s1ptw write in dirty log"); 497 + 498 + return continue_test; 499 + } 500 + 501 + void fail_vcpu_run_no_handler(int ret) 502 + { 503 + TEST_FAIL("Unexpected vcpu run failure\n"); 504 + } 505 + 506 + void fail_vcpu_run_mmio_no_syndrome_handler(int ret) 507 + { 508 + TEST_ASSERT(errno == ENOSYS, 509 + "The mmio handler should have returned not implemented."); 510 + events.fail_vcpu_runs += 1; 511 + } 512 + 513 + typedef uint32_t aarch64_insn_t; 514 + extern aarch64_insn_t __exec_test[2]; 515 + 516 + noinline void __return_0x77(void) 517 + { 518 + asm volatile("__exec_test: mov x0, #0x77\n" 519 + "ret\n"); 520 + } 521 + 522 + /* 523 + * Note that this function runs on the host before the test VM starts: there's 524 + * no need to sync the D$ and I$ caches. 525 + */ 526 + static void load_exec_code_for_test(struct kvm_vm *vm) 527 + { 528 + uint64_t *code; 529 + struct userspace_mem_region *region; 530 + void *hva; 531 + 532 + region = vm_get_mem_region(vm, MEM_REGION_TEST_DATA); 533 + hva = (void *)region->region.userspace_addr; 534 + 535 + assert(TEST_EXEC_GVA > TEST_GVA); 536 + code = hva + TEST_EXEC_GVA - TEST_GVA; 537 + memcpy(code, __exec_test, sizeof(__exec_test)); 538 + } 539 + 540 + static void setup_abort_handlers(struct kvm_vm *vm, struct kvm_vcpu *vcpu, 541 + struct test_desc *test) 542 + { 543 + vm_init_descriptor_tables(vm); 544 + vcpu_init_descriptor_tables(vcpu); 545 + 546 + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 547 + ESR_EC_DABT, no_dabt_handler); 548 + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 549 + ESR_EC_IABT, no_iabt_handler); 550 + } 551 + 552 + static void setup_gva_maps(struct kvm_vm *vm) 553 + { 554 + struct userspace_mem_region *region; 555 + uint64_t pte_gpa; 556 + 557 + region = vm_get_mem_region(vm, MEM_REGION_TEST_DATA); 558 + /* Map TEST_GVA first. This will install a new PTE. */ 559 + virt_pg_map(vm, TEST_GVA, region->region.guest_phys_addr); 560 + /* Then map TEST_PTE_GVA to the above PTE. */ 561 + pte_gpa = addr_hva2gpa(vm, virt_get_pte_hva(vm, TEST_GVA)); 562 + virt_pg_map(vm, TEST_PTE_GVA, pte_gpa); 563 + } 564 + 565 + enum pf_test_memslots { 566 + CODE_AND_DATA_MEMSLOT, 567 + PAGE_TABLE_MEMSLOT, 568 + TEST_DATA_MEMSLOT, 569 + }; 570 + 571 + /* 572 + * Create a memslot for code and data at pfn=0, and test-data and PT ones 573 + * at max_gfn. 574 + */ 575 + static void setup_memslots(struct kvm_vm *vm, struct test_params *p) 576 + { 577 + uint64_t backing_src_pagesz = get_backing_src_pagesz(p->src_type); 578 + uint64_t guest_page_size = vm->page_size; 579 + uint64_t max_gfn = vm_compute_max_gfn(vm); 580 + /* Enough for 2M of code when using 4K guest pages. */ 581 + uint64_t code_npages = 512; 582 + uint64_t pt_size, data_size, data_gpa; 583 + 584 + /* 585 + * This test requires 1 pgd, 2 pud, 4 pmd, and 6 pte pages when using 586 + * VM_MODE_P48V48_4K. Note that the .text takes ~1.6MBs. That's 13 587 + * pages. VM_MODE_P48V48_4K is the mode with most PT pages; let's use 588 + * twice that just in case. 589 + */ 590 + pt_size = 26 * guest_page_size; 591 + 592 + /* memslot sizes and gpa's must be aligned to the backing page size */ 593 + pt_size = align_up(pt_size, backing_src_pagesz); 594 + data_size = align_up(guest_page_size, backing_src_pagesz); 595 + data_gpa = (max_gfn * guest_page_size) - data_size; 596 + data_gpa = align_down(data_gpa, backing_src_pagesz); 597 + 598 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 599 + CODE_AND_DATA_MEMSLOT, code_npages, 0); 600 + vm->memslots[MEM_REGION_CODE] = CODE_AND_DATA_MEMSLOT; 601 + vm->memslots[MEM_REGION_DATA] = CODE_AND_DATA_MEMSLOT; 602 + 603 + vm_userspace_mem_region_add(vm, p->src_type, data_gpa - pt_size, 604 + PAGE_TABLE_MEMSLOT, pt_size / guest_page_size, 605 + p->test_desc->pt_memslot_flags); 606 + vm->memslots[MEM_REGION_PT] = PAGE_TABLE_MEMSLOT; 607 + 608 + vm_userspace_mem_region_add(vm, p->src_type, data_gpa, TEST_DATA_MEMSLOT, 609 + data_size / guest_page_size, 610 + p->test_desc->data_memslot_flags); 611 + vm->memslots[MEM_REGION_TEST_DATA] = TEST_DATA_MEMSLOT; 612 + } 613 + 614 + static void setup_ucall(struct kvm_vm *vm) 615 + { 616 + struct userspace_mem_region *region = vm_get_mem_region(vm, MEM_REGION_TEST_DATA); 617 + 618 + ucall_init(vm, region->region.guest_phys_addr + region->region.memory_size); 619 + } 620 + 621 + static void setup_default_handlers(struct test_desc *test) 622 + { 623 + if (!test->mmio_handler) 624 + test->mmio_handler = mmio_no_handler; 625 + 626 + if (!test->fail_vcpu_run_handler) 627 + test->fail_vcpu_run_handler = fail_vcpu_run_no_handler; 628 + } 629 + 630 + static void check_event_counts(struct test_desc *test) 631 + { 632 + ASSERT_EQ(test->expected_events.uffd_faults, events.uffd_faults); 633 + ASSERT_EQ(test->expected_events.mmio_exits, events.mmio_exits); 634 + ASSERT_EQ(test->expected_events.fail_vcpu_runs, events.fail_vcpu_runs); 635 + } 636 + 637 + static void print_test_banner(enum vm_guest_mode mode, struct test_params *p) 638 + { 639 + struct test_desc *test = p->test_desc; 640 + 641 + pr_debug("Test: %s\n", test->name); 642 + pr_debug("Testing guest mode: %s\n", vm_guest_mode_string(mode)); 643 + pr_debug("Testing memory backing src type: %s\n", 644 + vm_mem_backing_src_alias(p->src_type)->name); 645 + } 646 + 647 + static void reset_event_counts(void) 648 + { 649 + memset(&events, 0, sizeof(events)); 650 + } 651 + 652 + /* 653 + * This function either succeeds, skips the test (after setting test->skip), or 654 + * fails with a TEST_FAIL that aborts all tests. 655 + */ 656 + static void vcpu_run_loop(struct kvm_vm *vm, struct kvm_vcpu *vcpu, 657 + struct test_desc *test) 658 + { 659 + struct kvm_run *run; 660 + struct ucall uc; 661 + int ret; 662 + 663 + run = vcpu->run; 664 + 665 + for (;;) { 666 + ret = _vcpu_run(vcpu); 667 + if (ret) { 668 + test->fail_vcpu_run_handler(ret); 669 + goto done; 670 + } 671 + 672 + switch (get_ucall(vcpu, &uc)) { 673 + case UCALL_SYNC: 674 + if (!handle_cmd(vm, uc.args[1])) { 675 + test->skip = true; 676 + goto done; 677 + } 678 + break; 679 + case UCALL_ABORT: 680 + REPORT_GUEST_ASSERT_2(uc, "values: %#lx, %#lx"); 681 + break; 682 + case UCALL_DONE: 683 + goto done; 684 + case UCALL_NONE: 685 + if (run->exit_reason == KVM_EXIT_MMIO) 686 + test->mmio_handler(vm, run); 687 + break; 688 + default: 689 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 690 + } 691 + } 692 + 693 + done: 694 + pr_debug(test->skip ? "Skipped.\n" : "Done.\n"); 695 + } 696 + 697 + static void run_test(enum vm_guest_mode mode, void *arg) 698 + { 699 + struct test_params *p = (struct test_params *)arg; 700 + struct test_desc *test = p->test_desc; 701 + struct kvm_vm *vm; 702 + struct kvm_vcpu *vcpu; 703 + struct uffd_desc *pt_uffd, *data_uffd; 704 + 705 + print_test_banner(mode, p); 706 + 707 + vm = ____vm_create(mode); 708 + setup_memslots(vm, p); 709 + kvm_vm_elf_load(vm, program_invocation_name); 710 + setup_ucall(vm); 711 + vcpu = vm_vcpu_add(vm, 0, guest_code); 712 + 713 + setup_gva_maps(vm); 714 + 715 + reset_event_counts(); 716 + 717 + /* 718 + * Set some code in the data memslot for the guest to execute (only 719 + * applicable to the EXEC tests). This has to be done before 720 + * setup_uffd() as that function copies the memslot data for the uffd 721 + * handler. 722 + */ 723 + load_exec_code_for_test(vm); 724 + setup_uffd(vm, p, &pt_uffd, &data_uffd); 725 + setup_abort_handlers(vm, vcpu, test); 726 + setup_default_handlers(test); 727 + vcpu_args_set(vcpu, 1, test); 728 + 729 + vcpu_run_loop(vm, vcpu, test); 730 + 731 + kvm_vm_free(vm); 732 + free_uffd(test, pt_uffd, data_uffd); 733 + 734 + /* 735 + * Make sure we check the events after the uffd threads have exited, 736 + * which means they updated their respective event counters. 737 + */ 738 + if (!test->skip) 739 + check_event_counts(test); 740 + } 741 + 742 + static void help(char *name) 743 + { 744 + puts(""); 745 + printf("usage: %s [-h] [-s mem-type]\n", name); 746 + puts(""); 747 + guest_modes_help(); 748 + backing_src_help("-s"); 749 + puts(""); 750 + } 751 + 752 + #define SNAME(s) #s 753 + #define SCAT2(a, b) SNAME(a ## _ ## b) 754 + #define SCAT3(a, b, c) SCAT2(a, SCAT2(b, c)) 755 + #define SCAT4(a, b, c, d) SCAT2(a, SCAT3(b, c, d)) 756 + 757 + #define _CHECK(_test) _CHECK_##_test 758 + #define _PREPARE(_test) _PREPARE_##_test 759 + #define _PREPARE_guest_read64 NULL 760 + #define _PREPARE_guest_ld_preidx NULL 761 + #define _PREPARE_guest_write64 NULL 762 + #define _PREPARE_guest_st_preidx NULL 763 + #define _PREPARE_guest_exec NULL 764 + #define _PREPARE_guest_at NULL 765 + #define _PREPARE_guest_dc_zva guest_check_dc_zva 766 + #define _PREPARE_guest_cas guest_check_lse 767 + 768 + /* With or without access flag checks */ 769 + #define _PREPARE_with_af guest_set_ha, guest_clear_pte_af 770 + #define _PREPARE_no_af NULL 771 + #define _CHECK_with_af guest_check_pte_af 772 + #define _CHECK_no_af NULL 773 + 774 + /* Performs an access and checks that no faults were triggered. */ 775 + #define TEST_ACCESS(_access, _with_af, _mark_cmd) \ 776 + { \ 777 + .name = SCAT3(_access, _with_af, #_mark_cmd), \ 778 + .guest_prepare = { _PREPARE(_with_af), \ 779 + _PREPARE(_access) }, \ 780 + .mem_mark_cmd = _mark_cmd, \ 781 + .guest_test = _access, \ 782 + .guest_test_check = { _CHECK(_with_af) }, \ 783 + .expected_events = { 0 }, \ 784 + } 785 + 786 + #define TEST_UFFD(_access, _with_af, _mark_cmd, \ 787 + _uffd_data_handler, _uffd_pt_handler, _uffd_faults) \ 788 + { \ 789 + .name = SCAT4(uffd, _access, _with_af, #_mark_cmd), \ 790 + .guest_prepare = { _PREPARE(_with_af), \ 791 + _PREPARE(_access) }, \ 792 + .guest_test = _access, \ 793 + .mem_mark_cmd = _mark_cmd, \ 794 + .guest_test_check = { _CHECK(_with_af) }, \ 795 + .uffd_data_handler = _uffd_data_handler, \ 796 + .uffd_pt_handler = _uffd_pt_handler, \ 797 + .expected_events = { .uffd_faults = _uffd_faults, }, \ 798 + } 799 + 800 + #define TEST_DIRTY_LOG(_access, _with_af, _test_check) \ 801 + { \ 802 + .name = SCAT3(dirty_log, _access, _with_af), \ 803 + .data_memslot_flags = KVM_MEM_LOG_DIRTY_PAGES, \ 804 + .pt_memslot_flags = KVM_MEM_LOG_DIRTY_PAGES, \ 805 + .guest_prepare = { _PREPARE(_with_af), \ 806 + _PREPARE(_access) }, \ 807 + .guest_test = _access, \ 808 + .guest_test_check = { _CHECK(_with_af), _test_check, \ 809 + guest_check_s1ptw_wr_in_dirty_log}, \ 810 + .expected_events = { 0 }, \ 811 + } 812 + 813 + #define TEST_UFFD_AND_DIRTY_LOG(_access, _with_af, _uffd_data_handler, \ 814 + _uffd_faults, _test_check) \ 815 + { \ 816 + .name = SCAT3(uffd_and_dirty_log, _access, _with_af), \ 817 + .data_memslot_flags = KVM_MEM_LOG_DIRTY_PAGES, \ 818 + .pt_memslot_flags = KVM_MEM_LOG_DIRTY_PAGES, \ 819 + .guest_prepare = { _PREPARE(_with_af), \ 820 + _PREPARE(_access) }, \ 821 + .guest_test = _access, \ 822 + .mem_mark_cmd = CMD_HOLE_DATA | CMD_HOLE_PT, \ 823 + .guest_test_check = { _CHECK(_with_af), _test_check }, \ 824 + .uffd_data_handler = _uffd_data_handler, \ 825 + .uffd_pt_handler = uffd_pt_write_handler, \ 826 + .expected_events = { .uffd_faults = _uffd_faults, }, \ 827 + } 828 + 829 + #define TEST_RO_MEMSLOT(_access, _mmio_handler, _mmio_exits) \ 830 + { \ 831 + .name = SCAT3(ro_memslot, _access, _with_af), \ 832 + .data_memslot_flags = KVM_MEM_READONLY, \ 833 + .guest_prepare = { _PREPARE(_access) }, \ 834 + .guest_test = _access, \ 835 + .mmio_handler = _mmio_handler, \ 836 + .expected_events = { .mmio_exits = _mmio_exits }, \ 837 + } 838 + 839 + #define TEST_RO_MEMSLOT_NO_SYNDROME(_access) \ 840 + { \ 841 + .name = SCAT2(ro_memslot_no_syndrome, _access), \ 842 + .data_memslot_flags = KVM_MEM_READONLY, \ 843 + .guest_test = _access, \ 844 + .fail_vcpu_run_handler = fail_vcpu_run_mmio_no_syndrome_handler, \ 845 + .expected_events = { .fail_vcpu_runs = 1 }, \ 846 + } 847 + 848 + #define TEST_RO_MEMSLOT_AND_DIRTY_LOG(_access, _mmio_handler, _mmio_exits, \ 849 + _test_check) \ 850 + { \ 851 + .name = SCAT3(ro_memslot, _access, _with_af), \ 852 + .data_memslot_flags = KVM_MEM_READONLY | KVM_MEM_LOG_DIRTY_PAGES, \ 853 + .pt_memslot_flags = KVM_MEM_LOG_DIRTY_PAGES, \ 854 + .guest_prepare = { _PREPARE(_access) }, \ 855 + .guest_test = _access, \ 856 + .guest_test_check = { _test_check }, \ 857 + .mmio_handler = _mmio_handler, \ 858 + .expected_events = { .mmio_exits = _mmio_exits}, \ 859 + } 860 + 861 + #define TEST_RO_MEMSLOT_NO_SYNDROME_AND_DIRTY_LOG(_access, _test_check) \ 862 + { \ 863 + .name = SCAT2(ro_memslot_no_syn_and_dlog, _access), \ 864 + .data_memslot_flags = KVM_MEM_READONLY | KVM_MEM_LOG_DIRTY_PAGES, \ 865 + .pt_memslot_flags = KVM_MEM_LOG_DIRTY_PAGES, \ 866 + .guest_test = _access, \ 867 + .guest_test_check = { _test_check }, \ 868 + .fail_vcpu_run_handler = fail_vcpu_run_mmio_no_syndrome_handler, \ 869 + .expected_events = { .fail_vcpu_runs = 1 }, \ 870 + } 871 + 872 + #define TEST_RO_MEMSLOT_AND_UFFD(_access, _mmio_handler, _mmio_exits, \ 873 + _uffd_data_handler, _uffd_faults) \ 874 + { \ 875 + .name = SCAT2(ro_memslot_uffd, _access), \ 876 + .data_memslot_flags = KVM_MEM_READONLY, \ 877 + .mem_mark_cmd = CMD_HOLE_DATA | CMD_HOLE_PT, \ 878 + .guest_prepare = { _PREPARE(_access) }, \ 879 + .guest_test = _access, \ 880 + .uffd_data_handler = _uffd_data_handler, \ 881 + .uffd_pt_handler = uffd_pt_write_handler, \ 882 + .mmio_handler = _mmio_handler, \ 883 + .expected_events = { .mmio_exits = _mmio_exits, \ 884 + .uffd_faults = _uffd_faults }, \ 885 + } 886 + 887 + #define TEST_RO_MEMSLOT_NO_SYNDROME_AND_UFFD(_access, _uffd_data_handler, \ 888 + _uffd_faults) \ 889 + { \ 890 + .name = SCAT2(ro_memslot_no_syndrome, _access), \ 891 + .data_memslot_flags = KVM_MEM_READONLY, \ 892 + .mem_mark_cmd = CMD_HOLE_DATA | CMD_HOLE_PT, \ 893 + .guest_test = _access, \ 894 + .uffd_data_handler = _uffd_data_handler, \ 895 + .uffd_pt_handler = uffd_pt_write_handler, \ 896 + .fail_vcpu_run_handler = fail_vcpu_run_mmio_no_syndrome_handler, \ 897 + .expected_events = { .fail_vcpu_runs = 1, \ 898 + .uffd_faults = _uffd_faults }, \ 899 + } 900 + 901 + static struct test_desc tests[] = { 902 + 903 + /* Check that HW is setting the Access Flag (AF) (sanity checks). */ 904 + TEST_ACCESS(guest_read64, with_af, CMD_NONE), 905 + TEST_ACCESS(guest_ld_preidx, with_af, CMD_NONE), 906 + TEST_ACCESS(guest_cas, with_af, CMD_NONE), 907 + TEST_ACCESS(guest_write64, with_af, CMD_NONE), 908 + TEST_ACCESS(guest_st_preidx, with_af, CMD_NONE), 909 + TEST_ACCESS(guest_dc_zva, with_af, CMD_NONE), 910 + TEST_ACCESS(guest_exec, with_af, CMD_NONE), 911 + 912 + /* 913 + * Punch a hole in the data backing store, and then try multiple 914 + * accesses: reads should rturn zeroes, and writes should 915 + * re-populate the page. Moreover, the test also check that no 916 + * exception was generated in the guest. Note that this 917 + * reading/writing behavior is the same as reading/writing a 918 + * punched page (with fallocate(FALLOC_FL_PUNCH_HOLE)) from 919 + * userspace. 920 + */ 921 + TEST_ACCESS(guest_read64, no_af, CMD_HOLE_DATA), 922 + TEST_ACCESS(guest_cas, no_af, CMD_HOLE_DATA), 923 + TEST_ACCESS(guest_ld_preidx, no_af, CMD_HOLE_DATA), 924 + TEST_ACCESS(guest_write64, no_af, CMD_HOLE_DATA), 925 + TEST_ACCESS(guest_st_preidx, no_af, CMD_HOLE_DATA), 926 + TEST_ACCESS(guest_at, no_af, CMD_HOLE_DATA), 927 + TEST_ACCESS(guest_dc_zva, no_af, CMD_HOLE_DATA), 928 + 929 + /* 930 + * Punch holes in the data and PT backing stores and mark them for 931 + * userfaultfd handling. This should result in 2 faults: the access 932 + * on the data backing store, and its respective S1 page table walk 933 + * (S1PTW). 934 + */ 935 + TEST_UFFD(guest_read64, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 936 + uffd_data_read_handler, uffd_pt_write_handler, 2), 937 + /* no_af should also lead to a PT write. */ 938 + TEST_UFFD(guest_read64, no_af, CMD_HOLE_DATA | CMD_HOLE_PT, 939 + uffd_data_read_handler, uffd_pt_write_handler, 2), 940 + /* Note how that cas invokes the read handler. */ 941 + TEST_UFFD(guest_cas, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 942 + uffd_data_read_handler, uffd_pt_write_handler, 2), 943 + /* 944 + * Can't test guest_at with_af as it's IMPDEF whether the AF is set. 945 + * The S1PTW fault should still be marked as a write. 946 + */ 947 + TEST_UFFD(guest_at, no_af, CMD_HOLE_DATA | CMD_HOLE_PT, 948 + uffd_data_read_handler, uffd_pt_write_handler, 1), 949 + TEST_UFFD(guest_ld_preidx, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 950 + uffd_data_read_handler, uffd_pt_write_handler, 2), 951 + TEST_UFFD(guest_write64, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 952 + uffd_data_write_handler, uffd_pt_write_handler, 2), 953 + TEST_UFFD(guest_dc_zva, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 954 + uffd_data_write_handler, uffd_pt_write_handler, 2), 955 + TEST_UFFD(guest_st_preidx, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 956 + uffd_data_write_handler, uffd_pt_write_handler, 2), 957 + TEST_UFFD(guest_exec, with_af, CMD_HOLE_DATA | CMD_HOLE_PT, 958 + uffd_data_read_handler, uffd_pt_write_handler, 2), 959 + 960 + /* 961 + * Try accesses when the data and PT memory regions are both 962 + * tracked for dirty logging. 963 + */ 964 + TEST_DIRTY_LOG(guest_read64, with_af, guest_check_no_write_in_dirty_log), 965 + /* no_af should also lead to a PT write. */ 966 + TEST_DIRTY_LOG(guest_read64, no_af, guest_check_no_write_in_dirty_log), 967 + TEST_DIRTY_LOG(guest_ld_preidx, with_af, guest_check_no_write_in_dirty_log), 968 + TEST_DIRTY_LOG(guest_at, no_af, guest_check_no_write_in_dirty_log), 969 + TEST_DIRTY_LOG(guest_exec, with_af, guest_check_no_write_in_dirty_log), 970 + TEST_DIRTY_LOG(guest_write64, with_af, guest_check_write_in_dirty_log), 971 + TEST_DIRTY_LOG(guest_cas, with_af, guest_check_write_in_dirty_log), 972 + TEST_DIRTY_LOG(guest_dc_zva, with_af, guest_check_write_in_dirty_log), 973 + TEST_DIRTY_LOG(guest_st_preidx, with_af, guest_check_write_in_dirty_log), 974 + 975 + /* 976 + * Access when the data and PT memory regions are both marked for 977 + * dirty logging and UFFD at the same time. The expected result is 978 + * that writes should mark the dirty log and trigger a userfaultfd 979 + * write fault. Reads/execs should result in a read userfaultfd 980 + * fault, and nothing in the dirty log. Any S1PTW should result in 981 + * a write in the dirty log and a userfaultfd write. 982 + */ 983 + TEST_UFFD_AND_DIRTY_LOG(guest_read64, with_af, uffd_data_read_handler, 2, 984 + guest_check_no_write_in_dirty_log), 985 + /* no_af should also lead to a PT write. */ 986 + TEST_UFFD_AND_DIRTY_LOG(guest_read64, no_af, uffd_data_read_handler, 2, 987 + guest_check_no_write_in_dirty_log), 988 + TEST_UFFD_AND_DIRTY_LOG(guest_ld_preidx, with_af, uffd_data_read_handler, 989 + 2, guest_check_no_write_in_dirty_log), 990 + TEST_UFFD_AND_DIRTY_LOG(guest_at, with_af, 0, 1, 991 + guest_check_no_write_in_dirty_log), 992 + TEST_UFFD_AND_DIRTY_LOG(guest_exec, with_af, uffd_data_read_handler, 2, 993 + guest_check_no_write_in_dirty_log), 994 + TEST_UFFD_AND_DIRTY_LOG(guest_write64, with_af, uffd_data_write_handler, 995 + 2, guest_check_write_in_dirty_log), 996 + TEST_UFFD_AND_DIRTY_LOG(guest_cas, with_af, uffd_data_read_handler, 2, 997 + guest_check_write_in_dirty_log), 998 + TEST_UFFD_AND_DIRTY_LOG(guest_dc_zva, with_af, uffd_data_write_handler, 999 + 2, guest_check_write_in_dirty_log), 1000 + TEST_UFFD_AND_DIRTY_LOG(guest_st_preidx, with_af, 1001 + uffd_data_write_handler, 2, 1002 + guest_check_write_in_dirty_log), 1003 + 1004 + /* 1005 + * Try accesses when the data memory region is marked read-only 1006 + * (with KVM_MEM_READONLY). Writes with a syndrome result in an 1007 + * MMIO exit, writes with no syndrome (e.g., CAS) result in a 1008 + * failed vcpu run, and reads/execs with and without syndroms do 1009 + * not fault. 1010 + */ 1011 + TEST_RO_MEMSLOT(guest_read64, 0, 0), 1012 + TEST_RO_MEMSLOT(guest_ld_preidx, 0, 0), 1013 + TEST_RO_MEMSLOT(guest_at, 0, 0), 1014 + TEST_RO_MEMSLOT(guest_exec, 0, 0), 1015 + TEST_RO_MEMSLOT(guest_write64, mmio_on_test_gpa_handler, 1), 1016 + TEST_RO_MEMSLOT_NO_SYNDROME(guest_dc_zva), 1017 + TEST_RO_MEMSLOT_NO_SYNDROME(guest_cas), 1018 + TEST_RO_MEMSLOT_NO_SYNDROME(guest_st_preidx), 1019 + 1020 + /* 1021 + * Access when both the data region is both read-only and marked 1022 + * for dirty logging at the same time. The expected result is that 1023 + * for writes there should be no write in the dirty log. The 1024 + * readonly handling is the same as if the memslot was not marked 1025 + * for dirty logging: writes with a syndrome result in an MMIO 1026 + * exit, and writes with no syndrome result in a failed vcpu run. 1027 + */ 1028 + TEST_RO_MEMSLOT_AND_DIRTY_LOG(guest_read64, 0, 0, 1029 + guest_check_no_write_in_dirty_log), 1030 + TEST_RO_MEMSLOT_AND_DIRTY_LOG(guest_ld_preidx, 0, 0, 1031 + guest_check_no_write_in_dirty_log), 1032 + TEST_RO_MEMSLOT_AND_DIRTY_LOG(guest_at, 0, 0, 1033 + guest_check_no_write_in_dirty_log), 1034 + TEST_RO_MEMSLOT_AND_DIRTY_LOG(guest_exec, 0, 0, 1035 + guest_check_no_write_in_dirty_log), 1036 + TEST_RO_MEMSLOT_AND_DIRTY_LOG(guest_write64, mmio_on_test_gpa_handler, 1037 + 1, guest_check_no_write_in_dirty_log), 1038 + TEST_RO_MEMSLOT_NO_SYNDROME_AND_DIRTY_LOG(guest_dc_zva, 1039 + guest_check_no_write_in_dirty_log), 1040 + TEST_RO_MEMSLOT_NO_SYNDROME_AND_DIRTY_LOG(guest_cas, 1041 + guest_check_no_write_in_dirty_log), 1042 + TEST_RO_MEMSLOT_NO_SYNDROME_AND_DIRTY_LOG(guest_st_preidx, 1043 + guest_check_no_write_in_dirty_log), 1044 + 1045 + /* 1046 + * Access when the data region is both read-only and punched with 1047 + * holes tracked with userfaultfd. The expected result is the 1048 + * union of both userfaultfd and read-only behaviors. For example, 1049 + * write accesses result in a userfaultfd write fault and an MMIO 1050 + * exit. Writes with no syndrome result in a failed vcpu run and 1051 + * no userfaultfd write fault. Reads result in userfaultfd getting 1052 + * triggered. 1053 + */ 1054 + TEST_RO_MEMSLOT_AND_UFFD(guest_read64, 0, 0, 1055 + uffd_data_read_handler, 2), 1056 + TEST_RO_MEMSLOT_AND_UFFD(guest_ld_preidx, 0, 0, 1057 + uffd_data_read_handler, 2), 1058 + TEST_RO_MEMSLOT_AND_UFFD(guest_at, 0, 0, 1059 + uffd_no_handler, 1), 1060 + TEST_RO_MEMSLOT_AND_UFFD(guest_exec, 0, 0, 1061 + uffd_data_read_handler, 2), 1062 + TEST_RO_MEMSLOT_AND_UFFD(guest_write64, mmio_on_test_gpa_handler, 1, 1063 + uffd_data_write_handler, 2), 1064 + TEST_RO_MEMSLOT_NO_SYNDROME_AND_UFFD(guest_cas, 1065 + uffd_data_read_handler, 2), 1066 + TEST_RO_MEMSLOT_NO_SYNDROME_AND_UFFD(guest_dc_zva, 1067 + uffd_no_handler, 1), 1068 + TEST_RO_MEMSLOT_NO_SYNDROME_AND_UFFD(guest_st_preidx, 1069 + uffd_no_handler, 1), 1070 + 1071 + { 0 } 1072 + }; 1073 + 1074 + static void for_each_test_and_guest_mode(enum vm_mem_backing_src_type src_type) 1075 + { 1076 + struct test_desc *t; 1077 + 1078 + for (t = &tests[0]; t->name; t++) { 1079 + if (t->skip) 1080 + continue; 1081 + 1082 + struct test_params p = { 1083 + .src_type = src_type, 1084 + .test_desc = t, 1085 + }; 1086 + 1087 + for_each_guest_mode(run_test, &p); 1088 + } 1089 + } 1090 + 1091 + int main(int argc, char *argv[]) 1092 + { 1093 + enum vm_mem_backing_src_type src_type; 1094 + int opt; 1095 + 1096 + setbuf(stdout, NULL); 1097 + 1098 + src_type = DEFAULT_VM_MEM_SRC; 1099 + 1100 + while ((opt = getopt(argc, argv, "hm:s:")) != -1) { 1101 + switch (opt) { 1102 + case 'm': 1103 + guest_modes_cmdline(optarg); 1104 + break; 1105 + case 's': 1106 + src_type = parse_backing_src_type(optarg); 1107 + break; 1108 + case 'h': 1109 + default: 1110 + help(argv[0]); 1111 + exit(0); 1112 + } 1113 + } 1114 + 1115 + for_each_test_and_guest_mode(src_type); 1116 + return 0; 1117 + }

+1 -7

tools/testing/selftests/kvm/access_tracking_perf_test.c

··· 58 58 ITERATION_MARK_IDLE, 59 59 } iteration_work; 60 60 61 - /* Set to true when vCPU threads should exit. */ 62 - static bool done; 63 - 64 61 /* The iteration that was last completed by each vCPU. */ 65 62 static int vcpu_last_completed_iteration[KVM_MAX_VCPUS]; 66 63 ··· 208 211 int last_iteration = *current_iteration; 209 212 210 213 do { 211 - if (READ_ONCE(done)) 214 + if (READ_ONCE(memstress_args.stop_vcpus)) 212 215 return false; 213 216 214 217 *current_iteration = READ_ONCE(iteration); ··· 317 320 access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory"); 318 321 mark_memory_idle(vm, nr_vcpus); 319 322 access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory"); 320 - 321 - /* Set done to signal the vCPU threads to exit */ 322 - done = true; 323 323 324 324 memstress_join_vcpu_threads(nr_vcpus); 325 325 memstress_destroy_vm(vm);

+31 -199

tools/testing/selftests/kvm/demand_paging_test.c

··· 22 22 #include "test_util.h" 23 23 #include "memstress.h" 24 24 #include "guest_modes.h" 25 + #include "userfaultfd_util.h" 25 26 26 27 #ifdef __NR_userfaultfd 27 28 28 - #ifdef PRINT_PER_PAGE_UPDATES 29 - #define PER_PAGE_DEBUG(...) printf(__VA_ARGS__) 30 - #else 31 - #define PER_PAGE_DEBUG(...) _no_printf(__VA_ARGS__) 32 - #endif 33 - 34 - #ifdef PRINT_PER_VCPU_UPDATES 35 - #define PER_VCPU_DEBUG(...) printf(__VA_ARGS__) 36 - #else 37 - #define PER_VCPU_DEBUG(...) _no_printf(__VA_ARGS__) 38 - #endif 39 - 40 29 static int nr_vcpus = 1; 41 30 static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE; 31 + 42 32 static size_t demand_paging_size; 43 33 static char *guest_data_prototype; 44 34 ··· 57 67 ts_diff.tv_sec, ts_diff.tv_nsec); 58 68 } 59 69 60 - static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t addr) 70 + static int handle_uffd_page_request(int uffd_mode, int uffd, 71 + struct uffd_msg *msg) 61 72 { 62 73 pid_t tid = syscall(__NR_gettid); 74 + uint64_t addr = msg->arg.pagefault.address; 63 75 struct timespec start; 64 76 struct timespec ts_diff; 65 77 int r; ··· 108 116 return 0; 109 117 } 110 118 111 - bool quit_uffd_thread; 112 - 113 - struct uffd_handler_args { 114 - int uffd_mode; 115 - int uffd; 116 - int pipefd; 117 - useconds_t delay; 118 - }; 119 - 120 - static void *uffd_handler_thread_fn(void *arg) 121 - { 122 - struct uffd_handler_args *uffd_args = (struct uffd_handler_args *)arg; 123 - int uffd = uffd_args->uffd; 124 - int pipefd = uffd_args->pipefd; 125 - useconds_t delay = uffd_args->delay; 126 - int64_t pages = 0; 127 - struct timespec start; 128 - struct timespec ts_diff; 129 - 130 - clock_gettime(CLOCK_MONOTONIC, &start); 131 - while (!quit_uffd_thread) { 132 - struct uffd_msg msg; 133 - struct pollfd pollfd[2]; 134 - char tmp_chr; 135 - int r; 136 - uint64_t addr; 137 - 138 - pollfd[0].fd = uffd; 139 - pollfd[0].events = POLLIN; 140 - pollfd[1].fd = pipefd; 141 - pollfd[1].events = POLLIN; 142 - 143 - r = poll(pollfd, 2, -1); 144 - switch (r) { 145 - case -1: 146 - pr_info("poll err"); 147 - continue; 148 - case 0: 149 - continue; 150 - case 1: 151 - break; 152 - default: 153 - pr_info("Polling uffd returned %d", r); 154 - return NULL; 155 - } 156 - 157 - if (pollfd[0].revents & POLLERR) { 158 - pr_info("uffd revents has POLLERR"); 159 - return NULL; 160 - } 161 - 162 - if (pollfd[1].revents & POLLIN) { 163 - r = read(pollfd[1].fd, &tmp_chr, 1); 164 - TEST_ASSERT(r == 1, 165 - "Error reading pipefd in UFFD thread\n"); 166 - return NULL; 167 - } 168 - 169 - if (!(pollfd[0].revents & POLLIN)) 170 - continue; 171 - 172 - r = read(uffd, &msg, sizeof(msg)); 173 - if (r == -1) { 174 - if (errno == EAGAIN) 175 - continue; 176 - pr_info("Read of uffd got errno %d\n", errno); 177 - return NULL; 178 - } 179 - 180 - if (r != sizeof(msg)) { 181 - pr_info("Read on uffd returned unexpected size: %d bytes", r); 182 - return NULL; 183 - } 184 - 185 - if (!(msg.event & UFFD_EVENT_PAGEFAULT)) 186 - continue; 187 - 188 - if (delay) 189 - usleep(delay); 190 - addr = msg.arg.pagefault.address; 191 - r = handle_uffd_page_request(uffd_args->uffd_mode, uffd, addr); 192 - if (r < 0) 193 - return NULL; 194 - pages++; 195 - } 196 - 197 - ts_diff = timespec_elapsed(start); 198 - PER_VCPU_DEBUG("userfaulted %ld pages over %ld.%.9lds. (%f/sec)\n", 199 - pages, ts_diff.tv_sec, ts_diff.tv_nsec, 200 - pages / ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0)); 201 - 202 - return NULL; 203 - } 204 - 205 - static void setup_demand_paging(struct kvm_vm *vm, 206 - pthread_t *uffd_handler_thread, int pipefd, 207 - int uffd_mode, useconds_t uffd_delay, 208 - struct uffd_handler_args *uffd_args, 209 - void *hva, void *alias, uint64_t len) 210 - { 211 - bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR); 212 - int uffd; 213 - struct uffdio_api uffdio_api; 214 - struct uffdio_register uffdio_register; 215 - uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY; 216 - int ret; 217 - 218 - PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n", 219 - is_minor ? "MINOR" : "MISSING", 220 - is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY"); 221 - 222 - /* In order to get minor faults, prefault via the alias. */ 223 - if (is_minor) { 224 - size_t p; 225 - 226 - expected_ioctls = ((uint64_t) 1) << _UFFDIO_CONTINUE; 227 - 228 - TEST_ASSERT(alias != NULL, "Alias required for minor faults"); 229 - for (p = 0; p < (len / demand_paging_size); ++p) { 230 - memcpy(alias + (p * demand_paging_size), 231 - guest_data_prototype, demand_paging_size); 232 - } 233 - } 234 - 235 - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); 236 - TEST_ASSERT(uffd >= 0, __KVM_SYSCALL_ERROR("userfaultfd()", uffd)); 237 - 238 - uffdio_api.api = UFFD_API; 239 - uffdio_api.features = 0; 240 - ret = ioctl(uffd, UFFDIO_API, &uffdio_api); 241 - TEST_ASSERT(ret != -1, __KVM_SYSCALL_ERROR("UFFDIO_API", ret)); 242 - 243 - uffdio_register.range.start = (uint64_t)hva; 244 - uffdio_register.range.len = len; 245 - uffdio_register.mode = uffd_mode; 246 - ret = ioctl(uffd, UFFDIO_REGISTER, &uffdio_register); 247 - TEST_ASSERT(ret != -1, __KVM_SYSCALL_ERROR("UFFDIO_REGISTER", ret)); 248 - TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) == 249 - expected_ioctls, "missing userfaultfd ioctls"); 250 - 251 - uffd_args->uffd_mode = uffd_mode; 252 - uffd_args->uffd = uffd; 253 - uffd_args->pipefd = pipefd; 254 - uffd_args->delay = uffd_delay; 255 - pthread_create(uffd_handler_thread, NULL, uffd_handler_thread_fn, 256 - uffd_args); 257 - 258 - PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n", 259 - hva, hva + len); 260 - } 261 - 262 119 struct test_params { 263 120 int uffd_mode; 264 121 useconds_t uffd_delay; ··· 115 274 bool partition_vcpu_memory_access; 116 275 }; 117 276 277 + static void prefault_mem(void *alias, uint64_t len) 278 + { 279 + size_t p; 280 + 281 + TEST_ASSERT(alias != NULL, "Alias required for minor faults"); 282 + for (p = 0; p < (len / demand_paging_size); ++p) { 283 + memcpy(alias + (p * demand_paging_size), 284 + guest_data_prototype, demand_paging_size); 285 + } 286 + } 287 + 118 288 static void run_test(enum vm_guest_mode mode, void *arg) 119 289 { 120 290 struct test_params *p = arg; 121 - pthread_t *uffd_handler_threads = NULL; 122 - struct uffd_handler_args *uffd_args = NULL; 291 + struct uffd_desc **uffd_descs = NULL; 123 292 struct timespec start; 124 293 struct timespec ts_diff; 125 - int *pipefds = NULL; 126 294 struct kvm_vm *vm; 127 - int r, i; 295 + int i; 128 296 129 297 vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 130 298 p->src_type, p->partition_vcpu_memory_access); ··· 146 296 memset(guest_data_prototype, 0xAB, demand_paging_size); 147 297 148 298 if (p->uffd_mode) { 149 - uffd_handler_threads = 150 - malloc(nr_vcpus * sizeof(*uffd_handler_threads)); 151 - TEST_ASSERT(uffd_handler_threads, "Memory allocation failed"); 152 - 153 - uffd_args = malloc(nr_vcpus * sizeof(*uffd_args)); 154 - TEST_ASSERT(uffd_args, "Memory allocation failed"); 155 - 156 - pipefds = malloc(sizeof(int) * nr_vcpus * 2); 157 - TEST_ASSERT(pipefds, "Unable to allocate memory for pipefd"); 299 + uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *)); 300 + TEST_ASSERT(uffd_descs, "Memory allocation failed"); 158 301 159 302 for (i = 0; i < nr_vcpus; i++) { 160 303 struct memstress_vcpu_args *vcpu_args; ··· 160 317 vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa); 161 318 vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa); 162 319 320 + prefault_mem(vcpu_alias, 321 + vcpu_args->pages * memstress_args.guest_page_size); 322 + 163 323 /* 164 324 * Set up user fault fd to handle demand paging 165 325 * requests. 166 326 */ 167 - r = pipe2(&pipefds[i * 2], 168 - O_CLOEXEC | O_NONBLOCK); 169 - TEST_ASSERT(!r, "Failed to set up pipefd"); 170 - 171 - setup_demand_paging(vm, &uffd_handler_threads[i], 172 - pipefds[i * 2], p->uffd_mode, 173 - p->uffd_delay, &uffd_args[i], 174 - vcpu_hva, vcpu_alias, 175 - vcpu_args->pages * memstress_args.guest_page_size); 327 + uffd_descs[i] = uffd_setup_demand_paging( 328 + p->uffd_mode, p->uffd_delay, vcpu_hva, 329 + vcpu_args->pages * memstress_args.guest_page_size, 330 + &handle_uffd_page_request); 176 331 } 177 332 } 178 333 ··· 185 344 pr_info("All vCPU threads joined\n"); 186 345 187 346 if (p->uffd_mode) { 188 - char c; 189 - 190 347 /* Tell the user fault fd handler threads to quit */ 191 - for (i = 0; i < nr_vcpus; i++) { 192 - r = write(pipefds[i * 2 + 1], &c, 1); 193 - TEST_ASSERT(r == 1, "Unable to write to pipefd"); 194 - 195 - pthread_join(uffd_handler_threads[i], NULL); 196 - } 348 + for (i = 0; i < nr_vcpus; i++) 349 + uffd_stop_demand_paging(uffd_descs[i]); 197 350 } 198 351 199 352 pr_info("Total guest execution time: %ld.%.9lds\n", ··· 199 364 memstress_destroy_vm(vm); 200 365 201 366 free(guest_data_prototype); 202 - if (p->uffd_mode) { 203 - free(uffd_handler_threads); 204 - free(uffd_args); 205 - free(pipefds); 206 - } 367 + if (p->uffd_mode) 368 + free(uffd_descs); 207 369 } 208 370 209 371 static void help(char *name)

+39 -14

tools/testing/selftests/kvm/dirty_log_test.c

··· 24 24 #include "guest_modes.h" 25 25 #include "processor.h" 26 26 27 + #define DIRTY_MEM_BITS 30 /* 1G */ 28 + #define PAGE_SHIFT_4K 12 29 + 27 30 /* The memory slot index to track dirty pages */ 28 31 #define TEST_MEM_SLOT_INDEX 1 29 32 ··· 229 226 } 230 227 231 228 static void dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, 232 - void *bitmap, uint32_t num_pages) 229 + void *bitmap, uint32_t num_pages, 230 + uint32_t *unused) 233 231 { 234 232 kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap); 235 233 } 236 234 237 235 static void clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, 238 - void *bitmap, uint32_t num_pages) 236 + void *bitmap, uint32_t num_pages, 237 + uint32_t *unused) 239 238 { 240 239 kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap); 241 240 kvm_vm_clear_dirty_log(vcpu->vm, slot, bitmap, 0, num_pages); ··· 276 271 277 272 static void dirty_ring_create_vm_done(struct kvm_vm *vm) 278 273 { 274 + uint64_t pages; 275 + uint32_t limit; 276 + 277 + /* 278 + * We rely on vcpu exit due to full dirty ring state. Adjust 279 + * the ring buffer size to ensure we're able to reach the 280 + * full dirty ring state. 281 + */ 282 + pages = (1ul << (DIRTY_MEM_BITS - vm->page_shift)) + 3; 283 + pages = vm_adjust_num_guest_pages(vm->mode, pages); 284 + if (vm->page_size < getpagesize()) 285 + pages = vm_num_host_pages(vm->mode, pages); 286 + 287 + limit = 1 << (31 - __builtin_clz(pages)); 288 + test_dirty_ring_count = 1 << (31 - __builtin_clz(test_dirty_ring_count)); 289 + test_dirty_ring_count = min(limit, test_dirty_ring_count); 290 + pr_info("dirty ring count: 0x%x\n", test_dirty_ring_count); 291 + 279 292 /* 280 293 * Switch to dirty ring mode after VM creation but before any 281 294 * of the vcpu creation. ··· 352 329 } 353 330 354 331 static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, 355 - void *bitmap, uint32_t num_pages) 332 + void *bitmap, uint32_t num_pages, 333 + uint32_t *ring_buf_idx) 356 334 { 357 - /* We only have one vcpu */ 358 - static uint32_t fetch_index = 0; 359 335 uint32_t count = 0, cleared; 360 336 bool continued_vcpu = false; 361 337 ··· 371 349 372 350 /* Only have one vcpu */ 373 351 count = dirty_ring_collect_one(vcpu_map_dirty_ring(vcpu), 374 - slot, bitmap, num_pages, &fetch_index); 352 + slot, bitmap, num_pages, 353 + ring_buf_idx); 375 354 376 355 cleared = kvm_vm_reset_dirty_ring(vcpu->vm); 377 356 ··· 429 406 void (*create_vm_done)(struct kvm_vm *vm); 430 407 /* Hook to collect the dirty pages into the bitmap provided */ 431 408 void (*collect_dirty_pages) (struct kvm_vcpu *vcpu, int slot, 432 - void *bitmap, uint32_t num_pages); 409 + void *bitmap, uint32_t num_pages, 410 + uint32_t *ring_buf_idx); 433 411 /* Hook to call when after each vcpu run */ 434 412 void (*after_vcpu_run)(struct kvm_vcpu *vcpu, int ret, int err); 435 413 void (*before_vcpu_join) (void); ··· 495 471 } 496 472 497 473 static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot, 498 - void *bitmap, uint32_t num_pages) 474 + void *bitmap, uint32_t num_pages, 475 + uint32_t *ring_buf_idx) 499 476 { 500 477 struct log_mode *mode = &log_modes[host_log_mode]; 501 478 502 479 TEST_ASSERT(mode->collect_dirty_pages != NULL, 503 480 "collect_dirty_pages() is required for any log mode!"); 504 - mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages); 481 + mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx); 505 482 } 506 483 507 484 static void log_mode_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err) ··· 706 681 return vm; 707 682 } 708 683 709 - #define DIRTY_MEM_BITS 30 /* 1G */ 710 - #define PAGE_SHIFT_4K 12 711 - 712 684 struct test_params { 713 685 unsigned long iterations; 714 686 unsigned long interval; ··· 718 696 struct kvm_vcpu *vcpu; 719 697 struct kvm_vm *vm; 720 698 unsigned long *bmap; 699 + uint32_t ring_buf_idx = 0; 721 700 722 701 if (!log_mode_supported()) { 723 702 print_skip("Log mode '%s' not supported", ··· 792 769 host_dirty_count = 0; 793 770 host_clear_count = 0; 794 771 host_track_next_count = 0; 772 + WRITE_ONCE(dirty_ring_vcpu_ring_full, false); 795 773 796 774 pthread_create(&vcpu_thread, NULL, vcpu_worker, vcpu); 797 775 ··· 800 776 /* Give the vcpu thread some time to dirty some pages */ 801 777 usleep(p->interval * 1000); 802 778 log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX, 803 - bmap, host_num_pages); 779 + bmap, host_num_pages, 780 + &ring_buf_idx); 804 781 805 782 /* 806 783 * See vcpu_sync_stop_requested definition for details on why ··· 845 820 printf("usage: %s [-h] [-i iterations] [-I interval] " 846 821 "[-p offset] [-m mode]\n", name); 847 822 puts(""); 848 - printf(" -c: specify dirty ring size, in number of entries\n"); 823 + printf(" -c: hint to dirty ring size, in number of entries\n"); 849 824 printf(" (only useful for dirty-ring test; default: %"PRIu32")\n", 850 825 TEST_DIRTY_RING_COUNT); 851 826 printf(" -i: specify iteration counts (default: %"PRIu64")\n",

+29 -6

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 38 38 * NORMAL 4 1111:1111 39 39 * NORMAL_WT 5 1011:1011 40 40 */ 41 - #define DEFAULT_MAIR_EL1 ((0x00ul << (0 * 8)) | \ 42 - (0x04ul << (1 * 8)) | \ 43 - (0x0cul << (2 * 8)) | \ 44 - (0x44ul << (3 * 8)) | \ 45 - (0xfful << (4 * 8)) | \ 46 - (0xbbul << (5 * 8))) 41 + 42 + /* Linux doesn't use these memory types, so let's define them. */ 43 + #define MAIR_ATTR_DEVICE_GRE UL(0x0c) 44 + #define MAIR_ATTR_NORMAL_WT UL(0xbb) 45 + 46 + #define MT_DEVICE_nGnRnE 0 47 + #define MT_DEVICE_nGnRE 1 48 + #define MT_DEVICE_GRE 2 49 + #define MT_NORMAL_NC 3 50 + #define MT_NORMAL 4 51 + #define MT_NORMAL_WT 5 52 + 53 + #define DEFAULT_MAIR_EL1 \ 54 + (MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRnE, MT_DEVICE_nGnRnE) | \ 55 + MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRE, MT_DEVICE_nGnRE) | \ 56 + MAIR_ATTRIDX(MAIR_ATTR_DEVICE_GRE, MT_DEVICE_GRE) | \ 57 + MAIR_ATTRIDX(MAIR_ATTR_NORMAL_NC, MT_NORMAL_NC) | \ 58 + MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL) | \ 59 + MAIR_ATTRIDX(MAIR_ATTR_NORMAL_WT, MT_NORMAL_WT)) 47 60 48 61 #define MPIDR_HWID_BITMASK (0xff00fffffful) 49 62 ··· 105 92 #define ESR_EC_MASK (ESR_EC_NUM - 1) 106 93 107 94 #define ESR_EC_SVC64 0x15 95 + #define ESR_EC_IABT 0x21 96 + #define ESR_EC_DABT 0x25 108 97 #define ESR_EC_HW_BP_CURRENT 0x31 109 98 #define ESR_EC_SSTEP_CURRENT 0x33 110 99 #define ESR_EC_WP_CURRENT 0x35 111 100 #define ESR_EC_BRK_INS 0x3c 101 + 102 + /* Access flag */ 103 + #define PTE_AF (1ULL << 10) 104 + 105 + /* Access flag update enable/disable */ 106 + #define TCR_EL1_HA (1ULL << 39) 112 107 113 108 void aarch64_get_supported_page_sizes(uint32_t ipa, 114 109 bool *ps4k, bool *ps16k, bool *ps64k); ··· 129 108 int vector, handler_fn handler); 130 109 void vm_install_sync_handler(struct kvm_vm *vm, 131 110 int vector, int ec, handler_fn handler); 111 + 112 + uint64_t *virt_get_pte_hva(struct kvm_vm *vm, vm_vaddr_t gva); 132 113 133 114 static inline void cpu_relax(void) 134 115 {

+29 -2

tools/testing/selftests/kvm/include/kvm_util_base.h

··· 35 35 struct sparsebit *unused_phy_pages; 36 36 int fd; 37 37 off_t offset; 38 + enum vm_mem_backing_src_type backing_src_type; 38 39 void *host_mem; 39 40 void *host_alias; 40 41 void *mmap_start; ··· 64 63 struct rb_root gpa_tree; 65 64 struct rb_root hva_tree; 66 65 DECLARE_HASHTABLE(slot_hash, 9); 66 + }; 67 + 68 + enum kvm_mem_region_type { 69 + MEM_REGION_CODE, 70 + MEM_REGION_DATA, 71 + MEM_REGION_PT, 72 + MEM_REGION_TEST_DATA, 73 + NR_MEM_REGIONS, 67 74 }; 68 75 69 76 struct kvm_vm { ··· 103 94 int stats_fd; 104 95 struct kvm_stats_header stats_header; 105 96 struct kvm_stats_desc *stats_desc; 97 + 98 + /* 99 + * KVM region slots. These are the default memslots used by page 100 + * allocators, e.g., lib/elf uses the memslots[MEM_REGION_CODE] 101 + * memslot. 102 + */ 103 + uint32_t memslots[NR_MEM_REGIONS]; 106 104 }; 107 105 108 106 ··· 121 105 122 106 struct userspace_mem_region * 123 107 memslot2region(struct kvm_vm *vm, uint32_t memslot); 108 + 109 + static inline struct userspace_mem_region *vm_get_mem_region(struct kvm_vm *vm, 110 + enum kvm_mem_region_type type) 111 + { 112 + assert(type < NR_MEM_REGIONS); 113 + return memslot2region(vm, vm->memslots[type]); 114 + } 124 115 125 116 /* Minimum allocated guest virtual and physical addresses */ 126 117 #define KVM_UTIL_MIN_VADDR 0x2000 ··· 410 387 struct kvm_vcpu *__vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id); 411 388 vm_vaddr_t vm_vaddr_unused_gap(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min); 412 389 vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min); 390 + vm_vaddr_t __vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min, 391 + enum kvm_mem_region_type type); 413 392 vm_vaddr_t vm_vaddr_alloc_pages(struct kvm_vm *vm, int nr_pages); 393 + vm_vaddr_t __vm_vaddr_alloc_page(struct kvm_vm *vm, 394 + enum kvm_mem_region_type type); 414 395 vm_vaddr_t vm_vaddr_alloc_page(struct kvm_vm *vm); 415 396 416 397 void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, ··· 676 649 * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to 677 650 * calculate the amount of memory needed for per-vCPU data, e.g. stacks. 678 651 */ 679 - struct kvm_vm *____vm_create(enum vm_guest_mode mode, uint64_t nr_pages); 652 + struct kvm_vm *____vm_create(enum vm_guest_mode mode); 680 653 struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus, 681 654 uint64_t nr_extra_pages); 682 655 683 656 static inline struct kvm_vm *vm_create_barebones(void) 684 657 { 685 - return ____vm_create(VM_MODE_DEFAULT, 0); 658 + return ____vm_create(VM_MODE_DEFAULT); 686 659 } 687 660 688 661 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)

+3

tools/testing/selftests/kvm/include/memstress.h

··· 47 47 /* The vCPU=>pCPU pinning map. Only valid if pin_vcpus is true. */ 48 48 uint32_t vcpu_to_pcpu[KVM_MAX_VCPUS]; 49 49 50 + /* Test is done, stop running vCPUs. */ 51 + bool stop_vcpus; 52 + 50 53 struct memstress_vcpu_args vcpu_args[KVM_MAX_VCPUS]; 51 54 }; 52 55

+45

tools/testing/selftests/kvm/include/userfaultfd_util.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * KVM userfaultfd util 4 + * 5 + * Copyright (C) 2018, Red Hat, Inc. 6 + * Copyright (C) 2019-2022 Google LLC 7 + */ 8 + 9 + #define _GNU_SOURCE /* for pipe2 */ 10 + 11 + #include <inttypes.h> 12 + #include <time.h> 13 + #include <pthread.h> 14 + #include <linux/userfaultfd.h> 15 + 16 + #include "test_util.h" 17 + 18 + typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg); 19 + 20 + struct uffd_desc { 21 + int uffd_mode; 22 + int uffd; 23 + int pipefds[2]; 24 + useconds_t delay; 25 + uffd_handler_t handler; 26 + pthread_t thread; 27 + }; 28 + 29 + struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay, 30 + void *hva, uint64_t len, 31 + uffd_handler_t handler); 32 + 33 + void uffd_stop_demand_paging(struct uffd_desc *uffd); 34 + 35 + #ifdef PRINT_PER_PAGE_UPDATES 36 + #define PER_PAGE_DEBUG(...) printf(__VA_ARGS__) 37 + #else 38 + #define PER_PAGE_DEBUG(...) _no_printf(__VA_ARGS__) 39 + #endif 40 + 41 + #ifdef PRINT_PER_VCPU_UPDATES 42 + #define PER_VCPU_DEBUG(...) printf(__VA_ARGS__) 43 + #else 44 + #define PER_VCPU_DEBUG(...) _no_printf(__VA_ARGS__) 45 + #endif

+34 -21

tools/testing/selftests/kvm/lib/aarch64/processor.c

··· 11 11 #include "guest_modes.h" 12 12 #include "kvm_util.h" 13 13 #include "processor.h" 14 + #include <linux/bitfield.h> 14 15 15 16 #define DEFAULT_ARM64_GUEST_STACK_VADDR_MIN 0xac0000 16 17 ··· 77 76 78 77 void virt_arch_pgd_alloc(struct kvm_vm *vm) 79 78 { 80 - if (!vm->pgd_created) { 81 - vm_paddr_t paddr = vm_phy_pages_alloc(vm, 82 - page_align(vm, ptrs_per_pgd(vm) * 8) / vm->page_size, 83 - KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0); 84 - vm->pgd = paddr; 85 - vm->pgd_created = true; 86 - } 79 + size_t nr_pages = page_align(vm, ptrs_per_pgd(vm) * 8) / vm->page_size; 80 + 81 + if (vm->pgd_created) 82 + return; 83 + 84 + vm->pgd = vm_phy_pages_alloc(vm, nr_pages, 85 + KVM_GUEST_PAGE_TABLE_MIN_PADDR, 86 + vm->memslots[MEM_REGION_PT]); 87 + vm->pgd_created = true; 87 88 } 88 89 89 90 static void _virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, ··· 136 133 137 134 void virt_arch_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) 138 135 { 139 - uint64_t attr_idx = 4; /* NORMAL (See DEFAULT_MAIR_EL1) */ 136 + uint64_t attr_idx = MT_NORMAL; 140 137 141 138 _virt_pg_map(vm, vaddr, paddr, attr_idx); 142 139 } 143 140 144 - vm_paddr_t addr_arch_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva) 141 + uint64_t *virt_get_pte_hva(struct kvm_vm *vm, vm_vaddr_t gva) 145 142 { 146 143 uint64_t *ptep; 147 144 ··· 172 169 TEST_FAIL("Page table levels must be 2, 3, or 4"); 173 170 } 174 171 175 - return pte_addr(vm, *ptep) + (gva & (vm->page_size - 1)); 172 + return ptep; 176 173 177 174 unmapped_gva: 178 175 TEST_FAIL("No mapping for vm virtual address, gva: 0x%lx", gva); 179 - exit(1); 176 + exit(EXIT_FAILURE); 177 + } 178 + 179 + vm_paddr_t addr_arch_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva) 180 + { 181 + uint64_t *ptep = virt_get_pte_hva(vm, gva); 182 + 183 + return pte_addr(vm, *ptep) + (gva & (vm->page_size - 1)); 180 184 } 181 185 182 186 static void pte_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent, uint64_t page, int level) ··· 328 318 struct kvm_vcpu *aarch64_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id, 329 319 struct kvm_vcpu_init *init, void *guest_code) 330 320 { 331 - size_t stack_size = vm->page_size == 4096 ? 332 - DEFAULT_STACK_PGS * vm->page_size : 333 - vm->page_size; 334 - uint64_t stack_vaddr = vm_vaddr_alloc(vm, stack_size, 335 - DEFAULT_ARM64_GUEST_STACK_VADDR_MIN); 321 + size_t stack_size; 322 + uint64_t stack_vaddr; 336 323 struct kvm_vcpu *vcpu = __vm_vcpu_add(vm, vcpu_id); 324 + 325 + stack_size = vm->page_size == 4096 ? DEFAULT_STACK_PGS * vm->page_size : 326 + vm->page_size; 327 + stack_vaddr = __vm_vaddr_alloc(vm, stack_size, 328 + DEFAULT_ARM64_GUEST_STACK_VADDR_MIN, 329 + MEM_REGION_DATA); 337 330 338 331 aarch64_vcpu_setup(vcpu, init); 339 332 ··· 441 428 442 429 void vm_init_descriptor_tables(struct kvm_vm *vm) 443 430 { 444 - vm->handlers = vm_vaddr_alloc(vm, sizeof(struct handlers), 445 - vm->page_size); 431 + vm->handlers = __vm_vaddr_alloc(vm, sizeof(struct handlers), 432 + vm->page_size, MEM_REGION_DATA); 446 433 447 434 *(vm_vaddr_t *)addr_gva2hva(vm, (vm_vaddr_t)(&exception_handlers)) = vm->handlers; 448 435 } ··· 499 486 err = ioctl(vcpu_fd, KVM_GET_ONE_REG, &reg); 500 487 TEST_ASSERT(err == 0, KVM_IOCTL_ERROR(KVM_GET_ONE_REG, vcpu_fd)); 501 488 502 - *ps4k = ((val >> 28) & 0xf) != 0xf; 503 - *ps64k = ((val >> 24) & 0xf) == 0; 504 - *ps16k = ((val >> 20) & 0xf) != 0; 489 + *ps4k = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_TGRAN4), val) != 0xf; 490 + *ps64k = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_TGRAN64), val) == 0; 491 + *ps16k = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_TGRAN16), val) != 0; 505 492 506 493 close(vcpu_fd); 507 494 close(vm_fd);

+2 -1

tools/testing/selftests/kvm/lib/elf.c

··· 161 161 seg_vend |= vm->page_size - 1; 162 162 size_t seg_size = seg_vend - seg_vstart + 1; 163 163 164 - vm_vaddr_t vaddr = vm_vaddr_alloc(vm, seg_size, seg_vstart); 164 + vm_vaddr_t vaddr = __vm_vaddr_alloc(vm, seg_size, seg_vstart, 165 + MEM_REGION_CODE); 165 166 TEST_ASSERT(vaddr == seg_vstart, "Unable to allocate " 166 167 "virtual memory for segment at requested min addr,\n" 167 168 " segment idx: %u\n"

+53 -31

tools/testing/selftests/kvm/lib/kvm_util.c

··· 186 186 _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct vm_guest_mode_params) == NUM_VM_MODES, 187 187 "Missing new mode params?"); 188 188 189 - struct kvm_vm *____vm_create(enum vm_guest_mode mode, uint64_t nr_pages) 189 + struct kvm_vm *____vm_create(enum vm_guest_mode mode) 190 190 { 191 191 struct kvm_vm *vm; 192 - 193 - pr_debug("%s: mode='%s' pages='%ld'\n", __func__, 194 - vm_guest_mode_string(mode), nr_pages); 195 192 196 193 vm = calloc(1, sizeof(*vm)); 197 194 TEST_ASSERT(vm != NULL, "Insufficient Memory"); ··· 285 288 286 289 /* Allocate and setup memory for guest. */ 287 290 vm->vpages_mapped = sparsebit_alloc(); 288 - if (nr_pages != 0) 289 - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 290 - 0, 0, nr_pages, 0); 291 291 292 292 return vm; 293 293 } ··· 331 337 nr_extra_pages); 332 338 struct userspace_mem_region *slot0; 333 339 struct kvm_vm *vm; 340 + int i; 334 341 335 - vm = ____vm_create(mode, nr_pages); 342 + pr_debug("%s: mode='%s' pages='%ld'\n", __func__, 343 + vm_guest_mode_string(mode), nr_pages); 344 + 345 + vm = ____vm_create(mode); 346 + 347 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0); 348 + for (i = 0; i < NR_MEM_REGIONS; i++) 349 + vm->memslots[i] = 0; 336 350 337 351 kvm_vm_elf_load(vm, program_invocation_name); 338 352 ··· 651 649 sparsebit_free(&region->unused_phy_pages); 652 650 ret = munmap(region->mmap_start, region->mmap_size); 653 651 TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret)); 652 + if (region->fd >= 0) { 653 + /* There's an extra map when using shared memory. */ 654 + ret = munmap(region->mmap_alias, region->mmap_size); 655 + TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret)); 656 + close(region->fd); 657 + } 654 658 655 659 free(region); 656 660 } ··· 994 986 vm_mem_backing_src_alias(src_type)->name); 995 987 } 996 988 989 + region->backing_src_type = src_type; 997 990 region->unused_phy_pages = sparsebit_alloc(); 998 991 sparsebit_set_num(region->unused_phy_pages, 999 992 guest_paddr >> vm->page_shift, npages); ··· 1289 1280 return pgidx_start * vm->page_size; 1290 1281 } 1291 1282 1292 - /* 1293 - * VM Virtual Address Allocate 1294 - * 1295 - * Input Args: 1296 - * vm - Virtual Machine 1297 - * sz - Size in bytes 1298 - * vaddr_min - Minimum starting virtual address 1299 - * 1300 - * Output Args: None 1301 - * 1302 - * Return: 1303 - * Starting guest virtual address 1304 - * 1305 - * Allocates at least sz bytes within the virtual address space of the vm 1306 - * given by vm. The allocated bytes are mapped to a virtual address >= 1307 - * the address given by vaddr_min. Note that each allocation uses a 1308 - * a unique set of pages, with the minimum real allocation being at least 1309 - * a page. 1310 - */ 1311 - vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min) 1283 + vm_vaddr_t __vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min, 1284 + enum kvm_mem_region_type type) 1312 1285 { 1313 1286 uint64_t pages = (sz >> vm->page_shift) + ((sz % vm->page_size) != 0); 1314 1287 1315 1288 virt_pgd_alloc(vm); 1316 1289 vm_paddr_t paddr = vm_phy_pages_alloc(vm, pages, 1317 - KVM_UTIL_MIN_PFN * vm->page_size, 0); 1290 + KVM_UTIL_MIN_PFN * vm->page_size, 1291 + vm->memslots[type]); 1318 1292 1319 1293 /* 1320 1294 * Find an unused range of virtual page addresses of at least ··· 1318 1326 } 1319 1327 1320 1328 /* 1329 + * VM Virtual Address Allocate 1330 + * 1331 + * Input Args: 1332 + * vm - Virtual Machine 1333 + * sz - Size in bytes 1334 + * vaddr_min - Minimum starting virtual address 1335 + * 1336 + * Output Args: None 1337 + * 1338 + * Return: 1339 + * Starting guest virtual address 1340 + * 1341 + * Allocates at least sz bytes within the virtual address space of the vm 1342 + * given by vm. The allocated bytes are mapped to a virtual address >= 1343 + * the address given by vaddr_min. Note that each allocation uses a 1344 + * a unique set of pages, with the minimum real allocation being at least 1345 + * a page. The allocated physical space comes from the TEST_DATA memory region. 1346 + */ 1347 + vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min) 1348 + { 1349 + return __vm_vaddr_alloc(vm, sz, vaddr_min, MEM_REGION_TEST_DATA); 1350 + } 1351 + 1352 + /* 1321 1353 * VM Virtual Address Allocate Pages 1322 1354 * 1323 1355 * Input Args: ··· 1358 1342 vm_vaddr_t vm_vaddr_alloc_pages(struct kvm_vm *vm, int nr_pages) 1359 1343 { 1360 1344 return vm_vaddr_alloc(vm, nr_pages * getpagesize(), KVM_UTIL_MIN_VADDR); 1345 + } 1346 + 1347 + vm_vaddr_t __vm_vaddr_alloc_page(struct kvm_vm *vm, enum kvm_mem_region_type type) 1348 + { 1349 + return __vm_vaddr_alloc(vm, getpagesize(), KVM_UTIL_MIN_VADDR, type); 1361 1350 } 1362 1351 1363 1352 /* ··· 1591 1570 1592 1571 void *vcpu_map_dirty_ring(struct kvm_vcpu *vcpu) 1593 1572 { 1594 - uint32_t page_size = vcpu->vm->page_size; 1573 + uint32_t page_size = getpagesize(); 1595 1574 uint32_t size = vcpu->vm->dirty_ring_size; 1596 1575 1597 1576 TEST_ASSERT(size > 0, "Should enable dirty ring first"); ··· 1932 1911 1933 1912 vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm) 1934 1913 { 1935 - return vm_phy_page_alloc(vm, KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0); 1914 + return vm_phy_page_alloc(vm, KVM_GUEST_PAGE_TABLE_MIN_PADDR, 1915 + vm->memslots[MEM_REGION_PT]); 1936 1916 } 1937 1917 1938 1918 /*

+3

tools/testing/selftests/kvm/lib/memstress.c

··· 292 292 293 293 vcpu_thread_fn = vcpu_fn; 294 294 WRITE_ONCE(all_vcpu_threads_running, false); 295 + WRITE_ONCE(memstress_args.stop_vcpus, false); 295 296 296 297 for (i = 0; i < nr_vcpus; i++) { 297 298 struct vcpu_thread *vcpu = &vcpu_threads[i]; ··· 314 313 void memstress_join_vcpu_threads(int nr_vcpus) 315 314 { 316 315 int i; 316 + 317 + WRITE_ONCE(memstress_args.stop_vcpus, true); 317 318 318 319 for (i = 0; i < nr_vcpus; i++) 319 320 pthread_join(vcpu_threads[i].thread, NULL);

+17 -12

tools/testing/selftests/kvm/lib/riscv/processor.c

··· 55 55 56 56 void virt_arch_pgd_alloc(struct kvm_vm *vm) 57 57 { 58 - if (!vm->pgd_created) { 59 - vm_paddr_t paddr = vm_phy_pages_alloc(vm, 60 - page_align(vm, ptrs_per_pte(vm) * 8) / vm->page_size, 61 - KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0); 62 - vm->pgd = paddr; 63 - vm->pgd_created = true; 64 - } 58 + size_t nr_pages = page_align(vm, ptrs_per_pte(vm) * 8) / vm->page_size; 59 + 60 + if (vm->pgd_created) 61 + return; 62 + 63 + vm->pgd = vm_phy_pages_alloc(vm, nr_pages, 64 + KVM_GUEST_PAGE_TABLE_MIN_PADDR, 65 + vm->memslots[MEM_REGION_PT]); 66 + vm->pgd_created = true; 65 67 } 66 68 67 69 void virt_arch_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) ··· 281 279 void *guest_code) 282 280 { 283 281 int r; 284 - size_t stack_size = vm->page_size == 4096 ? 285 - DEFAULT_STACK_PGS * vm->page_size : 286 - vm->page_size; 287 - unsigned long stack_vaddr = vm_vaddr_alloc(vm, stack_size, 288 - DEFAULT_RISCV_GUEST_STACK_VADDR_MIN); 282 + size_t stack_size; 283 + unsigned long stack_vaddr; 289 284 unsigned long current_gp = 0; 290 285 struct kvm_mp_state mps; 291 286 struct kvm_vcpu *vcpu; 287 + 288 + stack_size = vm->page_size == 4096 ? DEFAULT_STACK_PGS * vm->page_size : 289 + vm->page_size; 290 + stack_vaddr = __vm_vaddr_alloc(vm, stack_size, 291 + DEFAULT_RISCV_GUEST_STACK_VADDR_MIN, 292 + MEM_REGION_DATA); 292 293 293 294 vcpu = __vm_vcpu_add(vm, vcpu_id); 294 295 riscv_vcpu_mmu_setup(vcpu);

+5 -3

tools/testing/selftests/kvm/lib/s390x/processor.c

··· 21 21 return; 22 22 23 23 paddr = vm_phy_pages_alloc(vm, PAGES_PER_REGION, 24 - KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0); 24 + KVM_GUEST_PAGE_TABLE_MIN_PADDR, 25 + vm->memslots[MEM_REGION_PT]); 25 26 memset(addr_gpa2hva(vm, paddr), 0xff, PAGES_PER_REGION * vm->page_size); 26 27 27 28 vm->pgd = paddr; ··· 168 167 TEST_ASSERT(vm->page_size == 4096, "Unsupported page size: 0x%x", 169 168 vm->page_size); 170 169 171 - stack_vaddr = vm_vaddr_alloc(vm, stack_size, 172 - DEFAULT_GUEST_STACK_VADDR_MIN); 170 + stack_vaddr = __vm_vaddr_alloc(vm, stack_size, 171 + DEFAULT_GUEST_STACK_VADDR_MIN, 172 + MEM_REGION_DATA); 173 173 174 174 vcpu = __vm_vcpu_add(vm, vcpu_id); 175 175

+186

tools/testing/selftests/kvm/lib/userfaultfd_util.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * KVM userfaultfd util 4 + * Adapted from demand_paging_test.c 5 + * 6 + * Copyright (C) 2018, Red Hat, Inc. 7 + * Copyright (C) 2019-2022 Google LLC 8 + */ 9 + 10 + #define _GNU_SOURCE /* for pipe2 */ 11 + 12 + #include <inttypes.h> 13 + #include <stdio.h> 14 + #include <stdlib.h> 15 + #include <time.h> 16 + #include <poll.h> 17 + #include <pthread.h> 18 + #include <linux/userfaultfd.h> 19 + #include <sys/syscall.h> 20 + 21 + #include "kvm_util.h" 22 + #include "test_util.h" 23 + #include "memstress.h" 24 + #include "userfaultfd_util.h" 25 + 26 + #ifdef __NR_userfaultfd 27 + 28 + static void *uffd_handler_thread_fn(void *arg) 29 + { 30 + struct uffd_desc *uffd_desc = (struct uffd_desc *)arg; 31 + int uffd = uffd_desc->uffd; 32 + int pipefd = uffd_desc->pipefds[0]; 33 + useconds_t delay = uffd_desc->delay; 34 + int64_t pages = 0; 35 + struct timespec start; 36 + struct timespec ts_diff; 37 + 38 + clock_gettime(CLOCK_MONOTONIC, &start); 39 + while (1) { 40 + struct uffd_msg msg; 41 + struct pollfd pollfd[2]; 42 + char tmp_chr; 43 + int r; 44 + 45 + pollfd[0].fd = uffd; 46 + pollfd[0].events = POLLIN; 47 + pollfd[1].fd = pipefd; 48 + pollfd[1].events = POLLIN; 49 + 50 + r = poll(pollfd, 2, -1); 51 + switch (r) { 52 + case -1: 53 + pr_info("poll err"); 54 + continue; 55 + case 0: 56 + continue; 57 + case 1: 58 + break; 59 + default: 60 + pr_info("Polling uffd returned %d", r); 61 + return NULL; 62 + } 63 + 64 + if (pollfd[0].revents & POLLERR) { 65 + pr_info("uffd revents has POLLERR"); 66 + return NULL; 67 + } 68 + 69 + if (pollfd[1].revents & POLLIN) { 70 + r = read(pollfd[1].fd, &tmp_chr, 1); 71 + TEST_ASSERT(r == 1, 72 + "Error reading pipefd in UFFD thread\n"); 73 + return NULL; 74 + } 75 + 76 + if (!(pollfd[0].revents & POLLIN)) 77 + continue; 78 + 79 + r = read(uffd, &msg, sizeof(msg)); 80 + if (r == -1) { 81 + if (errno == EAGAIN) 82 + continue; 83 + pr_info("Read of uffd got errno %d\n", errno); 84 + return NULL; 85 + } 86 + 87 + if (r != sizeof(msg)) { 88 + pr_info("Read on uffd returned unexpected size: %d bytes", r); 89 + return NULL; 90 + } 91 + 92 + if (!(msg.event & UFFD_EVENT_PAGEFAULT)) 93 + continue; 94 + 95 + if (delay) 96 + usleep(delay); 97 + r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg); 98 + if (r < 0) 99 + return NULL; 100 + pages++; 101 + } 102 + 103 + ts_diff = timespec_elapsed(start); 104 + PER_VCPU_DEBUG("userfaulted %ld pages over %ld.%.9lds. (%f/sec)\n", 105 + pages, ts_diff.tv_sec, ts_diff.tv_nsec, 106 + pages / ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0)); 107 + 108 + return NULL; 109 + } 110 + 111 + struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay, 112 + void *hva, uint64_t len, 113 + uffd_handler_t handler) 114 + { 115 + struct uffd_desc *uffd_desc; 116 + bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR); 117 + int uffd; 118 + struct uffdio_api uffdio_api; 119 + struct uffdio_register uffdio_register; 120 + uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY; 121 + int ret; 122 + 123 + PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n", 124 + is_minor ? "MINOR" : "MISSING", 125 + is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY"); 126 + 127 + uffd_desc = malloc(sizeof(struct uffd_desc)); 128 + TEST_ASSERT(uffd_desc, "malloc failed"); 129 + 130 + /* In order to get minor faults, prefault via the alias. */ 131 + if (is_minor) 132 + expected_ioctls = ((uint64_t) 1) << _UFFDIO_CONTINUE; 133 + 134 + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); 135 + TEST_ASSERT(uffd >= 0, "uffd creation failed, errno: %d", errno); 136 + 137 + uffdio_api.api = UFFD_API; 138 + uffdio_api.features = 0; 139 + TEST_ASSERT(ioctl(uffd, UFFDIO_API, &uffdio_api) != -1, 140 + "ioctl UFFDIO_API failed: %" PRIu64, 141 + (uint64_t)uffdio_api.api); 142 + 143 + uffdio_register.range.start = (uint64_t)hva; 144 + uffdio_register.range.len = len; 145 + uffdio_register.mode = uffd_mode; 146 + TEST_ASSERT(ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) != -1, 147 + "ioctl UFFDIO_REGISTER failed"); 148 + TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) == 149 + expected_ioctls, "missing userfaultfd ioctls"); 150 + 151 + ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK); 152 + TEST_ASSERT(!ret, "Failed to set up pipefd"); 153 + 154 + uffd_desc->uffd_mode = uffd_mode; 155 + uffd_desc->uffd = uffd; 156 + uffd_desc->delay = delay; 157 + uffd_desc->handler = handler; 158 + pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn, 159 + uffd_desc); 160 + 161 + PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n", 162 + hva, hva + len); 163 + 164 + return uffd_desc; 165 + } 166 + 167 + void uffd_stop_demand_paging(struct uffd_desc *uffd) 168 + { 169 + char c = 0; 170 + int ret; 171 + 172 + ret = write(uffd->pipefds[1], &c, 1); 173 + TEST_ASSERT(ret == 1, "Unable to write to pipefd"); 174 + 175 + ret = pthread_join(uffd->thread, NULL); 176 + TEST_ASSERT(ret == 0, "Pthread_join failed."); 177 + 178 + close(uffd->uffd); 179 + 180 + close(uffd->pipefds[1]); 181 + close(uffd->pipefds[0]); 182 + 183 + free(uffd); 184 + } 185 + 186 + #endif /* __NR_userfaultfd */

+7 -6

tools/testing/selftests/kvm/lib/x86_64/processor.c

··· 499 499 static void kvm_setup_gdt(struct kvm_vm *vm, struct kvm_dtable *dt) 500 500 { 501 501 if (!vm->gdt) 502 - vm->gdt = vm_vaddr_alloc_page(vm); 502 + vm->gdt = __vm_vaddr_alloc_page(vm, MEM_REGION_DATA); 503 503 504 504 dt->base = vm->gdt; 505 505 dt->limit = getpagesize(); ··· 509 509 int selector) 510 510 { 511 511 if (!vm->tss) 512 - vm->tss = vm_vaddr_alloc_page(vm); 512 + vm->tss = __vm_vaddr_alloc_page(vm, MEM_REGION_DATA); 513 513 514 514 memset(segp, 0, sizeof(*segp)); 515 515 segp->base = vm->tss; ··· 599 599 vm_vaddr_t stack_vaddr; 600 600 struct kvm_vcpu *vcpu; 601 601 602 - stack_vaddr = vm_vaddr_alloc(vm, DEFAULT_STACK_PGS * getpagesize(), 603 - DEFAULT_GUEST_STACK_VADDR_MIN); 602 + stack_vaddr = __vm_vaddr_alloc(vm, DEFAULT_STACK_PGS * getpagesize(), 603 + DEFAULT_GUEST_STACK_VADDR_MIN, 604 + MEM_REGION_DATA); 604 605 605 606 vcpu = __vm_vcpu_add(vm, vcpu_id); 606 607 vcpu_init_cpuid(vcpu, kvm_get_supported_cpuid()); ··· 1094 1093 extern void *idt_handlers; 1095 1094 int i; 1096 1095 1097 - vm->idt = vm_vaddr_alloc_page(vm); 1098 - vm->handlers = vm_vaddr_alloc_page(vm); 1096 + vm->idt = __vm_vaddr_alloc_page(vm, MEM_REGION_DATA); 1097 + vm->handlers = __vm_vaddr_alloc_page(vm, MEM_REGION_DATA); 1099 1098 /* Handlers have the same address in both address spaces.*/ 1100 1099 for (i = 0; i < NUM_INTERRUPTS; i++) 1101 1100 set_idt_entry(vm, i, (unsigned long)(&idt_handlers)[i], 0,

+1 -5

tools/testing/selftests/kvm/memslot_modification_stress_test.c

··· 34 34 static int nr_vcpus = 1; 35 35 static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE; 36 36 37 - static bool run_vcpus = true; 38 - 39 37 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args) 40 38 { 41 39 struct kvm_vcpu *vcpu = vcpu_args->vcpu; ··· 43 45 run = vcpu->run; 44 46 45 47 /* Let the guest access its memory until a stop signal is received */ 46 - while (READ_ONCE(run_vcpus)) { 48 + while (!READ_ONCE(memstress_args.stop_vcpus)) { 47 49 ret = _vcpu_run(vcpu); 48 50 TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret); 49 51 ··· 106 108 pr_info("Started all vCPUs\n"); 107 109 108 110 add_remove_memslot(vm, p->delay, p->nr_iterations); 109 - 110 - run_vcpus = false; 111 111 112 112 memstress_join_vcpu_threads(nr_vcpus); 113 113 pr_info("All vCPU threads joined\n");

+207 -106

tools/testing/selftests/kvm/memslot_perf_test.c

··· 20 20 #include <unistd.h> 21 21 22 22 #include <linux/compiler.h> 23 + #include <linux/sizes.h> 23 24 24 25 #include <test_util.h> 25 26 #include <kvm_util.h> 26 27 #include <processor.h> 27 28 28 - #define MEM_SIZE ((512U << 20) + 4096) 29 - #define MEM_SIZE_PAGES (MEM_SIZE / 4096) 30 - #define MEM_GPA 0x10000000UL 29 + #define MEM_EXTRA_SIZE SZ_64K 30 + 31 + #define MEM_SIZE (SZ_512M + MEM_EXTRA_SIZE) 32 + #define MEM_GPA SZ_256M 31 33 #define MEM_AUX_GPA MEM_GPA 32 34 #define MEM_SYNC_GPA MEM_AUX_GPA 33 - #define MEM_TEST_GPA (MEM_AUX_GPA + 4096) 34 - #define MEM_TEST_SIZE (MEM_SIZE - 4096) 35 - static_assert(MEM_SIZE % 4096 == 0, "invalid mem size"); 36 - static_assert(MEM_TEST_SIZE % 4096 == 0, "invalid mem test size"); 35 + #define MEM_TEST_GPA (MEM_AUX_GPA + MEM_EXTRA_SIZE) 36 + #define MEM_TEST_SIZE (MEM_SIZE - MEM_EXTRA_SIZE) 37 37 38 38 /* 39 39 * 32 MiB is max size that gets well over 100 iterations on 509 slots. ··· 41 41 * 8194 slots in use can then be tested (although with slightly 42 42 * limited resolution). 43 43 */ 44 - #define MEM_SIZE_MAP ((32U << 20) + 4096) 45 - #define MEM_SIZE_MAP_PAGES (MEM_SIZE_MAP / 4096) 46 - #define MEM_TEST_MAP_SIZE (MEM_SIZE_MAP - 4096) 47 - #define MEM_TEST_MAP_SIZE_PAGES (MEM_TEST_MAP_SIZE / 4096) 48 - static_assert(MEM_SIZE_MAP % 4096 == 0, "invalid map test region size"); 49 - static_assert(MEM_TEST_MAP_SIZE % 4096 == 0, "invalid map test region size"); 50 - static_assert(MEM_TEST_MAP_SIZE_PAGES % 2 == 0, "invalid map test region size"); 51 - static_assert(MEM_TEST_MAP_SIZE_PAGES > 2, "invalid map test region size"); 44 + #define MEM_SIZE_MAP (SZ_32M + MEM_EXTRA_SIZE) 45 + #define MEM_TEST_MAP_SIZE (MEM_SIZE_MAP - MEM_EXTRA_SIZE) 52 46 53 47 /* 54 48 * 128 MiB is min size that fills 32k slots with at least one page in each 55 49 * while at the same time gets 100+ iterations in such test 50 + * 51 + * 2 MiB chunk size like a typical huge page 56 52 */ 57 - #define MEM_TEST_UNMAP_SIZE (128U << 20) 58 - #define MEM_TEST_UNMAP_SIZE_PAGES (MEM_TEST_UNMAP_SIZE / 4096) 59 - /* 2 MiB chunk size like a typical huge page */ 60 - #define MEM_TEST_UNMAP_CHUNK_PAGES (2U << (20 - 12)) 61 - static_assert(MEM_TEST_UNMAP_SIZE <= MEM_TEST_SIZE, 62 - "invalid unmap test region size"); 63 - static_assert(MEM_TEST_UNMAP_SIZE % 4096 == 0, 64 - "invalid unmap test region size"); 65 - static_assert(MEM_TEST_UNMAP_SIZE_PAGES % 66 - (2 * MEM_TEST_UNMAP_CHUNK_PAGES) == 0, 67 - "invalid unmap test region size"); 53 + #define MEM_TEST_UNMAP_SIZE SZ_128M 54 + #define MEM_TEST_UNMAP_CHUNK_SIZE SZ_2M 68 55 69 56 /* 70 57 * For the move active test the middle of the test area is placed on 71 58 * a memslot boundary: half lies in the memslot being moved, half in 72 59 * other memslot(s). 73 60 * 74 - * When running this test with 32k memslots (32764, really) each memslot 75 - * contains 4 pages. 76 - * The last one additionally contains the remaining 21 pages of memory, 77 - * for the total size of 25 pages. 78 - * Hence, the maximum size here is 50 pages. 61 + * We have different number of memory slots, excluding the reserved 62 + * memory slot 0, on various architectures and configurations. The 63 + * memory size in this test is calculated by picking the maximal 64 + * last memory slot's memory size, with alignment to the largest 65 + * supported page size (64KB). In this way, the selected memory 66 + * size for this test is compatible with test_memslot_move_prepare(). 67 + * 68 + * architecture slots memory-per-slot memory-on-last-slot 69 + * -------------------------------------------------------------- 70 + * x86-4KB 32763 16KB 160KB 71 + * arm64-4KB 32766 16KB 112KB 72 + * arm64-16KB 32766 16KB 112KB 73 + * arm64-64KB 8192 64KB 128KB 79 74 */ 80 - #define MEM_TEST_MOVE_SIZE_PAGES (50) 81 - #define MEM_TEST_MOVE_SIZE (MEM_TEST_MOVE_SIZE_PAGES * 4096) 75 + #define MEM_TEST_MOVE_SIZE (3 * SZ_64K) 82 76 #define MEM_TEST_MOVE_GPA_DEST (MEM_GPA + MEM_SIZE) 83 77 static_assert(MEM_TEST_MOVE_SIZE <= MEM_TEST_SIZE, 84 78 "invalid move test region size"); ··· 94 100 }; 95 101 96 102 struct sync_area { 103 + uint32_t guest_page_size; 97 104 atomic_bool start_flag; 98 105 atomic_bool exit_flag; 99 106 atomic_bool sync_flag; ··· 187 192 uint64_t gpage, pgoffs; 188 193 uint32_t slot, slotoffs; 189 194 void *base; 195 + uint32_t guest_page_size = data->vm->page_size; 190 196 191 197 TEST_ASSERT(gpa >= MEM_GPA, "Too low gpa to translate"); 192 - TEST_ASSERT(gpa < MEM_GPA + data->npages * 4096, 198 + TEST_ASSERT(gpa < MEM_GPA + data->npages * guest_page_size, 193 199 "Too high gpa to translate"); 194 200 gpa -= MEM_GPA; 195 201 196 - gpage = gpa / 4096; 197 - pgoffs = gpa % 4096; 202 + gpage = gpa / guest_page_size; 203 + pgoffs = gpa % guest_page_size; 198 204 slot = min(gpage / data->pages_per_slot, (uint64_t)data->nslots - 1); 199 205 slotoffs = gpage - (slot * data->pages_per_slot); 200 206 ··· 213 217 } 214 218 215 219 base = data->hva_slots[slot]; 216 - return (uint8_t *)base + slotoffs * 4096 + pgoffs; 220 + return (uint8_t *)base + slotoffs * guest_page_size + pgoffs; 217 221 } 218 222 219 223 static uint64_t vm_slot2gpa(struct vm_data *data, uint32_t slot) 220 224 { 225 + uint32_t guest_page_size = data->vm->page_size; 226 + 221 227 TEST_ASSERT(slot < data->nslots, "Too high slot number"); 222 228 223 - return MEM_GPA + slot * data->pages_per_slot * 4096; 229 + return MEM_GPA + slot * data->pages_per_slot * guest_page_size; 224 230 } 225 231 226 232 static struct vm_data *alloc_vm(void) ··· 239 241 return data; 240 242 } 241 243 244 + static bool check_slot_pages(uint32_t host_page_size, uint32_t guest_page_size, 245 + uint64_t pages_per_slot, uint64_t rempages) 246 + { 247 + if (!pages_per_slot) 248 + return false; 249 + 250 + if ((pages_per_slot * guest_page_size) % host_page_size) 251 + return false; 252 + 253 + if ((rempages * guest_page_size) % host_page_size) 254 + return false; 255 + 256 + return true; 257 + } 258 + 259 + 260 + static uint64_t get_max_slots(struct vm_data *data, uint32_t host_page_size) 261 + { 262 + uint32_t guest_page_size = data->vm->page_size; 263 + uint64_t mempages, pages_per_slot, rempages; 264 + uint64_t slots; 265 + 266 + mempages = data->npages; 267 + slots = data->nslots; 268 + while (--slots > 1) { 269 + pages_per_slot = mempages / slots; 270 + rempages = mempages % pages_per_slot; 271 + if (check_slot_pages(host_page_size, guest_page_size, 272 + pages_per_slot, rempages)) 273 + return slots + 1; /* slot 0 is reserved */ 274 + } 275 + 276 + return 0; 277 + } 278 + 242 279 static bool prepare_vm(struct vm_data *data, int nslots, uint64_t *maxslots, 243 - void *guest_code, uint64_t mempages, 280 + void *guest_code, uint64_t mem_size, 244 281 struct timespec *slot_runtime) 245 282 { 246 - uint32_t max_mem_slots; 247 - uint64_t rempages; 283 + uint64_t mempages, rempages; 248 284 uint64_t guest_addr; 249 - uint32_t slot; 285 + uint32_t slot, host_page_size, guest_page_size; 250 286 struct timespec tstart; 251 287 struct sync_area *sync; 252 288 253 - max_mem_slots = kvm_check_cap(KVM_CAP_NR_MEMSLOTS); 254 - TEST_ASSERT(max_mem_slots > 1, 255 - "KVM_CAP_NR_MEMSLOTS should be greater than 1"); 256 - TEST_ASSERT(nslots > 1 || nslots == -1, 257 - "Slot count cap should be greater than 1"); 258 - if (nslots != -1) 259 - max_mem_slots = min(max_mem_slots, (uint32_t)nslots); 260 - pr_info_v("Allowed number of memory slots: %"PRIu32"\n", max_mem_slots); 289 + host_page_size = getpagesize(); 290 + guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; 291 + mempages = mem_size / guest_page_size; 261 292 262 - TEST_ASSERT(mempages > 1, 263 - "Can't test without any memory"); 293 + data->vm = __vm_create_with_one_vcpu(&data->vcpu, mempages, guest_code); 294 + TEST_ASSERT(data->vm->page_size == guest_page_size, "Invalid VM page size"); 264 295 265 296 data->npages = mempages; 266 - data->nslots = max_mem_slots - 1; 267 - data->pages_per_slot = mempages / data->nslots; 268 - if (!data->pages_per_slot) { 269 - *maxslots = mempages + 1; 297 + TEST_ASSERT(data->npages > 1, "Can't test without any memory"); 298 + data->nslots = nslots; 299 + data->pages_per_slot = data->npages / data->nslots; 300 + rempages = data->npages % data->nslots; 301 + if (!check_slot_pages(host_page_size, guest_page_size, 302 + data->pages_per_slot, rempages)) { 303 + *maxslots = get_max_slots(data, host_page_size); 270 304 return false; 271 305 } 272 306 273 - rempages = mempages % data->nslots; 274 307 data->hva_slots = malloc(sizeof(*data->hva_slots) * data->nslots); 275 308 TEST_ASSERT(data->hva_slots, "malloc() fail"); 276 309 277 310 data->vm = __vm_create_with_one_vcpu(&data->vcpu, mempages, guest_code); 278 311 279 312 pr_info_v("Adding slots 1..%i, each slot with %"PRIu64" pages + %"PRIu64" extra pages last\n", 280 - max_mem_slots - 1, data->pages_per_slot, rempages); 313 + data->nslots, data->pages_per_slot, rempages); 281 314 282 315 clock_gettime(CLOCK_MONOTONIC, &tstart); 283 - for (slot = 1, guest_addr = MEM_GPA; slot < max_mem_slots; slot++) { 316 + for (slot = 1, guest_addr = MEM_GPA; slot <= data->nslots; slot++) { 284 317 uint64_t npages; 285 318 286 319 npages = data->pages_per_slot; 287 - if (slot == max_mem_slots - 1) 320 + if (slot == data->nslots) 288 321 npages += rempages; 289 322 290 323 vm_userspace_mem_region_add(data->vm, VM_MEM_SRC_ANONYMOUS, 291 324 guest_addr, slot, npages, 292 325 0); 293 - guest_addr += npages * 4096; 326 + guest_addr += npages * guest_page_size; 294 327 } 295 328 *slot_runtime = timespec_elapsed(tstart); 296 329 297 - for (slot = 0, guest_addr = MEM_GPA; slot < max_mem_slots - 1; slot++) { 330 + for (slot = 1, guest_addr = MEM_GPA; slot <= data->nslots; slot++) { 298 331 uint64_t npages; 299 332 uint64_t gpa; 300 333 301 334 npages = data->pages_per_slot; 302 - if (slot == max_mem_slots - 2) 335 + if (slot == data->nslots) 303 336 npages += rempages; 304 337 305 - gpa = vm_phy_pages_alloc(data->vm, npages, guest_addr, 306 - slot + 1); 338 + gpa = vm_phy_pages_alloc(data->vm, npages, guest_addr, slot); 307 339 TEST_ASSERT(gpa == guest_addr, 308 340 "vm_phy_pages_alloc() failed\n"); 309 341 310 - data->hva_slots[slot] = addr_gpa2hva(data->vm, guest_addr); 311 - memset(data->hva_slots[slot], 0, npages * 4096); 342 + data->hva_slots[slot - 1] = addr_gpa2hva(data->vm, guest_addr); 343 + memset(data->hva_slots[slot - 1], 0, npages * guest_page_size); 312 344 313 - guest_addr += npages * 4096; 345 + guest_addr += npages * guest_page_size; 314 346 } 315 347 316 - virt_map(data->vm, MEM_GPA, MEM_GPA, mempages); 348 + virt_map(data->vm, MEM_GPA, MEM_GPA, data->npages); 317 349 318 350 sync = (typeof(sync))vm_gpa2hva(data, MEM_SYNC_GPA, NULL); 319 351 atomic_init(&sync->start_flag, false); ··· 442 414 static void guest_code_test_memslot_move(void) 443 415 { 444 416 struct sync_area *sync = (typeof(sync))MEM_SYNC_GPA; 417 + uint32_t page_size = (typeof(page_size))READ_ONCE(sync->guest_page_size); 445 418 uintptr_t base = (typeof(base))READ_ONCE(sync->move_area_ptr); 446 419 447 420 GUEST_SYNC(0); ··· 453 424 uintptr_t ptr; 454 425 455 426 for (ptr = base; ptr < base + MEM_TEST_MOVE_SIZE; 456 - ptr += 4096) 427 + ptr += page_size) 457 428 *(uint64_t *)ptr = MEM_TEST_VAL_1; 458 429 459 430 /* ··· 471 442 static void guest_code_test_memslot_map(void) 472 443 { 473 444 struct sync_area *sync = (typeof(sync))MEM_SYNC_GPA; 445 + uint32_t page_size = (typeof(page_size))READ_ONCE(sync->guest_page_size); 474 446 475 447 GUEST_SYNC(0); 476 448 ··· 481 451 uintptr_t ptr; 482 452 483 453 for (ptr = MEM_TEST_GPA; 484 - ptr < MEM_TEST_GPA + MEM_TEST_MAP_SIZE / 2; ptr += 4096) 454 + ptr < MEM_TEST_GPA + MEM_TEST_MAP_SIZE / 2; 455 + ptr += page_size) 485 456 *(uint64_t *)ptr = MEM_TEST_VAL_1; 486 457 487 458 if (!guest_perform_sync()) 488 459 break; 489 460 490 461 for (ptr = MEM_TEST_GPA + MEM_TEST_MAP_SIZE / 2; 491 - ptr < MEM_TEST_GPA + MEM_TEST_MAP_SIZE; ptr += 4096) 462 + ptr < MEM_TEST_GPA + MEM_TEST_MAP_SIZE; 463 + ptr += page_size) 492 464 *(uint64_t *)ptr = MEM_TEST_VAL_2; 493 465 494 466 if (!guest_perform_sync()) ··· 537 505 538 506 static void guest_code_test_memslot_rw(void) 539 507 { 508 + struct sync_area *sync = (typeof(sync))MEM_SYNC_GPA; 509 + uint32_t page_size = (typeof(page_size))READ_ONCE(sync->guest_page_size); 510 + 540 511 GUEST_SYNC(0); 541 512 542 513 guest_spin_until_start(); ··· 548 513 uintptr_t ptr; 549 514 550 515 for (ptr = MEM_TEST_GPA; 551 - ptr < MEM_TEST_GPA + MEM_TEST_SIZE; ptr += 4096) 516 + ptr < MEM_TEST_GPA + MEM_TEST_SIZE; ptr += page_size) 552 517 *(uint64_t *)ptr = MEM_TEST_VAL_1; 553 518 554 519 if (!guest_perform_sync()) 555 520 break; 556 521 557 - for (ptr = MEM_TEST_GPA + 4096 / 2; 558 - ptr < MEM_TEST_GPA + MEM_TEST_SIZE; ptr += 4096) { 522 + for (ptr = MEM_TEST_GPA + page_size / 2; 523 + ptr < MEM_TEST_GPA + MEM_TEST_SIZE; ptr += page_size) { 559 524 uint64_t val = *(uint64_t *)ptr; 560 525 561 526 GUEST_ASSERT_1(val == MEM_TEST_VAL_2, val); ··· 573 538 struct sync_area *sync, 574 539 uint64_t *maxslots, bool isactive) 575 540 { 541 + uint32_t guest_page_size = data->vm->page_size; 576 542 uint64_t movesrcgpa, movetestgpa; 577 543 578 544 movesrcgpa = vm_slot2gpa(data, data->nslots - 1); ··· 582 546 uint64_t lastpages; 583 547 584 548 vm_gpa2hva(data, movesrcgpa, &lastpages); 585 - if (lastpages < MEM_TEST_MOVE_SIZE_PAGES / 2) { 549 + if (lastpages * guest_page_size < MEM_TEST_MOVE_SIZE / 2) { 586 550 *maxslots = 0; 587 551 return false; 588 552 } ··· 628 592 uint64_t offsp, uint64_t count) 629 593 { 630 594 uint64_t gpa, ctr; 595 + uint32_t guest_page_size = data->vm->page_size; 631 596 632 - for (gpa = MEM_TEST_GPA + offsp * 4096, ctr = 0; ctr < count; ) { 597 + for (gpa = MEM_TEST_GPA + offsp * guest_page_size, ctr = 0; ctr < count; ) { 633 598 uint64_t npages; 634 599 void *hva; 635 600 int ret; ··· 638 601 hva = vm_gpa2hva(data, gpa, &npages); 639 602 TEST_ASSERT(npages, "Empty memory slot at gptr 0x%"PRIx64, gpa); 640 603 npages = min(npages, count - ctr); 641 - ret = madvise(hva, npages * 4096, MADV_DONTNEED); 604 + ret = madvise(hva, npages * guest_page_size, MADV_DONTNEED); 642 605 TEST_ASSERT(!ret, 643 606 "madvise(%p, MADV_DONTNEED) on VM memory should not fail for gptr 0x%"PRIx64, 644 607 hva, gpa); 645 608 ctr += npages; 646 - gpa += npages * 4096; 609 + gpa += npages * guest_page_size; 647 610 } 648 611 TEST_ASSERT(ctr == count, 649 612 "madvise(MADV_DONTNEED) should exactly cover all of the requested area"); ··· 654 617 { 655 618 uint64_t gpa; 656 619 uint64_t *val; 620 + uint32_t guest_page_size = data->vm->page_size; 657 621 658 622 if (!map_unmap_verify) 659 623 return; 660 624 661 - gpa = MEM_TEST_GPA + offsp * 4096; 625 + gpa = MEM_TEST_GPA + offsp * guest_page_size; 662 626 val = (typeof(val))vm_gpa2hva(data, gpa, NULL); 663 627 TEST_ASSERT(*val == valexp, 664 628 "Guest written values should read back correctly before unmap (%"PRIu64" vs %"PRIu64" @ %"PRIx64")", ··· 669 631 670 632 static void test_memslot_map_loop(struct vm_data *data, struct sync_area *sync) 671 633 { 634 + uint32_t guest_page_size = data->vm->page_size; 635 + uint64_t guest_pages = MEM_TEST_MAP_SIZE / guest_page_size; 636 + 672 637 /* 673 638 * Unmap the second half of the test area while guest writes to (maps) 674 639 * the first half. 675 640 */ 676 - test_memslot_do_unmap(data, MEM_TEST_MAP_SIZE_PAGES / 2, 677 - MEM_TEST_MAP_SIZE_PAGES / 2); 641 + test_memslot_do_unmap(data, guest_pages / 2, guest_pages / 2); 678 642 679 643 /* 680 644 * Wait for the guest to finish writing the first half of the test ··· 687 647 */ 688 648 host_perform_sync(sync); 689 649 test_memslot_map_unmap_check(data, 0, MEM_TEST_VAL_1); 690 - test_memslot_map_unmap_check(data, 691 - MEM_TEST_MAP_SIZE_PAGES / 2 - 1, 692 - MEM_TEST_VAL_1); 693 - test_memslot_do_unmap(data, 0, MEM_TEST_MAP_SIZE_PAGES / 2); 650 + test_memslot_map_unmap_check(data, guest_pages / 2 - 1, MEM_TEST_VAL_1); 651 + test_memslot_do_unmap(data, 0, guest_pages / 2); 694 652 695 653 696 654 /* ··· 701 663 * the test area. 702 664 */ 703 665 host_perform_sync(sync); 704 - test_memslot_map_unmap_check(data, MEM_TEST_MAP_SIZE_PAGES / 2, 705 - MEM_TEST_VAL_2); 706 - test_memslot_map_unmap_check(data, MEM_TEST_MAP_SIZE_PAGES - 1, 707 - MEM_TEST_VAL_2); 666 + test_memslot_map_unmap_check(data, guest_pages / 2, MEM_TEST_VAL_2); 667 + test_memslot_map_unmap_check(data, guest_pages - 1, MEM_TEST_VAL_2); 708 668 } 709 669 710 670 static void test_memslot_unmap_loop_common(struct vm_data *data, 711 671 struct sync_area *sync, 712 672 uint64_t chunk) 713 673 { 674 + uint32_t guest_page_size = data->vm->page_size; 675 + uint64_t guest_pages = MEM_TEST_UNMAP_SIZE / guest_page_size; 714 676 uint64_t ctr; 715 677 716 678 /* ··· 722 684 */ 723 685 host_perform_sync(sync); 724 686 test_memslot_map_unmap_check(data, 0, MEM_TEST_VAL_1); 725 - for (ctr = 0; ctr < MEM_TEST_UNMAP_SIZE_PAGES / 2; ctr += chunk) 687 + for (ctr = 0; ctr < guest_pages / 2; ctr += chunk) 726 688 test_memslot_do_unmap(data, ctr, chunk); 727 689 728 690 /* Likewise, but for the opposite host / guest areas */ 729 691 host_perform_sync(sync); 730 - test_memslot_map_unmap_check(data, MEM_TEST_UNMAP_SIZE_PAGES / 2, 731 - MEM_TEST_VAL_2); 732 - for (ctr = MEM_TEST_UNMAP_SIZE_PAGES / 2; 733 - ctr < MEM_TEST_UNMAP_SIZE_PAGES; ctr += chunk) 692 + test_memslot_map_unmap_check(data, guest_pages / 2, MEM_TEST_VAL_2); 693 + for (ctr = guest_pages / 2; ctr < guest_pages; ctr += chunk) 734 694 test_memslot_do_unmap(data, ctr, chunk); 735 695 } 736 696 737 697 static void test_memslot_unmap_loop(struct vm_data *data, 738 698 struct sync_area *sync) 739 699 { 740 - test_memslot_unmap_loop_common(data, sync, 1); 700 + uint32_t host_page_size = getpagesize(); 701 + uint32_t guest_page_size = data->vm->page_size; 702 + uint64_t guest_chunk_pages = guest_page_size >= host_page_size ? 703 + 1 : host_page_size / guest_page_size; 704 + 705 + test_memslot_unmap_loop_common(data, sync, guest_chunk_pages); 741 706 } 742 707 743 708 static void test_memslot_unmap_loop_chunked(struct vm_data *data, 744 709 struct sync_area *sync) 745 710 { 746 - test_memslot_unmap_loop_common(data, sync, MEM_TEST_UNMAP_CHUNK_PAGES); 711 + uint32_t guest_page_size = data->vm->page_size; 712 + uint64_t guest_chunk_pages = MEM_TEST_UNMAP_CHUNK_SIZE / guest_page_size; 713 + 714 + test_memslot_unmap_loop_common(data, sync, guest_chunk_pages); 747 715 } 748 716 749 717 static void test_memslot_rw_loop(struct vm_data *data, struct sync_area *sync) 750 718 { 751 719 uint64_t gptr; 720 + uint32_t guest_page_size = data->vm->page_size; 752 721 753 - for (gptr = MEM_TEST_GPA + 4096 / 2; 754 - gptr < MEM_TEST_GPA + MEM_TEST_SIZE; gptr += 4096) 722 + for (gptr = MEM_TEST_GPA + guest_page_size / 2; 723 + gptr < MEM_TEST_GPA + MEM_TEST_SIZE; gptr += guest_page_size) 755 724 *(uint64_t *)vm_gpa2hva(data, gptr, NULL) = MEM_TEST_VAL_2; 756 725 757 726 host_perform_sync(sync); 758 727 759 728 for (gptr = MEM_TEST_GPA; 760 - gptr < MEM_TEST_GPA + MEM_TEST_SIZE; gptr += 4096) { 729 + gptr < MEM_TEST_GPA + MEM_TEST_SIZE; gptr += guest_page_size) { 761 730 uint64_t *vptr = (typeof(vptr))vm_gpa2hva(data, gptr, NULL); 762 731 uint64_t val = *vptr; 763 732 ··· 793 748 struct timespec *slot_runtime, 794 749 struct timespec *guest_runtime) 795 750 { 796 - uint64_t mem_size = tdata->mem_size ? : MEM_SIZE_PAGES; 751 + uint64_t mem_size = tdata->mem_size ? : MEM_SIZE; 797 752 struct vm_data *data; 798 753 struct sync_area *sync; 799 754 struct timespec tstart; ··· 808 763 809 764 sync = (typeof(sync))vm_gpa2hva(data, MEM_SYNC_GPA, NULL); 810 765 766 + sync->guest_page_size = data->vm->page_size; 811 767 if (tdata->prepare && 812 768 !tdata->prepare(data, sync, maxslots)) { 813 769 ret = false; ··· 842 796 static const struct test_data tests[] = { 843 797 { 844 798 .name = "map", 845 - .mem_size = MEM_SIZE_MAP_PAGES, 799 + .mem_size = MEM_SIZE_MAP, 846 800 .guest_code = guest_code_test_memslot_map, 847 801 .loop = test_memslot_map_loop, 848 802 }, 849 803 { 850 804 .name = "unmap", 851 - .mem_size = MEM_TEST_UNMAP_SIZE_PAGES + 1, 805 + .mem_size = MEM_TEST_UNMAP_SIZE + MEM_EXTRA_SIZE, 852 806 .guest_code = guest_code_test_memslot_unmap, 853 807 .loop = test_memslot_unmap_loop, 854 808 }, 855 809 { 856 810 .name = "unmap chunked", 857 - .mem_size = MEM_TEST_UNMAP_SIZE_PAGES + 1, 811 + .mem_size = MEM_TEST_UNMAP_SIZE + MEM_EXTRA_SIZE, 858 812 .guest_code = guest_code_test_memslot_unmap, 859 813 .loop = test_memslot_unmap_loop_chunked, 860 814 }, ··· 912 866 pr_info("%d: %s\n", ctr, tests[ctr].name); 913 867 } 914 868 869 + static bool check_memory_sizes(void) 870 + { 871 + uint32_t host_page_size = getpagesize(); 872 + uint32_t guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; 873 + 874 + if (host_page_size > SZ_64K || guest_page_size > SZ_64K) { 875 + pr_info("Unsupported page size on host (0x%x) or guest (0x%x)\n", 876 + host_page_size, guest_page_size); 877 + return false; 878 + } 879 + 880 + if (MEM_SIZE % guest_page_size || 881 + MEM_TEST_SIZE % guest_page_size) { 882 + pr_info("invalid MEM_SIZE or MEM_TEST_SIZE\n"); 883 + return false; 884 + } 885 + 886 + if (MEM_SIZE_MAP % guest_page_size || 887 + MEM_TEST_MAP_SIZE % guest_page_size || 888 + (MEM_TEST_MAP_SIZE / guest_page_size) <= 2 || 889 + (MEM_TEST_MAP_SIZE / guest_page_size) % 2) { 890 + pr_info("invalid MEM_SIZE_MAP or MEM_TEST_MAP_SIZE\n"); 891 + return false; 892 + } 893 + 894 + if (MEM_TEST_UNMAP_SIZE > MEM_TEST_SIZE || 895 + MEM_TEST_UNMAP_SIZE % guest_page_size || 896 + (MEM_TEST_UNMAP_SIZE / guest_page_size) % 897 + (2 * MEM_TEST_UNMAP_CHUNK_SIZE / guest_page_size)) { 898 + pr_info("invalid MEM_TEST_UNMAP_SIZE or MEM_TEST_UNMAP_CHUNK_SIZE\n"); 899 + return false; 900 + } 901 + 902 + return true; 903 + } 904 + 915 905 static bool parse_args(int argc, char *argv[], 916 906 struct test_args *targs) 917 907 { 908 + uint32_t max_mem_slots; 918 909 int opt; 919 910 920 911 while ((opt = getopt(argc, argv, "hvds:f:e:l:r:")) != -1) { ··· 968 885 break; 969 886 case 's': 970 887 targs->nslots = atoi_paranoid(optarg); 971 - if (targs->nslots <= 0 && targs->nslots != -1) { 972 - pr_info("Slot count cap has to be positive or -1 for no cap\n"); 888 + if (targs->nslots <= 1 && targs->nslots != -1) { 889 + pr_info("Slot count cap must be larger than 1 or -1 for no cap\n"); 973 890 return false; 974 891 } 975 892 break; ··· 1002 919 pr_info("First test to run cannot be greater than the last test to run\n"); 1003 920 return false; 1004 921 } 922 + 923 + max_mem_slots = kvm_check_cap(KVM_CAP_NR_MEMSLOTS); 924 + if (max_mem_slots <= 1) { 925 + pr_info("KVM_CAP_NR_MEMSLOTS should be greater than 1\n"); 926 + return false; 927 + } 928 + 929 + /* Memory slot 0 is reserved */ 930 + if (targs->nslots == -1) 931 + targs->nslots = max_mem_slots - 1; 932 + else 933 + targs->nslots = min_t(int, targs->nslots, max_mem_slots) - 1; 934 + 935 + pr_info_v("Allowed Number of memory slots: %"PRIu32"\n", 936 + targs->nslots + 1); 1005 937 1006 938 return true; 1007 939 } ··· 1091 993 }; 1092 994 struct test_result rbestslottime; 1093 995 int tctr; 996 + 997 + if (!check_memory_sizes()) 998 + return -1; 1094 999 1095 1000 if (!parse_args(argc, argv, &targs)) 1096 1001 return -1;

+6

virt/kvm/Kconfig

··· 33 33 bool 34 34 select HAVE_KVM_DIRTY_RING 35 35 36 + # Allow enabling both the dirty bitmap and dirty ring. Only architectures 37 + # that need to dirty memory outside of a vCPU context should select this. 38 + config NEED_KVM_DIRTY_RING_WITH_BITMAP 39 + bool 40 + depends on HAVE_KVM_DIRTY_RING 41 + 36 42 config HAVE_KVM_EVENTFD 37 43 bool 38 44 select EVENTFD

+44 -2

virt/kvm/dirty_ring.c

··· 21 21 return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); 22 22 } 23 23 24 + bool kvm_use_dirty_bitmap(struct kvm *kvm) 25 + { 26 + lockdep_assert_held(&kvm->slots_lock); 27 + 28 + return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; 29 + } 30 + 31 + #ifndef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP 32 + bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) 33 + { 34 + return false; 35 + } 36 + #endif 37 + 24 38 static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring) 25 39 { 26 40 return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index); 27 41 } 28 42 29 - bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 43 + static bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 30 44 { 31 45 return kvm_dirty_ring_used(ring) >= ring->soft_limit; 32 46 } ··· 156 142 157 143 kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); 158 144 145 + /* 146 + * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared 147 + * by the VCPU thread next time when it enters the guest. 148 + */ 149 + 159 150 trace_kvm_dirty_ring_reset(ring); 160 151 161 152 return count; 162 153 } 163 154 164 - void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) 155 + void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset) 165 156 { 157 + struct kvm_dirty_ring *ring = &vcpu->dirty_ring; 166 158 struct kvm_dirty_gfn *entry; 167 159 168 160 /* It should never get full */ ··· 186 166 kvm_dirty_gfn_set_dirtied(entry); 187 167 ring->dirty_index++; 188 168 trace_kvm_dirty_ring_push(ring, slot, offset); 169 + 170 + if (kvm_dirty_ring_soft_full(ring)) 171 + kvm_make_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu); 172 + } 173 + 174 + bool kvm_dirty_ring_check_request(struct kvm_vcpu *vcpu) 175 + { 176 + /* 177 + * The VCPU isn't runnable when the dirty ring becomes soft full. 178 + * The KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent 179 + * the VCPU from running until the dirty pages are harvested and 180 + * the dirty ring is reset by userspace. 181 + */ 182 + if (kvm_check_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu) && 183 + kvm_dirty_ring_soft_full(&vcpu->dirty_ring)) { 184 + kvm_make_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu); 185 + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; 186 + trace_kvm_dirty_ring_exit(vcpu); 187 + return true; 188 + } 189 + 190 + return false; 189 191 } 190 192 191 193 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)

+53 -12

virt/kvm/kvm_main.c

··· 1617 1617 new->dirty_bitmap = NULL; 1618 1618 else if (old && old->dirty_bitmap) 1619 1619 new->dirty_bitmap = old->dirty_bitmap; 1620 - else if (!kvm->dirty_ring_size) { 1620 + else if (kvm_use_dirty_bitmap(kvm)) { 1621 1621 r = kvm_alloc_dirty_bitmap(new); 1622 1622 if (r) 1623 1623 return r; ··· 2068 2068 unsigned long n; 2069 2069 unsigned long any = 0; 2070 2070 2071 - /* Dirty ring tracking is exclusive to dirty log tracking */ 2072 - if (kvm->dirty_ring_size) 2071 + /* Dirty ring tracking may be exclusive to dirty log tracking */ 2072 + if (!kvm_use_dirty_bitmap(kvm)) 2073 2073 return -ENXIO; 2074 2074 2075 2075 *memslot = NULL; ··· 2133 2133 unsigned long *dirty_bitmap_buffer; 2134 2134 bool flush; 2135 2135 2136 - /* Dirty ring tracking is exclusive to dirty log tracking */ 2137 - if (kvm->dirty_ring_size) 2136 + /* Dirty ring tracking may be exclusive to dirty log tracking */ 2137 + if (!kvm_use_dirty_bitmap(kvm)) 2138 2138 return -ENXIO; 2139 2139 2140 2140 as_id = log->slot >> 16; ··· 2245 2245 unsigned long *dirty_bitmap_buffer; 2246 2246 bool flush; 2247 2247 2248 - /* Dirty ring tracking is exclusive to dirty log tracking */ 2249 - if (kvm->dirty_ring_size) 2248 + /* Dirty ring tracking may be exclusive to dirty log tracking */ 2249 + if (!kvm_use_dirty_bitmap(kvm)) 2250 2250 return -ENXIO; 2251 2251 2252 2252 as_id = log->slot >> 16; ··· 3321 3321 struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); 3322 3322 3323 3323 #ifdef CONFIG_HAVE_KVM_DIRTY_RING 3324 - if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) 3324 + if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) 3325 3325 return; 3326 + 3327 + WARN_ON_ONCE(!vcpu && !kvm_arch_allow_write_without_running_vcpu(kvm)); 3326 3328 #endif 3327 3329 3328 3330 if (memslot && kvm_slot_dirty_track_enabled(memslot)) { 3329 3331 unsigned long rel_gfn = gfn - memslot->base_gfn; 3330 3332 u32 slot = (memslot->as_id << 16) | memslot->id; 3331 3333 3332 - if (kvm->dirty_ring_size) 3333 - kvm_dirty_ring_push(&vcpu->dirty_ring, 3334 - slot, rel_gfn); 3335 - else 3334 + if (kvm->dirty_ring_size && vcpu) 3335 + kvm_dirty_ring_push(vcpu, slot, rel_gfn); 3336 + else if (memslot->dirty_bitmap) 3336 3337 set_bit_le(rel_gfn, memslot->dirty_bitmap); 3337 3338 } 3338 3339 } ··· 4501 4500 #else 4502 4501 return 0; 4503 4502 #endif 4503 + #ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP 4504 + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: 4505 + #endif 4504 4506 case KVM_CAP_BINARY_STATS_FD: 4505 4507 case KVM_CAP_SYSTEM_EVENT_DATA: 4506 4508 return 1; ··· 4579 4575 return -EINVAL; 4580 4576 } 4581 4577 4578 + static bool kvm_are_all_memslots_empty(struct kvm *kvm) 4579 + { 4580 + int i; 4581 + 4582 + lockdep_assert_held(&kvm->slots_lock); 4583 + 4584 + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 4585 + if (!kvm_memslots_empty(__kvm_memslots(kvm, i))) 4586 + return false; 4587 + } 4588 + 4589 + return true; 4590 + } 4591 + 4582 4592 static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, 4583 4593 struct kvm_enable_cap *cap) 4584 4594 { ··· 4623 4605 return -EINVAL; 4624 4606 4625 4607 return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); 4608 + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { 4609 + int r = -EINVAL; 4610 + 4611 + if (!IS_ENABLED(CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP) || 4612 + !kvm->dirty_ring_size || cap->flags) 4613 + return r; 4614 + 4615 + mutex_lock(&kvm->slots_lock); 4616 + 4617 + /* 4618 + * For simplicity, allow enabling ring+bitmap if and only if 4619 + * there are no memslots, e.g. to ensure all memslots allocate 4620 + * a bitmap after the capability is enabled. 4621 + */ 4622 + if (kvm_are_all_memslots_empty(kvm)) { 4623 + kvm->dirty_ring_with_bitmap = true; 4624 + r = 0; 4625 + } 4626 + 4627 + mutex_unlock(&kvm->slots_lock); 4628 + 4629 + return r; 4630 + } 4626 4631 default: 4627 4632 return kvm_vm_ioctl_enable_cap(kvm, cap); 4628 4633 }