Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+3 -3

Documentation/admin-guide/kernel-parameters.txt

··· 3996 3996 vulnerability. System may allow data leaks with this 3997 3997 option. 3998 3998 3999 - no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized 4000 - steal time accounting. steal time is computed, but 4001 - won't influence scheduler behaviour 3999 + no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV] Disable 4000 + paravirtualized steal time accounting. steal time is 4001 + computed, but won't influence scheduler behaviour 4002 4002 4003 4003 nosync [HW,M68K] Disables sync negotiation for all devices. 4004 4004

+207 -12

Documentation/virt/kvm/api.rst

··· 147 147 The new VM has no virtual cpus and no memory. 148 148 You probably want to use 0 as machine type. 149 149 150 + X86: 151 + ^^^^ 152 + 153 + Supported X86 VM types can be queried via KVM_CAP_VM_TYPES. 154 + 155 + S390: 156 + ^^^^^ 157 + 150 158 In order to create user controlled virtual machines on S390, check 151 159 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as 152 160 privileged user (CAP_SYS_ADMIN). 161 + 162 + MIPS: 163 + ^^^^^ 164 + 165 + To use hardware assisted virtualization on MIPS (VZ ASE) rather than 166 + the default trap & emulate implementation (which changes the virtual 167 + memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the 168 + flag KVM_VM_MIPS_VZ. 169 + 170 + ARM64: 171 + ^^^^^^ 153 172 154 173 On arm64, the physical address size for a VM (IPA Size limit) is limited 155 174 to 40bits by default. The limit can be configured if the host supports the ··· 625 606 interrupt number dequeues the interrupt. 626 607 627 608 This is an asynchronous vcpu ioctl and can be invoked from any thread. 628 - 629 - 630 - 4.17 KVM_DEBUG_GUEST 631 - -------------------- 632 - 633 - :Capability: basic 634 - :Architectures: none 635 - :Type: vcpu ioctl 636 - :Parameters: none) 637 - :Returns: -1 on error 638 - 639 - Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. 640 609 641 610 642 611 4.18 KVM_GET_MSRS ··· 6199 6192 ``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a 6200 6193 superset of the features supported by the system. 6201 6194 6195 + 4.140 KVM_SET_USER_MEMORY_REGION2 6196 + --------------------------------- 6197 + 6198 + :Capability: KVM_CAP_USER_MEMORY2 6199 + :Architectures: all 6200 + :Type: vm ioctl 6201 + :Parameters: struct kvm_userspace_memory_region2 (in) 6202 + :Returns: 0 on success, -1 on error 6203 + 6204 + KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that 6205 + allows mapping guest_memfd memory into a guest. All fields shared with 6206 + KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_GUEST_MEMFD 6207 + in flags to have KVM bind the memory region to a given guest_memfd range of 6208 + [guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd 6209 + must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and 6210 + the target range must not be bound to any other memory region. All standard 6211 + bounds checks apply (use common sense). 6212 + 6213 + :: 6214 + 6215 + struct kvm_userspace_memory_region2 { 6216 + __u32 slot; 6217 + __u32 flags; 6218 + __u64 guest_phys_addr; 6219 + __u64 memory_size; /* bytes */ 6220 + __u64 userspace_addr; /* start of the userspace allocated memory */ 6221 + __u64 guest_memfd_offset; 6222 + __u32 guest_memfd; 6223 + __u32 pad1; 6224 + __u64 pad2[14]; 6225 + }; 6226 + 6227 + A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and 6228 + userspace_addr (shared memory). However, "valid" for userspace_addr simply 6229 + means that the address itself must be a legal userspace address. The backing 6230 + mapping for userspace_addr is not required to be valid/populated at the time of 6231 + KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated 6232 + on-demand. 6233 + 6234 + When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes 6235 + userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE 6236 + state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute 6237 + is '0' for all gfns. Userspace can control whether memory is shared/private by 6238 + toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed. 6239 + 6240 + 4.141 KVM_SET_MEMORY_ATTRIBUTES 6241 + ------------------------------- 6242 + 6243 + :Capability: KVM_CAP_MEMORY_ATTRIBUTES 6244 + :Architectures: x86 6245 + :Type: vm ioctl 6246 + :Parameters: struct kvm_memory_attributes (in) 6247 + :Returns: 0 on success, <0 on error 6248 + 6249 + KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range 6250 + of guest physical memory. 6251 + 6252 + :: 6253 + 6254 + struct kvm_memory_attributes { 6255 + __u64 address; 6256 + __u64 size; 6257 + __u64 attributes; 6258 + __u64 flags; 6259 + }; 6260 + 6261 + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) 6262 + 6263 + The address and size must be page aligned. The supported attributes can be 6264 + retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If 6265 + executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes 6266 + supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES 6267 + returns all attributes supported by KVM. The only attribute defined at this 6268 + time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being 6269 + guest private memory. 6270 + 6271 + Note, there is no "get" API. Userspace is responsible for explicitly tracking 6272 + the state of a gfn/page as needed. 6273 + 6274 + The "flags" field is reserved for future extensions and must be '0'. 6275 + 6276 + 4.142 KVM_CREATE_GUEST_MEMFD 6277 + ---------------------------- 6278 + 6279 + :Capability: KVM_CAP_GUEST_MEMFD 6280 + :Architectures: none 6281 + :Type: vm ioctl 6282 + :Parameters: struct kvm_create_guest_memfd(in) 6283 + :Returns: 0 on success, <0 on error 6284 + 6285 + KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor 6286 + that refers to it. guest_memfd files are roughly analogous to files created 6287 + via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage, 6288 + and are automatically released when the last reference is dropped. Unlike 6289 + "regular" memfd_create() files, guest_memfd files are bound to their owning 6290 + virtual machine (see below), cannot be mapped, read, or written by userspace, 6291 + and cannot be resized (guest_memfd files do however support PUNCH_HOLE). 6292 + 6293 + :: 6294 + 6295 + struct kvm_create_guest_memfd { 6296 + __u64 size; 6297 + __u64 flags; 6298 + __u64 reserved[6]; 6299 + }; 6300 + 6301 + Conceptually, the inode backing a guest_memfd file represents physical memory, 6302 + i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The 6303 + file itself, which is bound to a "struct kvm", is that instance's view of the 6304 + underlying memory, e.g. effectively provides the translation of guest addresses 6305 + to host memory. This allows for use cases where multiple KVM structures are 6306 + used to manage a single virtual machine, e.g. when performing intrahost 6307 + migration of a virtual machine. 6308 + 6309 + KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2, 6310 + and more specifically via the guest_memfd and guest_memfd_offset fields in 6311 + "struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset 6312 + into the guest_memfd instance. For a given guest_memfd file, there can be at 6313 + most one mapping per page, i.e. binding multiple memory regions to a single 6314 + guest_memfd range is not allowed (any number of memory regions can be bound to 6315 + a single guest_memfd file, but the bound ranges must not overlap). 6316 + 6317 + See KVM_SET_USER_MEMORY_REGION2 for additional details. 6318 + 6202 6319 5. The kvm_run structure 6203 6320 ======================== 6204 6321 ··· 6954 6823 array field represents return values. The userspace should update the return 6955 6824 values of SBI call before resuming the VCPU. For more details on RISC-V SBI 6956 6825 spec refer, https://github.com/riscv/riscv-sbi-doc. 6826 + 6827 + :: 6828 + 6829 + /* KVM_EXIT_MEMORY_FAULT */ 6830 + struct { 6831 + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) 6832 + __u64 flags; 6833 + __u64 gpa; 6834 + __u64 size; 6835 + } memory_fault; 6836 + 6837 + KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that 6838 + could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the 6839 + guest physical address range [gpa, gpa + size) of the fault. The 'flags' field 6840 + describes properties of the faulting access that are likely pertinent: 6841 + 6842 + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred 6843 + on a private memory access. When clear, indicates the fault occurred on a 6844 + shared access. 6845 + 6846 + Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it 6847 + accompanies a return code of '-1', not '0'! errno will always be set to EFAULT 6848 + or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume 6849 + kvm_run.exit_reason is stale/undefined for all other error numbers. 6957 6850 6958 6851 :: 6959 6852 ··· 8013 7858 cause CPU stuck (due to event windows don't open up) and make the CPU 8014 7859 unavailable to host or other VMs. 8015 7860 7861 + 7.34 KVM_CAP_MEMORY_FAULT_INFO 7862 + ------------------------------ 7863 + 7864 + :Architectures: x86 7865 + :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. 7866 + 7867 + The presence of this capability indicates that KVM_RUN will fill 7868 + kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if 7869 + there is a valid memslot but no backing VMA for the corresponding host virtual 7870 + address. 7871 + 7872 + The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns 7873 + an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set 7874 + to KVM_EXIT_MEMORY_FAULT. 7875 + 7876 + Note: Userspaces which attempt to resolve memory faults so that they can retry 7877 + KVM_RUN are encouraged to guard against repeatedly receiving the same 7878 + error/annotated fault. 7879 + 7880 + See KVM_EXIT_MEMORY_FAULT for more information. 7881 + 8016 7882 8. Other capabilities. 8017 7883 ====================== 8018 7884 ··· 8550 8374 #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) 8551 8375 #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) 8552 8376 #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6) 8377 + #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7) 8553 8378 8554 8379 The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG 8555 8380 ioctl is available, for the guest to set its hypercall page. ··· 8593 8416 behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless 8594 8417 specifically enabled (by the guest making the hypercall, causing the VMM 8595 8418 to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute). 8419 + 8420 + The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports 8421 + clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be 8422 + done when the KVM_CAP_XEN_HVM ioctl sets the 8423 + KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag. 8596 8424 8597 8425 8.31 KVM_CAP_PPC_MULTITCE 8598 8426 ------------------------- ··· 8777 8595 block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 8778 8596 64-bit bitmap (each bit describing a block size). The default value is 8779 8597 0, to disable the eager page splitting. 8598 + 8599 + 8.41 KVM_CAP_VM_TYPES 8600 + --------------------- 8601 + 8602 + :Capability: KVM_CAP_MEMORY_ATTRIBUTES 8603 + :Architectures: x86 8604 + :Type: system ioctl 8605 + 8606 + This capability returns a bitmap of support VM types. The 1-setting of bit @n 8607 + means the VM type with value @n is supported. Possible values of @n are:: 8608 + 8609 + #define KVM_X86_DEFAULT_VM 0 8610 + #define KVM_X86_SW_PROTECTED_VM 1 8780 8611 8781 8612 9. Known KVM API problems 8782 8613 =========================

+3 -4

Documentation/virt/kvm/locking.rst

··· 43 43 44 44 - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock 45 45 46 - - kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and 47 - kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and 48 - cannot be taken without already holding kvm->arch.mmu_lock (typically with 49 - ``read_lock`` for the TDP MMU, thus the need for additional spinlocks). 46 + - kvm->arch.mmu_lock is an rwlock; critical sections for 47 + kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must 48 + also take kvm->arch.mmu_lock 50 49 51 50 Everything else is a leaf: no other lock is taken inside the critical 52 51 sections.

+15

arch/arm64/include/asm/esr.h

··· 392 392 return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR; 393 393 } 394 394 395 + static inline bool esr_fsc_is_translation_fault(unsigned long esr) 396 + { 397 + return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT; 398 + } 399 + 400 + static inline bool esr_fsc_is_permission_fault(unsigned long esr) 401 + { 402 + return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_PERM; 403 + } 404 + 405 + static inline bool esr_fsc_is_access_flag_fault(unsigned long esr) 406 + { 407 + return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_ACCESS; 408 + } 409 + 395 410 const char *esr_get_class_string(unsigned long esr); 396 411 #endif /* __ASSEMBLY */ 397 412

+35 -22

arch/arm64/include/asm/kvm_arm.h

··· 108 108 #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En) 109 109 110 110 /* TCR_EL2 Registers bits */ 111 + #define TCR_EL2_DS (1UL << 32) 111 112 #define TCR_EL2_RES1 ((1U << 31) | (1 << 23)) 112 113 #define TCR_EL2_TBI (1 << 20) 113 114 #define TCR_EL2_PS_SHIFT 16 ··· 123 122 TCR_EL2_ORGN0_MASK | TCR_EL2_IRGN0_MASK | TCR_EL2_T0SZ_MASK) 124 123 125 124 /* VTCR_EL2 Registers bits */ 125 + #define VTCR_EL2_DS TCR_EL2_DS 126 126 #define VTCR_EL2_RES1 (1U << 31) 127 127 #define VTCR_EL2_HD (1 << 22) 128 128 #define VTCR_EL2_HA (1 << 21) ··· 346 344 * Once we get to a point where the two describe the same thing, we'll 347 345 * merge the definitions. One day. 348 346 */ 349 - #define __HFGRTR_EL2_RES0 (GENMASK(63, 56) | GENMASK(53, 51)) 347 + #define __HFGRTR_EL2_RES0 HFGxTR_EL2_RES0 350 348 #define __HFGRTR_EL2_MASK GENMASK(49, 0) 351 - #define __HFGRTR_EL2_nMASK (GENMASK(58, 57) | GENMASK(55, 54) | BIT(50)) 349 + #define __HFGRTR_EL2_nMASK ~(__HFGRTR_EL2_RES0 | __HFGRTR_EL2_MASK) 352 350 353 - #define __HFGWTR_EL2_RES0 (GENMASK(63, 56) | GENMASK(53, 51) | \ 354 - BIT(46) | BIT(42) | BIT(40) | BIT(28) | \ 355 - GENMASK(26, 25) | BIT(21) | BIT(18) | \ 351 + /* 352 + * The HFGWTR bits are a subset of HFGRTR bits. To ensure we don't miss any 353 + * future additions, define __HFGWTR* macros relative to __HFGRTR* ones. 354 + */ 355 + #define __HFGRTR_ONLY_MASK (BIT(46) | BIT(42) | BIT(40) | BIT(28) | \ 356 + GENMASK(26, 25) | BIT(21) | BIT(18) | \ 356 357 GENMASK(15, 14) | GENMASK(10, 9) | BIT(2)) 357 - #define __HFGWTR_EL2_MASK GENMASK(49, 0) 358 - #define __HFGWTR_EL2_nMASK (GENMASK(58, 57) | GENMASK(55, 54) | BIT(50)) 358 + #define __HFGWTR_EL2_RES0 (__HFGRTR_EL2_RES0 | __HFGRTR_ONLY_MASK) 359 + #define __HFGWTR_EL2_MASK (__HFGRTR_EL2_MASK & ~__HFGRTR_ONLY_MASK) 360 + #define __HFGWTR_EL2_nMASK ~(__HFGWTR_EL2_RES0 | __HFGWTR_EL2_MASK) 359 361 360 - #define __HFGITR_EL2_RES0 GENMASK(63, 57) 361 - #define __HFGITR_EL2_MASK GENMASK(54, 0) 362 - #define __HFGITR_EL2_nMASK GENMASK(56, 55) 362 + #define __HFGITR_EL2_RES0 HFGITR_EL2_RES0 363 + #define __HFGITR_EL2_MASK (BIT(62) | BIT(60) | GENMASK(54, 0)) 364 + #define __HFGITR_EL2_nMASK ~(__HFGITR_EL2_RES0 | __HFGITR_EL2_MASK) 363 365 364 - #define __HDFGRTR_EL2_RES0 (BIT(49) | BIT(42) | GENMASK(39, 38) | \ 365 - GENMASK(21, 20) | BIT(8)) 366 - #define __HDFGRTR_EL2_MASK ~__HDFGRTR_EL2_nMASK 367 - #define __HDFGRTR_EL2_nMASK GENMASK(62, 59) 366 + #define __HDFGRTR_EL2_RES0 HDFGRTR_EL2_RES0 367 + #define __HDFGRTR_EL2_MASK (BIT(63) | GENMASK(58, 50) | GENMASK(48, 43) | \ 368 + GENMASK(41, 40) | GENMASK(37, 22) | \ 369 + GENMASK(19, 9) | GENMASK(7, 0)) 370 + #define __HDFGRTR_EL2_nMASK ~(__HDFGRTR_EL2_RES0 | __HDFGRTR_EL2_MASK) 368 371 369 - #define __HDFGWTR_EL2_RES0 (BIT(63) | GENMASK(59, 58) | BIT(51) | BIT(47) | \ 370 - BIT(43) | GENMASK(40, 38) | BIT(34) | BIT(30) | \ 371 - BIT(22) | BIT(9) | BIT(6)) 372 - #define __HDFGWTR_EL2_MASK ~__HDFGWTR_EL2_nMASK 373 - #define __HDFGWTR_EL2_nMASK GENMASK(62, 60) 372 + #define __HDFGWTR_EL2_RES0 HDFGWTR_EL2_RES0 373 + #define __HDFGWTR_EL2_MASK (GENMASK(57, 52) | GENMASK(50, 48) | \ 374 + GENMASK(46, 44) | GENMASK(42, 41) | \ 375 + GENMASK(37, 35) | GENMASK(33, 31) | \ 376 + GENMASK(29, 23) | GENMASK(21, 10) | \ 377 + GENMASK(8, 7) | GENMASK(5, 0)) 378 + #define __HDFGWTR_EL2_nMASK ~(__HDFGWTR_EL2_RES0 | __HDFGWTR_EL2_MASK) 379 + 380 + #define __HAFGRTR_EL2_RES0 HAFGRTR_EL2_RES0 381 + #define __HAFGRTR_EL2_MASK (GENMASK(49, 17) | GENMASK(4, 0)) 382 + #define __HAFGRTR_EL2_nMASK ~(__HAFGRTR_EL2_RES0 | __HAFGRTR_EL2_MASK) 374 383 375 384 /* Similar definitions for HCRX_EL2 */ 376 - #define __HCRX_EL2_RES0 (GENMASK(63, 16) | GENMASK(13, 12)) 377 - #define __HCRX_EL2_MASK (0) 378 - #define __HCRX_EL2_nMASK (GENMASK(15, 14) | GENMASK(4, 0)) 385 + #define __HCRX_EL2_RES0 HCRX_EL2_RES0 386 + #define __HCRX_EL2_MASK (BIT(6)) 387 + #define __HCRX_EL2_nMASK ~(__HCRX_EL2_RES0 | __HCRX_EL2_MASK) 379 388 380 389 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */ 381 390 #define HPFAR_MASK (~UL(0xf))

+18 -16

arch/arm64/include/asm/kvm_emulate.h

··· 17 17 #include <asm/esr.h> 18 18 #include <asm/kvm_arm.h> 19 19 #include <asm/kvm_hyp.h> 20 + #include <asm/kvm_nested.h> 20 21 #include <asm/ptrace.h> 21 22 #include <asm/cputype.h> 22 23 #include <asm/virt.h> ··· 54 53 void kvm_emulate_nested_eret(struct kvm_vcpu *vcpu); 55 54 int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2); 56 55 int kvm_inject_nested_irq(struct kvm_vcpu *vcpu); 57 - 58 - static inline bool vcpu_has_feature(const struct kvm_vcpu *vcpu, int feature) 59 - { 60 - return test_bit(feature, vcpu->kvm->arch.vcpu_features); 61 - } 62 56 63 57 #if defined(__KVM_VHE_HYPERVISOR__) || defined(__KVM_NVHE_HYPERVISOR__) 64 58 static __always_inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu) ··· 244 248 245 249 static inline bool is_hyp_ctxt(const struct kvm_vcpu *vcpu) 246 250 { 247 - return __is_hyp_ctxt(&vcpu->arch.ctxt); 251 + return vcpu_has_nv(vcpu) && __is_hyp_ctxt(&vcpu->arch.ctxt); 248 252 } 249 253 250 254 /* ··· 400 404 return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC; 401 405 } 402 406 403 - static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu) 407 + static inline 408 + bool kvm_vcpu_trap_is_permission_fault(const struct kvm_vcpu *vcpu) 404 409 { 405 - return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE; 410 + return esr_fsc_is_permission_fault(kvm_vcpu_get_esr(vcpu)); 406 411 } 407 412 408 - static __always_inline u8 kvm_vcpu_trap_get_fault_level(const struct kvm_vcpu *vcpu) 413 + static inline 414 + bool kvm_vcpu_trap_is_translation_fault(const struct kvm_vcpu *vcpu) 409 415 { 410 - return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_LEVEL; 416 + return esr_fsc_is_translation_fault(kvm_vcpu_get_esr(vcpu)); 417 + } 418 + 419 + static inline 420 + u64 kvm_vcpu_trap_get_perm_fault_granule(const struct kvm_vcpu *vcpu) 421 + { 422 + unsigned long esr = kvm_vcpu_get_esr(vcpu); 423 + 424 + BUG_ON(!esr_fsc_is_permission_fault(esr)); 425 + return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(esr & ESR_ELx_FSC_LEVEL)); 411 426 } 412 427 413 428 static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu) ··· 461 454 * first), then a permission fault to allow the flags 462 455 * to be set. 463 456 */ 464 - switch (kvm_vcpu_trap_get_fault_type(vcpu)) { 465 - case ESR_ELx_FSC_PERM: 466 - return true; 467 - default: 468 - return false; 469 - } 457 + return kvm_vcpu_trap_is_permission_fault(vcpu); 470 458 } 471 459 472 460 if (kvm_vcpu_trap_is_iabt(vcpu))

+92 -48

arch/arm64/include/asm/kvm_host.h

··· 27 27 #include <asm/fpsimd.h> 28 28 #include <asm/kvm.h> 29 29 #include <asm/kvm_asm.h> 30 + #include <asm/vncr_mapping.h> 30 31 31 32 #define __KVM_HAVE_ARCH_INTC_INITIALIZED 32 33 ··· 307 306 * Atomic access to multiple idregs are guarded by kvm_arch.config_lock. 308 307 */ 309 308 #define IDREG_IDX(id) (((sys_reg_CRm(id) - 1) << 3) | sys_reg_Op2(id)) 309 + #define IDX_IDREG(idx) sys_reg(3, 0, 0, ((idx) >> 3) + 1, (idx) & Op2_mask) 310 310 #define IDREG(kvm, id) ((kvm)->arch.id_regs[IDREG_IDX(id)]) 311 311 #define KVM_ARM_ID_REG_NUM (IDREG_IDX(sys_reg(3, 0, 0, 7, 7)) + 1) 312 312 u64 id_regs[KVM_ARM_ID_REG_NUM]; ··· 326 324 u64 disr_el1; /* Deferred [SError] Status Register */ 327 325 }; 328 326 327 + /* 328 + * VNCR() just places the VNCR_capable registers in the enum after 329 + * __VNCR_START__, and the value (after correction) to be an 8-byte offset 330 + * from the VNCR base. As we don't require the enum to be otherwise ordered, 331 + * we need the terrible hack below to ensure that we correctly size the 332 + * sys_regs array, no matter what. 333 + * 334 + * The __MAX__ macro has been lifted from Sean Eron Anderson's wonderful 335 + * treasure trove of bit hacks: 336 + * https://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax 337 + */ 338 + #define __MAX__(x,y) ((x) ^ (((x) ^ (y)) & -((x) < (y)))) 339 + #define VNCR(r) \ 340 + __before_##r, \ 341 + r = __VNCR_START__ + ((VNCR_ ## r) / 8), \ 342 + __after_##r = __MAX__(__before_##r - 1, r) 343 + 329 344 enum vcpu_sysreg { 330 345 __INVALID_SYSREG__, /* 0 is reserved as an invalid value */ 331 346 MPIDR_EL1, /* MultiProcessor Affinity Register */ 332 347 CLIDR_EL1, /* Cache Level ID Register */ 333 348 CSSELR_EL1, /* Cache Size Selection Register */ 334 - SCTLR_EL1, /* System Control Register */ 335 - ACTLR_EL1, /* Auxiliary Control Register */ 336 - CPACR_EL1, /* Coprocessor Access Control */ 337 - ZCR_EL1, /* SVE Control */ 338 - TTBR0_EL1, /* Translation Table Base Register 0 */ 339 - TTBR1_EL1, /* Translation Table Base Register 1 */ 340 - TCR_EL1, /* Translation Control Register */ 341 - TCR2_EL1, /* Extended Translation Control Register */ 342 - ESR_EL1, /* Exception Syndrome Register */ 343 - AFSR0_EL1, /* Auxiliary Fault Status Register 0 */ 344 - AFSR1_EL1, /* Auxiliary Fault Status Register 1 */ 345 - FAR_EL1, /* Fault Address Register */ 346 - MAIR_EL1, /* Memory Attribute Indirection Register */ 347 - VBAR_EL1, /* Vector Base Address Register */ 348 - CONTEXTIDR_EL1, /* Context ID Register */ 349 349 TPIDR_EL0, /* Thread ID, User R/W */ 350 350 TPIDRRO_EL0, /* Thread ID, User R/O */ 351 351 TPIDR_EL1, /* Thread ID, Privileged */ 352 - AMAIR_EL1, /* Aux Memory Attribute Indirection Register */ 353 352 CNTKCTL_EL1, /* Timer Control Register (EL1) */ 354 353 PAR_EL1, /* Physical Address Register */ 355 - MDSCR_EL1, /* Monitor Debug System Control Register */ 356 354 MDCCINT_EL1, /* Monitor Debug Comms Channel Interrupt Enable Reg */ 357 355 OSLSR_EL1, /* OS Lock Status Register */ 358 356 DISR_EL1, /* Deferred Interrupt Status Register */ ··· 383 381 APGAKEYLO_EL1, 384 382 APGAKEYHI_EL1, 385 383 386 - ELR_EL1, 387 - SP_EL1, 388 - SPSR_EL1, 389 - 390 - CNTVOFF_EL2, 391 - CNTV_CVAL_EL0, 392 - CNTV_CTL_EL0, 393 - CNTP_CVAL_EL0, 394 - CNTP_CTL_EL0, 395 - 396 384 /* Memory Tagging Extension registers */ 397 385 RGSR_EL1, /* Random Allocation Tag Seed Register */ 398 386 GCR_EL1, /* Tag Control Register */ 399 - TFSR_EL1, /* Tag Fault Status Register (EL1) */ 400 387 TFSRE0_EL1, /* Tag Fault Status Register (EL0) */ 401 - 402 - /* Permission Indirection Extension registers */ 403 - PIR_EL1, /* Permission Indirection Register 1 (EL1) */ 404 - PIRE0_EL1, /* Permission Indirection Register 0 (EL1) */ 405 388 406 389 /* 32bit specific registers. */ 407 390 DACR32_EL2, /* Domain Access Control Register */ ··· 395 408 DBGVCR32_EL2, /* Debug Vector Catch Register */ 396 409 397 410 /* EL2 registers */ 398 - VPIDR_EL2, /* Virtualization Processor ID Register */ 399 - VMPIDR_EL2, /* Virtualization Multiprocessor ID Register */ 400 411 SCTLR_EL2, /* System Control Register (EL2) */ 401 412 ACTLR_EL2, /* Auxiliary Control Register (EL2) */ 402 - HCR_EL2, /* Hypervisor Configuration Register */ 403 413 MDCR_EL2, /* Monitor Debug Configuration Register (EL2) */ 404 414 CPTR_EL2, /* Architectural Feature Trap Register (EL2) */ 405 - HSTR_EL2, /* Hypervisor System Trap Register */ 406 415 HACR_EL2, /* Hypervisor Auxiliary Control Register */ 407 - HCRX_EL2, /* Extended Hypervisor Configuration Register */ 408 416 TTBR0_EL2, /* Translation Table Base Register 0 (EL2) */ 409 417 TTBR1_EL2, /* Translation Table Base Register 1 (EL2) */ 410 418 TCR_EL2, /* Translation Control Register (EL2) */ 411 - VTTBR_EL2, /* Virtualization Translation Table Base Register */ 412 - VTCR_EL2, /* Virtualization Translation Control Register */ 413 419 SPSR_EL2, /* EL2 saved program status register */ 414 420 ELR_EL2, /* EL2 exception link register */ 415 421 AFSR0_EL2, /* Auxiliary Fault Status Register 0 (EL2) */ ··· 415 435 VBAR_EL2, /* Vector Base Address Register (EL2) */ 416 436 RVBAR_EL2, /* Reset Vector Base Address Register */ 417 437 CONTEXTIDR_EL2, /* Context ID Register (EL2) */ 418 - TPIDR_EL2, /* EL2 Software Thread ID Register */ 419 438 CNTHCTL_EL2, /* Counter-timer Hypervisor Control register */ 420 439 SP_EL2, /* EL2 Stack Pointer */ 421 - HFGRTR_EL2, 422 - HFGWTR_EL2, 423 - HFGITR_EL2, 424 - HDFGRTR_EL2, 425 - HDFGWTR_EL2, 426 440 CNTHP_CTL_EL2, 427 441 CNTHP_CVAL_EL2, 428 442 CNTHV_CTL_EL2, 429 443 CNTHV_CVAL_EL2, 444 + 445 + __VNCR_START__, /* Any VNCR-capable reg goes after this point */ 446 + 447 + VNCR(SCTLR_EL1),/* System Control Register */ 448 + VNCR(ACTLR_EL1),/* Auxiliary Control Register */ 449 + VNCR(CPACR_EL1),/* Coprocessor Access Control */ 450 + VNCR(ZCR_EL1), /* SVE Control */ 451 + VNCR(TTBR0_EL1),/* Translation Table Base Register 0 */ 452 + VNCR(TTBR1_EL1),/* Translation Table Base Register 1 */ 453 + VNCR(TCR_EL1), /* Translation Control Register */ 454 + VNCR(TCR2_EL1), /* Extended Translation Control Register */ 455 + VNCR(ESR_EL1), /* Exception Syndrome Register */ 456 + VNCR(AFSR0_EL1),/* Auxiliary Fault Status Register 0 */ 457 + VNCR(AFSR1_EL1),/* Auxiliary Fault Status Register 1 */ 458 + VNCR(FAR_EL1), /* Fault Address Register */ 459 + VNCR(MAIR_EL1), /* Memory Attribute Indirection Register */ 460 + VNCR(VBAR_EL1), /* Vector Base Address Register */ 461 + VNCR(CONTEXTIDR_EL1), /* Context ID Register */ 462 + VNCR(AMAIR_EL1),/* Aux Memory Attribute Indirection Register */ 463 + VNCR(MDSCR_EL1),/* Monitor Debug System Control Register */ 464 + VNCR(ELR_EL1), 465 + VNCR(SP_EL1), 466 + VNCR(SPSR_EL1), 467 + VNCR(TFSR_EL1), /* Tag Fault Status Register (EL1) */ 468 + VNCR(VPIDR_EL2),/* Virtualization Processor ID Register */ 469 + VNCR(VMPIDR_EL2),/* Virtualization Multiprocessor ID Register */ 470 + VNCR(HCR_EL2), /* Hypervisor Configuration Register */ 471 + VNCR(HSTR_EL2), /* Hypervisor System Trap Register */ 472 + VNCR(VTTBR_EL2),/* Virtualization Translation Table Base Register */ 473 + VNCR(VTCR_EL2), /* Virtualization Translation Control Register */ 474 + VNCR(TPIDR_EL2),/* EL2 Software Thread ID Register */ 475 + VNCR(HCRX_EL2), /* Extended Hypervisor Configuration Register */ 476 + 477 + /* Permission Indirection Extension registers */ 478 + VNCR(PIR_EL1), /* Permission Indirection Register 1 (EL1) */ 479 + VNCR(PIRE0_EL1), /* Permission Indirection Register 0 (EL1) */ 480 + 481 + VNCR(HFGRTR_EL2), 482 + VNCR(HFGWTR_EL2), 483 + VNCR(HFGITR_EL2), 484 + VNCR(HDFGRTR_EL2), 485 + VNCR(HDFGWTR_EL2), 486 + VNCR(HAFGRTR_EL2), 487 + 488 + VNCR(CNTVOFF_EL2), 489 + VNCR(CNTV_CVAL_EL0), 490 + VNCR(CNTV_CTL_EL0), 491 + VNCR(CNTP_CVAL_EL0), 492 + VNCR(CNTP_CTL_EL0), 430 493 431 494 NR_SYS_REGS /* Nothing after this line! */ 432 495 }; ··· 487 464 u64 sys_regs[NR_SYS_REGS]; 488 465 489 466 struct kvm_vcpu *__hyp_running_vcpu; 467 + 468 + /* This pointer has to be 4kB aligned. */ 469 + u64 *vncr_array; 490 470 }; 491 471 492 472 struct kvm_host_data { ··· 852 826 * accessed by a running VCPU. For example, for userspace access or 853 827 * for system registers that are never context switched, but only 854 828 * emulated. 829 + * 830 + * Don't bother with VNCR-based accesses in the nVHE code, it has no 831 + * business dealing with NV. 855 832 */ 856 - #define __ctxt_sys_reg(c,r) (&(c)->sys_regs[(r)]) 833 + static inline u64 *__ctxt_sys_reg(const struct kvm_cpu_context *ctxt, int r) 834 + { 835 + #if !defined (__KVM_NVHE_HYPERVISOR__) 836 + if (unlikely(cpus_have_final_cap(ARM64_HAS_NESTED_VIRT) && 837 + r >= __VNCR_START__ && ctxt->vncr_array)) 838 + return &ctxt->vncr_array[r - __VNCR_START__]; 839 + #endif 840 + return (u64 *)&ctxt->sys_regs[r]; 841 + } 857 842 858 843 #define ctxt_sys_reg(c,r) (*__ctxt_sys_reg(c,r)) 859 844 ··· 908 871 case AMAIR_EL1: *val = read_sysreg_s(SYS_AMAIR_EL12); break; 909 872 case CNTKCTL_EL1: *val = read_sysreg_s(SYS_CNTKCTL_EL12); break; 910 873 case ELR_EL1: *val = read_sysreg_s(SYS_ELR_EL12); break; 874 + case SPSR_EL1: *val = read_sysreg_s(SYS_SPSR_EL12); break; 911 875 case PAR_EL1: *val = read_sysreg_par(); break; 912 876 case DACR32_EL2: *val = read_sysreg_s(SYS_DACR32_EL2); break; 913 877 case IFSR32_EL2: *val = read_sysreg_s(SYS_IFSR32_EL2); break; ··· 953 915 case AMAIR_EL1: write_sysreg_s(val, SYS_AMAIR_EL12); break; 954 916 case CNTKCTL_EL1: write_sysreg_s(val, SYS_CNTKCTL_EL12); break; 955 917 case ELR_EL1: write_sysreg_s(val, SYS_ELR_EL12); break; 918 + case SPSR_EL1: write_sysreg_s(val, SYS_SPSR_EL12); break; 956 919 case PAR_EL1: write_sysreg_s(val, SYS_PAR_EL1); break; 957 920 case DACR32_EL2: write_sysreg_s(val, SYS_DACR32_EL2); break; 958 921 case IFSR32_EL2: write_sysreg_s(val, SYS_IFSR32_EL2); break; ··· 992 953 993 954 int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, 994 955 struct kvm_vcpu_events *events); 995 - 996 - #define KVM_ARCH_WANT_MMU_NOTIFIER 997 956 998 957 void kvm_arm_halt_guest(struct kvm *kvm); 999 958 void kvm_arm_resume_guest(struct kvm *kvm); ··· 1213 1176 1214 1177 #define kvm_vm_has_ran_once(kvm) \ 1215 1178 (test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &(kvm)->arch.flags)) 1179 + 1180 + static inline bool __vcpu_has_feature(const struct kvm_arch *ka, int feature) 1181 + { 1182 + return test_bit(feature, ka->vcpu_features); 1183 + } 1184 + 1185 + #define vcpu_has_feature(v, f) __vcpu_has_feature(&(v)->kvm->arch, (f)) 1216 1186 1217 1187 int kvm_trng_call(struct kvm_vcpu *vcpu); 1218 1188 #ifdef CONFIG_KVM

+50 -6

arch/arm64/include/asm/kvm_nested.h

··· 2 2 #ifndef __ARM64_KVM_NESTED_H 3 3 #define __ARM64_KVM_NESTED_H 4 4 5 - #include <asm/kvm_emulate.h> 5 + #include <linux/bitfield.h> 6 6 #include <linux/kvm_host.h> 7 + #include <asm/kvm_emulate.h> 7 8 8 9 static inline bool vcpu_has_nv(const struct kvm_vcpu *vcpu) 9 10 { ··· 13 12 vcpu_has_feature(vcpu, KVM_ARM_VCPU_HAS_EL2)); 14 13 } 15 14 15 + /* Translation helpers from non-VHE EL2 to EL1 */ 16 + static inline u64 tcr_el2_ps_to_tcr_el1_ips(u64 tcr_el2) 17 + { 18 + return (u64)FIELD_GET(TCR_EL2_PS_MASK, tcr_el2) << TCR_IPS_SHIFT; 19 + } 20 + 21 + static inline u64 translate_tcr_el2_to_tcr_el1(u64 tcr) 22 + { 23 + return TCR_EPD1_MASK | /* disable TTBR1_EL1 */ 24 + ((tcr & TCR_EL2_TBI) ? TCR_TBI0 : 0) | 25 + tcr_el2_ps_to_tcr_el1_ips(tcr) | 26 + (tcr & TCR_EL2_TG0_MASK) | 27 + (tcr & TCR_EL2_ORGN0_MASK) | 28 + (tcr & TCR_EL2_IRGN0_MASK) | 29 + (tcr & TCR_EL2_T0SZ_MASK); 30 + } 31 + 32 + static inline u64 translate_cptr_el2_to_cpacr_el1(u64 cptr_el2) 33 + { 34 + u64 cpacr_el1 = 0; 35 + 36 + if (cptr_el2 & CPTR_EL2_TTA) 37 + cpacr_el1 |= CPACR_ELx_TTA; 38 + if (!(cptr_el2 & CPTR_EL2_TFP)) 39 + cpacr_el1 |= CPACR_ELx_FPEN; 40 + if (!(cptr_el2 & CPTR_EL2_TZ)) 41 + cpacr_el1 |= CPACR_ELx_ZEN; 42 + 43 + return cpacr_el1; 44 + } 45 + 46 + static inline u64 translate_sctlr_el2_to_sctlr_el1(u64 val) 47 + { 48 + /* Only preserve the minimal set of bits we support */ 49 + val &= (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C | SCTLR_ELx_SA | 50 + SCTLR_ELx_I | SCTLR_ELx_IESB | SCTLR_ELx_WXN | SCTLR_ELx_EE); 51 + val |= SCTLR_EL1_RES1; 52 + 53 + return val; 54 + } 55 + 56 + static inline u64 translate_ttbr0_el2_to_ttbr0_el1(u64 ttbr0) 57 + { 58 + /* Clear the ASID field */ 59 + return ttbr0 & ~GENMASK_ULL(63, 48); 60 + } 61 + 16 62 extern bool __check_nv_sr_forward(struct kvm_vcpu *vcpu); 17 63 18 - struct sys_reg_params; 19 - struct sys_reg_desc; 20 - 21 - void access_nested_id_reg(struct kvm_vcpu *v, struct sys_reg_params *p, 22 - const struct sys_reg_desc *r); 64 + int kvm_init_nv_sysregs(struct kvm *kvm); 23 65 24 66 #endif /* __ARM64_KVM_NESTED_H */

+51 -25

arch/arm64/include/asm/kvm_pgtable.h

··· 11 11 #include <linux/kvm_host.h> 12 12 #include <linux/types.h> 13 13 14 - #define KVM_PGTABLE_MAX_LEVELS 4U 14 + #define KVM_PGTABLE_FIRST_LEVEL -1 15 + #define KVM_PGTABLE_LAST_LEVEL 3 15 16 16 17 /* 17 18 * The largest supported block sizes for KVM (no 52-bit PA support): ··· 21 20 * - 64K (level 2): 512MB 22 21 */ 23 22 #ifdef CONFIG_ARM64_4K_PAGES 24 - #define KVM_PGTABLE_MIN_BLOCK_LEVEL 1U 23 + #define KVM_PGTABLE_MIN_BLOCK_LEVEL 1 25 24 #else 26 - #define KVM_PGTABLE_MIN_BLOCK_LEVEL 2U 25 + #define KVM_PGTABLE_MIN_BLOCK_LEVEL 2 27 26 #endif 28 27 29 - #define kvm_lpa2_is_enabled() false 28 + #define kvm_lpa2_is_enabled() system_supports_lpa2() 29 + 30 + static inline u64 kvm_get_parange_max(void) 31 + { 32 + if (kvm_lpa2_is_enabled() || 33 + (IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && PAGE_SHIFT == 16)) 34 + return ID_AA64MMFR0_EL1_PARANGE_52; 35 + else 36 + return ID_AA64MMFR0_EL1_PARANGE_48; 37 + } 30 38 31 39 static inline u64 kvm_get_parange(u64 mmfr0) 32 40 { 41 + u64 parange_max = kvm_get_parange_max(); 33 42 u64 parange = cpuid_feature_extract_unsigned_field(mmfr0, 34 43 ID_AA64MMFR0_EL1_PARANGE_SHIFT); 35 - if (parange > ID_AA64MMFR0_EL1_PARANGE_MAX) 36 - parange = ID_AA64MMFR0_EL1_PARANGE_MAX; 44 + if (parange > parange_max) 45 + parange = parange_max; 37 46 38 47 return parange; 39 48 } ··· 54 43 55 44 #define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT) 56 45 #define KVM_PTE_ADDR_51_48 GENMASK(15, 12) 46 + #define KVM_PTE_ADDR_MASK_LPA2 GENMASK(49, PAGE_SHIFT) 47 + #define KVM_PTE_ADDR_51_50_LPA2 GENMASK(9, 8) 57 48 58 49 #define KVM_PHYS_INVALID (-1ULL) 59 50 ··· 66 53 67 54 static inline u64 kvm_pte_to_phys(kvm_pte_t pte) 68 55 { 69 - u64 pa = pte & KVM_PTE_ADDR_MASK; 56 + u64 pa; 70 57 71 - if (PAGE_SHIFT == 16) 72 - pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48; 58 + if (kvm_lpa2_is_enabled()) { 59 + pa = pte & KVM_PTE_ADDR_MASK_LPA2; 60 + pa |= FIELD_GET(KVM_PTE_ADDR_51_50_LPA2, pte) << 50; 61 + } else { 62 + pa = pte & KVM_PTE_ADDR_MASK; 63 + if (PAGE_SHIFT == 16) 64 + pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48; 65 + } 73 66 74 67 return pa; 75 68 } 76 69 77 70 static inline kvm_pte_t kvm_phys_to_pte(u64 pa) 78 71 { 79 - kvm_pte_t pte = pa & KVM_PTE_ADDR_MASK; 72 + kvm_pte_t pte; 80 73 81 - if (PAGE_SHIFT == 16) { 82 - pa &= GENMASK(51, 48); 83 - pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48); 74 + if (kvm_lpa2_is_enabled()) { 75 + pte = pa & KVM_PTE_ADDR_MASK_LPA2; 76 + pa &= GENMASK(51, 50); 77 + pte |= FIELD_PREP(KVM_PTE_ADDR_51_50_LPA2, pa >> 50); 78 + } else { 79 + pte = pa & KVM_PTE_ADDR_MASK; 80 + if (PAGE_SHIFT == 16) { 81 + pa &= GENMASK(51, 48); 82 + pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48); 83 + } 84 84 } 85 85 86 86 return pte; ··· 104 78 return __phys_to_pfn(kvm_pte_to_phys(pte)); 105 79 } 106 80 107 - static inline u64 kvm_granule_shift(u32 level) 81 + static inline u64 kvm_granule_shift(s8 level) 108 82 { 109 - /* Assumes KVM_PGTABLE_MAX_LEVELS is 4 */ 83 + /* Assumes KVM_PGTABLE_LAST_LEVEL is 3 */ 110 84 return ARM64_HW_PGTABLE_LEVEL_SHIFT(level); 111 85 } 112 86 113 - static inline u64 kvm_granule_size(u32 level) 87 + static inline u64 kvm_granule_size(s8 level) 114 88 { 115 89 return BIT(kvm_granule_shift(level)); 116 90 } 117 91 118 - static inline bool kvm_level_supports_block_mapping(u32 level) 92 + static inline bool kvm_level_supports_block_mapping(s8 level) 119 93 { 120 94 return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL; 121 95 } 122 96 123 97 static inline u32 kvm_supported_block_sizes(void) 124 98 { 125 - u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL; 99 + s8 level = KVM_PGTABLE_MIN_BLOCK_LEVEL; 126 100 u32 r = 0; 127 101 128 - for (; level < KVM_PGTABLE_MAX_LEVELS; level++) 102 + for (; level <= KVM_PGTABLE_LAST_LEVEL; level++) 129 103 r |= BIT(kvm_granule_shift(level)); 130 104 131 105 return r; ··· 170 144 void* (*zalloc_page)(void *arg); 171 145 void* (*zalloc_pages_exact)(size_t size); 172 146 void (*free_pages_exact)(void *addr, size_t size); 173 - void (*free_unlinked_table)(void *addr, u32 level); 147 + void (*free_unlinked_table)(void *addr, s8 level); 174 148 void (*get_page)(void *addr); 175 149 void (*put_page)(void *addr); 176 150 int (*page_count)(void *addr); ··· 266 240 u64 start; 267 241 u64 addr; 268 242 u64 end; 269 - u32 level; 243 + s8 level; 270 244 enum kvm_pgtable_walk_flags flags; 271 245 }; 272 246 ··· 369 343 */ 370 344 struct kvm_pgtable { 371 345 u32 ia_bits; 372 - u32 start_level; 346 + s8 start_level; 373 347 kvm_pteref_t pgd; 374 348 struct kvm_pgtable_mm_ops *mm_ops; 375 349 ··· 503 477 * The page-table is assumed to be unreachable by any hardware walkers prior to 504 478 * freeing and therefore no TLB invalidation is performed. 505 479 */ 506 - void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level); 480 + void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level); 507 481 508 482 /** 509 483 * kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure. ··· 527 501 * an ERR_PTR(error) on failure. 528 502 */ 529 503 kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt, 530 - u64 phys, u32 level, 504 + u64 phys, s8 level, 531 505 enum kvm_pgtable_prot prot, 532 506 void *mc, bool force_pte); 533 507 ··· 753 727 * Return: 0 on success, negative error code on failure. 754 728 */ 755 729 int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr, 756 - kvm_pte_t *ptep, u32 *level); 730 + kvm_pte_t *ptep, s8 *level); 757 731 758 732 /** 759 733 * kvm_pgtable_stage2_pte_prot() - Retrieve the protection attributes of a

+3 -2

arch/arm64/include/asm/kvm_pkvm.h

··· 56 56 57 57 static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages) 58 58 { 59 - unsigned long total = 0, i; 59 + unsigned long total = 0; 60 + int i; 60 61 61 62 /* Provision the worst case scenario */ 62 - for (i = 0; i < KVM_PGTABLE_MAX_LEVELS; i++) { 63 + for (i = KVM_PGTABLE_FIRST_LEVEL; i <= KVM_PGTABLE_LAST_LEVEL; i++) { 63 64 nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE); 64 65 total += nr_pages; 65 66 }

+103

arch/arm64/include/asm/vncr_mapping.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * System register offsets in the VNCR page 4 + * All offsets are *byte* displacements! 5 + */ 6 + 7 + #ifndef __ARM64_VNCR_MAPPING_H__ 8 + #define __ARM64_VNCR_MAPPING_H__ 9 + 10 + #define VNCR_VTTBR_EL2 0x020 11 + #define VNCR_VTCR_EL2 0x040 12 + #define VNCR_VMPIDR_EL2 0x050 13 + #define VNCR_CNTVOFF_EL2 0x060 14 + #define VNCR_HCR_EL2 0x078 15 + #define VNCR_HSTR_EL2 0x080 16 + #define VNCR_VPIDR_EL2 0x088 17 + #define VNCR_TPIDR_EL2 0x090 18 + #define VNCR_HCRX_EL2 0x0A0 19 + #define VNCR_VNCR_EL2 0x0B0 20 + #define VNCR_CPACR_EL1 0x100 21 + #define VNCR_CONTEXTIDR_EL1 0x108 22 + #define VNCR_SCTLR_EL1 0x110 23 + #define VNCR_ACTLR_EL1 0x118 24 + #define VNCR_TCR_EL1 0x120 25 + #define VNCR_AFSR0_EL1 0x128 26 + #define VNCR_AFSR1_EL1 0x130 27 + #define VNCR_ESR_EL1 0x138 28 + #define VNCR_MAIR_EL1 0x140 29 + #define VNCR_AMAIR_EL1 0x148 30 + #define VNCR_MDSCR_EL1 0x158 31 + #define VNCR_SPSR_EL1 0x160 32 + #define VNCR_CNTV_CVAL_EL0 0x168 33 + #define VNCR_CNTV_CTL_EL0 0x170 34 + #define VNCR_CNTP_CVAL_EL0 0x178 35 + #define VNCR_CNTP_CTL_EL0 0x180 36 + #define VNCR_SCXTNUM_EL1 0x188 37 + #define VNCR_TFSR_EL1 0x190 38 + #define VNCR_HFGRTR_EL2 0x1B8 39 + #define VNCR_HFGWTR_EL2 0x1C0 40 + #define VNCR_HFGITR_EL2 0x1C8 41 + #define VNCR_HDFGRTR_EL2 0x1D0 42 + #define VNCR_HDFGWTR_EL2 0x1D8 43 + #define VNCR_ZCR_EL1 0x1E0 44 + #define VNCR_HAFGRTR_EL2 0x1E8 45 + #define VNCR_TTBR0_EL1 0x200 46 + #define VNCR_TTBR1_EL1 0x210 47 + #define VNCR_FAR_EL1 0x220 48 + #define VNCR_ELR_EL1 0x230 49 + #define VNCR_SP_EL1 0x240 50 + #define VNCR_VBAR_EL1 0x250 51 + #define VNCR_TCR2_EL1 0x270 52 + #define VNCR_PIRE0_EL1 0x290 53 + #define VNCR_PIRE0_EL2 0x298 54 + #define VNCR_PIR_EL1 0x2A0 55 + #define VNCR_ICH_LR0_EL2 0x400 56 + #define VNCR_ICH_LR1_EL2 0x408 57 + #define VNCR_ICH_LR2_EL2 0x410 58 + #define VNCR_ICH_LR3_EL2 0x418 59 + #define VNCR_ICH_LR4_EL2 0x420 60 + #define VNCR_ICH_LR5_EL2 0x428 61 + #define VNCR_ICH_LR6_EL2 0x430 62 + #define VNCR_ICH_LR7_EL2 0x438 63 + #define VNCR_ICH_LR8_EL2 0x440 64 + #define VNCR_ICH_LR9_EL2 0x448 65 + #define VNCR_ICH_LR10_EL2 0x450 66 + #define VNCR_ICH_LR11_EL2 0x458 67 + #define VNCR_ICH_LR12_EL2 0x460 68 + #define VNCR_ICH_LR13_EL2 0x468 69 + #define VNCR_ICH_LR14_EL2 0x470 70 + #define VNCR_ICH_LR15_EL2 0x478 71 + #define VNCR_ICH_AP0R0_EL2 0x480 72 + #define VNCR_ICH_AP0R1_EL2 0x488 73 + #define VNCR_ICH_AP0R2_EL2 0x490 74 + #define VNCR_ICH_AP0R3_EL2 0x498 75 + #define VNCR_ICH_AP1R0_EL2 0x4A0 76 + #define VNCR_ICH_AP1R1_EL2 0x4A8 77 + #define VNCR_ICH_AP1R2_EL2 0x4B0 78 + #define VNCR_ICH_AP1R3_EL2 0x4B8 79 + #define VNCR_ICH_HCR_EL2 0x4C0 80 + #define VNCR_ICH_VMCR_EL2 0x4C8 81 + #define VNCR_VDISR_EL2 0x500 82 + #define VNCR_PMBLIMITR_EL1 0x800 83 + #define VNCR_PMBPTR_EL1 0x810 84 + #define VNCR_PMBSR_EL1 0x820 85 + #define VNCR_PMSCR_EL1 0x828 86 + #define VNCR_PMSEVFR_EL1 0x830 87 + #define VNCR_PMSICR_EL1 0x838 88 + #define VNCR_PMSIRR_EL1 0x840 89 + #define VNCR_PMSLATFR_EL1 0x848 90 + #define VNCR_TRFCR_EL1 0x880 91 + #define VNCR_MPAM1_EL1 0x900 92 + #define VNCR_MPAMHCR_EL2 0x930 93 + #define VNCR_MPAMVPMV_EL2 0x938 94 + #define VNCR_MPAMVPM0_EL2 0x940 95 + #define VNCR_MPAMVPM1_EL2 0x948 96 + #define VNCR_MPAMVPM2_EL2 0x950 97 + #define VNCR_MPAMVPM3_EL2 0x958 98 + #define VNCR_MPAMVPM4_EL2 0x960 99 + #define VNCR_MPAMVPM5_EL2 0x968 100 + #define VNCR_MPAMVPM6_EL2 0x970 101 + #define VNCR_MPAMVPM7_EL2 0x978 102 + 103 + #endif /* __ARM64_VNCR_MAPPING_H__ */

+1 -1

arch/arm64/kernel/cpufeature.c

··· 2341 2341 .capability = ARM64_HAS_NESTED_VIRT, 2342 2342 .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2343 2343 .matches = has_nested_virt_support, 2344 - ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, NV, IMP) 2344 + ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, NV, NV2) 2345 2345 }, 2346 2346 { 2347 2347 .capability = ARM64_HAS_32BIT_EL0_DO_NOT_USE,

+2 -5

arch/arm64/kvm/Kconfig

··· 21 21 menuconfig KVM 22 22 bool "Kernel-based Virtual Machine (KVM) support" 23 23 depends on HAVE_KVM 24 + select KVM_COMMON 24 25 select KVM_GENERIC_HARDWARE_ENABLING 25 - select MMU_NOTIFIER 26 - select PREEMPT_NOTIFIERS 26 + select KVM_GENERIC_MMU_NOTIFIER 27 27 select HAVE_KVM_CPU_RELAX_INTERCEPT 28 28 select KVM_MMIO 29 29 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 30 30 select KVM_XFER_TO_GUEST_WORK 31 31 select KVM_VFIO 32 - select HAVE_KVM_EVENTFD 33 - select HAVE_KVM_IRQFD 34 32 select HAVE_KVM_DIRTY_RING_ACQ_REL 35 33 select NEED_KVM_DIRTY_RING_WITH_BITMAP 36 34 select HAVE_KVM_MSI ··· 39 41 select HAVE_KVM_VCPU_RUN_PID_CHANGE 40 42 select SCHED_INFO 41 43 select GUEST_PERF_EVENTS if PERF_EVENTS 42 - select INTERVAL_TREE 43 44 select XARRAY_MULTI 44 45 help 45 46 Support hosting virtualized guest machines.

+1 -2

arch/arm64/kvm/arch_timer.c

··· 295 295 u64 val = vcpu_get_reg(vcpu, kvm_vcpu_sys_get_rt(vcpu)); 296 296 struct arch_timer_context *ctx; 297 297 298 - ctx = (vcpu_has_nv(vcpu) && is_hyp_ctxt(vcpu)) ? vcpu_hvtimer(vcpu) 299 - : vcpu_vtimer(vcpu); 298 + ctx = is_hyp_ctxt(vcpu) ? vcpu_hvtimer(vcpu) : vcpu_vtimer(vcpu); 300 299 301 300 return kvm_counter_compute_delta(ctx, val); 302 301 }

+11 -1

arch/arm64/kvm/arm.c

··· 221 221 r = vgic_present; 222 222 break; 223 223 case KVM_CAP_IOEVENTFD: 224 - case KVM_CAP_DEVICE_CTRL: 225 224 case KVM_CAP_USER_MEMORY: 226 225 case KVM_CAP_SYNC_MMU: 227 226 case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: ··· 664 665 * first time on this VM. 665 666 */ 666 667 ret = kvm_vgic_map_resources(kvm); 668 + if (ret) 669 + return ret; 670 + } 671 + 672 + if (vcpu_has_nv(vcpu)) { 673 + ret = kvm_init_nv_sysregs(vcpu->kvm); 667 674 if (ret) 668 675 return ret; 669 676 } ··· 1842 1837 static void __init cpu_prepare_hyp_mode(int cpu, u32 hyp_va_bits) 1843 1838 { 1844 1839 struct kvm_nvhe_init_params *params = per_cpu_ptr_nvhe_sym(kvm_init_params, cpu); 1840 + u64 mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); 1845 1841 unsigned long tcr; 1846 1842 1847 1843 /* ··· 1865 1859 } 1866 1860 tcr &= ~TCR_T0SZ_MASK; 1867 1861 tcr |= TCR_T0SZ(hyp_va_bits); 1862 + tcr &= ~TCR_EL2_PS_MASK; 1863 + tcr |= FIELD_PREP(TCR_EL2_PS_MASK, kvm_get_parange(mmfr0)); 1864 + if (kvm_lpa2_is_enabled()) 1865 + tcr |= TCR_EL2_DS; 1868 1866 params->tcr_el2 = tcr; 1869 1867 1870 1868 params->pgd_pa = kvm_mmu_get_httbr();

+63

arch/arm64/kvm/emulate-nested.c

··· 1012 1012 HDFGRTR_GROUP, 1013 1013 HDFGWTR_GROUP, 1014 1014 HFGITR_GROUP, 1015 + HAFGRTR_GROUP, 1015 1016 1016 1017 /* Must be last */ 1017 1018 __NR_FGT_GROUP_IDS__ ··· 1043 1042 1044 1043 static const struct encoding_to_trap_config encoding_to_fgt[] __initconst = { 1045 1044 /* HFGRTR_EL2, HFGWTR_EL2 */ 1045 + SR_FGT(SYS_AMAIR2_EL1, HFGxTR, nAMAIR2_EL1, 0), 1046 + SR_FGT(SYS_MAIR2_EL1, HFGxTR, nMAIR2_EL1, 0), 1047 + SR_FGT(SYS_S2POR_EL1, HFGxTR, nS2POR_EL1, 0), 1048 + SR_FGT(SYS_POR_EL1, HFGxTR, nPOR_EL1, 0), 1049 + SR_FGT(SYS_POR_EL0, HFGxTR, nPOR_EL0, 0), 1046 1050 SR_FGT(SYS_PIR_EL1, HFGxTR, nPIR_EL1, 0), 1047 1051 SR_FGT(SYS_PIRE0_EL1, HFGxTR, nPIRE0_EL1, 0), 1052 + SR_FGT(SYS_RCWMASK_EL1, HFGxTR, nRCWMASK_EL1, 0), 1048 1053 SR_FGT(SYS_TPIDR2_EL0, HFGxTR, nTPIDR2_EL0, 0), 1049 1054 SR_FGT(SYS_SMPRI_EL1, HFGxTR, nSMPRI_EL1, 0), 1055 + SR_FGT(SYS_GCSCR_EL1, HFGxTR, nGCS_EL1, 0), 1056 + SR_FGT(SYS_GCSPR_EL1, HFGxTR, nGCS_EL1, 0), 1057 + SR_FGT(SYS_GCSCRE0_EL1, HFGxTR, nGCS_EL0, 0), 1058 + SR_FGT(SYS_GCSPR_EL0, HFGxTR, nGCS_EL0, 0), 1050 1059 SR_FGT(SYS_ACCDATA_EL1, HFGxTR, nACCDATA_EL1, 0), 1051 1060 SR_FGT(SYS_ERXADDR_EL1, HFGxTR, ERXADDR_EL1, 1), 1052 1061 SR_FGT(SYS_ERXPFGCDN_EL1, HFGxTR, ERXPFGCDN_EL1, 1), ··· 1118 1107 SR_FGT(SYS_AFSR1_EL1, HFGxTR, AFSR1_EL1, 1), 1119 1108 SR_FGT(SYS_AFSR0_EL1, HFGxTR, AFSR0_EL1, 1), 1120 1109 /* HFGITR_EL2 */ 1110 + SR_FGT(OP_AT_S1E1A, HFGITR, ATS1E1A, 1), 1111 + SR_FGT(OP_COSP_RCTX, HFGITR, COSPRCTX, 1), 1112 + SR_FGT(OP_GCSPUSHX, HFGITR, nGCSEPP, 0), 1113 + SR_FGT(OP_GCSPOPX, HFGITR, nGCSEPP, 0), 1114 + SR_FGT(OP_GCSPUSHM, HFGITR, nGCSPUSHM_EL1, 0), 1121 1115 SR_FGT(OP_BRB_IALL, HFGITR, nBRBIALL, 0), 1122 1116 SR_FGT(OP_BRB_INJ, HFGITR, nBRBINJ, 0), 1123 1117 SR_FGT(SYS_DC_CVAC, HFGITR, DCCVAC, 1), ··· 1690 1674 SR_FGT(SYS_PMCR_EL0, HDFGWTR, PMCR_EL0, 1), 1691 1675 SR_FGT(SYS_PMSWINC_EL0, HDFGWTR, PMSWINC_EL0, 1), 1692 1676 SR_FGT(SYS_OSLAR_EL1, HDFGWTR, OSLAR_EL1, 1), 1677 + /* 1678 + * HAFGRTR_EL2 1679 + */ 1680 + SR_FGT(SYS_AMEVTYPER1_EL0(15), HAFGRTR, AMEVTYPER115_EL0, 1), 1681 + SR_FGT(SYS_AMEVTYPER1_EL0(14), HAFGRTR, AMEVTYPER114_EL0, 1), 1682 + SR_FGT(SYS_AMEVTYPER1_EL0(13), HAFGRTR, AMEVTYPER113_EL0, 1), 1683 + SR_FGT(SYS_AMEVTYPER1_EL0(12), HAFGRTR, AMEVTYPER112_EL0, 1), 1684 + SR_FGT(SYS_AMEVTYPER1_EL0(11), HAFGRTR, AMEVTYPER111_EL0, 1), 1685 + SR_FGT(SYS_AMEVTYPER1_EL0(10), HAFGRTR, AMEVTYPER110_EL0, 1), 1686 + SR_FGT(SYS_AMEVTYPER1_EL0(9), HAFGRTR, AMEVTYPER19_EL0, 1), 1687 + SR_FGT(SYS_AMEVTYPER1_EL0(8), HAFGRTR, AMEVTYPER18_EL0, 1), 1688 + SR_FGT(SYS_AMEVTYPER1_EL0(7), HAFGRTR, AMEVTYPER17_EL0, 1), 1689 + SR_FGT(SYS_AMEVTYPER1_EL0(6), HAFGRTR, AMEVTYPER16_EL0, 1), 1690 + SR_FGT(SYS_AMEVTYPER1_EL0(5), HAFGRTR, AMEVTYPER15_EL0, 1), 1691 + SR_FGT(SYS_AMEVTYPER1_EL0(4), HAFGRTR, AMEVTYPER14_EL0, 1), 1692 + SR_FGT(SYS_AMEVTYPER1_EL0(3), HAFGRTR, AMEVTYPER13_EL0, 1), 1693 + SR_FGT(SYS_AMEVTYPER1_EL0(2), HAFGRTR, AMEVTYPER12_EL0, 1), 1694 + SR_FGT(SYS_AMEVTYPER1_EL0(1), HAFGRTR, AMEVTYPER11_EL0, 1), 1695 + SR_FGT(SYS_AMEVTYPER1_EL0(0), HAFGRTR, AMEVTYPER10_EL0, 1), 1696 + SR_FGT(SYS_AMEVCNTR1_EL0(15), HAFGRTR, AMEVCNTR115_EL0, 1), 1697 + SR_FGT(SYS_AMEVCNTR1_EL0(14), HAFGRTR, AMEVCNTR114_EL0, 1), 1698 + SR_FGT(SYS_AMEVCNTR1_EL0(13), HAFGRTR, AMEVCNTR113_EL0, 1), 1699 + SR_FGT(SYS_AMEVCNTR1_EL0(12), HAFGRTR, AMEVCNTR112_EL0, 1), 1700 + SR_FGT(SYS_AMEVCNTR1_EL0(11), HAFGRTR, AMEVCNTR111_EL0, 1), 1701 + SR_FGT(SYS_AMEVCNTR1_EL0(10), HAFGRTR, AMEVCNTR110_EL0, 1), 1702 + SR_FGT(SYS_AMEVCNTR1_EL0(9), HAFGRTR, AMEVCNTR19_EL0, 1), 1703 + SR_FGT(SYS_AMEVCNTR1_EL0(8), HAFGRTR, AMEVCNTR18_EL0, 1), 1704 + SR_FGT(SYS_AMEVCNTR1_EL0(7), HAFGRTR, AMEVCNTR17_EL0, 1), 1705 + SR_FGT(SYS_AMEVCNTR1_EL0(6), HAFGRTR, AMEVCNTR16_EL0, 1), 1706 + SR_FGT(SYS_AMEVCNTR1_EL0(5), HAFGRTR, AMEVCNTR15_EL0, 1), 1707 + SR_FGT(SYS_AMEVCNTR1_EL0(4), HAFGRTR, AMEVCNTR14_EL0, 1), 1708 + SR_FGT(SYS_AMEVCNTR1_EL0(3), HAFGRTR, AMEVCNTR13_EL0, 1), 1709 + SR_FGT(SYS_AMEVCNTR1_EL0(2), HAFGRTR, AMEVCNTR12_EL0, 1), 1710 + SR_FGT(SYS_AMEVCNTR1_EL0(1), HAFGRTR, AMEVCNTR11_EL0, 1), 1711 + SR_FGT(SYS_AMEVCNTR1_EL0(0), HAFGRTR, AMEVCNTR10_EL0, 1), 1712 + SR_FGT(SYS_AMCNTENCLR1_EL0, HAFGRTR, AMCNTEN1, 1), 1713 + SR_FGT(SYS_AMCNTENSET1_EL0, HAFGRTR, AMCNTEN1, 1), 1714 + SR_FGT(SYS_AMCNTENCLR0_EL0, HAFGRTR, AMCNTEN0, 1), 1715 + SR_FGT(SYS_AMCNTENSET0_EL0, HAFGRTR, AMCNTEN0, 1), 1716 + SR_FGT(SYS_AMEVCNTR0_EL0(3), HAFGRTR, AMEVCNTR03_EL0, 1), 1717 + SR_FGT(SYS_AMEVCNTR0_EL0(2), HAFGRTR, AMEVCNTR02_EL0, 1), 1718 + SR_FGT(SYS_AMEVCNTR0_EL0(1), HAFGRTR, AMEVCNTR01_EL0, 1), 1719 + SR_FGT(SYS_AMEVCNTR0_EL0(0), HAFGRTR, AMEVCNTR00_EL0, 1), 1693 1720 }; 1694 1721 1695 1722 static union trap_config get_trap_config(u32 sysreg) ··· 1951 1892 val = sanitised_sys_reg(vcpu, HDFGRTR_EL2); 1952 1893 else 1953 1894 val = sanitised_sys_reg(vcpu, HDFGWTR_EL2); 1895 + break; 1896 + 1897 + case HAFGRTR_GROUP: 1898 + val = sanitised_sys_reg(vcpu, HAFGRTR_EL2); 1954 1899 break; 1955 1900 1956 1901 case HFGITR_GROUP:

+1 -1

arch/arm64/kvm/hyp/include/hyp/fault.h

··· 60 60 */ 61 61 if (!(esr & ESR_ELx_S1PTW) && 62 62 (cpus_have_final_cap(ARM64_WORKAROUND_834220) || 63 - (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_PERM)) { 63 + esr_fsc_is_permission_fault(esr))) { 64 64 if (!__translate_far_to_hpfar(far, &hpfar)) 65 65 return false; 66 66 } else {

+62 -31

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 79 79 clr |= ~hfg & __ ## reg ## _nMASK; \ 80 80 } while(0) 81 81 82 + #define update_fgt_traps_cs(vcpu, reg, clr, set) \ 83 + do { \ 84 + struct kvm_cpu_context *hctxt = \ 85 + &this_cpu_ptr(&kvm_host_data)->host_ctxt; \ 86 + u64 c = 0, s = 0; \ 87 + \ 88 + ctxt_sys_reg(hctxt, reg) = read_sysreg_s(SYS_ ## reg); \ 89 + compute_clr_set(vcpu, reg, c, s); \ 90 + s |= set; \ 91 + c |= clr; \ 92 + if (c || s) { \ 93 + u64 val = __ ## reg ## _nMASK; \ 94 + val |= s; \ 95 + val &= ~c; \ 96 + write_sysreg_s(val, SYS_ ## reg); \ 97 + } \ 98 + } while(0) 99 + 100 + #define update_fgt_traps(vcpu, reg) \ 101 + update_fgt_traps_cs(vcpu, reg, 0, 0) 102 + 103 + /* 104 + * Validate the fine grain trap masks. 105 + * Check that the masks do not overlap and that all bits are accounted for. 106 + */ 107 + #define CHECK_FGT_MASKS(reg) \ 108 + do { \ 109 + BUILD_BUG_ON((__ ## reg ## _MASK) & (__ ## reg ## _nMASK)); \ 110 + BUILD_BUG_ON(~((__ ## reg ## _RES0) ^ (__ ## reg ## _MASK) ^ \ 111 + (__ ## reg ## _nMASK))); \ 112 + } while(0) 113 + 114 + static inline bool cpu_has_amu(void) 115 + { 116 + u64 pfr0 = read_sysreg_s(SYS_ID_AA64PFR0_EL1); 117 + 118 + return cpuid_feature_extract_unsigned_field(pfr0, 119 + ID_AA64PFR0_EL1_AMU_SHIFT); 120 + } 82 121 83 122 static inline void __activate_traps_hfgxtr(struct kvm_vcpu *vcpu) 84 123 { 85 124 struct kvm_cpu_context *hctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt; 86 125 u64 r_clr = 0, w_clr = 0, r_set = 0, w_set = 0, tmp; 87 126 u64 r_val, w_val; 127 + 128 + CHECK_FGT_MASKS(HFGRTR_EL2); 129 + CHECK_FGT_MASKS(HFGWTR_EL2); 130 + CHECK_FGT_MASKS(HFGITR_EL2); 131 + CHECK_FGT_MASKS(HDFGRTR_EL2); 132 + CHECK_FGT_MASKS(HDFGWTR_EL2); 133 + CHECK_FGT_MASKS(HAFGRTR_EL2); 134 + CHECK_FGT_MASKS(HCRX_EL2); 88 135 89 136 if (!cpus_have_final_cap(ARM64_HAS_FGT)) 90 137 return; ··· 157 110 compute_clr_set(vcpu, HFGWTR_EL2, w_clr, w_set); 158 111 } 159 112 160 - /* The default is not to trap anything but ACCDATA_EL1 */ 161 - r_val = __HFGRTR_EL2_nMASK & ~HFGxTR_EL2_nACCDATA_EL1; 113 + /* The default to trap everything not handled or supported in KVM. */ 114 + tmp = HFGxTR_EL2_nAMAIR2_EL1 | HFGxTR_EL2_nMAIR2_EL1 | HFGxTR_EL2_nS2POR_EL1 | 115 + HFGxTR_EL2_nPOR_EL1 | HFGxTR_EL2_nPOR_EL0 | HFGxTR_EL2_nACCDATA_EL1; 116 + 117 + r_val = __HFGRTR_EL2_nMASK & ~tmp; 162 118 r_val |= r_set; 163 119 r_val &= ~r_clr; 164 120 165 - w_val = __HFGWTR_EL2_nMASK & ~HFGxTR_EL2_nACCDATA_EL1; 121 + w_val = __HFGWTR_EL2_nMASK & ~tmp; 166 122 w_val |= w_set; 167 123 w_val &= ~w_clr; 168 124 ··· 175 125 if (!vcpu_has_nv(vcpu) || is_hyp_ctxt(vcpu)) 176 126 return; 177 127 178 - ctxt_sys_reg(hctxt, HFGITR_EL2) = read_sysreg_s(SYS_HFGITR_EL2); 128 + update_fgt_traps(vcpu, HFGITR_EL2); 129 + update_fgt_traps(vcpu, HDFGRTR_EL2); 130 + update_fgt_traps(vcpu, HDFGWTR_EL2); 179 131 180 - r_set = r_clr = 0; 181 - compute_clr_set(vcpu, HFGITR_EL2, r_clr, r_set); 182 - r_val = __HFGITR_EL2_nMASK; 183 - r_val |= r_set; 184 - r_val &= ~r_clr; 185 - 186 - write_sysreg_s(r_val, SYS_HFGITR_EL2); 187 - 188 - ctxt_sys_reg(hctxt, HDFGRTR_EL2) = read_sysreg_s(SYS_HDFGRTR_EL2); 189 - ctxt_sys_reg(hctxt, HDFGWTR_EL2) = read_sysreg_s(SYS_HDFGWTR_EL2); 190 - 191 - r_clr = r_set = w_clr = w_set = 0; 192 - 193 - compute_clr_set(vcpu, HDFGRTR_EL2, r_clr, r_set); 194 - compute_clr_set(vcpu, HDFGWTR_EL2, w_clr, w_set); 195 - 196 - r_val = __HDFGRTR_EL2_nMASK; 197 - r_val |= r_set; 198 - r_val &= ~r_clr; 199 - 200 - w_val = __HDFGWTR_EL2_nMASK; 201 - w_val |= w_set; 202 - w_val &= ~w_clr; 203 - 204 - write_sysreg_s(r_val, SYS_HDFGRTR_EL2); 205 - write_sysreg_s(w_val, SYS_HDFGWTR_EL2); 132 + if (cpu_has_amu()) 133 + update_fgt_traps(vcpu, HAFGRTR_EL2); 206 134 } 207 135 208 136 static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) ··· 199 171 write_sysreg_s(ctxt_sys_reg(hctxt, HFGITR_EL2), SYS_HFGITR_EL2); 200 172 write_sysreg_s(ctxt_sys_reg(hctxt, HDFGRTR_EL2), SYS_HDFGRTR_EL2); 201 173 write_sysreg_s(ctxt_sys_reg(hctxt, HDFGWTR_EL2), SYS_HDFGWTR_EL2); 174 + 175 + if (cpu_has_amu()) 176 + write_sysreg_s(ctxt_sys_reg(hctxt, HAFGRTR_EL2), SYS_HAFGRTR_EL2); 202 177 } 203 178 204 179 static inline void __activate_traps_common(struct kvm_vcpu *vcpu) ··· 622 591 if (static_branch_unlikely(&vgic_v2_cpuif_trap)) { 623 592 bool valid; 624 593 625 - valid = kvm_vcpu_trap_get_fault_type(vcpu) == ESR_ELx_FSC_FAULT && 594 + valid = kvm_vcpu_trap_is_translation_fault(vcpu) && 626 595 kvm_vcpu_dabt_isvalid(vcpu) && 627 596 !kvm_vcpu_abt_issea(vcpu) && 628 597 !kvm_vcpu_abt_iss1tw(vcpu);

+18 -4

arch/arm64/kvm/hyp/include/nvhe/fixed_config.h

··· 69 69 ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_SSBS) \ 70 70 ) 71 71 72 + #define PVM_ID_AA64PFR2_ALLOW 0ULL 73 + 72 74 /* 73 75 * Allow for protected VMs: 74 76 * - Mixed-endian ··· 103 101 * - Privileged Access Never 104 102 * - SError interrupt exceptions from speculative reads 105 103 * - Enhanced Translation Synchronization 104 + * - Control for cache maintenance permission 106 105 */ 107 106 #define PVM_ID_AA64MMFR1_ALLOW (\ 108 107 ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_HAFDBS) | \ ··· 111 108 ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_HPDS) | \ 112 109 ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_PAN) | \ 113 110 ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_SpecSEI) | \ 114 - ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_ETS) \ 111 + ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_ETS) | \ 112 + ARM64_FEATURE_MASK(ID_AA64MMFR1_EL1_CMOW) \ 115 113 ) 116 114 117 115 /* ··· 136 132 ARM64_FEATURE_MASK(ID_AA64MMFR2_EL1_BBM) | \ 137 133 ARM64_FEATURE_MASK(ID_AA64MMFR2_EL1_E0PD) \ 138 134 ) 135 + 136 + #define PVM_ID_AA64MMFR3_ALLOW (0ULL) 139 137 140 138 /* 141 139 * No support for Scalable Vectors for protected VMs: ··· 184 178 ARM64_FEATURE_MASK(ID_AA64ISAR0_EL1_RNDR) \ 185 179 ) 186 180 181 + /* Restrict pointer authentication to the basic version. */ 182 + #define PVM_ID_AA64ISAR1_RESTRICT_UNSIGNED (\ 183 + FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_APA), ID_AA64ISAR1_EL1_APA_PAuth) | \ 184 + FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_API), ID_AA64ISAR1_EL1_API_PAuth) \ 185 + ) 186 + 187 + #define PVM_ID_AA64ISAR2_RESTRICT_UNSIGNED (\ 188 + FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_APA3), ID_AA64ISAR2_EL1_APA3_PAuth) \ 189 + ) 190 + 187 191 #define PVM_ID_AA64ISAR1_ALLOW (\ 188 192 ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_DPB) | \ 189 - ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_APA) | \ 190 - ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_API) | \ 191 193 ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_JSCVT) | \ 192 194 ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_FCMA) | \ 193 195 ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_LRCPC) | \ ··· 210 196 ) 211 197 212 198 #define PVM_ID_AA64ISAR2_ALLOW (\ 199 + ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_ATS1A)| \ 213 200 ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_GPA3) | \ 214 - ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_APA3) | \ 215 201 ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_MOPS) \ 216 202 ) 217 203

+2 -4

arch/arm64/kvm/hyp/nvhe/hyp-init.S

··· 122 122 alternative_else_nop_endif 123 123 msr ttbr0_el2, x2 124 124 125 - /* 126 - * Set the PS bits in TCR_EL2. 127 - */ 128 125 ldr x0, [x0, #NVHE_INIT_TCR_EL2] 129 - tcr_compute_pa_size x0, #TCR_EL2_PS_SHIFT, x1, x2 130 126 msr tcr_el2, x0 131 127 132 128 isb ··· 288 292 mov sp, x0 289 293 290 294 /* And turn the MMU back on! */ 295 + dsb nsh 296 + isb 291 297 set_sctlr_el2 x2 292 298 ret x1 293 299 SYM_FUNC_END(__pkvm_init_switch_pgd)

+3 -3

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 91 91 hyp_put_page(&host_s2_pool, addr); 92 92 } 93 93 94 - static void host_s2_free_unlinked_table(void *addr, u32 level) 94 + static void host_s2_free_unlinked_table(void *addr, s8 level) 95 95 { 96 96 kvm_pgtable_stage2_free_unlinked(&host_mmu.mm_ops, addr, level); 97 97 } ··· 443 443 { 444 444 struct kvm_mem_range cur; 445 445 kvm_pte_t pte; 446 - u32 level; 446 + s8 level; 447 447 int ret; 448 448 449 449 hyp_assert_lock_held(&host_mmu.lock); ··· 462 462 cur.start = ALIGN_DOWN(addr, granule); 463 463 cur.end = cur.start + granule; 464 464 level++; 465 - } while ((level < KVM_PGTABLE_MAX_LEVELS) && 465 + } while ((level <= KVM_PGTABLE_LAST_LEVEL) && 466 466 !(kvm_level_supports_block_mapping(level) && 467 467 range_included(&cur, range))); 468 468

+2 -2

arch/arm64/kvm/hyp/nvhe/mm.c

··· 260 260 * https://lore.kernel.org/kvm/20221017115209.2099-1-will@kernel.org/T/#mf10dfbaf1eaef9274c581b81c53758918c1d0f03 261 261 */ 262 262 dsb(ishst); 263 - __tlbi_level(vale2is, __TLBI_VADDR(addr, 0), (KVM_PGTABLE_MAX_LEVELS - 1)); 263 + __tlbi_level(vale2is, __TLBI_VADDR(addr, 0), KVM_PGTABLE_LAST_LEVEL); 264 264 dsb(ish); 265 265 isb(); 266 266 } ··· 275 275 { 276 276 struct hyp_fixmap_slot *slot = per_cpu_ptr(&fixmap_slots, (u64)ctx->arg); 277 277 278 - if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MAX_LEVELS - 1) 278 + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_LAST_LEVEL) 279 279 return -EINVAL; 280 280 281 281 slot->addr = ctx->addr;

+4

arch/arm64/kvm/hyp/nvhe/pkvm.c

··· 136 136 cptr_set |= CPTR_EL2_TTA; 137 137 } 138 138 139 + /* Trap External Trace */ 140 + if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_ExtTrcBuff), feature_ids)) 141 + mdcr_clear |= MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT; 142 + 139 143 vcpu->arch.mdcr_el2 |= mdcr_set; 140 144 vcpu->arch.mdcr_el2 &= ~mdcr_clear; 141 145 vcpu->arch.cptr_el2 |= cptr_set;

+1 -1

arch/arm64/kvm/hyp/nvhe/setup.c

··· 181 181 if (!kvm_pte_valid(ctx->old)) 182 182 return 0; 183 183 184 - if (ctx->level != (KVM_PGTABLE_MAX_LEVELS - 1)) 184 + if (ctx->level != KVM_PGTABLE_LAST_LEVEL) 185 185 return -EINVAL; 186 186 187 187 phys = kvm_pte_to_phys(ctx->old);

+57 -33

arch/arm64/kvm/hyp/pgtable.c

··· 79 79 80 80 static bool kvm_phys_is_valid(u64 phys) 81 81 { 82 - return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_EL1_PARANGE_MAX)); 82 + u64 parange_max = kvm_get_parange_max(); 83 + u8 shift = id_aa64mmfr0_parange_to_phys_shift(parange_max); 84 + 85 + return phys < BIT(shift); 83 86 } 84 87 85 88 static bool kvm_block_mapping_supported(const struct kvm_pgtable_visit_ctx *ctx, u64 phys) ··· 101 98 return IS_ALIGNED(ctx->addr, granule); 102 99 } 103 100 104 - static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, u32 level) 101 + static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, s8 level) 105 102 { 106 103 u64 shift = kvm_granule_shift(level); 107 104 u64 mask = BIT(PAGE_SHIFT - 3) - 1; ··· 117 114 return (addr & mask) >> shift; 118 115 } 119 116 120 - static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level) 117 + static u32 kvm_pgd_pages(u32 ia_bits, s8 start_level) 121 118 { 122 119 struct kvm_pgtable pgt = { 123 120 .ia_bits = ia_bits, ··· 127 124 return kvm_pgd_page_idx(&pgt, -1ULL) + 1; 128 125 } 129 126 130 - static bool kvm_pte_table(kvm_pte_t pte, u32 level) 127 + static bool kvm_pte_table(kvm_pte_t pte, s8 level) 131 128 { 132 - if (level == KVM_PGTABLE_MAX_LEVELS - 1) 129 + if (level == KVM_PGTABLE_LAST_LEVEL) 133 130 return false; 134 131 135 132 if (!kvm_pte_valid(pte)) ··· 157 154 return pte; 158 155 } 159 156 160 - static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level) 157 + static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, s8 level) 161 158 { 162 159 kvm_pte_t pte = kvm_phys_to_pte(pa); 163 - u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE : 164 - KVM_PTE_TYPE_BLOCK; 160 + u64 type = (level == KVM_PGTABLE_LAST_LEVEL) ? KVM_PTE_TYPE_PAGE : 161 + KVM_PTE_TYPE_BLOCK; 165 162 166 163 pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI); 167 164 pte |= FIELD_PREP(KVM_PTE_TYPE, type); ··· 206 203 } 207 204 208 205 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, 209 - struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, u32 level); 206 + struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, s8 level); 210 207 211 208 static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data, 212 209 struct kvm_pgtable_mm_ops *mm_ops, 213 - kvm_pteref_t pteref, u32 level) 210 + kvm_pteref_t pteref, s8 level) 214 211 { 215 212 enum kvm_pgtable_walk_flags flags = data->walker->flags; 216 213 kvm_pte_t *ptep = kvm_dereference_pteref(data->walker, pteref); ··· 275 272 } 276 273 277 274 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, 278 - struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, u32 level) 275 + struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, s8 level) 279 276 { 280 277 u32 idx; 281 278 int ret = 0; 282 279 283 - if (WARN_ON_ONCE(level >= KVM_PGTABLE_MAX_LEVELS)) 280 + if (WARN_ON_ONCE(level < KVM_PGTABLE_FIRST_LEVEL || 281 + level > KVM_PGTABLE_LAST_LEVEL)) 284 282 return -EINVAL; 285 283 286 284 for (idx = kvm_pgtable_idx(data, level); idx < PTRS_PER_PTE; ++idx) { ··· 344 340 345 341 struct leaf_walk_data { 346 342 kvm_pte_t pte; 347 - u32 level; 343 + s8 level; 348 344 }; 349 345 350 346 static int leaf_walker(const struct kvm_pgtable_visit_ctx *ctx, ··· 359 355 } 360 356 361 357 int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr, 362 - kvm_pte_t *ptep, u32 *level) 358 + kvm_pte_t *ptep, s8 *level) 363 359 { 364 360 struct leaf_walk_data data; 365 361 struct kvm_pgtable_walker walker = { ··· 412 408 } 413 409 414 410 attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap); 415 - attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh); 411 + if (!kvm_lpa2_is_enabled()) 412 + attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh); 416 413 attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF; 417 414 attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW; 418 415 *ptep = attr; ··· 472 467 if (hyp_map_walker_try_leaf(ctx, data)) 473 468 return 0; 474 469 475 - if (WARN_ON(ctx->level == KVM_PGTABLE_MAX_LEVELS - 1)) 470 + if (WARN_ON(ctx->level == KVM_PGTABLE_LAST_LEVEL)) 476 471 return -EINVAL; 477 472 478 473 childp = (kvm_pte_t *)mm_ops->zalloc_page(NULL); ··· 568 563 int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits, 569 564 struct kvm_pgtable_mm_ops *mm_ops) 570 565 { 571 - u64 levels = ARM64_HW_PGTABLE_LEVELS(va_bits); 566 + s8 start_level = KVM_PGTABLE_LAST_LEVEL + 1 - 567 + ARM64_HW_PGTABLE_LEVELS(va_bits); 568 + 569 + if (start_level < KVM_PGTABLE_FIRST_LEVEL || 570 + start_level > KVM_PGTABLE_LAST_LEVEL) 571 + return -EINVAL; 572 572 573 573 pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_page(NULL); 574 574 if (!pgt->pgd) 575 575 return -ENOMEM; 576 576 577 577 pgt->ia_bits = va_bits; 578 - pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels; 578 + pgt->start_level = start_level; 579 579 pgt->mm_ops = mm_ops; 580 580 pgt->mmu = NULL; 581 581 pgt->force_pte_cb = NULL; ··· 634 624 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) 635 625 { 636 626 u64 vtcr = VTCR_EL2_FLAGS; 637 - u8 lvls; 627 + s8 lvls; 638 628 639 629 vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT; 640 630 vtcr |= VTCR_EL2_T0SZ(phys_shift); ··· 645 635 lvls = stage2_pgtable_levels(phys_shift); 646 636 if (lvls < 2) 647 637 lvls = 2; 638 + 639 + /* 640 + * When LPA2 is enabled, the HW supports an extra level of translation 641 + * (for 5 in total) when using 4K pages. It also introduces VTCR_EL2.SL2 642 + * to as an addition to SL0 to enable encoding this extra start level. 643 + * However, since we always use concatenated pages for the first level 644 + * lookup, we will never need this extra level and therefore do not need 645 + * to touch SL2. 646 + */ 648 647 vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls); 649 648 650 649 #ifdef CONFIG_ARM64_HW_AFDBM ··· 672 653 if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) 673 654 vtcr |= VTCR_EL2_HA; 674 655 #endif /* CONFIG_ARM64_HW_AFDBM */ 656 + 657 + if (kvm_lpa2_is_enabled()) 658 + vtcr |= VTCR_EL2_DS; 675 659 676 660 /* Set the vmid bits */ 677 661 vtcr |= (get_vmid_bits(mmfr1) == 16) ? ··· 733 711 if (prot & KVM_PGTABLE_PROT_W) 734 712 attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W; 735 713 736 - attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh); 714 + if (!kvm_lpa2_is_enabled()) 715 + attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh); 716 + 737 717 attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF; 738 718 attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW; 739 719 *ptep = attr; ··· 926 902 { 927 903 u64 phys = stage2_map_walker_phys_addr(ctx, data); 928 904 929 - if (data->force_pte && (ctx->level < (KVM_PGTABLE_MAX_LEVELS - 1))) 905 + if (data->force_pte && ctx->level < KVM_PGTABLE_LAST_LEVEL) 930 906 return false; 931 907 932 908 return kvm_block_mapping_supported(ctx, phys); ··· 1005 981 if (ret != -E2BIG) 1006 982 return ret; 1007 983 1008 - if (WARN_ON(ctx->level == KVM_PGTABLE_MAX_LEVELS - 1)) 984 + if (WARN_ON(ctx->level == KVM_PGTABLE_LAST_LEVEL)) 1009 985 return -EINVAL; 1010 986 1011 987 if (!data->memcache) ··· 1175 1151 kvm_pte_t attr_set; 1176 1152 kvm_pte_t attr_clr; 1177 1153 kvm_pte_t pte; 1178 - u32 level; 1154 + s8 level; 1179 1155 }; 1180 1156 1181 1157 static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx, ··· 1218 1194 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr, 1219 1195 u64 size, kvm_pte_t attr_set, 1220 1196 kvm_pte_t attr_clr, kvm_pte_t *orig_pte, 1221 - u32 *level, enum kvm_pgtable_walk_flags flags) 1197 + s8 *level, enum kvm_pgtable_walk_flags flags) 1222 1198 { 1223 1199 int ret; 1224 1200 kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI; ··· 1320 1296 enum kvm_pgtable_prot prot) 1321 1297 { 1322 1298 int ret; 1323 - u32 level; 1299 + s8 level; 1324 1300 kvm_pte_t set = 0, clr = 0; 1325 1301 1326 1302 if (prot & KVM_PTE_LEAF_ATTR_HI_SW) ··· 1373 1349 } 1374 1350 1375 1351 kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt, 1376 - u64 phys, u32 level, 1352 + u64 phys, s8 level, 1377 1353 enum kvm_pgtable_prot prot, 1378 1354 void *mc, bool force_pte) 1379 1355 { ··· 1431 1407 * fully populated tree up to the PTE entries. Note that @level is 1432 1408 * interpreted as in "level @level entry". 1433 1409 */ 1434 - static int stage2_block_get_nr_page_tables(u32 level) 1410 + static int stage2_block_get_nr_page_tables(s8 level) 1435 1411 { 1436 1412 switch (level) { 1437 1413 case 1: ··· 1442 1418 return 0; 1443 1419 default: 1444 1420 WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL || 1445 - level >= KVM_PGTABLE_MAX_LEVELS); 1421 + level > KVM_PGTABLE_LAST_LEVEL); 1446 1422 return -EINVAL; 1447 1423 }; 1448 1424 } ··· 1455 1431 struct kvm_s2_mmu *mmu; 1456 1432 kvm_pte_t pte = ctx->old, new, *childp; 1457 1433 enum kvm_pgtable_prot prot; 1458 - u32 level = ctx->level; 1434 + s8 level = ctx->level; 1459 1435 bool force_pte; 1460 1436 int nr_pages; 1461 1437 u64 phys; 1462 1438 1463 1439 /* No huge-pages exist at the last level */ 1464 - if (level == KVM_PGTABLE_MAX_LEVELS - 1) 1440 + if (level == KVM_PGTABLE_LAST_LEVEL) 1465 1441 return 0; 1466 1442 1467 1443 /* We only split valid block mappings */ ··· 1538 1514 u64 vtcr = mmu->vtcr; 1539 1515 u32 ia_bits = VTCR_EL2_IPA(vtcr); 1540 1516 u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr); 1541 - u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0; 1517 + s8 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0; 1542 1518 1543 1519 pgd_sz = kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE; 1544 1520 pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_pages_exact(pgd_sz); ··· 1561 1537 { 1562 1538 u32 ia_bits = VTCR_EL2_IPA(vtcr); 1563 1539 u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr); 1564 - u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0; 1540 + s8 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0; 1565 1541 1566 1542 return kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE; 1567 1543 } ··· 1597 1573 pgt->pgd = NULL; 1598 1574 } 1599 1575 1600 - void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level) 1576 + void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level) 1601 1577 { 1602 1578 kvm_pteref_t ptep = (kvm_pteref_t)pgtable; 1603 1579 struct kvm_pgtable_walker walker = {

+25 -24

arch/arm64/kvm/mmu.c

··· 223 223 { 224 224 struct page *page = container_of(head, struct page, rcu_head); 225 225 void *pgtable = page_to_virt(page); 226 - u32 level = page_private(page); 226 + s8 level = page_private(page); 227 227 228 228 kvm_pgtable_stage2_free_unlinked(&kvm_s2_mm_ops, pgtable, level); 229 229 } 230 230 231 - static void stage2_free_unlinked_table(void *addr, u32 level) 231 + static void stage2_free_unlinked_table(void *addr, s8 level) 232 232 { 233 233 struct page *page = virt_to_page(addr); 234 234 ··· 804 804 struct kvm_pgtable pgt = { 805 805 .pgd = (kvm_pteref_t)kvm->mm->pgd, 806 806 .ia_bits = vabits_actual, 807 - .start_level = (KVM_PGTABLE_MAX_LEVELS - 808 - CONFIG_PGTABLE_LEVELS), 807 + .start_level = (KVM_PGTABLE_LAST_LEVEL - 808 + CONFIG_PGTABLE_LEVELS + 1), 809 809 .mm_ops = &kvm_user_mm_ops, 810 810 }; 811 811 unsigned long flags; 812 812 kvm_pte_t pte = 0; /* Keep GCC quiet... */ 813 - u32 level = ~0; 813 + s8 level = S8_MAX; 814 814 int ret; 815 815 816 816 /* ··· 829 829 * Not seeing an error, but not updating level? Something went 830 830 * deeply wrong... 831 831 */ 832 - if (WARN_ON(level >= KVM_PGTABLE_MAX_LEVELS)) 832 + if (WARN_ON(level > KVM_PGTABLE_LAST_LEVEL)) 833 + return -EFAULT; 834 + if (WARN_ON(level < KVM_PGTABLE_FIRST_LEVEL)) 833 835 return -EFAULT; 834 836 835 837 /* Oops, the userspace PTs are gone... Replay the fault */ ··· 1376 1374 1377 1375 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, 1378 1376 struct kvm_memory_slot *memslot, unsigned long hva, 1379 - unsigned long fault_status) 1377 + bool fault_is_perm) 1380 1378 { 1381 1379 int ret = 0; 1382 1380 bool write_fault, writable, force_pte = false; ··· 1390 1388 gfn_t gfn; 1391 1389 kvm_pfn_t pfn; 1392 1390 bool logging_active = memslot_is_logging(memslot); 1393 - unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu); 1394 1391 long vma_pagesize, fault_granule; 1395 1392 enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; 1396 1393 struct kvm_pgtable *pgt; 1397 1394 1398 - fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level); 1395 + if (fault_is_perm) 1396 + fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); 1399 1397 write_fault = kvm_is_write_fault(vcpu); 1400 1398 exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); 1401 1399 VM_BUG_ON(write_fault && exec_fault); 1402 1400 1403 - if (fault_status == ESR_ELx_FSC_PERM && !write_fault && !exec_fault) { 1401 + if (fault_is_perm && !write_fault && !exec_fault) { 1404 1402 kvm_err("Unexpected L2 read permission error\n"); 1405 1403 return -EFAULT; 1406 1404 } ··· 1411 1409 * only exception to this is when dirty logging is enabled at runtime 1412 1410 * and a write fault needs to collapse a block entry into a table. 1413 1411 */ 1414 - if (fault_status != ESR_ELx_FSC_PERM || 1415 - (logging_active && write_fault)) { 1412 + if (!fault_is_perm || (logging_active && write_fault)) { 1416 1413 ret = kvm_mmu_topup_memory_cache(memcache, 1417 1414 kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu)); 1418 1415 if (ret) ··· 1528 1527 * backed by a THP and thus use block mapping if possible. 1529 1528 */ 1530 1529 if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) { 1531 - if (fault_status == ESR_ELx_FSC_PERM && 1532 - fault_granule > PAGE_SIZE) 1530 + if (fault_is_perm && fault_granule > PAGE_SIZE) 1533 1531 vma_pagesize = fault_granule; 1534 1532 else 1535 1533 vma_pagesize = transparent_hugepage_adjust(kvm, memslot, ··· 1541 1541 } 1542 1542 } 1543 1543 1544 - if (fault_status != ESR_ELx_FSC_PERM && !device && kvm_has_mte(kvm)) { 1544 + if (!fault_is_perm && !device && kvm_has_mte(kvm)) { 1545 1545 /* Check the VMM hasn't introduced a new disallowed VMA */ 1546 1546 if (mte_allowed) { 1547 1547 sanitise_mte_tags(kvm, pfn, vma_pagesize); ··· 1567 1567 * permissions only if vma_pagesize equals fault_granule. Otherwise, 1568 1568 * kvm_pgtable_stage2_map() should be called to change block size. 1569 1569 */ 1570 - if (fault_status == ESR_ELx_FSC_PERM && vma_pagesize == fault_granule) 1570 + if (fault_is_perm && vma_pagesize == fault_granule) 1571 1571 ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); 1572 1572 else 1573 1573 ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, ··· 1618 1618 */ 1619 1619 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) 1620 1620 { 1621 - unsigned long fault_status; 1621 + unsigned long esr; 1622 1622 phys_addr_t fault_ipa; 1623 1623 struct kvm_memory_slot *memslot; 1624 1624 unsigned long hva; ··· 1626 1626 gfn_t gfn; 1627 1627 int ret, idx; 1628 1628 1629 - fault_status = kvm_vcpu_trap_get_fault_type(vcpu); 1629 + esr = kvm_vcpu_get_esr(vcpu); 1630 1630 1631 1631 fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); 1632 1632 is_iabt = kvm_vcpu_trap_is_iabt(vcpu); 1633 1633 1634 - if (fault_status == ESR_ELx_FSC_FAULT) { 1634 + if (esr_fsc_is_permission_fault(esr)) { 1635 1635 /* Beyond sanitised PARange (which is the IPA limit) */ 1636 1636 if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) { 1637 1637 kvm_inject_size_fault(vcpu); ··· 1666 1666 kvm_vcpu_get_hfar(vcpu), fault_ipa); 1667 1667 1668 1668 /* Check the stage-2 fault is trans. fault or write fault */ 1669 - if (fault_status != ESR_ELx_FSC_FAULT && 1670 - fault_status != ESR_ELx_FSC_PERM && 1671 - fault_status != ESR_ELx_FSC_ACCESS) { 1669 + if (!esr_fsc_is_translation_fault(esr) && 1670 + !esr_fsc_is_permission_fault(esr) && 1671 + !esr_fsc_is_access_flag_fault(esr)) { 1672 1672 kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n", 1673 1673 kvm_vcpu_trap_get_class(vcpu), 1674 1674 (unsigned long)kvm_vcpu_trap_get_fault(vcpu), ··· 1730 1730 /* Userspace should not be able to register out-of-bounds IPAs */ 1731 1731 VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->arch.hw_mmu)); 1732 1732 1733 - if (fault_status == ESR_ELx_FSC_ACCESS) { 1733 + if (esr_fsc_is_access_flag_fault(esr)) { 1734 1734 handle_access_fault(vcpu, fault_ipa); 1735 1735 ret = 1; 1736 1736 goto out_unlock; 1737 1737 } 1738 1738 1739 - ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); 1739 + ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, 1740 + esr_fsc_is_permission_fault(esr)); 1740 1741 if (ret == 0) 1741 1742 ret = 1; 1742 1743 out:

+15 -7

arch/arm64/kvm/nested.c

··· 23 23 * This list should get updated as new features get added to the NV 24 24 * support, and new extension to the architecture. 25 25 */ 26 - void access_nested_id_reg(struct kvm_vcpu *v, struct sys_reg_params *p, 27 - const struct sys_reg_desc *r) 26 + static u64 limit_nv_id_reg(u32 id, u64 val) 28 27 { 29 - u32 id = reg_to_encoding(r); 30 - u64 val, tmp; 31 - 32 - val = p->regval; 28 + u64 tmp; 33 29 34 30 switch (id) { 35 31 case SYS_ID_AA64ISAR0_EL1: ··· 154 158 break; 155 159 } 156 160 157 - p->regval = val; 161 + return val; 162 + } 163 + int kvm_init_nv_sysregs(struct kvm *kvm) 164 + { 165 + mutex_lock(&kvm->arch.config_lock); 166 + 167 + for (int i = 0; i < KVM_ARM_ID_REG_NUM; i++) 168 + kvm->arch.id_regs[i] = limit_nv_id_reg(IDX_IDREG(i), 169 + kvm->arch.id_regs[i]); 170 + 171 + mutex_unlock(&kvm->arch.config_lock); 172 + 173 + return 0; 158 174 }

+4 -5

arch/arm64/kvm/reset.c

··· 280 280 parange = cpuid_feature_extract_unsigned_field(mmfr0, 281 281 ID_AA64MMFR0_EL1_PARANGE_SHIFT); 282 282 /* 283 - * IPA size beyond 48 bits could not be supported 284 - * on either 4K or 16K page size. Hence let's cap 285 - * it to 48 bits, in case it's reported as larger 286 - * on the system. 283 + * IPA size beyond 48 bits for 4K and 16K page size is only supported 284 + * when LPA2 is available. So if we have LPA2, enable it, else cap to 48 285 + * bits, in case it's reported as larger on the system. 287 286 */ 288 - if (PAGE_SIZE != SZ_64K) 287 + if (!kvm_lpa2_is_enabled() && PAGE_SIZE != SZ_64K) 289 288 parange = min(parange, (unsigned int)ID_AA64MMFR0_EL1_PARANGE_48); 290 289 291 290 /*

+187 -48

arch/arm64/kvm/sys_regs.c

··· 45 45 static int set_id_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, 46 46 u64 val); 47 47 48 + static bool bad_trap(struct kvm_vcpu *vcpu, 49 + struct sys_reg_params *params, 50 + const struct sys_reg_desc *r, 51 + const char *msg) 52 + { 53 + WARN_ONCE(1, "Unexpected %s\n", msg); 54 + print_sys_reg_instr(params); 55 + kvm_inject_undefined(vcpu); 56 + return false; 57 + } 58 + 48 59 static bool read_from_write_only(struct kvm_vcpu *vcpu, 49 60 struct sys_reg_params *params, 50 61 const struct sys_reg_desc *r) 51 62 { 52 - WARN_ONCE(1, "Unexpected sys_reg read to write-only register\n"); 53 - print_sys_reg_instr(params); 54 - kvm_inject_undefined(vcpu); 55 - return false; 63 + return bad_trap(vcpu, params, r, 64 + "sys_reg read to write-only register"); 56 65 } 57 66 58 67 static bool write_to_read_only(struct kvm_vcpu *vcpu, 59 68 struct sys_reg_params *params, 60 69 const struct sys_reg_desc *r) 61 70 { 62 - WARN_ONCE(1, "Unexpected sys_reg write to read-only register\n"); 63 - print_sys_reg_instr(params); 64 - kvm_inject_undefined(vcpu); 65 - return false; 71 + return bad_trap(vcpu, params, r, 72 + "sys_reg write to read-only register"); 73 + } 74 + 75 + #define PURE_EL2_SYSREG(el2) \ 76 + case el2: { \ 77 + *el1r = el2; \ 78 + return true; \ 79 + } 80 + 81 + #define MAPPED_EL2_SYSREG(el2, el1, fn) \ 82 + case el2: { \ 83 + *xlate = fn; \ 84 + *el1r = el1; \ 85 + return true; \ 86 + } 87 + 88 + static bool get_el2_to_el1_mapping(unsigned int reg, 89 + unsigned int *el1r, u64 (**xlate)(u64)) 90 + { 91 + switch (reg) { 92 + PURE_EL2_SYSREG( VPIDR_EL2 ); 93 + PURE_EL2_SYSREG( VMPIDR_EL2 ); 94 + PURE_EL2_SYSREG( ACTLR_EL2 ); 95 + PURE_EL2_SYSREG( HCR_EL2 ); 96 + PURE_EL2_SYSREG( MDCR_EL2 ); 97 + PURE_EL2_SYSREG( HSTR_EL2 ); 98 + PURE_EL2_SYSREG( HACR_EL2 ); 99 + PURE_EL2_SYSREG( VTTBR_EL2 ); 100 + PURE_EL2_SYSREG( VTCR_EL2 ); 101 + PURE_EL2_SYSREG( RVBAR_EL2 ); 102 + PURE_EL2_SYSREG( TPIDR_EL2 ); 103 + PURE_EL2_SYSREG( HPFAR_EL2 ); 104 + PURE_EL2_SYSREG( CNTHCTL_EL2 ); 105 + MAPPED_EL2_SYSREG(SCTLR_EL2, SCTLR_EL1, 106 + translate_sctlr_el2_to_sctlr_el1 ); 107 + MAPPED_EL2_SYSREG(CPTR_EL2, CPACR_EL1, 108 + translate_cptr_el2_to_cpacr_el1 ); 109 + MAPPED_EL2_SYSREG(TTBR0_EL2, TTBR0_EL1, 110 + translate_ttbr0_el2_to_ttbr0_el1 ); 111 + MAPPED_EL2_SYSREG(TTBR1_EL2, TTBR1_EL1, NULL ); 112 + MAPPED_EL2_SYSREG(TCR_EL2, TCR_EL1, 113 + translate_tcr_el2_to_tcr_el1 ); 114 + MAPPED_EL2_SYSREG(VBAR_EL2, VBAR_EL1, NULL ); 115 + MAPPED_EL2_SYSREG(AFSR0_EL2, AFSR0_EL1, NULL ); 116 + MAPPED_EL2_SYSREG(AFSR1_EL2, AFSR1_EL1, NULL ); 117 + MAPPED_EL2_SYSREG(ESR_EL2, ESR_EL1, NULL ); 118 + MAPPED_EL2_SYSREG(FAR_EL2, FAR_EL1, NULL ); 119 + MAPPED_EL2_SYSREG(MAIR_EL2, MAIR_EL1, NULL ); 120 + MAPPED_EL2_SYSREG(AMAIR_EL2, AMAIR_EL1, NULL ); 121 + MAPPED_EL2_SYSREG(ELR_EL2, ELR_EL1, NULL ); 122 + MAPPED_EL2_SYSREG(SPSR_EL2, SPSR_EL1, NULL ); 123 + default: 124 + return false; 125 + } 66 126 } 67 127 68 128 u64 vcpu_read_sys_reg(const struct kvm_vcpu *vcpu, int reg) 69 129 { 70 130 u64 val = 0x8badf00d8badf00d; 131 + u64 (*xlate)(u64) = NULL; 132 + unsigned int el1r; 71 133 72 - if (vcpu_get_flag(vcpu, SYSREGS_ON_CPU) && 73 - __vcpu_read_sys_reg_from_cpu(reg, &val)) 134 + if (!vcpu_get_flag(vcpu, SYSREGS_ON_CPU)) 135 + goto memory_read; 136 + 137 + if (unlikely(get_el2_to_el1_mapping(reg, &el1r, &xlate))) { 138 + if (!is_hyp_ctxt(vcpu)) 139 + goto memory_read; 140 + 141 + /* 142 + * If this register does not have an EL1 counterpart, 143 + * then read the stored EL2 version. 144 + */ 145 + if (reg == el1r) 146 + goto memory_read; 147 + 148 + /* 149 + * If we have a non-VHE guest and that the sysreg 150 + * requires translation to be used at EL1, use the 151 + * in-memory copy instead. 152 + */ 153 + if (!vcpu_el2_e2h_is_set(vcpu) && xlate) 154 + goto memory_read; 155 + 156 + /* Get the current version of the EL1 counterpart. */ 157 + WARN_ON(!__vcpu_read_sys_reg_from_cpu(el1r, &val)); 158 + return val; 159 + } 160 + 161 + /* EL1 register can't be on the CPU if the guest is in vEL2. */ 162 + if (unlikely(is_hyp_ctxt(vcpu))) 163 + goto memory_read; 164 + 165 + if (__vcpu_read_sys_reg_from_cpu(reg, &val)) 74 166 return val; 75 167 168 + memory_read: 76 169 return __vcpu_sys_reg(vcpu, reg); 77 170 } 78 171 79 172 void vcpu_write_sys_reg(struct kvm_vcpu *vcpu, u64 val, int reg) 80 173 { 81 - if (vcpu_get_flag(vcpu, SYSREGS_ON_CPU) && 82 - __vcpu_write_sys_reg_to_cpu(val, reg)) 174 + u64 (*xlate)(u64) = NULL; 175 + unsigned int el1r; 176 + 177 + if (!vcpu_get_flag(vcpu, SYSREGS_ON_CPU)) 178 + goto memory_write; 179 + 180 + if (unlikely(get_el2_to_el1_mapping(reg, &el1r, &xlate))) { 181 + if (!is_hyp_ctxt(vcpu)) 182 + goto memory_write; 183 + 184 + /* 185 + * Always store a copy of the write to memory to avoid having 186 + * to reverse-translate virtual EL2 system registers for a 187 + * non-VHE guest hypervisor. 188 + */ 189 + __vcpu_sys_reg(vcpu, reg) = val; 190 + 191 + /* No EL1 counterpart? We're done here.? */ 192 + if (reg == el1r) 193 + return; 194 + 195 + if (!vcpu_el2_e2h_is_set(vcpu) && xlate) 196 + val = xlate(val); 197 + 198 + /* Redirect this to the EL1 version of the register. */ 199 + WARN_ON(!__vcpu_write_sys_reg_to_cpu(val, el1r)); 200 + return; 201 + } 202 + 203 + /* EL1 register can't be on the CPU if the guest is in vEL2. */ 204 + if (unlikely(is_hyp_ctxt(vcpu))) 205 + goto memory_write; 206 + 207 + if (__vcpu_write_sys_reg_to_cpu(val, reg)) 83 208 return; 84 209 85 - __vcpu_sys_reg(vcpu, reg) = val; 210 + memory_write: 211 + __vcpu_sys_reg(vcpu, reg) = val; 86 212 } 87 213 88 214 /* CSSELR values; used to index KVM_REG_ARM_DEMUX_ID_CCSIDR */ ··· 1631 1505 return write_to_read_only(vcpu, p, r); 1632 1506 1633 1507 p->regval = read_id_reg(vcpu, r); 1634 - if (vcpu_has_nv(vcpu)) 1635 - access_nested_id_reg(vcpu, p, r); 1636 1508 1637 1509 return true; 1638 1510 } ··· 2009 1885 return REG_HIDDEN; 2010 1886 } 2011 1887 1888 + static bool bad_vncr_trap(struct kvm_vcpu *vcpu, 1889 + struct sys_reg_params *p, 1890 + const struct sys_reg_desc *r) 1891 + { 1892 + /* 1893 + * We really shouldn't be here, and this is likely the result 1894 + * of a misconfigured trap, as this register should target the 1895 + * VNCR page, and nothing else. 1896 + */ 1897 + return bad_trap(vcpu, p, r, 1898 + "trap of VNCR-backed register"); 1899 + } 1900 + 1901 + static bool bad_redir_trap(struct kvm_vcpu *vcpu, 1902 + struct sys_reg_params *p, 1903 + const struct sys_reg_desc *r) 1904 + { 1905 + /* 1906 + * We really shouldn't be here, and this is likely the result 1907 + * of a misconfigured trap, as this register should target the 1908 + * corresponding EL1, and nothing else. 1909 + */ 1910 + return bad_trap(vcpu, p, r, 1911 + "trap of EL2 register redirected to EL1"); 1912 + } 1913 + 2012 1914 #define EL2_REG(name, acc, rst, v) { \ 2013 1915 SYS_DESC(SYS_##name), \ 2014 1916 .access = acc, \ ··· 2043 1893 .visibility = el2_visibility, \ 2044 1894 .val = v, \ 2045 1895 } 1896 + 1897 + #define EL2_REG_VNCR(name, rst, v) EL2_REG(name, bad_vncr_trap, rst, v) 1898 + #define EL2_REG_REDIR(name, rst, v) EL2_REG(name, bad_redir_trap, rst, v) 2046 1899 2047 1900 /* 2048 1901 * EL{0,1}2 registers are the EL2 view on an EL0 or EL1 register when ··· 2661 2508 { PMU_SYS_REG(PMCCFILTR_EL0), .access = access_pmu_evtyper, 2662 2509 .reset = reset_val, .reg = PMCCFILTR_EL0, .val = 0 }, 2663 2510 2664 - EL2_REG(VPIDR_EL2, access_rw, reset_unknown, 0), 2665 - EL2_REG(VMPIDR_EL2, access_rw, reset_unknown, 0), 2511 + EL2_REG_VNCR(VPIDR_EL2, reset_unknown, 0), 2512 + EL2_REG_VNCR(VMPIDR_EL2, reset_unknown, 0), 2666 2513 EL2_REG(SCTLR_EL2, access_rw, reset_val, SCTLR_EL2_RES1), 2667 2514 EL2_REG(ACTLR_EL2, access_rw, reset_val, 0), 2668 - EL2_REG(HCR_EL2, access_rw, reset_val, 0), 2515 + EL2_REG_VNCR(HCR_EL2, reset_val, 0), 2669 2516 EL2_REG(MDCR_EL2, access_rw, reset_val, 0), 2670 2517 EL2_REG(CPTR_EL2, access_rw, reset_val, CPTR_NVHE_EL2_RES1), 2671 - EL2_REG(HSTR_EL2, access_rw, reset_val, 0), 2672 - EL2_REG(HFGRTR_EL2, access_rw, reset_val, 0), 2673 - EL2_REG(HFGWTR_EL2, access_rw, reset_val, 0), 2674 - EL2_REG(HFGITR_EL2, access_rw, reset_val, 0), 2675 - EL2_REG(HACR_EL2, access_rw, reset_val, 0), 2518 + EL2_REG_VNCR(HSTR_EL2, reset_val, 0), 2519 + EL2_REG_VNCR(HFGRTR_EL2, reset_val, 0), 2520 + EL2_REG_VNCR(HFGWTR_EL2, reset_val, 0), 2521 + EL2_REG_VNCR(HFGITR_EL2, reset_val, 0), 2522 + EL2_REG_VNCR(HACR_EL2, reset_val, 0), 2676 2523 2677 - EL2_REG(HCRX_EL2, access_rw, reset_val, 0), 2524 + EL2_REG_VNCR(HCRX_EL2, reset_val, 0), 2678 2525 2679 2526 EL2_REG(TTBR0_EL2, access_rw, reset_val, 0), 2680 2527 EL2_REG(TTBR1_EL2, access_rw, reset_val, 0), 2681 2528 EL2_REG(TCR_EL2, access_rw, reset_val, TCR_EL2_RES1), 2682 - EL2_REG(VTTBR_EL2, access_rw, reset_val, 0), 2683 - EL2_REG(VTCR_EL2, access_rw, reset_val, 0), 2529 + EL2_REG_VNCR(VTTBR_EL2, reset_val, 0), 2530 + EL2_REG_VNCR(VTCR_EL2, reset_val, 0), 2684 2531 2685 2532 { SYS_DESC(SYS_DACR32_EL2), trap_undef, reset_unknown, DACR32_EL2 }, 2686 - EL2_REG(HDFGRTR_EL2, access_rw, reset_val, 0), 2687 - EL2_REG(HDFGWTR_EL2, access_rw, reset_val, 0), 2688 - EL2_REG(SPSR_EL2, access_rw, reset_val, 0), 2689 - EL2_REG(ELR_EL2, access_rw, reset_val, 0), 2533 + EL2_REG_VNCR(HDFGRTR_EL2, reset_val, 0), 2534 + EL2_REG_VNCR(HDFGWTR_EL2, reset_val, 0), 2535 + EL2_REG_VNCR(HAFGRTR_EL2, reset_val, 0), 2536 + EL2_REG_REDIR(SPSR_EL2, reset_val, 0), 2537 + EL2_REG_REDIR(ELR_EL2, reset_val, 0), 2690 2538 { SYS_DESC(SYS_SP_EL1), access_sp_el1}, 2691 2539 2692 2540 /* AArch32 SPSR_* are RES0 if trapped from a NV guest */ ··· 2703 2549 { SYS_DESC(SYS_IFSR32_EL2), trap_undef, reset_unknown, IFSR32_EL2 }, 2704 2550 EL2_REG(AFSR0_EL2, access_rw, reset_val, 0), 2705 2551 EL2_REG(AFSR1_EL2, access_rw, reset_val, 0), 2706 - EL2_REG(ESR_EL2, access_rw, reset_val, 0), 2552 + EL2_REG_REDIR(ESR_EL2, reset_val, 0), 2707 2553 { SYS_DESC(SYS_FPEXC32_EL2), trap_undef, reset_val, FPEXC32_EL2, 0x700 }, 2708 2554 2709 - EL2_REG(FAR_EL2, access_rw, reset_val, 0), 2555 + EL2_REG_REDIR(FAR_EL2, reset_val, 0), 2710 2556 EL2_REG(HPFAR_EL2, access_rw, reset_val, 0), 2711 2557 2712 2558 EL2_REG(MAIR_EL2, access_rw, reset_val, 0), ··· 2719 2565 EL2_REG(CONTEXTIDR_EL2, access_rw, reset_val, 0), 2720 2566 EL2_REG(TPIDR_EL2, access_rw, reset_val, 0), 2721 2567 2722 - EL2_REG(CNTVOFF_EL2, access_rw, reset_val, 0), 2568 + EL2_REG_VNCR(CNTVOFF_EL2, reset_val, 0), 2723 2569 EL2_REG(CNTHCTL_EL2, access_rw, reset_val, 0), 2724 2570 2725 - EL12_REG(SCTLR, access_vm_reg, reset_val, 0x00C50078), 2726 - EL12_REG(CPACR, access_rw, reset_val, 0), 2727 - EL12_REG(TTBR0, access_vm_reg, reset_unknown, 0), 2728 - EL12_REG(TTBR1, access_vm_reg, reset_unknown, 0), 2729 - EL12_REG(TCR, access_vm_reg, reset_val, 0), 2730 - { SYS_DESC(SYS_SPSR_EL12), access_spsr}, 2731 - { SYS_DESC(SYS_ELR_EL12), access_elr}, 2732 - EL12_REG(AFSR0, access_vm_reg, reset_unknown, 0), 2733 - EL12_REG(AFSR1, access_vm_reg, reset_unknown, 0), 2734 - EL12_REG(ESR, access_vm_reg, reset_unknown, 0), 2735 - EL12_REG(FAR, access_vm_reg, reset_unknown, 0), 2736 - EL12_REG(MAIR, access_vm_reg, reset_unknown, 0), 2737 - EL12_REG(AMAIR, access_vm_reg, reset_amair_el1, 0), 2738 - EL12_REG(VBAR, access_rw, reset_val, 0), 2739 - EL12_REG(CONTEXTIDR, access_vm_reg, reset_val, 0), 2740 2571 EL12_REG(CNTKCTL, access_rw, reset_val, 0), 2741 2572 2742 2573 EL2_REG(SP_EL2, NULL, reset_unknown, 0),

+5

arch/arm64/kvm/vgic/vgic-its.c

··· 590 590 unsigned long flags; 591 591 592 592 raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); 593 + 593 594 irq = __vgic_its_check_cache(dist, db, devid, eventid); 595 + if (irq) 596 + vgic_get_irq_kref(irq); 597 + 594 598 raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); 595 599 596 600 return irq; ··· 773 769 raw_spin_lock_irqsave(&irq->irq_lock, flags); 774 770 irq->pending_latch = true; 775 771 vgic_queue_irq_unlock(kvm, irq, flags); 772 + vgic_put_irq(kvm, irq); 776 773 777 774 return 0; 778 775 }

+5 -23

arch/arm64/kvm/vgic/vgic-mmio-v3.c

··· 357 357 gpa_t addr, unsigned int len, 358 358 unsigned long val) 359 359 { 360 - u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 361 - int i; 362 - unsigned long flags; 360 + int ret; 363 361 364 - for (i = 0; i < len * 8; i++) { 365 - struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); 362 + ret = vgic_uaccess_write_spending(vcpu, addr, len, val); 363 + if (ret) 364 + return ret; 366 365 367 - raw_spin_lock_irqsave(&irq->irq_lock, flags); 368 - if (test_bit(i, &val)) { 369 - /* 370 - * pending_latch is set irrespective of irq type 371 - * (level or edge) to avoid dependency that VM should 372 - * restore irq config before pending info. 373 - */ 374 - irq->pending_latch = true; 375 - vgic_queue_irq_unlock(vcpu->kvm, irq, flags); 376 - } else { 377 - irq->pending_latch = false; 378 - raw_spin_unlock_irqrestore(&irq->irq_lock, flags); 379 - } 380 - 381 - vgic_put_irq(vcpu->kvm, irq); 382 - } 383 - 384 - return 0; 366 + return vgic_uaccess_write_cpending(vcpu, addr, len, ~val); 385 367 } 386 368 387 369 /* We want to avoid outer shareable. */

+43 -58

arch/arm64/kvm/vgic/vgic-mmio.c

··· 301 301 vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V2); 302 302 } 303 303 304 - void vgic_mmio_write_spending(struct kvm_vcpu *vcpu, 305 - gpa_t addr, unsigned int len, 306 - unsigned long val) 304 + static void __set_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len, 305 + unsigned long val, bool is_user) 307 306 { 308 307 u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 309 308 int i; ··· 311 312 for_each_set_bit(i, &val, len * 8) { 312 313 struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); 313 314 314 - /* GICD_ISPENDR0 SGI bits are WI */ 315 - if (is_vgic_v2_sgi(vcpu, irq)) { 315 + /* GICD_ISPENDR0 SGI bits are WI when written from the guest. */ 316 + if (is_vgic_v2_sgi(vcpu, irq) && !is_user) { 316 317 vgic_put_irq(vcpu->kvm, irq); 317 318 continue; 318 319 } 319 320 320 321 raw_spin_lock_irqsave(&irq->irq_lock, flags); 322 + 323 + /* 324 + * GICv2 SGIs are terribly broken. We can't restore 325 + * the source of the interrupt, so just pick the vcpu 326 + * itself as the source... 327 + */ 328 + if (is_vgic_v2_sgi(vcpu, irq)) 329 + irq->source |= BIT(vcpu->vcpu_id); 321 330 322 331 if (irq->hw && vgic_irq_is_sgi(irq->intid)) { 323 332 /* HW SGI? Ask the GIC to inject it */ ··· 342 335 } 343 336 344 337 irq->pending_latch = true; 345 - if (irq->hw) 338 + if (irq->hw && !is_user) 346 339 vgic_irq_set_phys_active(irq, true); 347 340 348 341 vgic_queue_irq_unlock(vcpu->kvm, irq, flags); ··· 350 343 } 351 344 } 352 345 346 + void vgic_mmio_write_spending(struct kvm_vcpu *vcpu, 347 + gpa_t addr, unsigned int len, 348 + unsigned long val) 349 + { 350 + __set_pending(vcpu, addr, len, val, false); 351 + } 352 + 353 353 int vgic_uaccess_write_spending(struct kvm_vcpu *vcpu, 354 354 gpa_t addr, unsigned int len, 355 355 unsigned long val) 356 356 { 357 - u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 358 - int i; 359 - unsigned long flags; 360 - 361 - for_each_set_bit(i, &val, len * 8) { 362 - struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); 363 - 364 - raw_spin_lock_irqsave(&irq->irq_lock, flags); 365 - irq->pending_latch = true; 366 - 367 - /* 368 - * GICv2 SGIs are terribly broken. We can't restore 369 - * the source of the interrupt, so just pick the vcpu 370 - * itself as the source... 371 - */ 372 - if (is_vgic_v2_sgi(vcpu, irq)) 373 - irq->source |= BIT(vcpu->vcpu_id); 374 - 375 - vgic_queue_irq_unlock(vcpu->kvm, irq, flags); 376 - 377 - vgic_put_irq(vcpu->kvm, irq); 378 - } 379 - 357 + __set_pending(vcpu, addr, len, val, true); 380 358 return 0; 381 359 } 382 360 ··· 386 394 vgic_irq_set_phys_active(irq, false); 387 395 } 388 396 389 - void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu, 390 - gpa_t addr, unsigned int len, 391 - unsigned long val) 397 + static void __clear_pending(struct kvm_vcpu *vcpu, 398 + gpa_t addr, unsigned int len, 399 + unsigned long val, bool is_user) 392 400 { 393 401 u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 394 402 int i; ··· 397 405 for_each_set_bit(i, &val, len * 8) { 398 406 struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); 399 407 400 - /* GICD_ICPENDR0 SGI bits are WI */ 401 - if (is_vgic_v2_sgi(vcpu, irq)) { 408 + /* GICD_ICPENDR0 SGI bits are WI when written from the guest. */ 409 + if (is_vgic_v2_sgi(vcpu, irq) && !is_user) { 402 410 vgic_put_irq(vcpu->kvm, irq); 403 411 continue; 404 412 } 405 413 406 414 raw_spin_lock_irqsave(&irq->irq_lock, flags); 415 + 416 + /* 417 + * More fun with GICv2 SGIs! If we're clearing one of them 418 + * from userspace, which source vcpu to clear? Let's not 419 + * even think of it, and blow the whole set. 420 + */ 421 + if (is_vgic_v2_sgi(vcpu, irq)) 422 + irq->source = 0; 407 423 408 424 if (irq->hw && vgic_irq_is_sgi(irq->intid)) { 409 425 /* HW SGI? Ask the GIC to clear its pending bit */ ··· 427 427 continue; 428 428 } 429 429 430 - if (irq->hw) 430 + if (irq->hw && !is_user) 431 431 vgic_hw_irq_cpending(vcpu, irq); 432 432 else 433 433 irq->pending_latch = false; ··· 437 437 } 438 438 } 439 439 440 + void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu, 441 + gpa_t addr, unsigned int len, 442 + unsigned long val) 443 + { 444 + __clear_pending(vcpu, addr, len, val, false); 445 + } 446 + 440 447 int vgic_uaccess_write_cpending(struct kvm_vcpu *vcpu, 441 448 gpa_t addr, unsigned int len, 442 449 unsigned long val) 443 450 { 444 - u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 445 - int i; 446 - unsigned long flags; 447 - 448 - for_each_set_bit(i, &val, len * 8) { 449 - struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); 450 - 451 - raw_spin_lock_irqsave(&irq->irq_lock, flags); 452 - /* 453 - * More fun with GICv2 SGIs! If we're clearing one of them 454 - * from userspace, which source vcpu to clear? Let's not 455 - * even think of it, and blow the whole set. 456 - */ 457 - if (is_vgic_v2_sgi(vcpu, irq)) 458 - irq->source = 0; 459 - 460 - irq->pending_latch = false; 461 - 462 - raw_spin_unlock_irqrestore(&irq->irq_lock, flags); 463 - 464 - vgic_put_irq(vcpu->kvm, irq); 465 - } 466 - 451 + __clear_pending(vcpu, addr, len, val, true); 467 452 return 0; 468 453 } 469 454

+22 -3

arch/loongarch/include/asm/kvm_host.h

··· 45 45 u64 signal_exits; 46 46 }; 47 47 48 + #define KVM_MEM_HUGEPAGE_CAPABLE (1UL << 0) 49 + #define KVM_MEM_HUGEPAGE_INCAPABLE (1UL << 1) 48 50 struct kvm_arch_memory_slot { 51 + unsigned long flags; 49 52 }; 50 53 51 54 struct kvm_context { ··· 95 92 }; 96 93 97 94 #define KVM_LARCH_FPU (0x1 << 0) 98 - #define KVM_LARCH_SWCSR_LATEST (0x1 << 1) 99 - #define KVM_LARCH_HWCSR_USABLE (0x1 << 2) 95 + #define KVM_LARCH_LSX (0x1 << 1) 96 + #define KVM_LARCH_LASX (0x1 << 2) 97 + #define KVM_LARCH_SWCSR_LATEST (0x1 << 3) 98 + #define KVM_LARCH_HWCSR_USABLE (0x1 << 4) 100 99 101 100 struct kvm_vcpu_arch { 102 101 /* ··· 180 175 csr->csrs[reg] = val; 181 176 } 182 177 178 + static inline bool kvm_guest_has_fpu(struct kvm_vcpu_arch *arch) 179 + { 180 + return arch->cpucfg[2] & CPUCFG2_FP; 181 + } 182 + 183 + static inline bool kvm_guest_has_lsx(struct kvm_vcpu_arch *arch) 184 + { 185 + return arch->cpucfg[2] & CPUCFG2_LSX; 186 + } 187 + 188 + static inline bool kvm_guest_has_lasx(struct kvm_vcpu_arch *arch) 189 + { 190 + return arch->cpucfg[2] & CPUCFG2_LASX; 191 + } 192 + 183 193 /* Debug: dump vcpu state */ 184 194 int kvm_arch_vcpu_dump_regs(struct kvm_vcpu *vcpu); 185 195 ··· 203 183 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa); 204 184 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write); 205 185 206 - #define KVM_ARCH_WANT_MMU_NOTIFIER 207 186 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); 208 187 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable); 209 188 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);

+20 -1

arch/loongarch/include/asm/kvm_vcpu.h

··· 55 55 void kvm_restore_fpu(struct loongarch_fpu *fpu); 56 56 void kvm_restore_fcsr(struct loongarch_fpu *fpu); 57 57 58 - void kvm_acquire_timer(struct kvm_vcpu *vcpu); 58 + #ifdef CONFIG_CPU_HAS_LSX 59 + int kvm_own_lsx(struct kvm_vcpu *vcpu); 60 + void kvm_save_lsx(struct loongarch_fpu *fpu); 61 + void kvm_restore_lsx(struct loongarch_fpu *fpu); 62 + #else 63 + static inline int kvm_own_lsx(struct kvm_vcpu *vcpu) { } 64 + static inline void kvm_save_lsx(struct loongarch_fpu *fpu) { } 65 + static inline void kvm_restore_lsx(struct loongarch_fpu *fpu) { } 66 + #endif 67 + 68 + #ifdef CONFIG_CPU_HAS_LASX 69 + int kvm_own_lasx(struct kvm_vcpu *vcpu); 70 + void kvm_save_lasx(struct loongarch_fpu *fpu); 71 + void kvm_restore_lasx(struct loongarch_fpu *fpu); 72 + #else 73 + static inline int kvm_own_lasx(struct kvm_vcpu *vcpu) { } 74 + static inline void kvm_save_lasx(struct loongarch_fpu *fpu) { } 75 + static inline void kvm_restore_lasx(struct loongarch_fpu *fpu) { } 76 + #endif 77 + 59 78 void kvm_init_timer(struct kvm_vcpu *vcpu, unsigned long hz); 60 79 void kvm_reset_timer(struct kvm_vcpu *vcpu); 61 80 void kvm_save_timer(struct kvm_vcpu *vcpu);

+1

arch/loongarch/include/uapi/asm/kvm.h

··· 79 79 #define LOONGARCH_REG_64(TYPE, REG) (TYPE | KVM_REG_SIZE_U64 | (REG << LOONGARCH_REG_SHIFT)) 80 80 #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, REG) 81 81 #define KVM_IOC_CPUCFG(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG) 82 + #define KVM_LOONGARCH_VCPU_CPUCFG 0 82 83 83 84 struct kvm_debug_exit_arch { 84 85 };

+2

arch/loongarch/kernel/fpu.S

··· 349 349 lsx_restore_all_upper a0 t0 t1 350 350 jr ra 351 351 SYM_FUNC_END(_restore_lsx_upper) 352 + EXPORT_SYMBOL(_restore_lsx_upper) 352 353 353 354 SYM_FUNC_START(_init_lsx_upper) 354 355 lsx_init_all_upper t1 ··· 385 384 lasx_restore_all_upper a0 t0 t1 386 385 jr ra 387 386 SYM_FUNC_END(_restore_lasx_upper) 387 + EXPORT_SYMBOL(_restore_lasx_upper) 388 388 389 389 SYM_FUNC_START(_init_lasx_upper) 390 390 lasx_init_all_upper t1

+2 -3

arch/loongarch/kvm/Kconfig

··· 22 22 depends on AS_HAS_LVZ_EXTENSION 23 23 depends on HAVE_KVM 24 24 select HAVE_KVM_DIRTY_RING_ACQ_REL 25 - select HAVE_KVM_EVENTFD 26 25 select HAVE_KVM_VCPU_ASYNC_IOCTL 26 + select KVM_COMMON 27 27 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 28 28 select KVM_GENERIC_HARDWARE_ENABLING 29 + select KVM_GENERIC_MMU_NOTIFIER 29 30 select KVM_MMIO 30 31 select KVM_XFER_TO_GUEST_WORK 31 - select MMU_NOTIFIER 32 - select PREEMPT_NOTIFIERS 33 32 help 34 33 Support hosting virtualized guest machines using 35 34 hardware virtualization extensions. You will need

+39 -11

arch/loongarch/kvm/exit.c

··· 200 200 ++vcpu->stat.idle_exits; 201 201 trace_kvm_exit_idle(vcpu, KVM_TRACE_EXIT_IDLE); 202 202 203 - if (!kvm_arch_vcpu_runnable(vcpu)) { 204 - /* 205 - * Switch to the software timer before halt-polling/blocking as 206 - * the guest's timer may be a break event for the vCPU, and the 207 - * hypervisor timer runs only when the CPU is in guest mode. 208 - * Switch before halt-polling so that KVM recognizes an expired 209 - * timer before blocking. 210 - */ 211 - kvm_save_timer(vcpu); 212 - kvm_vcpu_block(vcpu); 213 - } 203 + if (!kvm_arch_vcpu_runnable(vcpu)) 204 + kvm_vcpu_halt(vcpu); 214 205 215 206 return EMULATE_DONE; 216 207 } ··· 634 643 { 635 644 struct kvm_run *run = vcpu->run; 636 645 646 + if (!kvm_guest_has_fpu(&vcpu->arch)) { 647 + kvm_queue_exception(vcpu, EXCCODE_INE, 0); 648 + return RESUME_GUEST; 649 + } 650 + 637 651 /* 638 652 * If guest FPU not present, the FPU operation should have been 639 653 * treated as a reserved instruction! ··· 651 655 } 652 656 653 657 kvm_own_fpu(vcpu); 658 + 659 + return RESUME_GUEST; 660 + } 661 + 662 + /* 663 + * kvm_handle_lsx_disabled() - Guest used LSX while disabled in root. 664 + * @vcpu: Virtual CPU context. 665 + * 666 + * Handle when the guest attempts to use LSX when it is disabled in the root 667 + * context. 668 + */ 669 + static int kvm_handle_lsx_disabled(struct kvm_vcpu *vcpu) 670 + { 671 + if (kvm_own_lsx(vcpu)) 672 + kvm_queue_exception(vcpu, EXCCODE_INE, 0); 673 + 674 + return RESUME_GUEST; 675 + } 676 + 677 + /* 678 + * kvm_handle_lasx_disabled() - Guest used LASX while disabled in root. 679 + * @vcpu: Virtual CPU context. 680 + * 681 + * Handle when the guest attempts to use LASX when it is disabled in the root 682 + * context. 683 + */ 684 + static int kvm_handle_lasx_disabled(struct kvm_vcpu *vcpu) 685 + { 686 + if (kvm_own_lasx(vcpu)) 687 + kvm_queue_exception(vcpu, EXCCODE_INE, 0); 654 688 655 689 return RESUME_GUEST; 656 690 } ··· 713 687 [EXCCODE_TLBS] = kvm_handle_write_fault, 714 688 [EXCCODE_TLBM] = kvm_handle_write_fault, 715 689 [EXCCODE_FPDIS] = kvm_handle_fpu_disabled, 690 + [EXCCODE_LSXDIS] = kvm_handle_lsx_disabled, 691 + [EXCCODE_LASXDIS] = kvm_handle_lasx_disabled, 716 692 [EXCCODE_GSPR] = kvm_handle_gspr, 717 693 }; 718 694

-1

arch/loongarch/kvm/main.c

··· 287 287 if (env & CSR_GCFG_MATC_ROOT) 288 288 gcfg |= CSR_GCFG_MATC_ROOT; 289 289 290 - gcfg |= CSR_GCFG_TIT; 291 290 write_csr_gcfg(gcfg); 292 291 293 292 kvm_flush_tlb_all();

+83 -41

arch/loongarch/kvm/mmu.c

··· 13 13 #include <asm/tlb.h> 14 14 #include <asm/kvm_mmu.h> 15 15 16 + static inline bool kvm_hugepage_capable(struct kvm_memory_slot *slot) 17 + { 18 + return slot->arch.flags & KVM_MEM_HUGEPAGE_CAPABLE; 19 + } 20 + 21 + static inline bool kvm_hugepage_incapable(struct kvm_memory_slot *slot) 22 + { 23 + return slot->arch.flags & KVM_MEM_HUGEPAGE_INCAPABLE; 24 + } 25 + 16 26 static inline void kvm_ptw_prepare(struct kvm *kvm, kvm_ptw_ctx *ctx) 17 27 { 18 28 ctx->level = kvm->arch.root_level; ··· 375 365 kvm_ptw_top(kvm->arch.pgd, start << PAGE_SHIFT, end << PAGE_SHIFT, &ctx); 376 366 } 377 367 368 + int kvm_arch_prepare_memory_region(struct kvm *kvm, const struct kvm_memory_slot *old, 369 + struct kvm_memory_slot *new, enum kvm_mr_change change) 370 + { 371 + gpa_t gpa_start; 372 + hva_t hva_start; 373 + size_t size, gpa_offset, hva_offset; 374 + 375 + if ((change != KVM_MR_MOVE) && (change != KVM_MR_CREATE)) 376 + return 0; 377 + /* 378 + * Prevent userspace from creating a memory region outside of the 379 + * VM GPA address space 380 + */ 381 + if ((new->base_gfn + new->npages) > (kvm->arch.gpa_size >> PAGE_SHIFT)) 382 + return -ENOMEM; 383 + 384 + new->arch.flags = 0; 385 + size = new->npages * PAGE_SIZE; 386 + gpa_start = new->base_gfn << PAGE_SHIFT; 387 + hva_start = new->userspace_addr; 388 + if (IS_ALIGNED(size, PMD_SIZE) && IS_ALIGNED(gpa_start, PMD_SIZE) 389 + && IS_ALIGNED(hva_start, PMD_SIZE)) 390 + new->arch.flags |= KVM_MEM_HUGEPAGE_CAPABLE; 391 + else { 392 + /* 393 + * Pages belonging to memslots that don't have the same 394 + * alignment within a PMD for userspace and GPA cannot be 395 + * mapped with PMD entries, because we'll end up mapping 396 + * the wrong pages. 397 + * 398 + * Consider a layout like the following: 399 + * 400 + * memslot->userspace_addr: 401 + * +-----+--------------------+--------------------+---+ 402 + * |abcde|fgh Stage-1 block | Stage-1 block tv|xyz| 403 + * +-----+--------------------+--------------------+---+ 404 + * 405 + * memslot->base_gfn << PAGE_SIZE: 406 + * +---+--------------------+--------------------+-----+ 407 + * |abc|def Stage-2 block | Stage-2 block |tvxyz| 408 + * +---+--------------------+--------------------+-----+ 409 + * 410 + * If we create those stage-2 blocks, we'll end up with this 411 + * incorrect mapping: 412 + * d -> f 413 + * e -> g 414 + * f -> h 415 + */ 416 + gpa_offset = gpa_start & (PMD_SIZE - 1); 417 + hva_offset = hva_start & (PMD_SIZE - 1); 418 + if (gpa_offset != hva_offset) { 419 + new->arch.flags |= KVM_MEM_HUGEPAGE_INCAPABLE; 420 + } else { 421 + if (gpa_offset == 0) 422 + gpa_offset = PMD_SIZE; 423 + if ((size + gpa_offset) < (PMD_SIZE * 2)) 424 + new->arch.flags |= KVM_MEM_HUGEPAGE_INCAPABLE; 425 + } 426 + } 427 + 428 + return 0; 429 + } 430 + 378 431 void kvm_arch_commit_memory_region(struct kvm *kvm, 379 432 struct kvm_memory_slot *old, 380 433 const struct kvm_memory_slot *new, ··· 635 562 } 636 563 637 564 static bool fault_supports_huge_mapping(struct kvm_memory_slot *memslot, 638 - unsigned long hva, unsigned long map_size, bool write) 565 + unsigned long hva, bool write) 639 566 { 640 - size_t size; 641 - gpa_t gpa_start; 642 - hva_t uaddr_start, uaddr_end; 567 + hva_t start, end; 643 568 644 569 /* Disable dirty logging on HugePages */ 645 570 if (kvm_slot_dirty_track_enabled(memslot) && write) 646 571 return false; 647 572 648 - size = memslot->npages * PAGE_SIZE; 649 - gpa_start = memslot->base_gfn << PAGE_SHIFT; 650 - uaddr_start = memslot->userspace_addr; 651 - uaddr_end = uaddr_start + size; 573 + if (kvm_hugepage_capable(memslot)) 574 + return true; 652 575 653 - /* 654 - * Pages belonging to memslots that don't have the same alignment 655 - * within a PMD for userspace and GPA cannot be mapped with stage-2 656 - * PMD entries, because we'll end up mapping the wrong pages. 657 - * 658 - * Consider a layout like the following: 659 - * 660 - * memslot->userspace_addr: 661 - * +-----+--------------------+--------------------+---+ 662 - * |abcde|fgh Stage-1 block | Stage-1 block tv|xyz| 663 - * +-----+--------------------+--------------------+---+ 664 - * 665 - * memslot->base_gfn << PAGE_SIZE: 666 - * +---+--------------------+--------------------+-----+ 667 - * |abc|def Stage-2 block | Stage-2 block |tvxyz| 668 - * +---+--------------------+--------------------+-----+ 669 - * 670 - * If we create those stage-2 blocks, we'll end up with this incorrect 671 - * mapping: 672 - * d -> f 673 - * e -> g 674 - * f -> h 675 - */ 676 - if ((gpa_start & (map_size - 1)) != (uaddr_start & (map_size - 1))) 576 + if (kvm_hugepage_incapable(memslot)) 677 577 return false; 578 + 579 + start = memslot->userspace_addr; 580 + end = start + memslot->npages * PAGE_SIZE; 678 581 679 582 /* 680 583 * Next, let's make sure we're not trying to map anything not covered ··· 664 615 * userspace_addr or the base_gfn, as both are equally aligned (per 665 616 * the check above) and equally sized. 666 617 */ 667 - return (hva & ~(map_size - 1)) >= uaddr_start && 668 - (hva & ~(map_size - 1)) + map_size <= uaddr_end; 618 + return (hva >= ALIGN(start, PMD_SIZE)) && (hva < ALIGN_DOWN(end, PMD_SIZE)); 669 619 } 670 620 671 621 /* ··· 890 842 891 843 /* Disable dirty logging on HugePages */ 892 844 level = 0; 893 - if (!fault_supports_huge_mapping(memslot, hva, PMD_SIZE, write)) { 845 + if (!fault_supports_huge_mapping(memslot, hva, write)) { 894 846 level = 0; 895 847 } else { 896 848 level = host_pfn_mapping_level(kvm, gfn, memslot); ··· 947 899 948 900 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) 949 901 { 950 - } 951 - 952 - int kvm_arch_prepare_memory_region(struct kvm *kvm, const struct kvm_memory_slot *old, 953 - struct kvm_memory_slot *new, enum kvm_mr_change change) 954 - { 955 - return 0; 956 902 } 957 903 958 904 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,

+31

arch/loongarch/kvm/switch.S

··· 245 245 jr ra 246 246 SYM_FUNC_END(kvm_restore_fpu) 247 247 248 + #ifdef CONFIG_CPU_HAS_LSX 249 + SYM_FUNC_START(kvm_save_lsx) 250 + fpu_save_csr a0 t1 251 + fpu_save_cc a0 t1 t2 252 + lsx_save_data a0 t1 253 + jr ra 254 + SYM_FUNC_END(kvm_save_lsx) 255 + 256 + SYM_FUNC_START(kvm_restore_lsx) 257 + lsx_restore_data a0 t1 258 + fpu_restore_cc a0 t1 t2 259 + fpu_restore_csr a0 t1 t2 260 + jr ra 261 + SYM_FUNC_END(kvm_restore_lsx) 262 + #endif 263 + 264 + #ifdef CONFIG_CPU_HAS_LASX 265 + SYM_FUNC_START(kvm_save_lasx) 266 + fpu_save_csr a0 t1 267 + fpu_save_cc a0 t1 t2 268 + lasx_save_data a0 t1 269 + jr ra 270 + SYM_FUNC_END(kvm_save_lasx) 271 + 272 + SYM_FUNC_START(kvm_restore_lasx) 273 + lasx_restore_data a0 t1 274 + fpu_restore_cc a0 t1 t2 275 + fpu_restore_csr a0 t1 t2 276 + jr ra 277 + SYM_FUNC_END(kvm_restore_lasx) 278 + #endif 248 279 .section ".rodata" 249 280 SYM_DATA(kvm_exception_size, .quad kvm_exc_entry_end - kvm_exc_entry) 250 281 SYM_DATA(kvm_enter_guest_size, .quad kvm_enter_guest_end - kvm_enter_guest)

+75 -52

arch/loongarch/kvm/timer.c

··· 65 65 } 66 66 67 67 /* 68 - * Restore hard timer state and enable guest to access timer registers 69 - * without trap, should be called with irq disabled 70 - */ 71 - void kvm_acquire_timer(struct kvm_vcpu *vcpu) 72 - { 73 - unsigned long cfg; 74 - 75 - cfg = read_csr_gcfg(); 76 - if (!(cfg & CSR_GCFG_TIT)) 77 - return; 78 - 79 - /* Enable guest access to hard timer */ 80 - write_csr_gcfg(cfg & ~CSR_GCFG_TIT); 81 - 82 - /* 83 - * Freeze the soft-timer and sync the guest stable timer with it. We do 84 - * this with interrupts disabled to avoid latency. 85 - */ 86 - hrtimer_cancel(&vcpu->arch.swtimer); 87 - } 88 - 89 - /* 90 68 * Restore soft timer state from saved context. 91 69 */ 92 70 void kvm_restore_timer(struct kvm_vcpu *vcpu) 93 71 { 94 - unsigned long cfg, delta, period; 72 + unsigned long cfg, estat; 73 + unsigned long ticks, delta, period; 95 74 ktime_t expire, now; 96 75 struct loongarch_csrs *csr = vcpu->arch.csr; 97 76 98 77 /* 99 78 * Set guest stable timer cfg csr 79 + * Disable timer before restore estat CSR register, avoid to 80 + * get invalid timer interrupt for old timer cfg 100 81 */ 101 82 cfg = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_TCFG); 83 + 84 + write_gcsr_timercfg(0); 102 85 kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_ESTAT); 103 86 kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_TCFG); 104 87 if (!(cfg & CSR_TCFG_EN)) { ··· 91 108 } 92 109 93 110 /* 111 + * Freeze the soft-timer and sync the guest stable timer with it. 112 + */ 113 + hrtimer_cancel(&vcpu->arch.swtimer); 114 + 115 + /* 116 + * From LoongArch Reference Manual Volume 1 Chapter 7.6.2 117 + * If oneshot timer is fired, CSR TVAL will be -1, there are two 118 + * conditions: 119 + * 1) timer is fired during exiting to host 120 + * 2) timer is fired and vm is doing timer irq, and then exiting to 121 + * host. Host should not inject timer irq to avoid spurious 122 + * timer interrupt again 123 + */ 124 + ticks = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_TVAL); 125 + estat = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_ESTAT); 126 + if (!(cfg & CSR_TCFG_PERIOD) && (ticks > cfg)) { 127 + /* 128 + * Writing 0 to LOONGARCH_CSR_TVAL will inject timer irq 129 + * and set CSR TVAL with -1 130 + */ 131 + write_gcsr_timertick(0); 132 + 133 + /* 134 + * Writing CSR_TINTCLR_TI to LOONGARCH_CSR_TINTCLR will clear 135 + * timer interrupt, and CSR TVAL keeps unchanged with -1, it 136 + * avoids spurious timer interrupt 137 + */ 138 + if (!(estat & CPU_TIMER)) 139 + gcsr_write(CSR_TINTCLR_TI, LOONGARCH_CSR_TINTCLR); 140 + return; 141 + } 142 + 143 + /* 94 144 * Set remainder tick value if not expired 95 145 */ 146 + delta = 0; 96 147 now = ktime_get(); 97 148 expire = vcpu->arch.expire; 98 149 if (ktime_before(now, expire)) 99 150 delta = ktime_to_tick(vcpu, ktime_sub(expire, now)); 100 - else { 101 - if (cfg & CSR_TCFG_PERIOD) { 102 - period = cfg & CSR_TCFG_VAL; 103 - delta = ktime_to_tick(vcpu, ktime_sub(now, expire)); 104 - delta = period - (delta % period); 105 - } else 106 - delta = 0; 151 + else if (cfg & CSR_TCFG_PERIOD) { 152 + period = cfg & CSR_TCFG_VAL; 153 + delta = ktime_to_tick(vcpu, ktime_sub(now, expire)); 154 + delta = period - (delta % period); 155 + 107 156 /* 108 157 * Inject timer here though sw timer should inject timer 109 158 * interrupt async already, since sw timer may be cancelled 110 - * during injecting intr async in function kvm_acquire_timer 159 + * during injecting intr async 111 160 */ 112 161 kvm_queue_irq(vcpu, INT_TI); 113 162 } ··· 154 139 */ 155 140 static void _kvm_save_timer(struct kvm_vcpu *vcpu) 156 141 { 157 - unsigned long ticks, delta; 142 + unsigned long ticks, delta, cfg; 158 143 ktime_t expire; 159 144 struct loongarch_csrs *csr = vcpu->arch.csr; 160 145 146 + cfg = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_TCFG); 161 147 ticks = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_TVAL); 162 - delta = tick_to_ns(vcpu, ticks); 163 - expire = ktime_add_ns(ktime_get(), delta); 164 - vcpu->arch.expire = expire; 165 - if (ticks) { 148 + 149 + /* 150 + * From LoongArch Reference Manual Volume 1 Chapter 7.6.2 151 + * If period timer is fired, CSR TVAL will be reloaded from CSR TCFG 152 + * If oneshot timer is fired, CSR TVAL will be -1 153 + * Here judge one-shot timer fired by checking whether TVAL is larger 154 + * than TCFG 155 + */ 156 + if (ticks < cfg) { 157 + delta = tick_to_ns(vcpu, ticks); 158 + expire = ktime_add_ns(ktime_get(), delta); 159 + vcpu->arch.expire = expire; 160 + 166 161 /* 167 - * Update hrtimer to use new timeout 168 162 * HRTIMER_MODE_PINNED is suggested since vcpu may run in 169 163 * the same physical cpu in next time 170 164 */ 171 - hrtimer_cancel(&vcpu->arch.swtimer); 172 165 hrtimer_start(&vcpu->arch.swtimer, expire, HRTIMER_MODE_ABS_PINNED); 173 - } else 166 + } else if (vcpu->stat.generic.blocking) { 174 167 /* 175 - * Inject timer interrupt so that hall polling can dectect and exit 168 + * Inject timer interrupt so that halt polling can dectect and exit. 169 + * VCPU is scheduled out already and sleeps in rcuwait queue and 170 + * will not poll pending events again. kvm_queue_irq() is not enough, 171 + * hrtimer swtimer should be used here. 176 172 */ 177 - kvm_queue_irq(vcpu, INT_TI); 173 + expire = ktime_add_ns(ktime_get(), 10); 174 + vcpu->arch.expire = expire; 175 + hrtimer_start(&vcpu->arch.swtimer, expire, HRTIMER_MODE_ABS_PINNED); 176 + } 178 177 } 179 178 180 179 /* ··· 197 168 */ 198 169 void kvm_save_timer(struct kvm_vcpu *vcpu) 199 170 { 200 - unsigned long cfg; 201 171 struct loongarch_csrs *csr = vcpu->arch.csr; 202 172 203 173 preempt_disable(); 204 - cfg = read_csr_gcfg(); 205 - if (!(cfg & CSR_GCFG_TIT)) { 206 - /* Disable guest use of hard timer */ 207 - write_csr_gcfg(cfg | CSR_GCFG_TIT); 208 174 209 - /* Save hard timer state */ 210 - kvm_save_hw_gcsr(csr, LOONGARCH_CSR_TCFG); 211 - kvm_save_hw_gcsr(csr, LOONGARCH_CSR_TVAL); 212 - if (kvm_read_sw_gcsr(csr, LOONGARCH_CSR_TCFG) & CSR_TCFG_EN) 213 - _kvm_save_timer(vcpu); 214 - } 175 + /* Save hard timer state */ 176 + kvm_save_hw_gcsr(csr, LOONGARCH_CSR_TCFG); 177 + kvm_save_hw_gcsr(csr, LOONGARCH_CSR_TVAL); 178 + if (kvm_read_sw_gcsr(csr, LOONGARCH_CSR_TCFG) & CSR_TCFG_EN) 179 + _kvm_save_timer(vcpu); 215 180 216 181 /* Save timer-related state to vCPU context */ 217 182 kvm_save_hw_gcsr(csr, LOONGARCH_CSR_ESTAT);

+5 -1

arch/loongarch/kvm/trace.h

··· 102 102 #define KVM_TRACE_AUX_DISCARD 4 103 103 104 104 #define KVM_TRACE_AUX_FPU 1 105 + #define KVM_TRACE_AUX_LSX 2 106 + #define KVM_TRACE_AUX_LASX 3 105 107 106 108 #define kvm_trace_symbol_aux_op \ 107 109 { KVM_TRACE_AUX_SAVE, "save" }, \ ··· 113 111 { KVM_TRACE_AUX_DISCARD, "discard" } 114 112 115 113 #define kvm_trace_symbol_aux_state \ 116 - { KVM_TRACE_AUX_FPU, "FPU" } 114 + { KVM_TRACE_AUX_FPU, "FPU" }, \ 115 + { KVM_TRACE_AUX_LSX, "LSX" }, \ 116 + { KVM_TRACE_AUX_LASX, "LASX" } 117 117 118 118 TRACE_EVENT(kvm_aux, 119 119 TP_PROTO(struct kvm_vcpu *vcpu, unsigned int op,

+272 -35

arch/loongarch/kvm/vcpu.c

··· 95 95 * check vmid before vcpu enter guest 96 96 */ 97 97 local_irq_disable(); 98 - kvm_acquire_timer(vcpu); 99 98 kvm_deliver_intr(vcpu); 100 99 kvm_deliver_exception(vcpu); 101 100 /* Make sure the vcpu mode has been written */ ··· 186 187 187 188 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) 188 189 { 189 - return kvm_pending_timer(vcpu) || 190 + int ret; 191 + 192 + /* Protect from TOD sync and vcpu_load/put() */ 193 + preempt_disable(); 194 + ret = kvm_pending_timer(vcpu) || 190 195 kvm_read_hw_gcsr(LOONGARCH_CSR_ESTAT) & (1 << INT_TI); 196 + preempt_enable(); 197 + 198 + return ret; 191 199 } 192 200 193 201 int kvm_arch_vcpu_dump_regs(struct kvm_vcpu *vcpu) ··· 250 244 return -EINVAL; 251 245 } 252 246 253 - /** 254 - * kvm_migrate_count() - Migrate timer. 255 - * @vcpu: Virtual CPU. 256 - * 257 - * Migrate hrtimer to the current CPU by cancelling and restarting it 258 - * if the hrtimer is active. 259 - * 260 - * Must be called when the vCPU is migrated to a different CPU, so that 261 - * the timer can interrupt the guest at the new CPU, and the timer irq can 262 - * be delivered to the vCPU. 263 - */ 264 - static void kvm_migrate_count(struct kvm_vcpu *vcpu) 265 - { 266 - if (hrtimer_cancel(&vcpu->arch.swtimer)) 267 - hrtimer_restart(&vcpu->arch.swtimer); 268 - } 269 - 270 247 static int _kvm_getcsr(struct kvm_vcpu *vcpu, unsigned int id, u64 *val) 271 248 { 272 249 unsigned long gintc; ··· 295 306 296 307 kvm_write_sw_gcsr(csr, id, val); 297 308 309 + return ret; 310 + } 311 + 312 + static int _kvm_get_cpucfg(int id, u64 *v) 313 + { 314 + int ret = 0; 315 + 316 + if (id < 0 && id >= KVM_MAX_CPUCFG_REGS) 317 + return -EINVAL; 318 + 319 + switch (id) { 320 + case 2: 321 + /* Return CPUCFG2 features which have been supported by KVM */ 322 + *v = CPUCFG2_FP | CPUCFG2_FPSP | CPUCFG2_FPDP | 323 + CPUCFG2_FPVERS | CPUCFG2_LLFTP | CPUCFG2_LLFTPREV | 324 + CPUCFG2_LAM; 325 + /* 326 + * If LSX is supported by CPU, it is also supported by KVM, 327 + * as we implement it. 328 + */ 329 + if (cpu_has_lsx) 330 + *v |= CPUCFG2_LSX; 331 + /* 332 + * if LASX is supported by CPU, it is also supported by KVM, 333 + * as we implement it. 334 + */ 335 + if (cpu_has_lasx) 336 + *v |= CPUCFG2_LASX; 337 + 338 + break; 339 + default: 340 + ret = -EINVAL; 341 + break; 342 + } 343 + return ret; 344 + } 345 + 346 + static int kvm_check_cpucfg(int id, u64 val) 347 + { 348 + u64 mask; 349 + int ret = 0; 350 + 351 + if (id < 0 && id >= KVM_MAX_CPUCFG_REGS) 352 + return -EINVAL; 353 + 354 + if (_kvm_get_cpucfg(id, &mask)) 355 + return ret; 356 + 357 + switch (id) { 358 + case 2: 359 + /* CPUCFG2 features checking */ 360 + if (val & ~mask) 361 + /* The unsupported features should not be set */ 362 + ret = -EINVAL; 363 + else if (!(val & CPUCFG2_LLFTP)) 364 + /* The LLFTP must be set, as guest must has a constant timer */ 365 + ret = -EINVAL; 366 + else if ((val & CPUCFG2_FP) && (!(val & CPUCFG2_FPSP) || !(val & CPUCFG2_FPDP))) 367 + /* Single and double float point must both be set when enable FP */ 368 + ret = -EINVAL; 369 + else if ((val & CPUCFG2_LSX) && !(val & CPUCFG2_FP)) 370 + /* FP should be set when enable LSX */ 371 + ret = -EINVAL; 372 + else if ((val & CPUCFG2_LASX) && !(val & CPUCFG2_LSX)) 373 + /* LSX, FP should be set when enable LASX, and FP has been checked before. */ 374 + ret = -EINVAL; 375 + break; 376 + default: 377 + break; 378 + } 298 379 return ret; 299 380 } 300 381 ··· 437 378 break; 438 379 case KVM_REG_LOONGARCH_CPUCFG: 439 380 id = KVM_GET_IOC_CPUCFG_IDX(reg->id); 440 - if (id >= 0 && id < KVM_MAX_CPUCFG_REGS) 441 - vcpu->arch.cpucfg[id] = (u32)v; 442 - else 443 - ret = -EINVAL; 381 + ret = kvm_check_cpucfg(id, v); 382 + if (ret) 383 + break; 384 + vcpu->arch.cpucfg[id] = (u32)v; 444 385 break; 445 386 case KVM_REG_LOONGARCH_KVM: 446 387 switch (reg->id) { ··· 530 471 return -EINVAL; 531 472 } 532 473 474 + static int kvm_loongarch_cpucfg_has_attr(struct kvm_vcpu *vcpu, 475 + struct kvm_device_attr *attr) 476 + { 477 + switch (attr->attr) { 478 + case 2: 479 + return 0; 480 + default: 481 + return -ENXIO; 482 + } 483 + 484 + return -ENXIO; 485 + } 486 + 487 + static int kvm_loongarch_vcpu_has_attr(struct kvm_vcpu *vcpu, 488 + struct kvm_device_attr *attr) 489 + { 490 + int ret = -ENXIO; 491 + 492 + switch (attr->group) { 493 + case KVM_LOONGARCH_VCPU_CPUCFG: 494 + ret = kvm_loongarch_cpucfg_has_attr(vcpu, attr); 495 + break; 496 + default: 497 + break; 498 + } 499 + 500 + return ret; 501 + } 502 + 503 + static int kvm_loongarch_get_cpucfg_attr(struct kvm_vcpu *vcpu, 504 + struct kvm_device_attr *attr) 505 + { 506 + int ret = 0; 507 + uint64_t val; 508 + uint64_t __user *uaddr = (uint64_t __user *)attr->addr; 509 + 510 + ret = _kvm_get_cpucfg(attr->attr, &val); 511 + if (ret) 512 + return ret; 513 + 514 + put_user(val, uaddr); 515 + 516 + return ret; 517 + } 518 + 519 + static int kvm_loongarch_vcpu_get_attr(struct kvm_vcpu *vcpu, 520 + struct kvm_device_attr *attr) 521 + { 522 + int ret = -ENXIO; 523 + 524 + switch (attr->group) { 525 + case KVM_LOONGARCH_VCPU_CPUCFG: 526 + ret = kvm_loongarch_get_cpucfg_attr(vcpu, attr); 527 + break; 528 + default: 529 + break; 530 + } 531 + 532 + return ret; 533 + } 534 + 535 + static int kvm_loongarch_cpucfg_set_attr(struct kvm_vcpu *vcpu, 536 + struct kvm_device_attr *attr) 537 + { 538 + return -ENXIO; 539 + } 540 + 541 + static int kvm_loongarch_vcpu_set_attr(struct kvm_vcpu *vcpu, 542 + struct kvm_device_attr *attr) 543 + { 544 + int ret = -ENXIO; 545 + 546 + switch (attr->group) { 547 + case KVM_LOONGARCH_VCPU_CPUCFG: 548 + ret = kvm_loongarch_cpucfg_set_attr(vcpu, attr); 549 + break; 550 + default: 551 + break; 552 + } 553 + 554 + return ret; 555 + } 556 + 533 557 long kvm_arch_vcpu_ioctl(struct file *filp, 534 558 unsigned int ioctl, unsigned long arg) 535 559 { 536 560 long r; 561 + struct kvm_device_attr attr; 537 562 void __user *argp = (void __user *)arg; 538 563 struct kvm_vcpu *vcpu = filp->private_data; 539 564 ··· 655 512 if (copy_from_user(&cap, argp, sizeof(cap))) 656 513 break; 657 514 r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap); 515 + break; 516 + } 517 + case KVM_HAS_DEVICE_ATTR: { 518 + r = -EFAULT; 519 + if (copy_from_user(&attr, argp, sizeof(attr))) 520 + break; 521 + r = kvm_loongarch_vcpu_has_attr(vcpu, &attr); 522 + break; 523 + } 524 + case KVM_GET_DEVICE_ATTR: { 525 + r = -EFAULT; 526 + if (copy_from_user(&attr, argp, sizeof(attr))) 527 + break; 528 + r = kvm_loongarch_vcpu_get_attr(vcpu, &attr); 529 + break; 530 + } 531 + case KVM_SET_DEVICE_ATTR: { 532 + r = -EFAULT; 533 + if (copy_from_user(&attr, argp, sizeof(attr))) 534 + break; 535 + r = kvm_loongarch_vcpu_set_attr(vcpu, &attr); 658 536 break; 659 537 } 660 538 default: ··· 725 561 preempt_enable(); 726 562 } 727 563 564 + #ifdef CONFIG_CPU_HAS_LSX 565 + /* Enable LSX and restore context */ 566 + int kvm_own_lsx(struct kvm_vcpu *vcpu) 567 + { 568 + if (!kvm_guest_has_fpu(&vcpu->arch) || !kvm_guest_has_lsx(&vcpu->arch)) 569 + return -EINVAL; 570 + 571 + preempt_disable(); 572 + 573 + /* Enable LSX for guest */ 574 + set_csr_euen(CSR_EUEN_LSXEN | CSR_EUEN_FPEN); 575 + switch (vcpu->arch.aux_inuse & KVM_LARCH_FPU) { 576 + case KVM_LARCH_FPU: 577 + /* 578 + * Guest FPU state already loaded, 579 + * only restore upper LSX state 580 + */ 581 + _restore_lsx_upper(&vcpu->arch.fpu); 582 + break; 583 + default: 584 + /* Neither FP or LSX already active, 585 + * restore full LSX state 586 + */ 587 + kvm_restore_lsx(&vcpu->arch.fpu); 588 + break; 589 + } 590 + 591 + trace_kvm_aux(vcpu, KVM_TRACE_AUX_RESTORE, KVM_TRACE_AUX_LSX); 592 + vcpu->arch.aux_inuse |= KVM_LARCH_LSX | KVM_LARCH_FPU; 593 + preempt_enable(); 594 + 595 + return 0; 596 + } 597 + #endif 598 + 599 + #ifdef CONFIG_CPU_HAS_LASX 600 + /* Enable LASX and restore context */ 601 + int kvm_own_lasx(struct kvm_vcpu *vcpu) 602 + { 603 + if (!kvm_guest_has_fpu(&vcpu->arch) || !kvm_guest_has_lsx(&vcpu->arch) || !kvm_guest_has_lasx(&vcpu->arch)) 604 + return -EINVAL; 605 + 606 + preempt_disable(); 607 + 608 + set_csr_euen(CSR_EUEN_FPEN | CSR_EUEN_LSXEN | CSR_EUEN_LASXEN); 609 + switch (vcpu->arch.aux_inuse & (KVM_LARCH_FPU | KVM_LARCH_LSX)) { 610 + case KVM_LARCH_LSX: 611 + case KVM_LARCH_LSX | KVM_LARCH_FPU: 612 + /* Guest LSX state already loaded, only restore upper LASX state */ 613 + _restore_lasx_upper(&vcpu->arch.fpu); 614 + break; 615 + case KVM_LARCH_FPU: 616 + /* Guest FP state already loaded, only restore upper LSX & LASX state */ 617 + _restore_lsx_upper(&vcpu->arch.fpu); 618 + _restore_lasx_upper(&vcpu->arch.fpu); 619 + break; 620 + default: 621 + /* Neither FP or LSX already active, restore full LASX state */ 622 + kvm_restore_lasx(&vcpu->arch.fpu); 623 + break; 624 + } 625 + 626 + trace_kvm_aux(vcpu, KVM_TRACE_AUX_RESTORE, KVM_TRACE_AUX_LASX); 627 + vcpu->arch.aux_inuse |= KVM_LARCH_LASX | KVM_LARCH_LSX | KVM_LARCH_FPU; 628 + preempt_enable(); 629 + 630 + return 0; 631 + } 632 + #endif 633 + 728 634 /* Save context and disable FPU */ 729 635 void kvm_lose_fpu(struct kvm_vcpu *vcpu) 730 636 { 731 637 preempt_disable(); 732 638 733 - if (vcpu->arch.aux_inuse & KVM_LARCH_FPU) { 639 + if (vcpu->arch.aux_inuse & KVM_LARCH_LASX) { 640 + kvm_save_lasx(&vcpu->arch.fpu); 641 + vcpu->arch.aux_inuse &= ~(KVM_LARCH_LSX | KVM_LARCH_FPU | KVM_LARCH_LASX); 642 + trace_kvm_aux(vcpu, KVM_TRACE_AUX_SAVE, KVM_TRACE_AUX_LASX); 643 + 644 + /* Disable LASX & LSX & FPU */ 645 + clear_csr_euen(CSR_EUEN_FPEN | CSR_EUEN_LSXEN | CSR_EUEN_LASXEN); 646 + } else if (vcpu->arch.aux_inuse & KVM_LARCH_LSX) { 647 + kvm_save_lsx(&vcpu->arch.fpu); 648 + vcpu->arch.aux_inuse &= ~(KVM_LARCH_LSX | KVM_LARCH_FPU); 649 + trace_kvm_aux(vcpu, KVM_TRACE_AUX_SAVE, KVM_TRACE_AUX_LSX); 650 + 651 + /* Disable LSX & FPU */ 652 + clear_csr_euen(CSR_EUEN_FPEN | CSR_EUEN_LSXEN); 653 + } else if (vcpu->arch.aux_inuse & KVM_LARCH_FPU) { 734 654 kvm_save_fpu(&vcpu->arch.fpu); 735 655 vcpu->arch.aux_inuse &= ~KVM_LARCH_FPU; 736 656 trace_kvm_aux(vcpu, KVM_TRACE_AUX_SAVE, KVM_TRACE_AUX_FPU); ··· 1037 789 unsigned long flags; 1038 790 1039 791 local_irq_save(flags); 1040 - if (vcpu->arch.last_sched_cpu != cpu) { 1041 - kvm_debug("[%d->%d]KVM vCPU[%d] switch\n", 1042 - vcpu->arch.last_sched_cpu, cpu, vcpu->vcpu_id); 1043 - /* 1044 - * Migrate the timer interrupt to the current CPU so that it 1045 - * always interrupts the guest and synchronously triggers a 1046 - * guest timer interrupt. 1047 - */ 1048 - kvm_migrate_count(vcpu); 1049 - } 1050 - 1051 792 /* Restore guest state to registers */ 1052 793 _kvm_vcpu_load(vcpu, cpu); 1053 794 local_irq_restore(flags);

-2

arch/mips/include/asm/kvm_host.h

··· 810 810 pgd_t *kvm_pgd_alloc(void); 811 811 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu); 812 812 813 - #define KVM_ARCH_WANT_MMU_NOTIFIER 814 - 815 813 /* Emulation */ 816 814 enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause); 817 815 int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);

+2 -4

arch/mips/kvm/Kconfig

··· 20 20 depends on HAVE_KVM 21 21 depends on MIPS_FP_SUPPORT 22 22 select EXPORT_UASM 23 - select PREEMPT_NOTIFIERS 23 + select KVM_COMMON 24 24 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 25 - select HAVE_KVM_EVENTFD 26 25 select HAVE_KVM_VCPU_ASYNC_IOCTL 27 26 select KVM_MMIO 28 - select MMU_NOTIFIER 29 - select INTERVAL_TREE 27 + select KVM_GENERIC_MMU_NOTIFIER 30 28 select KVM_GENERIC_HARDWARE_ENABLING 31 29 help 32 30 Support for hosting Guest kernels.

-2

arch/powerpc/include/asm/kvm_host.h

··· 63 63 64 64 #include <linux/mmu_notifier.h> 65 65 66 - #define KVM_ARCH_WANT_MMU_NOTIFIER 67 - 68 66 #define HPTEG_CACHE_NUM (1 << 15) 69 67 #define HPTEG_HASH_BITS_PTE 13 70 68 #define HPTEG_HASH_BITS_PTE_LONG 12

+5 -9

arch/powerpc/kvm/Kconfig

··· 19 19 20 20 config KVM 21 21 bool 22 - select PREEMPT_NOTIFIERS 23 - select HAVE_KVM_EVENTFD 22 + select KVM_COMMON 24 23 select HAVE_KVM_VCPU_ASYNC_IOCTL 25 24 select KVM_VFIO 26 25 select IRQ_BYPASS_MANAGER 27 26 select HAVE_KVM_IRQ_BYPASS 28 - select INTERVAL_TREE 29 27 30 28 config KVM_BOOK3S_HANDLER 31 29 bool ··· 40 42 config KVM_BOOK3S_PR_POSSIBLE 41 43 bool 42 44 select KVM_MMIO 43 - select MMU_NOTIFIER 45 + select KVM_GENERIC_MMU_NOTIFIER 44 46 45 47 config KVM_BOOK3S_HV_POSSIBLE 46 48 bool ··· 83 85 tristate "KVM for POWER7 and later using hypervisor mode in host" 84 86 depends on KVM_BOOK3S_64 && PPC_POWERNV 85 87 select KVM_BOOK3S_HV_POSSIBLE 86 - select MMU_NOTIFIER 88 + select KVM_GENERIC_MMU_NOTIFIER 87 89 select CMA 88 90 help 89 91 Support running unmodified book3s_64 guest kernels in ··· 192 194 depends on !CONTEXT_TRACKING_USER 193 195 select KVM 194 196 select KVM_MMIO 195 - select MMU_NOTIFIER 197 + select KVM_GENERIC_MMU_NOTIFIER 196 198 help 197 199 Support running unmodified E500 guest kernels in virtual machines on 198 200 E500v2 host processors. ··· 209 211 select KVM 210 212 select KVM_MMIO 211 213 select KVM_BOOKE_HV 212 - select MMU_NOTIFIER 214 + select KVM_GENERIC_MMU_NOTIFIER 213 215 help 214 216 Support running unmodified E500MC/E5500/E6500 guest kernels in 215 217 virtual machines on E500MC/E5500/E6500 host processors. ··· 223 225 bool "KVM in-kernel MPIC emulation" 224 226 depends on KVM && PPC_E500 225 227 select HAVE_KVM_IRQCHIP 226 - select HAVE_KVM_IRQFD 227 228 select HAVE_KVM_IRQ_ROUTING 228 229 select HAVE_KVM_MSI 229 230 help ··· 235 238 bool "KVM in-kernel XICS emulation" 236 239 depends on KVM_BOOK3S_64 && !KVM_MPIC 237 240 select HAVE_KVM_IRQCHIP 238 - select HAVE_KVM_IRQFD 239 241 default y 240 242 help 241 243 Include support for the XICS (eXternal Interrupt Controller

+1 -1

arch/powerpc/kvm/book3s_hv.c

··· 6240 6240 } 6241 6241 6242 6242 srcu_idx = srcu_read_lock(&kvm->srcu); 6243 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 6243 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 6244 6244 struct kvm_memory_slot *memslot; 6245 6245 struct kvm_memslots *slots = __kvm_memslots(kvm, i); 6246 6246 int bkt;

+2 -8

arch/powerpc/kvm/powerpc.c

··· 528 528 case KVM_CAP_ENABLE_CAP: 529 529 case KVM_CAP_ONE_REG: 530 530 case KVM_CAP_IOEVENTFD: 531 - case KVM_CAP_DEVICE_CTRL: 532 531 case KVM_CAP_IMMEDIATE_EXIT: 533 532 case KVM_CAP_SET_GUEST_DEBUG: 534 533 r = 1; ··· 577 578 break; 578 579 #endif 579 580 580 - #ifdef CONFIG_HAVE_KVM_IRQFD 581 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 581 582 case KVM_CAP_IRQFD_RESAMPLE: 582 583 r = !xive_enabled(); 583 584 break; ··· 631 632 break; 632 633 #endif 633 634 case KVM_CAP_SYNC_MMU: 634 - #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE 635 - r = hv_enabled; 636 - #elif defined(KVM_ARCH_WANT_MMU_NOTIFIER) 635 + BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER)); 637 636 r = 1; 638 - #else 639 - r = 0; 640 - #endif 641 637 break; 642 638 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE 643 639 case KVM_CAP_PPC_HTAB_FD:

+19

arch/riscv/Kconfig

··· 723 723 724 724 If you want to execute 32-bit userspace applications, say Y. 725 725 726 + config PARAVIRT 727 + bool "Enable paravirtualization code" 728 + depends on RISCV_SBI 729 + help 730 + This changes the kernel so it can modify itself when it is run 731 + under a hypervisor, potentially improving performance significantly 732 + over full virtualization. 733 + 734 + config PARAVIRT_TIME_ACCOUNTING 735 + bool "Paravirtual steal time accounting" 736 + depends on PARAVIRT 737 + help 738 + Select this option to enable fine granularity task steal time 739 + accounting. Time spent executing other tasks in parallel with 740 + the current vCPU is discounted from the vCPU power. To account for 741 + that, there can be a small performance impact. 742 + 743 + If in doubt, say N here. 744 + 726 745 config RELOCATABLE 727 746 bool "Build a relocatable kernel" 728 747 depends on MMU && 64BIT && !XIP_KERNEL

+10 -2

arch/riscv/include/asm/kvm_host.h

··· 41 41 KVM_ARCH_REQ_FLAGS(4, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 42 42 #define KVM_REQ_HFENCE \ 43 43 KVM_ARCH_REQ_FLAGS(5, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 44 + #define KVM_REQ_STEAL_UPDATE KVM_ARCH_REQ(6) 44 45 45 46 enum kvm_riscv_hfence_type { 46 47 KVM_RISCV_HFENCE_UNKNOWN = 0, ··· 263 262 264 263 /* 'static' configurations which are set only once */ 265 264 struct kvm_vcpu_config cfg; 265 + 266 + /* SBI steal-time accounting */ 267 + struct { 268 + gpa_t shmem; 269 + u64 last_steal; 270 + } sta; 266 271 }; 267 272 268 273 static inline void kvm_arch_sync_events(struct kvm *kvm) {} 269 274 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} 270 - 271 - #define KVM_ARCH_WANT_MMU_NOTIFIER 272 275 273 276 #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12 274 277 ··· 376 371 bool kvm_riscv_vcpu_has_interrupts(struct kvm_vcpu *vcpu, u64 mask); 377 372 void kvm_riscv_vcpu_power_off(struct kvm_vcpu *vcpu); 378 373 void kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu); 374 + 375 + void kvm_riscv_vcpu_sbi_sta_reset(struct kvm_vcpu *vcpu); 376 + void kvm_riscv_vcpu_record_steal_time(struct kvm_vcpu *vcpu); 379 377 380 378 #endif /* __RISCV_KVM_HOST_H__ */

+16 -4

arch/riscv/include/asm/kvm_vcpu_sbi.h

··· 15 15 #define KVM_SBI_VERSION_MINOR 0 16 16 17 17 enum kvm_riscv_sbi_ext_status { 18 - KVM_RISCV_SBI_EXT_UNINITIALIZED, 19 - KVM_RISCV_SBI_EXT_AVAILABLE, 20 - KVM_RISCV_SBI_EXT_UNAVAILABLE, 18 + KVM_RISCV_SBI_EXT_STATUS_UNINITIALIZED, 19 + KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE, 20 + KVM_RISCV_SBI_EXT_STATUS_ENABLED, 21 + KVM_RISCV_SBI_EXT_STATUS_DISABLED, 21 22 }; 22 23 23 24 struct kvm_vcpu_sbi_context { ··· 37 36 unsigned long extid_start; 38 37 unsigned long extid_end; 39 38 40 - bool default_unavail; 39 + bool default_disabled; 41 40 42 41 /** 43 42 * SBI extension handler. It can be defined for a given extension or group of ··· 60 59 const struct kvm_one_reg *reg); 61 60 int kvm_riscv_vcpu_get_reg_sbi_ext(struct kvm_vcpu *vcpu, 62 61 const struct kvm_one_reg *reg); 62 + int kvm_riscv_vcpu_set_reg_sbi(struct kvm_vcpu *vcpu, 63 + const struct kvm_one_reg *reg); 64 + int kvm_riscv_vcpu_get_reg_sbi(struct kvm_vcpu *vcpu, 65 + const struct kvm_one_reg *reg); 63 66 const struct kvm_vcpu_sbi_extension *kvm_vcpu_sbi_find_ext( 64 67 struct kvm_vcpu *vcpu, unsigned long extid); 68 + bool riscv_vcpu_supports_sbi_ext(struct kvm_vcpu *vcpu, int idx); 65 69 int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run); 66 70 void kvm_riscv_vcpu_sbi_init(struct kvm_vcpu *vcpu); 71 + 72 + int kvm_riscv_vcpu_get_reg_sbi_sta(struct kvm_vcpu *vcpu, unsigned long reg_num, 73 + unsigned long *reg_val); 74 + int kvm_riscv_vcpu_set_reg_sbi_sta(struct kvm_vcpu *vcpu, unsigned long reg_num, 75 + unsigned long reg_val); 67 76 68 77 #ifdef CONFIG_RISCV_SBI_V01 69 78 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_v01; ··· 85 74 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_srst; 86 75 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_hsm; 87 76 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_dbcn; 77 + extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_sta; 88 78 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental; 89 79 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor; 90 80

+28

arch/riscv/include/asm/paravirt.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_RISCV_PARAVIRT_H 3 + #define _ASM_RISCV_PARAVIRT_H 4 + 5 + #ifdef CONFIG_PARAVIRT 6 + #include <linux/static_call_types.h> 7 + 8 + struct static_key; 9 + extern struct static_key paravirt_steal_enabled; 10 + extern struct static_key paravirt_steal_rq_enabled; 11 + 12 + u64 dummy_steal_clock(int cpu); 13 + 14 + DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock); 15 + 16 + static inline u64 paravirt_steal_clock(int cpu) 17 + { 18 + return static_call(pv_steal_clock)(cpu); 19 + } 20 + 21 + int __init pv_time_init(void); 22 + 23 + #else 24 + 25 + #define pv_time_init() do {} while (0) 26 + 27 + #endif /* CONFIG_PARAVIRT */ 28 + #endif /* _ASM_RISCV_PARAVIRT_H */

+1

arch/riscv/include/asm/paravirt_api_clock.h

··· 1 + #include <asm/paravirt.h>

+17

arch/riscv/include/asm/sbi.h

··· 31 31 SBI_EXT_SRST = 0x53525354, 32 32 SBI_EXT_PMU = 0x504D55, 33 33 SBI_EXT_DBCN = 0x4442434E, 34 + SBI_EXT_STA = 0x535441, 34 35 35 36 /* Experimentals extensions must lie within this range */ 36 37 SBI_EXT_EXPERIMENTAL_START = 0x08000000, ··· 244 243 SBI_EXT_DBCN_CONSOLE_WRITE_BYTE = 2, 245 244 }; 246 245 246 + /* SBI STA (steal-time accounting) extension */ 247 + enum sbi_ext_sta_fid { 248 + SBI_EXT_STA_STEAL_TIME_SET_SHMEM = 0, 249 + }; 250 + 251 + struct sbi_sta_struct { 252 + __le32 sequence; 253 + __le32 flags; 254 + __le64 steal; 255 + u8 preempted; 256 + u8 pad[47]; 257 + } __packed; 258 + 259 + #define SBI_STA_SHMEM_DISABLE -1 260 + 261 + /* SBI spec version fields */ 247 262 #define SBI_SPEC_VERSION_DEFAULT 0x1 248 263 #define SBI_SPEC_VERSION_MAJOR_SHIFT 24 249 264 #define SBI_SPEC_VERSION_MAJOR_MASK 0x7f

+13

arch/riscv/include/uapi/asm/kvm.h

··· 157 157 KVM_RISCV_SBI_EXT_EXPERIMENTAL, 158 158 KVM_RISCV_SBI_EXT_VENDOR, 159 159 KVM_RISCV_SBI_EXT_DBCN, 160 + KVM_RISCV_SBI_EXT_STA, 160 161 KVM_RISCV_SBI_EXT_MAX, 162 + }; 163 + 164 + /* SBI STA extension registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */ 165 + struct kvm_riscv_sbi_sta { 166 + unsigned long shmem_lo; 167 + unsigned long shmem_hi; 161 168 }; 162 169 163 170 /* Possible states for kvm_riscv_timer */ ··· 247 240 (offsetof(struct __riscv_v_ext_state, name) / sizeof(unsigned long)) 248 241 #define KVM_REG_RISCV_VECTOR_REG(n) \ 249 242 ((n) + sizeof(struct __riscv_v_ext_state) / sizeof(unsigned long)) 243 + 244 + /* Registers for specific SBI extensions are mapped as type 10 */ 245 + #define KVM_REG_RISCV_SBI_STATE (0x0a << KVM_REG_RISCV_TYPE_SHIFT) 246 + #define KVM_REG_RISCV_SBI_STA (0x0 << KVM_REG_RISCV_SUBTYPE_SHIFT) 247 + #define KVM_REG_RISCV_SBI_STA_REG(name) \ 248 + (offsetof(struct kvm_riscv_sbi_sta, name) / sizeof(unsigned long)) 250 249 251 250 /* Device Control API: RISC-V AIA */ 252 251 #define KVM_DEV_RISCV_APLIC_ALIGN 0x1000

+1

arch/riscv/kernel/Makefile

··· 86 86 obj-$(CONFIG_SMP) += cpu_ops_sbi.o 87 87 endif 88 88 obj-$(CONFIG_HOTPLUG_CPU) += cpu-hotplug.o 89 + obj-$(CONFIG_PARAVIRT) += paravirt.o 89 90 obj-$(CONFIG_KGDB) += kgdb.o 90 91 obj-$(CONFIG_KEXEC_CORE) += kexec_relocate.o crash_save_regs.o machine_kexec.o 91 92 obj-$(CONFIG_KEXEC_FILE) += elf_kexec.o machine_kexec_file.o

+135

arch/riscv/kernel/paravirt.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (c) 2023 Ventana Micro Systems Inc. 4 + */ 5 + 6 + #define pr_fmt(fmt) "riscv-pv: " fmt 7 + 8 + #include <linux/cpuhotplug.h> 9 + #include <linux/compiler.h> 10 + #include <linux/errno.h> 11 + #include <linux/init.h> 12 + #include <linux/jump_label.h> 13 + #include <linux/kconfig.h> 14 + #include <linux/kernel.h> 15 + #include <linux/percpu-defs.h> 16 + #include <linux/printk.h> 17 + #include <linux/static_call.h> 18 + #include <linux/types.h> 19 + 20 + #include <asm/barrier.h> 21 + #include <asm/page.h> 22 + #include <asm/paravirt.h> 23 + #include <asm/sbi.h> 24 + 25 + struct static_key paravirt_steal_enabled; 26 + struct static_key paravirt_steal_rq_enabled; 27 + 28 + static u64 native_steal_clock(int cpu) 29 + { 30 + return 0; 31 + } 32 + 33 + DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); 34 + 35 + static bool steal_acc = true; 36 + static int __init parse_no_stealacc(char *arg) 37 + { 38 + steal_acc = false; 39 + return 0; 40 + } 41 + 42 + early_param("no-steal-acc", parse_no_stealacc); 43 + 44 + DEFINE_PER_CPU(struct sbi_sta_struct, steal_time) __aligned(64); 45 + 46 + static bool __init has_pv_steal_clock(void) 47 + { 48 + if (sbi_spec_version >= sbi_mk_version(2, 0) && 49 + sbi_probe_extension(SBI_EXT_STA) > 0) { 50 + pr_info("SBI STA extension detected\n"); 51 + return true; 52 + } 53 + 54 + return false; 55 + } 56 + 57 + static int sbi_sta_steal_time_set_shmem(unsigned long lo, unsigned long hi, 58 + unsigned long flags) 59 + { 60 + struct sbiret ret; 61 + 62 + ret = sbi_ecall(SBI_EXT_STA, SBI_EXT_STA_STEAL_TIME_SET_SHMEM, 63 + lo, hi, flags, 0, 0, 0); 64 + if (ret.error) { 65 + if (lo == SBI_STA_SHMEM_DISABLE && hi == SBI_STA_SHMEM_DISABLE) 66 + pr_warn("Failed to disable steal-time shmem"); 67 + else 68 + pr_warn("Failed to set steal-time shmem"); 69 + return sbi_err_map_linux_errno(ret.error); 70 + } 71 + 72 + return 0; 73 + } 74 + 75 + static int pv_time_cpu_online(unsigned int cpu) 76 + { 77 + struct sbi_sta_struct *st = this_cpu_ptr(&steal_time); 78 + phys_addr_t pa = __pa(st); 79 + unsigned long lo = (unsigned long)pa; 80 + unsigned long hi = IS_ENABLED(CONFIG_32BIT) ? upper_32_bits((u64)pa) : 0; 81 + 82 + return sbi_sta_steal_time_set_shmem(lo, hi, 0); 83 + } 84 + 85 + static int pv_time_cpu_down_prepare(unsigned int cpu) 86 + { 87 + return sbi_sta_steal_time_set_shmem(SBI_STA_SHMEM_DISABLE, 88 + SBI_STA_SHMEM_DISABLE, 0); 89 + } 90 + 91 + static u64 pv_time_steal_clock(int cpu) 92 + { 93 + struct sbi_sta_struct *st = per_cpu_ptr(&steal_time, cpu); 94 + u32 sequence; 95 + u64 steal; 96 + 97 + /* 98 + * Check the sequence field before and after reading the steal 99 + * field. Repeat the read if it is different or odd. 100 + */ 101 + do { 102 + sequence = READ_ONCE(st->sequence); 103 + virt_rmb(); 104 + steal = READ_ONCE(st->steal); 105 + virt_rmb(); 106 + } while ((le32_to_cpu(sequence) & 1) || 107 + sequence != READ_ONCE(st->sequence)); 108 + 109 + return le64_to_cpu(steal); 110 + } 111 + 112 + int __init pv_time_init(void) 113 + { 114 + int ret; 115 + 116 + if (!has_pv_steal_clock()) 117 + return 0; 118 + 119 + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, 120 + "riscv/pv_time:online", 121 + pv_time_cpu_online, 122 + pv_time_cpu_down_prepare); 123 + if (ret < 0) 124 + return ret; 125 + 126 + static_call_update(pv_steal_clock, pv_time_steal_clock); 127 + 128 + static_key_slow_inc(&paravirt_steal_enabled); 129 + if (steal_acc) 130 + static_key_slow_inc(&paravirt_steal_rq_enabled); 131 + 132 + pr_info("Computing paravirt steal-time\n"); 133 + 134 + return 0; 135 + }

+3

arch/riscv/kernel/time.c

··· 12 12 #include <asm/sbi.h> 13 13 #include <asm/processor.h> 14 14 #include <asm/timex.h> 15 + #include <asm/paravirt.h> 15 16 16 17 unsigned long riscv_timebase __ro_after_init; 17 18 EXPORT_SYMBOL_GPL(riscv_timebase); ··· 46 45 timer_probe(); 47 46 48 47 tick_setup_hrtimer_broadcast(); 48 + 49 + pv_time_init(); 49 50 }

+3 -4

arch/riscv/kvm/Kconfig

··· 20 20 config KVM 21 21 tristate "Kernel-based Virtual Machine (KVM) support (EXPERIMENTAL)" 22 22 depends on RISCV_SBI && MMU 23 - select HAVE_KVM_EVENTFD 24 23 select HAVE_KVM_IRQCHIP 25 - select HAVE_KVM_IRQFD 26 24 select HAVE_KVM_IRQ_ROUTING 27 25 select HAVE_KVM_MSI 28 26 select HAVE_KVM_VCPU_ASYNC_IOCTL 27 + select KVM_COMMON 29 28 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 30 29 select KVM_GENERIC_HARDWARE_ENABLING 31 30 select KVM_MMIO 32 31 select KVM_XFER_TO_GUEST_WORK 33 - select MMU_NOTIFIER 34 - select PREEMPT_NOTIFIERS 32 + select KVM_GENERIC_MMU_NOTIFIER 33 + select SCHED_INFO 35 34 help 36 35 Support hosting virtualized guest machines. 37 36

+1

arch/riscv/kvm/Makefile

··· 26 26 kvm-y += vcpu_sbi_base.o 27 27 kvm-y += vcpu_sbi_replace.o 28 28 kvm-y += vcpu_sbi_hsm.o 29 + kvm-y += vcpu_sbi_sta.o 29 30 kvm-y += vcpu_timer.o 30 31 kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_pmu.o vcpu_sbi_pmu.o 31 32 kvm-y += aia.o

+8 -2

arch/riscv/kvm/vcpu.c

··· 83 83 vcpu->arch.hfence_tail = 0; 84 84 memset(vcpu->arch.hfence_queue, 0, sizeof(vcpu->arch.hfence_queue)); 85 85 86 + kvm_riscv_vcpu_sbi_sta_reset(vcpu); 87 + 86 88 /* Reset the guest CSRs for hotplug usecase */ 87 89 if (loaded) 88 90 kvm_arch_vcpu_load(vcpu, smp_processor_id()); ··· 543 541 544 542 kvm_riscv_vcpu_aia_load(vcpu, cpu); 545 543 544 + kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); 545 + 546 546 vcpu->cpu = cpu; 547 547 } 548 548 ··· 618 614 619 615 if (kvm_check_request(KVM_REQ_HFENCE, vcpu)) 620 616 kvm_riscv_hfence_process(vcpu); 617 + 618 + if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) 619 + kvm_riscv_vcpu_record_steal_time(vcpu); 621 620 } 622 621 } 623 622 ··· 764 757 /* Update HVIP CSR for current CPU */ 765 758 kvm_riscv_update_hvip(vcpu); 766 759 767 - if (ret <= 0 || 768 - kvm_riscv_gstage_vmid_ver_changed(&vcpu->kvm->arch.vmid) || 760 + if (kvm_riscv_gstage_vmid_ver_changed(&vcpu->kvm->arch.vmid) || 769 761 kvm_request_pending(vcpu) || 770 762 xfer_to_guest_mode_work_pending()) { 771 763 vcpu->mode = OUTSIDE_GUEST_MODE;

+111 -38

arch/riscv/kvm/vcpu_onereg.c

··· 485 485 if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) 486 486 rc = kvm_riscv_vcpu_smstateen_set_csr(vcpu, reg_num, 487 487 reg_val); 488 - break; 488 + break; 489 489 default: 490 490 rc = -ENOENT; 491 491 break; ··· 931 931 return copy_isa_ext_reg_indices(vcpu, NULL);; 932 932 } 933 933 934 - static inline unsigned long num_sbi_ext_regs(void) 934 + static int copy_sbi_ext_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) 935 935 { 936 - /* 937 - * number of KVM_REG_RISCV_SBI_SINGLE + 938 - * 2 x (number of KVM_REG_RISCV_SBI_MULTI) 939 - */ 940 - return KVM_RISCV_SBI_EXT_MAX + 2*(KVM_REG_RISCV_SBI_MULTI_REG_LAST+1); 941 - } 936 + unsigned int n = 0; 942 937 943 - static int copy_sbi_ext_reg_indices(u64 __user *uindices) 944 - { 945 - int n; 946 - 947 - /* copy KVM_REG_RISCV_SBI_SINGLE */ 948 - n = KVM_RISCV_SBI_EXT_MAX; 949 - for (int i = 0; i < n; i++) { 938 + for (int i = 0; i < KVM_RISCV_SBI_EXT_MAX; i++) { 950 939 u64 size = IS_ENABLED(CONFIG_32BIT) ? 951 940 KVM_REG_SIZE_U32 : KVM_REG_SIZE_U64; 952 941 u64 reg = KVM_REG_RISCV | size | KVM_REG_RISCV_SBI_EXT | 953 942 KVM_REG_RISCV_SBI_SINGLE | i; 954 943 944 + if (!riscv_vcpu_supports_sbi_ext(vcpu, i)) 945 + continue; 946 + 947 + if (uindices) { 948 + if (put_user(reg, uindices)) 949 + return -EFAULT; 950 + uindices++; 951 + } 952 + 953 + n++; 954 + } 955 + 956 + return n; 957 + } 958 + 959 + static unsigned long num_sbi_ext_regs(struct kvm_vcpu *vcpu) 960 + { 961 + return copy_sbi_ext_reg_indices(vcpu, NULL); 962 + } 963 + 964 + static int copy_sbi_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) 965 + { 966 + struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context; 967 + int total = 0; 968 + 969 + if (scontext->ext_status[KVM_RISCV_SBI_EXT_STA] == KVM_RISCV_SBI_EXT_STATUS_ENABLED) { 970 + u64 size = IS_ENABLED(CONFIG_32BIT) ? KVM_REG_SIZE_U32 : KVM_REG_SIZE_U64; 971 + int n = sizeof(struct kvm_riscv_sbi_sta) / sizeof(unsigned long); 972 + 973 + for (int i = 0; i < n; i++) { 974 + u64 reg = KVM_REG_RISCV | size | 975 + KVM_REG_RISCV_SBI_STATE | 976 + KVM_REG_RISCV_SBI_STA | i; 977 + 978 + if (uindices) { 979 + if (put_user(reg, uindices)) 980 + return -EFAULT; 981 + uindices++; 982 + } 983 + } 984 + 985 + total += n; 986 + } 987 + 988 + return total; 989 + } 990 + 991 + static inline unsigned long num_sbi_regs(struct kvm_vcpu *vcpu) 992 + { 993 + return copy_sbi_reg_indices(vcpu, NULL); 994 + } 995 + 996 + static inline unsigned long num_vector_regs(const struct kvm_vcpu *vcpu) 997 + { 998 + if (!riscv_isa_extension_available(vcpu->arch.isa, v)) 999 + return 0; 1000 + 1001 + /* vstart, vl, vtype, vcsr, vlenb and 32 vector regs */ 1002 + return 37; 1003 + } 1004 + 1005 + static int copy_vector_reg_indices(const struct kvm_vcpu *vcpu, 1006 + u64 __user *uindices) 1007 + { 1008 + const struct kvm_cpu_context *cntx = &vcpu->arch.guest_context; 1009 + int n = num_vector_regs(vcpu); 1010 + u64 reg, size; 1011 + int i; 1012 + 1013 + if (n == 0) 1014 + return 0; 1015 + 1016 + /* copy vstart, vl, vtype, vcsr and vlenb */ 1017 + size = IS_ENABLED(CONFIG_32BIT) ? KVM_REG_SIZE_U32 : KVM_REG_SIZE_U64; 1018 + for (i = 0; i < 5; i++) { 1019 + reg = KVM_REG_RISCV | size | KVM_REG_RISCV_VECTOR | i; 1020 + 955 1021 if (uindices) { 956 1022 if (put_user(reg, uindices)) 957 1023 return -EFAULT; ··· 1025 959 } 1026 960 } 1027 961 1028 - /* copy KVM_REG_RISCV_SBI_MULTI */ 1029 - n = KVM_REG_RISCV_SBI_MULTI_REG_LAST + 1; 1030 - for (int i = 0; i < n; i++) { 1031 - u64 size = IS_ENABLED(CONFIG_32BIT) ? 1032 - KVM_REG_SIZE_U32 : KVM_REG_SIZE_U64; 1033 - u64 reg = KVM_REG_RISCV | size | KVM_REG_RISCV_SBI_EXT | 1034 - KVM_REG_RISCV_SBI_MULTI_EN | i; 1035 - 1036 - if (uindices) { 1037 - if (put_user(reg, uindices)) 1038 - return -EFAULT; 1039 - uindices++; 1040 - } 1041 - 1042 - reg = KVM_REG_RISCV | size | KVM_REG_RISCV_SBI_EXT | 1043 - KVM_REG_RISCV_SBI_MULTI_DIS | i; 962 + /* vector_regs have a variable 'vlenb' size */ 963 + size = __builtin_ctzl(cntx->vector.vlenb); 964 + size <<= KVM_REG_SIZE_SHIFT; 965 + for (i = 0; i < 32; i++) { 966 + reg = KVM_REG_RISCV | KVM_REG_RISCV_VECTOR | size | 967 + KVM_REG_RISCV_VECTOR_REG(i); 1044 968 1045 969 if (uindices) { 1046 970 if (put_user(reg, uindices)) ··· 1039 983 } 1040 984 } 1041 985 1042 - return num_sbi_ext_regs(); 986 + return n; 1043 987 } 1044 988 1045 989 /* ··· 1057 1001 res += num_timer_regs(); 1058 1002 res += num_fp_f_regs(vcpu); 1059 1003 res += num_fp_d_regs(vcpu); 1004 + res += num_vector_regs(vcpu); 1060 1005 res += num_isa_ext_regs(vcpu); 1061 - res += num_sbi_ext_regs(); 1006 + res += num_sbi_ext_regs(vcpu); 1007 + res += num_sbi_regs(vcpu); 1062 1008 1063 1009 return res; 1064 1010 } ··· 1103 1045 return ret; 1104 1046 uindices += ret; 1105 1047 1048 + ret = copy_vector_reg_indices(vcpu, uindices); 1049 + if (ret < 0) 1050 + return ret; 1051 + uindices += ret; 1052 + 1106 1053 ret = copy_isa_ext_reg_indices(vcpu, uindices); 1107 1054 if (ret < 0) 1108 1055 return ret; 1109 1056 uindices += ret; 1110 1057 1111 - ret = copy_sbi_ext_reg_indices(uindices); 1058 + ret = copy_sbi_ext_reg_indices(vcpu, uindices); 1112 1059 if (ret < 0) 1113 1060 return ret; 1061 + uindices += ret; 1062 + 1063 + ret = copy_sbi_reg_indices(vcpu, uindices); 1064 + if (ret < 0) 1065 + return ret; 1066 + uindices += ret; 1114 1067 1115 1068 return 0; 1116 1069 } ··· 1144 1075 case KVM_REG_RISCV_FP_D: 1145 1076 return kvm_riscv_vcpu_set_reg_fp(vcpu, reg, 1146 1077 KVM_REG_RISCV_FP_D); 1078 + case KVM_REG_RISCV_VECTOR: 1079 + return kvm_riscv_vcpu_set_reg_vector(vcpu, reg); 1147 1080 case KVM_REG_RISCV_ISA_EXT: 1148 1081 return kvm_riscv_vcpu_set_reg_isa_ext(vcpu, reg); 1149 1082 case KVM_REG_RISCV_SBI_EXT: 1150 1083 return kvm_riscv_vcpu_set_reg_sbi_ext(vcpu, reg); 1151 - case KVM_REG_RISCV_VECTOR: 1152 - return kvm_riscv_vcpu_set_reg_vector(vcpu, reg); 1084 + case KVM_REG_RISCV_SBI_STATE: 1085 + return kvm_riscv_vcpu_set_reg_sbi(vcpu, reg); 1153 1086 default: 1154 1087 break; 1155 1088 } ··· 1177 1106 case KVM_REG_RISCV_FP_D: 1178 1107 return kvm_riscv_vcpu_get_reg_fp(vcpu, reg, 1179 1108 KVM_REG_RISCV_FP_D); 1109 + case KVM_REG_RISCV_VECTOR: 1110 + return kvm_riscv_vcpu_get_reg_vector(vcpu, reg); 1180 1111 case KVM_REG_RISCV_ISA_EXT: 1181 1112 return kvm_riscv_vcpu_get_reg_isa_ext(vcpu, reg); 1182 1113 case KVM_REG_RISCV_SBI_EXT: 1183 1114 return kvm_riscv_vcpu_get_reg_sbi_ext(vcpu, reg); 1184 - case KVM_REG_RISCV_VECTOR: 1185 - return kvm_riscv_vcpu_get_reg_vector(vcpu, reg); 1115 + case KVM_REG_RISCV_SBI_STATE: 1116 + return kvm_riscv_vcpu_get_reg_sbi(vcpu, reg); 1186 1117 default: 1187 1118 break; 1188 1119 }

+110 -32

arch/riscv/kvm/vcpu_sbi.c

··· 71 71 .ext_ptr = &vcpu_sbi_ext_dbcn, 72 72 }, 73 73 { 74 + .ext_idx = KVM_RISCV_SBI_EXT_STA, 75 + .ext_ptr = &vcpu_sbi_ext_sta, 76 + }, 77 + { 74 78 .ext_idx = KVM_RISCV_SBI_EXT_EXPERIMENTAL, 75 79 .ext_ptr = &vcpu_sbi_ext_experimental, 76 80 }, ··· 83 79 .ext_ptr = &vcpu_sbi_ext_vendor, 84 80 }, 85 81 }; 82 + 83 + static const struct kvm_riscv_sbi_extension_entry * 84 + riscv_vcpu_get_sbi_ext(struct kvm_vcpu *vcpu, unsigned long idx) 85 + { 86 + const struct kvm_riscv_sbi_extension_entry *sext = NULL; 87 + 88 + if (idx >= KVM_RISCV_SBI_EXT_MAX) 89 + return NULL; 90 + 91 + for (int i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 92 + if (sbi_ext[i].ext_idx == idx) { 93 + sext = &sbi_ext[i]; 94 + break; 95 + } 96 + } 97 + 98 + return sext; 99 + } 100 + 101 + bool riscv_vcpu_supports_sbi_ext(struct kvm_vcpu *vcpu, int idx) 102 + { 103 + struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context; 104 + const struct kvm_riscv_sbi_extension_entry *sext; 105 + 106 + sext = riscv_vcpu_get_sbi_ext(vcpu, idx); 107 + 108 + return sext && scontext->ext_status[sext->ext_idx] != KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE; 109 + } 86 110 87 111 void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run) 88 112 { ··· 172 140 unsigned long reg_num, 173 141 unsigned long reg_val) 174 142 { 175 - unsigned long i; 176 - const struct kvm_riscv_sbi_extension_entry *sext = NULL; 177 143 struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context; 178 - 179 - if (reg_num >= KVM_RISCV_SBI_EXT_MAX) 180 - return -ENOENT; 144 + const struct kvm_riscv_sbi_extension_entry *sext; 181 145 182 146 if (reg_val != 1 && reg_val != 0) 183 147 return -EINVAL; 184 148 185 - for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 186 - if (sbi_ext[i].ext_idx == reg_num) { 187 - sext = &sbi_ext[i]; 188 - break; 189 - } 190 - } 191 - if (!sext) 149 + sext = riscv_vcpu_get_sbi_ext(vcpu, reg_num); 150 + if (!sext || scontext->ext_status[sext->ext_idx] == KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE) 192 151 return -ENOENT; 193 152 194 153 scontext->ext_status[sext->ext_idx] = (reg_val) ? 195 - KVM_RISCV_SBI_EXT_AVAILABLE : 196 - KVM_RISCV_SBI_EXT_UNAVAILABLE; 154 + KVM_RISCV_SBI_EXT_STATUS_ENABLED : 155 + KVM_RISCV_SBI_EXT_STATUS_DISABLED; 197 156 198 157 return 0; 199 158 } ··· 193 170 unsigned long reg_num, 194 171 unsigned long *reg_val) 195 172 { 196 - unsigned long i; 197 - const struct kvm_riscv_sbi_extension_entry *sext = NULL; 198 173 struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context; 174 + const struct kvm_riscv_sbi_extension_entry *sext; 199 175 200 - if (reg_num >= KVM_RISCV_SBI_EXT_MAX) 201 - return -ENOENT; 202 - 203 - for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 204 - if (sbi_ext[i].ext_idx == reg_num) { 205 - sext = &sbi_ext[i]; 206 - break; 207 - } 208 - } 209 - if (!sext) 176 + sext = riscv_vcpu_get_sbi_ext(vcpu, reg_num); 177 + if (!sext || scontext->ext_status[sext->ext_idx] == KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE) 210 178 return -ENOENT; 211 179 212 180 *reg_val = scontext->ext_status[sext->ext_idx] == 213 - KVM_RISCV_SBI_EXT_AVAILABLE; 181 + KVM_RISCV_SBI_EXT_STATUS_ENABLED; 182 + 214 183 return 0; 215 184 } 216 185 ··· 325 310 return 0; 326 311 } 327 312 313 + int kvm_riscv_vcpu_set_reg_sbi(struct kvm_vcpu *vcpu, 314 + const struct kvm_one_reg *reg) 315 + { 316 + unsigned long __user *uaddr = 317 + (unsigned long __user *)(unsigned long)reg->addr; 318 + unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK | 319 + KVM_REG_SIZE_MASK | 320 + KVM_REG_RISCV_SBI_STATE); 321 + unsigned long reg_subtype, reg_val; 322 + 323 + if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long)) 324 + return -EINVAL; 325 + 326 + if (copy_from_user(&reg_val, uaddr, KVM_REG_SIZE(reg->id))) 327 + return -EFAULT; 328 + 329 + reg_subtype = reg_num & KVM_REG_RISCV_SUBTYPE_MASK; 330 + reg_num &= ~KVM_REG_RISCV_SUBTYPE_MASK; 331 + 332 + switch (reg_subtype) { 333 + case KVM_REG_RISCV_SBI_STA: 334 + return kvm_riscv_vcpu_set_reg_sbi_sta(vcpu, reg_num, reg_val); 335 + default: 336 + return -EINVAL; 337 + } 338 + 339 + return 0; 340 + } 341 + 342 + int kvm_riscv_vcpu_get_reg_sbi(struct kvm_vcpu *vcpu, 343 + const struct kvm_one_reg *reg) 344 + { 345 + unsigned long __user *uaddr = 346 + (unsigned long __user *)(unsigned long)reg->addr; 347 + unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK | 348 + KVM_REG_SIZE_MASK | 349 + KVM_REG_RISCV_SBI_STATE); 350 + unsigned long reg_subtype, reg_val; 351 + int ret; 352 + 353 + if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long)) 354 + return -EINVAL; 355 + 356 + reg_subtype = reg_num & KVM_REG_RISCV_SUBTYPE_MASK; 357 + reg_num &= ~KVM_REG_RISCV_SUBTYPE_MASK; 358 + 359 + switch (reg_subtype) { 360 + case KVM_REG_RISCV_SBI_STA: 361 + ret = kvm_riscv_vcpu_get_reg_sbi_sta(vcpu, reg_num, &reg_val); 362 + break; 363 + default: 364 + return -EINVAL; 365 + } 366 + 367 + if (ret) 368 + return ret; 369 + 370 + if (copy_to_user(uaddr, &reg_val, KVM_REG_SIZE(reg->id))) 371 + return -EFAULT; 372 + 373 + return 0; 374 + } 375 + 328 376 const struct kvm_vcpu_sbi_extension *kvm_vcpu_sbi_find_ext( 329 377 struct kvm_vcpu *vcpu, unsigned long extid) 330 378 { ··· 403 325 if (ext->extid_start <= extid && ext->extid_end >= extid) { 404 326 if (entry->ext_idx >= KVM_RISCV_SBI_EXT_MAX || 405 327 scontext->ext_status[entry->ext_idx] == 406 - KVM_RISCV_SBI_EXT_AVAILABLE) 328 + KVM_RISCV_SBI_EXT_STATUS_ENABLED) 407 329 return ext; 408 330 409 331 return NULL; ··· 491 413 492 414 if (ext->probe && !ext->probe(vcpu)) { 493 415 scontext->ext_status[entry->ext_idx] = 494 - KVM_RISCV_SBI_EXT_UNAVAILABLE; 416 + KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE; 495 417 continue; 496 418 } 497 419 498 - scontext->ext_status[entry->ext_idx] = ext->default_unavail ? 499 - KVM_RISCV_SBI_EXT_UNAVAILABLE : 500 - KVM_RISCV_SBI_EXT_AVAILABLE; 420 + scontext->ext_status[entry->ext_idx] = ext->default_disabled ? 421 + KVM_RISCV_SBI_EXT_STATUS_DISABLED : 422 + KVM_RISCV_SBI_EXT_STATUS_ENABLED; 501 423 } 502 424 }

+1 -1

arch/riscv/kvm/vcpu_sbi_replace.c

··· 204 204 const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_dbcn = { 205 205 .extid_start = SBI_EXT_DBCN, 206 206 .extid_end = SBI_EXT_DBCN, 207 - .default_unavail = true, 207 + .default_disabled = true, 208 208 .handler = kvm_sbi_ext_dbcn_handler, 209 209 };

+208

arch/riscv/kvm/vcpu_sbi_sta.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2023 Ventana Micro Systems Inc. 4 + */ 5 + 6 + #include <linux/kconfig.h> 7 + #include <linux/kernel.h> 8 + #include <linux/kvm_host.h> 9 + #include <linux/mm.h> 10 + #include <linux/sizes.h> 11 + 12 + #include <asm/bug.h> 13 + #include <asm/current.h> 14 + #include <asm/kvm_vcpu_sbi.h> 15 + #include <asm/page.h> 16 + #include <asm/sbi.h> 17 + #include <asm/uaccess.h> 18 + 19 + void kvm_riscv_vcpu_sbi_sta_reset(struct kvm_vcpu *vcpu) 20 + { 21 + vcpu->arch.sta.shmem = INVALID_GPA; 22 + vcpu->arch.sta.last_steal = 0; 23 + } 24 + 25 + void kvm_riscv_vcpu_record_steal_time(struct kvm_vcpu *vcpu) 26 + { 27 + gpa_t shmem = vcpu->arch.sta.shmem; 28 + u64 last_steal = vcpu->arch.sta.last_steal; 29 + u32 *sequence_ptr, sequence; 30 + u64 *steal_ptr, steal; 31 + unsigned long hva; 32 + gfn_t gfn; 33 + 34 + if (shmem == INVALID_GPA) 35 + return; 36 + 37 + /* 38 + * shmem is 64-byte aligned (see the enforcement in 39 + * kvm_sbi_sta_steal_time_set_shmem()) and the size of sbi_sta_struct 40 + * is 64 bytes, so we know all its offsets are in the same page. 41 + */ 42 + gfn = shmem >> PAGE_SHIFT; 43 + hva = kvm_vcpu_gfn_to_hva(vcpu, gfn); 44 + 45 + if (WARN_ON(kvm_is_error_hva(hva))) { 46 + vcpu->arch.sta.shmem = INVALID_GPA; 47 + return; 48 + } 49 + 50 + sequence_ptr = (u32 *)(hva + offset_in_page(shmem) + 51 + offsetof(struct sbi_sta_struct, sequence)); 52 + steal_ptr = (u64 *)(hva + offset_in_page(shmem) + 53 + offsetof(struct sbi_sta_struct, steal)); 54 + 55 + if (WARN_ON(get_user(sequence, sequence_ptr))) 56 + return; 57 + 58 + sequence = le32_to_cpu(sequence); 59 + sequence += 1; 60 + 61 + if (WARN_ON(put_user(cpu_to_le32(sequence), sequence_ptr))) 62 + return; 63 + 64 + if (!WARN_ON(get_user(steal, steal_ptr))) { 65 + steal = le64_to_cpu(steal); 66 + vcpu->arch.sta.last_steal = READ_ONCE(current->sched_info.run_delay); 67 + steal += vcpu->arch.sta.last_steal - last_steal; 68 + WARN_ON(put_user(cpu_to_le64(steal), steal_ptr)); 69 + } 70 + 71 + sequence += 1; 72 + WARN_ON(put_user(cpu_to_le32(sequence), sequence_ptr)); 73 + 74 + kvm_vcpu_mark_page_dirty(vcpu, gfn); 75 + } 76 + 77 + static int kvm_sbi_sta_steal_time_set_shmem(struct kvm_vcpu *vcpu) 78 + { 79 + struct kvm_cpu_context *cp = &vcpu->arch.guest_context; 80 + unsigned long shmem_phys_lo = cp->a0; 81 + unsigned long shmem_phys_hi = cp->a1; 82 + u32 flags = cp->a2; 83 + struct sbi_sta_struct zero_sta = {0}; 84 + unsigned long hva; 85 + bool writable; 86 + gpa_t shmem; 87 + int ret; 88 + 89 + if (flags != 0) 90 + return SBI_ERR_INVALID_PARAM; 91 + 92 + if (shmem_phys_lo == SBI_STA_SHMEM_DISABLE && 93 + shmem_phys_hi == SBI_STA_SHMEM_DISABLE) { 94 + vcpu->arch.sta.shmem = INVALID_GPA; 95 + return 0; 96 + } 97 + 98 + if (shmem_phys_lo & (SZ_64 - 1)) 99 + return SBI_ERR_INVALID_PARAM; 100 + 101 + shmem = shmem_phys_lo; 102 + 103 + if (shmem_phys_hi != 0) { 104 + if (IS_ENABLED(CONFIG_32BIT)) 105 + shmem |= ((gpa_t)shmem_phys_hi << 32); 106 + else 107 + return SBI_ERR_INVALID_ADDRESS; 108 + } 109 + 110 + hva = kvm_vcpu_gfn_to_hva_prot(vcpu, shmem >> PAGE_SHIFT, &writable); 111 + if (kvm_is_error_hva(hva) || !writable) 112 + return SBI_ERR_INVALID_ADDRESS; 113 + 114 + ret = kvm_vcpu_write_guest(vcpu, shmem, &zero_sta, sizeof(zero_sta)); 115 + if (ret) 116 + return SBI_ERR_FAILURE; 117 + 118 + vcpu->arch.sta.shmem = shmem; 119 + vcpu->arch.sta.last_steal = current->sched_info.run_delay; 120 + 121 + return 0; 122 + } 123 + 124 + static int kvm_sbi_ext_sta_handler(struct kvm_vcpu *vcpu, struct kvm_run *run, 125 + struct kvm_vcpu_sbi_return *retdata) 126 + { 127 + struct kvm_cpu_context *cp = &vcpu->arch.guest_context; 128 + unsigned long funcid = cp->a6; 129 + int ret; 130 + 131 + switch (funcid) { 132 + case SBI_EXT_STA_STEAL_TIME_SET_SHMEM: 133 + ret = kvm_sbi_sta_steal_time_set_shmem(vcpu); 134 + break; 135 + default: 136 + ret = SBI_ERR_NOT_SUPPORTED; 137 + break; 138 + } 139 + 140 + retdata->err_val = ret; 141 + 142 + return 0; 143 + } 144 + 145 + static unsigned long kvm_sbi_ext_sta_probe(struct kvm_vcpu *vcpu) 146 + { 147 + return !!sched_info_on(); 148 + } 149 + 150 + const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_sta = { 151 + .extid_start = SBI_EXT_STA, 152 + .extid_end = SBI_EXT_STA, 153 + .handler = kvm_sbi_ext_sta_handler, 154 + .probe = kvm_sbi_ext_sta_probe, 155 + }; 156 + 157 + int kvm_riscv_vcpu_get_reg_sbi_sta(struct kvm_vcpu *vcpu, 158 + unsigned long reg_num, 159 + unsigned long *reg_val) 160 + { 161 + switch (reg_num) { 162 + case KVM_REG_RISCV_SBI_STA_REG(shmem_lo): 163 + *reg_val = (unsigned long)vcpu->arch.sta.shmem; 164 + break; 165 + case KVM_REG_RISCV_SBI_STA_REG(shmem_hi): 166 + if (IS_ENABLED(CONFIG_32BIT)) 167 + *reg_val = upper_32_bits(vcpu->arch.sta.shmem); 168 + else 169 + *reg_val = 0; 170 + break; 171 + default: 172 + return -EINVAL; 173 + } 174 + 175 + return 0; 176 + } 177 + 178 + int kvm_riscv_vcpu_set_reg_sbi_sta(struct kvm_vcpu *vcpu, 179 + unsigned long reg_num, 180 + unsigned long reg_val) 181 + { 182 + switch (reg_num) { 183 + case KVM_REG_RISCV_SBI_STA_REG(shmem_lo): 184 + if (IS_ENABLED(CONFIG_32BIT)) { 185 + gpa_t hi = upper_32_bits(vcpu->arch.sta.shmem); 186 + 187 + vcpu->arch.sta.shmem = reg_val; 188 + vcpu->arch.sta.shmem |= hi << 32; 189 + } else { 190 + vcpu->arch.sta.shmem = reg_val; 191 + } 192 + break; 193 + case KVM_REG_RISCV_SBI_STA_REG(shmem_hi): 194 + if (IS_ENABLED(CONFIG_32BIT)) { 195 + gpa_t lo = lower_32_bits(vcpu->arch.sta.shmem); 196 + 197 + vcpu->arch.sta.shmem = ((gpa_t)reg_val << 32); 198 + vcpu->arch.sta.shmem |= lo; 199 + } else if (reg_val != 0) { 200 + return -EINVAL; 201 + } 202 + break; 203 + default: 204 + return -EINVAL; 205 + } 206 + 207 + return 0; 208 + }

+14 -18

arch/riscv/kvm/vcpu_switch.S

··· 15 15 .altmacro 16 16 .option norelax 17 17 18 - ENTRY(__kvm_riscv_switch_to) 18 + SYM_FUNC_START(__kvm_riscv_switch_to) 19 19 /* Save Host GPRs (except A0 and T0-T6) */ 20 20 REG_S ra, (KVM_ARCH_HOST_RA)(a0) 21 21 REG_S sp, (KVM_ARCH_HOST_SP)(a0) ··· 45 45 REG_L t0, (KVM_ARCH_GUEST_SSTATUS)(a0) 46 46 REG_L t1, (KVM_ARCH_GUEST_HSTATUS)(a0) 47 47 REG_L t2, (KVM_ARCH_GUEST_SCOUNTEREN)(a0) 48 - la t4, __kvm_switch_return 48 + la t4, .Lkvm_switch_return 49 49 REG_L t5, (KVM_ARCH_GUEST_SEPC)(a0) 50 50 51 51 /* Save Host and Restore Guest SSTATUS */ ··· 113 113 114 114 /* Back to Host */ 115 115 .align 2 116 - __kvm_switch_return: 116 + .Lkvm_switch_return: 117 117 /* Swap Guest A0 with SSCRATCH */ 118 118 csrrw a0, CSR_SSCRATCH, a0 119 119 ··· 208 208 209 209 /* Return to C code */ 210 210 ret 211 - ENDPROC(__kvm_riscv_switch_to) 211 + SYM_FUNC_END(__kvm_riscv_switch_to) 212 212 213 - ENTRY(__kvm_riscv_unpriv_trap) 213 + SYM_CODE_START(__kvm_riscv_unpriv_trap) 214 214 /* 215 215 * We assume that faulting unpriv load/store instruction is 216 216 * 4-byte long and blindly increment SEPC by 4. ··· 231 231 csrr a1, CSR_HTINST 232 232 REG_S a1, (KVM_ARCH_TRAP_HTINST)(a0) 233 233 sret 234 - ENDPROC(__kvm_riscv_unpriv_trap) 234 + SYM_CODE_END(__kvm_riscv_unpriv_trap) 235 235 236 236 #ifdef CONFIG_FPU 237 - .align 3 238 - .global __kvm_riscv_fp_f_save 239 - __kvm_riscv_fp_f_save: 237 + SYM_FUNC_START(__kvm_riscv_fp_f_save) 240 238 csrr t2, CSR_SSTATUS 241 239 li t1, SR_FS 242 240 csrs CSR_SSTATUS, t1 ··· 274 276 sw t0, KVM_ARCH_FP_F_FCSR(a0) 275 277 csrw CSR_SSTATUS, t2 276 278 ret 279 + SYM_FUNC_END(__kvm_riscv_fp_f_save) 277 280 278 - .align 3 279 - .global __kvm_riscv_fp_d_save 280 - __kvm_riscv_fp_d_save: 281 + SYM_FUNC_START(__kvm_riscv_fp_d_save) 281 282 csrr t2, CSR_SSTATUS 282 283 li t1, SR_FS 283 284 csrs CSR_SSTATUS, t1 ··· 316 319 sw t0, KVM_ARCH_FP_D_FCSR(a0) 317 320 csrw CSR_SSTATUS, t2 318 321 ret 322 + SYM_FUNC_END(__kvm_riscv_fp_d_save) 319 323 320 - .align 3 321 - .global __kvm_riscv_fp_f_restore 322 - __kvm_riscv_fp_f_restore: 324 + SYM_FUNC_START(__kvm_riscv_fp_f_restore) 323 325 csrr t2, CSR_SSTATUS 324 326 li t1, SR_FS 325 327 lw t0, KVM_ARCH_FP_F_FCSR(a0) ··· 358 362 fscsr t0 359 363 csrw CSR_SSTATUS, t2 360 364 ret 365 + SYM_FUNC_END(__kvm_riscv_fp_f_restore) 361 366 362 - .align 3 363 - .global __kvm_riscv_fp_d_restore 364 - __kvm_riscv_fp_d_restore: 367 + SYM_FUNC_START(__kvm_riscv_fp_d_restore) 365 368 csrr t2, CSR_SSTATUS 366 369 li t1, SR_FS 367 370 lw t0, KVM_ARCH_FP_D_FCSR(a0) ··· 400 405 fscsr t0 401 406 csrw CSR_SSTATUS, t2 402 407 ret 408 + SYM_FUNC_END(__kvm_riscv_fp_d_restore) 403 409 #endif

+16

arch/riscv/kvm/vcpu_vector.c

··· 76 76 cntx->vector.datap = kmalloc(riscv_v_vsize, GFP_KERNEL); 77 77 if (!cntx->vector.datap) 78 78 return -ENOMEM; 79 + cntx->vector.vlenb = riscv_v_vsize / 32; 79 80 80 81 vcpu->arch.host_context.vector.datap = kzalloc(riscv_v_vsize, GFP_KERNEL); 81 82 if (!vcpu->arch.host_context.vector.datap) ··· 115 114 break; 116 115 case KVM_REG_RISCV_VECTOR_CSR_REG(vcsr): 117 116 *reg_addr = &cntx->vector.vcsr; 117 + break; 118 + case KVM_REG_RISCV_VECTOR_CSR_REG(vlenb): 119 + *reg_addr = &cntx->vector.vlenb; 118 120 break; 119 121 case KVM_REG_RISCV_VECTOR_CSR_REG(datap): 120 122 default: ··· 176 172 177 173 if (!riscv_isa_extension_available(isa, v)) 178 174 return -ENOENT; 175 + 176 + if (reg_num == KVM_REG_RISCV_VECTOR_CSR_REG(vlenb)) { 177 + struct kvm_cpu_context *cntx = &vcpu->arch.guest_context; 178 + unsigned long reg_val; 179 + 180 + if (copy_from_user(&reg_val, uaddr, reg_size)) 181 + return -EFAULT; 182 + if (reg_val != cntx->vector.vlenb) 183 + return -EINVAL; 184 + 185 + return 0; 186 + } 179 187 180 188 rc = kvm_riscv_vcpu_vreg_addr(vcpu, reg_num, reg_size, &reg_addr); 181 189 if (rc)

-1

arch/riscv/kvm/vm.c

··· 179 179 r = kvm_riscv_aia_available(); 180 180 break; 181 181 case KVM_CAP_IOEVENTFD: 182 - case KVM_CAP_DEVICE_CTRL: 183 182 case KVM_CAP_USER_MEMORY: 184 183 case KVM_CAP_SYNC_MMU: 185 184 case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:

+6

arch/s390/include/asm/facility.h

··· 111 111 preempt_enable(); 112 112 } 113 113 114 + /** 115 + * stfle_size - Actual size of the facility list as specified by stfle 116 + * (number of double words) 117 + */ 118 + unsigned int stfle_size(void); 119 + 114 120 #endif /* __ASM_FACILITY_H */

+1 -1

arch/s390/include/asm/kvm_host.h

··· 818 818 819 819 struct kvm_s390_cpu_model { 820 820 /* facility mask supported by kvm & hosting machine */ 821 - __u64 fac_mask[S390_ARCH_FAC_LIST_SIZE_U64]; 821 + __u64 fac_mask[S390_ARCH_FAC_MASK_SIZE_U64]; 822 822 struct kvm_s390_vm_cpu_subfunc subfuncs; 823 823 /* facility list requested by guest (in dma page) */ 824 824 __u64 *fac_list;

+1 -1

arch/s390/kernel/Makefile

··· 41 41 obj-y += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o 42 42 obj-y += entry.o reipl.o kdebugfs.o alternative.o 43 43 obj-y += nospec-branch.o ipl_vmparm.o machine_kexec_reloc.o unwind_bc.o 44 - obj-y += smp.o text_amode31.o stacktrace.o abs_lowcore.o 44 + obj-y += smp.o text_amode31.o stacktrace.o abs_lowcore.o facility.o 45 45 46 46 extra-y += vmlinux.lds 47 47

+21

arch/s390/kernel/facility.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright IBM Corp. 2023 4 + */ 5 + 6 + #include <asm/facility.h> 7 + 8 + unsigned int stfle_size(void) 9 + { 10 + static unsigned int size; 11 + unsigned int r; 12 + u64 dummy; 13 + 14 + r = READ_ONCE(size); 15 + if (!r) { 16 + r = __stfle_asm(&dummy, 1) + 1; 17 + WRITE_ONCE(size, r); 18 + } 19 + return r; 20 + } 21 + EXPORT_SYMBOL(stfle_size);

+1 -4

arch/s390/kvm/Kconfig

··· 20 20 def_tristate y 21 21 prompt "Kernel-based Virtual Machine (KVM) support" 22 22 depends on HAVE_KVM 23 - select PREEMPT_NOTIFIERS 24 23 select HAVE_KVM_CPU_RELAX_INTERCEPT 25 24 select HAVE_KVM_VCPU_ASYNC_IOCTL 26 - select HAVE_KVM_EVENTFD 27 25 select KVM_ASYNC_PF 28 26 select KVM_ASYNC_PF_SYNC 27 + select KVM_COMMON 29 28 select HAVE_KVM_IRQCHIP 30 - select HAVE_KVM_IRQFD 31 29 select HAVE_KVM_IRQ_ROUTING 32 30 select HAVE_KVM_INVALID_WAKEUPS 33 31 select HAVE_KVM_NO_POLL 34 32 select KVM_VFIO 35 - select INTERVAL_TREE 36 33 select MMU_NOTIFIER 37 34 help 38 35 Support hosting paravirtualized guest machines using the SIE

+2 -2

arch/s390/kvm/guestdbg.c

··· 213 213 else if (dbg->arch.nr_hw_bp > MAX_BP_COUNT) 214 214 return -EINVAL; 215 215 216 - bp_data = memdup_user(dbg->arch.hw_bp, 217 - sizeof(*bp_data) * dbg->arch.nr_hw_bp); 216 + bp_data = memdup_array_user(dbg->arch.hw_bp, dbg->arch.nr_hw_bp, 217 + sizeof(*bp_data)); 218 218 if (IS_ERR(bp_data)) 219 219 return PTR_ERR(bp_data); 220 220

-1

arch/s390/kvm/kvm-s390.c

··· 563 563 case KVM_CAP_ENABLE_CAP: 564 564 case KVM_CAP_S390_CSS_SUPPORT: 565 565 case KVM_CAP_IOEVENTFD: 566 - case KVM_CAP_DEVICE_CTRL: 567 566 case KVM_CAP_S390_IRQCHIP: 568 567 case KVM_CAP_VM_ATTRIBUTES: 569 568 case KVM_CAP_MP_STATE:

+17 -2

arch/s390/kvm/vsie.c

··· 19 19 #include <asm/nmi.h> 20 20 #include <asm/dis.h> 21 21 #include <asm/fpu/api.h> 22 + #include <asm/facility.h> 22 23 #include "kvm-s390.h" 23 24 #include "gaccess.h" 24 25 ··· 985 984 static int handle_stfle(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) 986 985 { 987 986 struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s; 988 - __u32 fac = READ_ONCE(vsie_page->scb_o->fac) & 0x7ffffff8U; 987 + __u32 fac = READ_ONCE(vsie_page->scb_o->fac); 989 988 989 + /* 990 + * Alternate-STFLE-Interpretive-Execution facilities are not supported 991 + * -> format-0 flcb 992 + */ 990 993 if (fac && test_kvm_facility(vcpu->kvm, 7)) { 991 994 retry_vsie_icpt(vsie_page); 995 + /* 996 + * The facility list origin (FLO) is in bits 1 - 28 of the FLD 997 + * so we need to mask here before reading. 998 + */ 999 + fac = fac & 0x7ffffff8U; 1000 + /* 1001 + * format-0 -> size of nested guest's facility list == guest's size 1002 + * guest's size == host's size, since STFLE is interpretatively executed 1003 + * using a format-0 for the guest, too. 1004 + */ 992 1005 if (read_guest_real(vcpu, fac, &vsie_page->fac, 993 - sizeof(vsie_page->fac))) 1006 + stfle_size() * sizeof(u64))) 994 1007 return set_validity_icpt(scb_s, 0x1090U); 995 1008 scb_s->fac = (__u32)(__u64) &vsie_page->fac; 996 1009 }

+3

arch/x86/include/asm/kvm-x86-ops.h

··· 55 55 KVM_X86_OP(get_if_flag) 56 56 KVM_X86_OP(flush_tlb_all) 57 57 KVM_X86_OP(flush_tlb_current) 58 + #if IS_ENABLED(CONFIG_HYPERV) 58 59 KVM_X86_OP_OPTIONAL(flush_remote_tlbs) 59 60 KVM_X86_OP_OPTIONAL(flush_remote_tlbs_range) 61 + #endif 60 62 KVM_X86_OP(flush_tlb_gva) 61 63 KVM_X86_OP(flush_tlb_guest) 62 64 KVM_X86_OP(vcpu_pre_run) ··· 137 135 KVM_X86_OP(complete_emulated_msr) 138 136 KVM_X86_OP(vcpu_deliver_sipi_vector) 139 137 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); 138 + KVM_X86_OP_OPTIONAL(get_untagged_addr) 140 139 141 140 #undef KVM_X86_OP 142 141 #undef KVM_X86_OP_OPTIONAL

+1 -1

arch/x86/include/asm/kvm-x86-pmu-ops.h

··· 22 22 KVM_X86_PMU_OP(set_msr) 23 23 KVM_X86_PMU_OP(refresh) 24 24 KVM_X86_PMU_OP(init) 25 - KVM_X86_PMU_OP(reset) 25 + KVM_X86_PMU_OP_OPTIONAL(reset) 26 26 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi) 27 27 KVM_X86_PMU_OP_OPTIONAL(cleanup) 28 28

+63 -12

arch/x86/include/asm/kvm_host.h

··· 133 133 | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \ 134 134 | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \ 135 135 | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \ 136 - | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP)) 136 + | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \ 137 + | X86_CR4_LAM_SUP)) 137 138 138 139 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) 139 140 ··· 501 500 u8 idx; 502 501 bool is_paused; 503 502 bool intr; 503 + /* 504 + * Base value of the PMC counter, relative to the *consumed* count in 505 + * the associated perf_event. This value includes counter updates from 506 + * the perf_event and emulated_count since the last time the counter 507 + * was reprogrammed, but it is *not* the current value as seen by the 508 + * guest or userspace. 509 + * 510 + * The count is relative to the associated perf_event so that KVM 511 + * doesn't need to reprogram the perf_event every time the guest writes 512 + * to the counter. 513 + */ 504 514 u64 counter; 505 - u64 prev_counter; 515 + /* 516 + * PMC events triggered by KVM emulation that haven't been fully 517 + * processed, i.e. haven't undergone overflow detection. 518 + */ 519 + u64 emulated_counter; 506 520 u64 eventsel; 507 521 struct perf_event *perf_event; 508 522 struct kvm_vcpu *vcpu; ··· 953 937 /* used for guest single stepping over the given code position */ 954 938 unsigned long singlestep_rip; 955 939 940 + #ifdef CONFIG_KVM_HYPERV 956 941 bool hyperv_enabled; 957 942 struct kvm_vcpu_hv *hyperv; 943 + #endif 958 944 #ifdef CONFIG_KVM_XEN 959 945 struct kvm_vcpu_xen xen; 960 946 #endif ··· 1113 1095 HV_TSC_PAGE_BROKEN, 1114 1096 }; 1115 1097 1098 + #ifdef CONFIG_KVM_HYPERV 1116 1099 /* Hyper-V emulation context */ 1117 1100 struct kvm_hv { 1118 1101 struct mutex hv_lock; ··· 1144 1125 */ 1145 1126 unsigned int synic_auto_eoi_used; 1146 1127 1147 - struct hv_partition_assist_pg *hv_pa_pg; 1148 1128 struct kvm_hv_syndbg hv_syndbg; 1149 1129 }; 1130 + #endif 1150 1131 1151 1132 struct msr_bitmap_range { 1152 1133 u32 flags; ··· 1155 1136 unsigned long *bitmap; 1156 1137 }; 1157 1138 1139 + #ifdef CONFIG_KVM_XEN 1158 1140 /* Xen emulation context */ 1159 1141 struct kvm_xen { 1160 1142 struct mutex xen_lock; ··· 1167 1147 struct idr evtchn_ports; 1168 1148 unsigned long poll_mask[BITS_TO_LONGS(KVM_MAX_VCPUS)]; 1169 1149 }; 1150 + #endif 1170 1151 1171 1152 enum kvm_irqchip_mode { 1172 1153 KVM_IRQCHIP_NONE, ··· 1276 1255 }; 1277 1256 1278 1257 struct kvm_arch { 1258 + unsigned long vm_type; 1279 1259 unsigned long n_used_mmu_pages; 1280 1260 unsigned long n_requested_mmu_pages; 1281 1261 unsigned long n_max_mmu_pages; ··· 1369 1347 /* reads protected by irq_srcu, writes by irq_lock */ 1370 1348 struct hlist_head mask_notifier_list; 1371 1349 1350 + #ifdef CONFIG_KVM_HYPERV 1372 1351 struct kvm_hv hyperv; 1352 + #endif 1353 + 1354 + #ifdef CONFIG_KVM_XEN 1373 1355 struct kvm_xen xen; 1356 + #endif 1374 1357 1375 1358 bool backwards_tsc_observed; 1376 1359 bool boot_vcpu_runs_old_kvmclock; ··· 1433 1406 * the MMU lock in read mode + RCU or 1434 1407 * the MMU lock in write mode 1435 1408 * 1436 - * For writes, this list is protected by: 1437 - * the MMU lock in read mode + the tdp_mmu_pages_lock or 1438 - * the MMU lock in write mode 1409 + * For writes, this list is protected by tdp_mmu_pages_lock; see 1410 + * below for the details. 1439 1411 * 1440 1412 * Roots will remain in the list until their tdp_mmu_root_count 1441 1413 * drops to zero, at which point the thread that decremented the ··· 1451 1425 * - possible_nx_huge_pages; 1452 1426 * - the possible_nx_huge_page_link field of kvm_mmu_page structs used 1453 1427 * by the TDP MMU 1454 - * It is acceptable, but not necessary, to acquire this lock when 1455 - * the thread holds the MMU lock in write mode. 1428 + * Because the lock is only taken within the MMU lock, strictly 1429 + * speaking it is redundant to acquire this lock when the thread 1430 + * holds the MMU lock in write mode. However it often simplifies 1431 + * the code to do so. 1456 1432 */ 1457 1433 spinlock_t tdp_mmu_pages_lock; 1458 1434 #endif /* CONFIG_X86_64 */ ··· 1469 1441 #if IS_ENABLED(CONFIG_HYPERV) 1470 1442 hpa_t hv_root_tdp; 1471 1443 spinlock_t hv_root_tdp_lock; 1444 + struct hv_partition_assist_pg *hv_pa_pg; 1472 1445 #endif 1473 1446 /* 1474 1447 * VM-scope maximum vCPU ID. Used to determine the size of structures ··· 1642 1613 1643 1614 void (*flush_tlb_all)(struct kvm_vcpu *vcpu); 1644 1615 void (*flush_tlb_current)(struct kvm_vcpu *vcpu); 1616 + #if IS_ENABLED(CONFIG_HYPERV) 1645 1617 int (*flush_remote_tlbs)(struct kvm *kvm); 1646 1618 int (*flush_remote_tlbs_range)(struct kvm *kvm, gfn_t gfn, 1647 1619 gfn_t nr_pages); 1620 + #endif 1648 1621 1649 1622 /* 1650 1623 * Flush any TLB entries associated with the given GVA. ··· 1792 1761 * Returns vCPU specific APICv inhibit reasons 1793 1762 */ 1794 1763 unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); 1764 + 1765 + gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags); 1795 1766 }; 1796 1767 1797 1768 struct kvm_x86_nested_ops { ··· 1857 1824 #define __KVM_HAVE_ARCH_VM_FREE 1858 1825 void kvm_arch_free_vm(struct kvm *kvm); 1859 1826 1827 + #if IS_ENABLED(CONFIG_HYPERV) 1860 1828 #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS 1861 1829 static inline int kvm_arch_flush_remote_tlbs(struct kvm *kvm) 1862 1830 { ··· 1869 1835 } 1870 1836 1871 1837 #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE 1838 + static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, 1839 + u64 nr_pages) 1840 + { 1841 + if (!kvm_x86_ops.flush_remote_tlbs_range) 1842 + return -EOPNOTSUPP; 1843 + 1844 + return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages); 1845 + } 1846 + #endif /* CONFIG_HYPERV */ 1872 1847 1873 1848 #define kvm_arch_pmi_in_guest(vcpu) \ 1874 1849 ((vcpu) && (vcpu)->arch.handling_intr_from_guest) ··· 1890 1847 int kvm_mmu_create(struct kvm_vcpu *vcpu); 1891 1848 void kvm_mmu_init_vm(struct kvm *kvm); 1892 1849 void kvm_mmu_uninit_vm(struct kvm *kvm); 1850 + 1851 + void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm, 1852 + struct kvm_memory_slot *slot); 1893 1853 1894 1854 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu); 1895 1855 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); ··· 2132 2086 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, 2133 2087 int tdp_max_root_level, int tdp_huge_page_level); 2134 2088 2089 + #ifdef CONFIG_KVM_PRIVATE_MEM 2090 + #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM) 2091 + #else 2092 + #define kvm_arch_has_private_mem(kvm) false 2093 + #endif 2094 + 2135 2095 static inline u16 kvm_read_ldt(void) 2136 2096 { 2137 2097 u16 ldt; ··· 2185 2133 #define HF_SMM_MASK (1 << 1) 2186 2134 #define HF_SMM_INSIDE_NMI_MASK (1 << 2) 2187 2135 2188 - # define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE 2189 - # define KVM_ADDRESS_SPACE_NUM 2 2136 + # define KVM_MAX_NR_ADDRESS_SPACES 2 2137 + /* SMM is currently unsupported for guests with private memory. */ 2138 + # define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2) 2190 2139 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0) 2191 2140 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm) 2192 2141 #else 2193 2142 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0) 2194 2143 #endif 2195 - 2196 - #define KVM_ARCH_WANT_MMU_NOTIFIER 2197 2144 2198 2145 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v); 2199 2146 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);

+3

arch/x86/include/uapi/asm/kvm.h

··· 562 562 /* x86-specific KVM_EXIT_HYPERCALL flags. */ 563 563 #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0) 564 564 565 + #define KVM_X86_DEFAULT_VM 0 566 + #define KVM_X86_SW_PROTECTED_VM 1 567 + 565 568 #endif /* _ASM_X86_KVM_H */

+8 -4

arch/x86/kernel/kvmclock.c

··· 24 24 25 25 static int kvmclock __initdata = 1; 26 26 static int kvmclock_vsyscall __initdata = 1; 27 - static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME; 28 - static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK; 27 + static int msr_kvm_system_time __ro_after_init; 28 + static int msr_kvm_wall_clock __ro_after_init; 29 29 static u64 kvm_sched_clock_offset __ro_after_init; 30 30 31 31 static int __init parse_no_kvmclock(char *arg) ··· 195 195 196 196 void kvmclock_disable(void) 197 197 { 198 - native_write_msr(msr_kvm_system_time, 0, 0); 198 + if (msr_kvm_system_time) 199 + native_write_msr(msr_kvm_system_time, 0, 0); 199 200 } 200 201 201 202 static void __init kvmclock_init_mem(void) ··· 295 294 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { 296 295 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; 297 296 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; 298 - } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) { 297 + } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) { 298 + msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; 299 + msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; 300 + } else { 299 301 return; 300 302 } 301 303

+35 -12

arch/x86/kvm/Kconfig

··· 23 23 depends on HAVE_KVM 24 24 depends on HIGH_RES_TIMERS 25 25 depends on X86_LOCAL_APIC 26 - select PREEMPT_NOTIFIERS 27 - select MMU_NOTIFIER 26 + select KVM_COMMON 27 + select KVM_GENERIC_MMU_NOTIFIER 28 28 select HAVE_KVM_IRQCHIP 29 29 select HAVE_KVM_PFNCACHE 30 - select HAVE_KVM_IRQFD 31 30 select HAVE_KVM_DIRTY_RING_TSO 32 31 select HAVE_KVM_DIRTY_RING_ACQ_REL 33 32 select IRQ_BYPASS_MANAGER 34 33 select HAVE_KVM_IRQ_BYPASS 35 34 select HAVE_KVM_IRQ_ROUTING 36 - select HAVE_KVM_EVENTFD 37 35 select KVM_ASYNC_PF 38 36 select USER_RETURN_NOTIFIER 39 37 select KVM_MMIO ··· 44 46 select KVM_XFER_TO_GUEST_WORK 45 47 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 46 48 select KVM_VFIO 47 - select INTERVAL_TREE 48 49 select HAVE_KVM_PM_NOTIFIER if PM 49 50 select KVM_GENERIC_HARDWARE_ENABLING 50 51 help ··· 62 65 63 66 config KVM_WERROR 64 67 bool "Compile KVM with -Werror" 65 - # KASAN may cause the build to fail due to larger frames 66 - default y if X86_64 && !KASAN 67 - # We use the dependency on !COMPILE_TEST to not be enabled 68 - # blindly in allmodconfig or allyesconfig configurations 69 - depends on KVM 70 - depends on (X86_64 && !KASAN) || !COMPILE_TEST 71 - depends on EXPERT 68 + # Disallow KVM's -Werror if KASAN is enabled, e.g. to guard against 69 + # randomized configs from selecting KVM_WERROR=y, which doesn't play 70 + # nice with KASAN. KASAN builds generates warnings for the default 71 + # FRAME_WARN, i.e. KVM_WERROR=y with KASAN=y requires special tuning. 72 + # Building KVM with -Werror and KASAN is still doable via enabling 73 + # the kernel-wide WERROR=y. 74 + depends on KVM && EXPERT && !KASAN 72 75 help 73 76 Add -Werror to the build flags for KVM. 74 77 75 78 If in doubt, say "N". 79 + 80 + config KVM_SW_PROTECTED_VM 81 + bool "Enable support for KVM software-protected VMs" 82 + depends on EXPERT 83 + depends on KVM && X86_64 84 + select KVM_GENERIC_PRIVATE_MEM 85 + help 86 + Enable support for KVM software-protected VMs. Currently "protected" 87 + means the VM can be backed with memory provided by 88 + KVM_CREATE_GUEST_MEMFD. 89 + 90 + If unsure, say "N". 76 91 77 92 config KVM_INTEL 78 93 tristate "KVM for Intel (and compatible) processors support" ··· 137 128 firmware to implement UEFI secure boot. 138 129 139 130 If unsure, say Y. 131 + 132 + config KVM_HYPERV 133 + bool "Support for Microsoft Hyper-V emulation" 134 + depends on KVM 135 + default y 136 + help 137 + Provides KVM support for emulating Microsoft Hyper-V. This allows KVM 138 + to expose a subset of the paravirtualized interfaces defined in the 139 + Hyper-V Hypervisor Top-Level Functional Specification (TLFS): 140 + https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs 141 + These interfaces are required for the correct and performant functioning 142 + of Windows and Hyper-V guests on KVM. 143 + 144 + If unsure, say "Y". 140 145 141 146 config KVM_XEN 142 147 bool "Support for Xen hypercall interface"

+9 -7

arch/x86/kvm/Makefile

··· 11 11 12 12 kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ 13 13 i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \ 14 - hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \ 14 + debugfs.o mmu/mmu.o mmu/page_track.o \ 15 15 mmu/spte.o 16 16 17 - ifdef CONFIG_HYPERV 18 - kvm-y += kvm_onhyperv.o 19 - endif 20 - 21 17 kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o 18 + kvm-$(CONFIG_KVM_HYPERV) += hyperv.o 22 19 kvm-$(CONFIG_KVM_XEN) += xen.o 23 20 kvm-$(CONFIG_KVM_SMM) += smm.o 24 21 25 22 kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \ 26 - vmx/hyperv.o vmx/nested.o vmx/posted_intr.o 23 + vmx/nested.o vmx/posted_intr.o 24 + 27 25 kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o 26 + kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o 28 27 29 28 kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \ 30 - svm/sev.o svm/hyperv.o 29 + svm/sev.o 30 + kvm-amd-$(CONFIG_KVM_HYPERV) += svm/hyperv.o 31 31 32 32 ifdef CONFIG_HYPERV 33 + kvm-y += kvm_onhyperv.o 34 + kvm-intel-y += vmx/vmx_onhyperv.o vmx/hyperv_evmcs.o 33 35 kvm-amd-y += svm/svm_onhyperv.o 34 36 endif 35 37

+27 -6

arch/x86/kvm/cpuid.c

··· 314 314 315 315 static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent) 316 316 { 317 + #ifdef CONFIG_KVM_HYPERV 317 318 struct kvm_cpuid_entry2 *entry; 318 319 319 320 entry = cpuid_entry2_find(entries, nent, HYPERV_CPUID_INTERFACE, 320 321 KVM_CPUID_INDEX_NOT_SIGNIFICANT); 321 322 return entry && entry->eax == HYPERV_CPUID_SIGNATURE_EAX; 323 + #else 324 + return false; 325 + #endif 322 326 } 323 327 324 328 static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) ··· 437 433 return 0; 438 434 } 439 435 436 + #ifdef CONFIG_KVM_HYPERV 440 437 if (kvm_cpuid_has_hyperv(e2, nent)) { 441 438 r = kvm_hv_vcpu_init(vcpu); 442 439 if (r) 443 440 return r; 444 441 } 442 + #endif 445 443 446 444 r = kvm_check_cpuid(vcpu, e2, nent); 447 445 if (r) ··· 475 469 return -E2BIG; 476 470 477 471 if (cpuid->nent) { 478 - e = vmemdup_user(entries, array_size(sizeof(*e), cpuid->nent)); 472 + e = vmemdup_array_user(entries, cpuid->nent, sizeof(*e)); 479 473 if (IS_ERR(e)) 480 474 return PTR_ERR(e); 481 475 ··· 519 513 return -E2BIG; 520 514 521 515 if (cpuid->nent) { 522 - e2 = vmemdup_user(entries, array_size(sizeof(*e2), cpuid->nent)); 516 + e2 = vmemdup_array_user(entries, cpuid->nent, sizeof(*e2)); 523 517 if (IS_ERR(e2)) 524 518 return PTR_ERR(e2); 525 519 } ··· 677 671 kvm_cpu_cap_mask(CPUID_7_1_EAX, 678 672 F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) | 679 673 F(FZRM) | F(FSRS) | F(FSRC) | 680 - F(AMX_FP16) | F(AVX_IFMA) 674 + F(AMX_FP16) | F(AVX_IFMA) | F(LAM) 681 675 ); 682 676 683 677 kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, 684 678 F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | 685 679 F(AMX_COMPLEX) 680 + ); 681 + 682 + kvm_cpu_cap_init_kvm_defined(CPUID_7_2_EDX, 683 + F(INTEL_PSFD) | F(IPRED_CTRL) | F(RRSBA_CTRL) | F(DDPD_U) | 684 + F(BHI_CTRL) | F(MCDT_NO) 686 685 ); 687 686 688 687 kvm_cpu_cap_mask(CPUID_D_1_EAX, ··· 971 960 break; 972 961 /* function 7 has additional index. */ 973 962 case 7: 974 - entry->eax = min(entry->eax, 1u); 963 + max_idx = entry->eax = min(entry->eax, 2u); 975 964 cpuid_entry_override(entry, CPUID_7_0_EBX); 976 965 cpuid_entry_override(entry, CPUID_7_ECX); 977 966 cpuid_entry_override(entry, CPUID_7_EDX); 978 967 979 - /* KVM only supports 0x7.0 and 0x7.1, capped above via min(). */ 980 - if (entry->eax == 1) { 968 + /* KVM only supports up to 0x7.2, capped above via min(). */ 969 + if (max_idx >= 1) { 981 970 entry = do_host_cpuid(array, function, 1); 982 971 if (!entry) 983 972 goto out; ··· 986 975 cpuid_entry_override(entry, CPUID_7_1_EDX); 987 976 entry->ebx = 0; 988 977 entry->ecx = 0; 978 + } 979 + if (max_idx >= 2) { 980 + entry = do_host_cpuid(array, function, 2); 981 + if (!entry) 982 + goto out; 983 + 984 + cpuid_entry_override(entry, CPUID_7_2_EDX); 985 + entry->ecx = 0; 986 + entry->ebx = 0; 987 + entry->eax = 0; 989 988 } 990 989 break; 991 990 case 0xa: { /* Architectural Performance Monitoring */

+8 -5

arch/x86/kvm/cpuid.h

··· 47 47 return !(gpa & vcpu->arch.reserved_gpa_bits); 48 48 } 49 49 50 - static inline bool kvm_vcpu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa) 51 - { 52 - return !kvm_vcpu_is_legal_gpa(vcpu, gpa); 53 - } 54 - 55 50 static inline bool kvm_vcpu_is_legal_aligned_gpa(struct kvm_vcpu *vcpu, 56 51 gpa_t gpa, gpa_t alignment) 57 52 { ··· 272 277 273 278 return test_bit(kvm_governed_feature_index(x86_feature), 274 279 vcpu->arch.governed_features.enabled); 280 + } 281 + 282 + static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) 283 + { 284 + if (guest_can_use(vcpu, X86_FEATURE_LAM)) 285 + cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57); 286 + 287 + return kvm_vcpu_is_legal_gpa(vcpu, cr3); 275 288 } 276 289 277 290 #endif

+1 -1

arch/x86/kvm/debugfs.c

··· 111 111 mutex_lock(&kvm->slots_lock); 112 112 write_lock(&kvm->mmu_lock); 113 113 114 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 114 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 115 115 int bkt; 116 116 117 117 slots = __kvm_memslots(kvm, i);

+15 -12

arch/x86/kvm/emulate.c

··· 687 687 static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt, 688 688 struct segmented_address addr, 689 689 unsigned *max_size, unsigned size, 690 - bool write, bool fetch, 691 - enum x86emul_mode mode, ulong *linear) 690 + enum x86emul_mode mode, ulong *linear, 691 + unsigned int flags) 692 692 { 693 693 struct desc_struct desc; 694 694 bool usable; ··· 701 701 *max_size = 0; 702 702 switch (mode) { 703 703 case X86EMUL_MODE_PROT64: 704 - *linear = la; 704 + *linear = la = ctxt->ops->get_untagged_addr(ctxt, la, flags); 705 705 va_bits = ctxt_virt_addr_bits(ctxt); 706 706 if (!__is_canonical_address(la, va_bits)) 707 707 goto bad; ··· 717 717 if (!usable) 718 718 goto bad; 719 719 /* code segment in protected mode or read-only data segment */ 720 - if ((((ctxt->mode != X86EMUL_MODE_REAL) && (desc.type & 8)) 721 - || !(desc.type & 2)) && write) 720 + if ((((ctxt->mode != X86EMUL_MODE_REAL) && (desc.type & 8)) || !(desc.type & 2)) && 721 + (flags & X86EMUL_F_WRITE)) 722 722 goto bad; 723 723 /* unreadable code segment */ 724 - if (!fetch && (desc.type & 8) && !(desc.type & 2)) 724 + if (!(flags & X86EMUL_F_FETCH) && (desc.type & 8) && !(desc.type & 2)) 725 725 goto bad; 726 726 lim = desc_limit_scaled(&desc); 727 727 if (!(desc.type & 8) && (desc.type & 4)) { ··· 757 757 ulong *linear) 758 758 { 759 759 unsigned max_size; 760 - return __linearize(ctxt, addr, &max_size, size, write, false, 761 - ctxt->mode, linear); 760 + return __linearize(ctxt, addr, &max_size, size, ctxt->mode, linear, 761 + write ? X86EMUL_F_WRITE : 0); 762 762 } 763 763 764 764 static inline int assign_eip(struct x86_emulate_ctxt *ctxt, ulong dst) ··· 771 771 772 772 if (ctxt->op_bytes != sizeof(unsigned long)) 773 773 addr.ea = dst & ((1UL << (ctxt->op_bytes << 3)) - 1); 774 - rc = __linearize(ctxt, addr, &max_size, 1, false, true, ctxt->mode, &linear); 774 + rc = __linearize(ctxt, addr, &max_size, 1, ctxt->mode, &linear, 775 + X86EMUL_F_FETCH); 775 776 if (rc == X86EMUL_CONTINUE) 776 777 ctxt->_eip = addr.ea; 777 778 return rc; ··· 908 907 * boundary check itself. Instead, we use max_size to check 909 908 * against op_size. 910 909 */ 911 - rc = __linearize(ctxt, addr, &max_size, 0, false, true, ctxt->mode, 912 - &linear); 910 + rc = __linearize(ctxt, addr, &max_size, 0, ctxt->mode, &linear, 911 + X86EMUL_F_FETCH); 913 912 if (unlikely(rc != X86EMUL_CONTINUE)) 914 913 return rc; 915 914 ··· 3440 3439 { 3441 3440 int rc; 3442 3441 ulong linear; 3442 + unsigned int max_size; 3443 3443 3444 - rc = linearize(ctxt, ctxt->src.addr.mem, 1, false, &linear); 3444 + rc = __linearize(ctxt, ctxt->src.addr.mem, &max_size, 1, ctxt->mode, 3445 + &linear, X86EMUL_F_INVLPG); 3445 3446 if (rc == X86EMUL_CONTINUE) 3446 3447 ctxt->ops->invlpg(ctxt, linear); 3447 3448 /* Disable writeback. */

+1

arch/x86/kvm/governed_features.h

··· 16 16 KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD) 17 17 KVM_GOVERNED_X86_FEATURE(VGIF) 18 18 KVM_GOVERNED_X86_FEATURE(VNMI) 19 + KVM_GOVERNED_X86_FEATURE(LAM) 19 20 20 21 #undef KVM_GOVERNED_X86_FEATURE 21 22 #undef KVM_GOVERNED_FEATURE

+85 -2

arch/x86/kvm/hyperv.h

··· 24 24 #include <linux/kvm_host.h> 25 25 #include "x86.h" 26 26 27 + #ifdef CONFIG_KVM_HYPERV 28 + 27 29 /* "Hv#1" signature */ 28 30 #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648 29 31 ··· 106 104 int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vcpu_id, u32 sint); 107 105 void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector); 108 106 int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages); 107 + 108 + static inline bool kvm_hv_synic_has_vector(struct kvm_vcpu *vcpu, int vector) 109 + { 110 + return to_hv_vcpu(vcpu) && test_bit(vector, to_hv_synic(vcpu)->vec_bitmap); 111 + } 112 + 113 + static inline bool kvm_hv_synic_auto_eoi_set(struct kvm_vcpu *vcpu, int vector) 114 + { 115 + return to_hv_vcpu(vcpu) && 116 + test_bit(vector, to_hv_synic(vcpu)->auto_eoi_bitmap); 117 + } 109 118 110 119 void kvm_hv_vcpu_uninit(struct kvm_vcpu *vcpu); 111 120 ··· 249 236 return kvm_hv_get_assist_page(vcpu); 250 237 } 251 238 252 - int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); 239 + static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, 240 + bool tdp_enabled) 241 + { 242 + /* 243 + * KVM_REQ_HV_TLB_FLUSH flushes entries from either L1's VP_ID or 244 + * L2's VP_ID upon request from the guest. Make sure we check for 245 + * pending entries in the right FIFO upon L1/L2 transition as these 246 + * requests are put by other vCPUs asynchronously. 247 + */ 248 + if (to_hv_vcpu(vcpu) && tdp_enabled) 249 + kvm_make_request(KVM_REQ_HV_TLB_FLUSH, vcpu); 250 + } 253 251 254 - #endif 252 + int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); 253 + #else /* CONFIG_KVM_HYPERV */ 254 + static inline void kvm_hv_setup_tsc_page(struct kvm *kvm, 255 + struct pvclock_vcpu_time_info *hv_clock) {} 256 + static inline void kvm_hv_request_tsc_page_update(struct kvm *kvm) {} 257 + static inline void kvm_hv_init_vm(struct kvm *kvm) {} 258 + static inline void kvm_hv_destroy_vm(struct kvm *kvm) {} 259 + static inline int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu) 260 + { 261 + return 0; 262 + } 263 + static inline void kvm_hv_vcpu_uninit(struct kvm_vcpu *vcpu) {} 264 + static inline bool kvm_hv_hypercall_enabled(struct kvm_vcpu *vcpu) 265 + { 266 + return false; 267 + } 268 + static inline int kvm_hv_hypercall(struct kvm_vcpu *vcpu) 269 + { 270 + return HV_STATUS_ACCESS_DENIED; 271 + } 272 + static inline void kvm_hv_vcpu_purge_flush_tlb(struct kvm_vcpu *vcpu) {} 273 + static inline void kvm_hv_free_pa_page(struct kvm *kvm) {} 274 + static inline bool kvm_hv_synic_has_vector(struct kvm_vcpu *vcpu, int vector) 275 + { 276 + return false; 277 + } 278 + static inline bool kvm_hv_synic_auto_eoi_set(struct kvm_vcpu *vcpu, int vector) 279 + { 280 + return false; 281 + } 282 + static inline void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector) {} 283 + static inline bool kvm_hv_invtsc_suppressed(struct kvm_vcpu *vcpu) 284 + { 285 + return false; 286 + } 287 + static inline void kvm_hv_set_cpuid(struct kvm_vcpu *vcpu, bool hyperv_enabled) {} 288 + static inline bool kvm_hv_has_stimer_pending(struct kvm_vcpu *vcpu) 289 + { 290 + return false; 291 + } 292 + static inline bool kvm_hv_is_tlb_flush_hcall(struct kvm_vcpu *vcpu) 293 + { 294 + return false; 295 + } 296 + static inline bool guest_hv_cpuid_has_l2_tlb_flush(struct kvm_vcpu *vcpu) 297 + { 298 + return false; 299 + } 300 + static inline int kvm_hv_verify_vp_assist(struct kvm_vcpu *vcpu) 301 + { 302 + return 0; 303 + } 304 + static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu) 305 + { 306 + return vcpu->vcpu_idx; 307 + } 308 + static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, bool tdp_enabled) {} 309 + #endif /* CONFIG_KVM_HYPERV */ 310 + 311 + #endif /* __ARCH_X86_KVM_HYPERV_H__ */

+2

arch/x86/kvm/irq.c

··· 118 118 if (!lapic_in_kernel(v)) 119 119 return v->arch.interrupt.nr; 120 120 121 + #ifdef CONFIG_KVM_XEN 121 122 if (kvm_xen_has_interrupt(v)) 122 123 return v->kvm->arch.xen.upcall_vector; 124 + #endif 123 125 124 126 if (irqchip_split(v->kvm)) { 125 127 int vector = v->arch.pending_external_vector;

+8 -1

arch/x86/kvm/irq_comm.c

··· 144 144 return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL); 145 145 } 146 146 147 - 147 + #ifdef CONFIG_KVM_HYPERV 148 148 static int kvm_hv_set_sint(struct kvm_kernel_irq_routing_entry *e, 149 149 struct kvm *kvm, int irq_source_id, int level, 150 150 bool line_status) ··· 154 154 155 155 return kvm_hv_synic_set_irq(kvm, e->hv_sint.vcpu, e->hv_sint.sint); 156 156 } 157 + #endif 157 158 158 159 int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, 159 160 struct kvm *kvm, int irq_source_id, int level, ··· 164 163 int r; 165 164 166 165 switch (e->type) { 166 + #ifdef CONFIG_KVM_HYPERV 167 167 case KVM_IRQ_ROUTING_HV_SINT: 168 168 return kvm_hv_set_sint(e, kvm, irq_source_id, level, 169 169 line_status); 170 + #endif 170 171 171 172 case KVM_IRQ_ROUTING_MSI: 172 173 if (kvm_msi_route_invalid(kvm, e)) ··· 317 314 if (kvm_msi_route_invalid(kvm, e)) 318 315 return -EINVAL; 319 316 break; 317 + #ifdef CONFIG_KVM_HYPERV 320 318 case KVM_IRQ_ROUTING_HV_SINT: 321 319 e->set = kvm_hv_set_sint; 322 320 e->hv_sint.vcpu = ue->u.hv_sint.vcpu; 323 321 e->hv_sint.sint = ue->u.hv_sint.sint; 324 322 break; 323 + #endif 325 324 #ifdef CONFIG_KVM_XEN 326 325 case KVM_IRQ_ROUTING_XEN_EVTCHN: 327 326 return kvm_xen_setup_evtchn(kvm, e, ue); ··· 443 438 444 439 void kvm_arch_irq_routing_update(struct kvm *kvm) 445 440 { 441 + #ifdef CONFIG_KVM_HYPERV 446 442 kvm_hv_irq_routing_update(kvm); 443 + #endif 447 444 }

+9

arch/x86/kvm/kvm_emulate.h

··· 88 88 #define X86EMUL_IO_NEEDED 5 /* IO is needed to complete emulation */ 89 89 #define X86EMUL_INTERCEPTED 6 /* Intercepted by nested VMCB/VMCS */ 90 90 91 + /* x86-specific emulation flags */ 92 + #define X86EMUL_F_WRITE BIT(0) 93 + #define X86EMUL_F_FETCH BIT(1) 94 + #define X86EMUL_F_IMPLICIT BIT(2) 95 + #define X86EMUL_F_INVLPG BIT(3) 96 + 91 97 struct x86_emulate_ops { 92 98 void (*vm_bugged)(struct x86_emulate_ctxt *ctxt); 93 99 /* ··· 230 224 int (*leave_smm)(struct x86_emulate_ctxt *ctxt); 231 225 void (*triple_fault)(struct x86_emulate_ctxt *ctxt); 232 226 int (*set_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr); 227 + 228 + gva_t (*get_untagged_addr)(struct x86_emulate_ctxt *ctxt, gva_t addr, 229 + unsigned int flags); 233 230 }; 234 231 235 232 /* Type, address-of, and value of an instruction's operand. */

+20

arch/x86/kvm/kvm_onhyperv.h

··· 10 10 int hv_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, gfn_t nr_pages); 11 11 int hv_flush_remote_tlbs(struct kvm *kvm); 12 12 void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp); 13 + static inline hpa_t hv_get_partition_assist_page(struct kvm_vcpu *vcpu) 14 + { 15 + /* 16 + * Partition assist page is something which Hyper-V running in L0 17 + * requires from KVM running in L1 before direct TLB flush for L2 18 + * guests can be enabled. KVM doesn't currently use the page but to 19 + * comply with TLFS it still needs to be allocated. For now, this 20 + * is a single page shared among all vCPUs. 21 + */ 22 + struct hv_partition_assist_pg **p_hv_pa_pg = 23 + &vcpu->kvm->arch.hv_pa_pg; 24 + 25 + if (!*p_hv_pa_pg) 26 + *p_hv_pa_pg = kzalloc(PAGE_SIZE, GFP_KERNEL_ACCOUNT); 27 + 28 + if (!*p_hv_pa_pg) 29 + return INVALID_PAGE; 30 + 31 + return __pa(*p_hv_pa_pg); 32 + } 13 33 #else /* !CONFIG_HYPERV */ 14 34 static inline int hv_flush_remote_tlbs(struct kvm *kvm) 15 35 {

+2 -3

arch/x86/kvm/lapic.c

··· 1475 1475 apic_clear_isr(vector, apic); 1476 1476 apic_update_ppr(apic); 1477 1477 1478 - if (to_hv_vcpu(apic->vcpu) && 1479 - test_bit(vector, to_hv_synic(apic->vcpu)->vec_bitmap)) 1478 + if (kvm_hv_synic_has_vector(apic->vcpu, vector)) 1480 1479 kvm_hv_synic_send_eoi(apic->vcpu, vector); 1481 1480 1482 1481 kvm_ioapic_send_eoi(apic, vector); ··· 2904 2905 */ 2905 2906 2906 2907 apic_clear_irr(vector, apic); 2907 - if (to_hv_vcpu(vcpu) && test_bit(vector, to_hv_synic(vcpu)->auto_eoi_bitmap)) { 2908 + if (kvm_hv_synic_auto_eoi_set(vcpu, vector)) { 2908 2909 /* 2909 2910 * For auto-EOI interrupts, there might be another pending 2910 2911 * interrupt above PPR, so check whether to raise another

+8

arch/x86/kvm/mmu.h

··· 146 146 return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu)); 147 147 } 148 148 149 + static inline unsigned long kvm_get_active_cr3_lam_bits(struct kvm_vcpu *vcpu) 150 + { 151 + if (!guest_can_use(vcpu, X86_FEATURE_LAM)) 152 + return 0; 153 + 154 + return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57); 155 + } 156 + 149 157 static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) 150 158 { 151 159 u64 root_hpa = vcpu->arch.mmu->root.hpa;

+266 -27

arch/x86/kvm/mmu/mmu.c

··· 271 271 272 272 static inline bool kvm_available_flush_remote_tlbs_range(void) 273 273 { 274 + #if IS_ENABLED(CONFIG_HYPERV) 274 275 return kvm_x86_ops.flush_remote_tlbs_range; 275 - } 276 - 277 - int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages) 278 - { 279 - if (!kvm_x86_ops.flush_remote_tlbs_range) 280 - return -EOPNOTSUPP; 281 - 282 - return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages); 276 + #else 277 + return false; 278 + #endif 283 279 } 284 280 285 281 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index); ··· 791 795 return &slot->arch.lpage_info[level - 2][idx]; 792 796 } 793 797 798 + /* 799 + * The most significant bit in disallow_lpage tracks whether or not memory 800 + * attributes are mixed, i.e. not identical for all gfns at the current level. 801 + * The lower order bits are used to refcount other cases where a hugepage is 802 + * disallowed, e.g. if KVM has shadow a page table at the gfn. 803 + */ 804 + #define KVM_LPAGE_MIXED_FLAG BIT(31) 805 + 794 806 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot, 795 807 gfn_t gfn, int count) 796 808 { 797 809 struct kvm_lpage_info *linfo; 798 - int i; 810 + int old, i; 799 811 800 812 for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) { 801 813 linfo = lpage_info_slot(gfn, slot, i); 814 + 815 + old = linfo->disallow_lpage; 802 816 linfo->disallow_lpage += count; 803 - WARN_ON_ONCE(linfo->disallow_lpage < 0); 817 + WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG); 804 818 } 805 819 } 806 820 ··· 1388 1382 gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); 1389 1383 1390 1384 if (READ_ONCE(eager_page_split)) 1391 - kvm_mmu_try_split_huge_pages(kvm, slot, start, end, PG_LEVEL_4K); 1385 + kvm_mmu_try_split_huge_pages(kvm, slot, start, end + 1, PG_LEVEL_4K); 1392 1386 1393 1387 kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); 1394 1388 ··· 2846 2840 /* 2847 2841 * Recheck after taking the spinlock, a different vCPU 2848 2842 * may have since marked the page unsync. A false 2849 - * positive on the unprotected check above is not 2843 + * negative on the unprotected check above is not 2850 2844 * possible as clearing sp->unsync _must_ hold mmu_lock 2851 - * for write, i.e. unsync cannot transition from 0->1 2845 + * for write, i.e. unsync cannot transition from 1->0 2852 2846 * while this CPU holds mmu_lock for read (or write). 2853 2847 */ 2854 2848 if (READ_ONCE(sp->unsync)) ··· 3062 3056 * 3063 3057 * There are several ways to safely use this helper: 3064 3058 * 3065 - * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before 3059 + * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before 3066 3060 * consuming it. In this case, mmu_lock doesn't need to be held during the 3067 3061 * lookup, but it does need to be held while checking the MMU notifier. 3068 3062 * ··· 3143 3137 return level; 3144 3138 } 3145 3139 3146 - int kvm_mmu_max_mapping_level(struct kvm *kvm, 3147 - const struct kvm_memory_slot *slot, gfn_t gfn, 3148 - int max_level) 3140 + static int __kvm_mmu_max_mapping_level(struct kvm *kvm, 3141 + const struct kvm_memory_slot *slot, 3142 + gfn_t gfn, int max_level, bool is_private) 3149 3143 { 3150 3144 struct kvm_lpage_info *linfo; 3151 3145 int host_level; ··· 3157 3151 break; 3158 3152 } 3159 3153 3154 + if (is_private) 3155 + return max_level; 3156 + 3160 3157 if (max_level == PG_LEVEL_4K) 3161 3158 return PG_LEVEL_4K; 3162 3159 3163 3160 host_level = host_pfn_mapping_level(kvm, gfn, slot); 3164 3161 return min(host_level, max_level); 3162 + } 3163 + 3164 + int kvm_mmu_max_mapping_level(struct kvm *kvm, 3165 + const struct kvm_memory_slot *slot, gfn_t gfn, 3166 + int max_level) 3167 + { 3168 + bool is_private = kvm_slot_can_be_private(slot) && 3169 + kvm_mem_is_private(kvm, gfn); 3170 + 3171 + return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private); 3165 3172 } 3166 3173 3167 3174 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) ··· 3197 3178 * Enforce the iTLB multihit workaround after capturing the requested 3198 3179 * level, which will be used to do precise, accurate accounting. 3199 3180 */ 3200 - fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, 3201 - fault->gfn, fault->max_level); 3181 + fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot, 3182 + fault->gfn, fault->max_level, 3183 + fault->is_private); 3202 3184 if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed) 3203 3185 return; 3204 3186 ··· 3576 3556 return; 3577 3557 3578 3558 if (is_tdp_mmu_page(sp)) 3579 - kvm_tdp_mmu_put_root(kvm, sp, false); 3559 + kvm_tdp_mmu_put_root(kvm, sp); 3580 3560 else if (!--sp->root_count && sp->role.invalid) 3581 3561 kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); 3582 3562 ··· 3759 3739 kvm_page_track_write_tracking_enabled(kvm)) 3760 3740 goto out_success; 3761 3741 3762 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 3742 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 3763 3743 slots = __kvm_memslots(kvm, i); 3764 3744 kvm_for_each_memslot(slot, bkt, slots) { 3765 3745 /* ··· 3802 3782 hpa_t root; 3803 3783 3804 3784 root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu); 3805 - root_gfn = root_pgd >> PAGE_SHIFT; 3785 + root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT; 3806 3786 3807 3787 if (!kvm_vcpu_is_visible_gfn(vcpu, root_gfn)) { 3808 3788 mmu->root.hpa = kvm_mmu_get_dummy_root(); ··· 4279 4259 kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL); 4280 4260 } 4281 4261 4262 + static inline u8 kvm_max_level_for_order(int order) 4263 + { 4264 + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G); 4265 + 4266 + KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) && 4267 + order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) && 4268 + order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K)); 4269 + 4270 + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) 4271 + return PG_LEVEL_1G; 4272 + 4273 + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) 4274 + return PG_LEVEL_2M; 4275 + 4276 + return PG_LEVEL_4K; 4277 + } 4278 + 4279 + static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, 4280 + struct kvm_page_fault *fault) 4281 + { 4282 + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, 4283 + PAGE_SIZE, fault->write, fault->exec, 4284 + fault->is_private); 4285 + } 4286 + 4287 + static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, 4288 + struct kvm_page_fault *fault) 4289 + { 4290 + int max_order, r; 4291 + 4292 + if (!kvm_slot_can_be_private(fault->slot)) { 4293 + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); 4294 + return -EFAULT; 4295 + } 4296 + 4297 + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, 4298 + &max_order); 4299 + if (r) { 4300 + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); 4301 + return r; 4302 + } 4303 + 4304 + fault->max_level = min(kvm_max_level_for_order(max_order), 4305 + fault->max_level); 4306 + fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY); 4307 + 4308 + return RET_PF_CONTINUE; 4309 + } 4310 + 4282 4311 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) 4283 4312 { 4284 4313 struct kvm_memory_slot *slot = fault->slot; ··· 4359 4290 !kvm_apicv_activated(vcpu->kvm)) 4360 4291 return RET_PF_EMULATE; 4361 4292 } 4293 + 4294 + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) { 4295 + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); 4296 + return -EFAULT; 4297 + } 4298 + 4299 + if (fault->is_private) 4300 + return kvm_faultin_pfn_private(vcpu, fault); 4362 4301 4363 4302 async = false; 4364 4303 fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async, ··· 4443 4366 return true; 4444 4367 4445 4368 return fault->slot && 4446 - mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva); 4369 + mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn); 4447 4370 } 4448 4371 4449 4372 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) ··· 6305 6228 if (!kvm_memslots_have_rmaps(kvm)) 6306 6229 return flush; 6307 6230 6308 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 6231 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 6309 6232 slots = __kvm_memslots(kvm, i); 6310 6233 6311 6234 kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) { ··· 6337 6260 6338 6261 write_lock(&kvm->mmu_lock); 6339 6262 6340 - kvm_mmu_invalidate_begin(kvm, 0, -1ul); 6263 + kvm_mmu_invalidate_begin(kvm); 6264 + 6265 + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end); 6341 6266 6342 6267 flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end); 6343 6268 ··· 6349 6270 if (flush) 6350 6271 kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start); 6351 6272 6352 - kvm_mmu_invalidate_end(kvm, 0, -1ul); 6273 + kvm_mmu_invalidate_end(kvm); 6353 6274 6354 6275 write_unlock(&kvm->mmu_lock); 6355 6276 } ··· 6802 6723 * modifier prior to checking for a wrap of the MMIO generation so 6803 6724 * that a wrap in any address space is detected. 6804 6725 */ 6805 - gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1); 6726 + gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1); 6806 6727 6807 6728 /* 6808 6729 * The very rare case: if the MMIO generation number has wrapped, ··· 7255 7176 if (kvm->arch.nx_huge_page_recovery_thread) 7256 7177 kthread_stop(kvm->arch.nx_huge_page_recovery_thread); 7257 7178 } 7179 + 7180 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 7181 + bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, 7182 + struct kvm_gfn_range *range) 7183 + { 7184 + /* 7185 + * Zap SPTEs even if the slot can't be mapped PRIVATE. KVM x86 only 7186 + * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM 7187 + * can simply ignore such slots. But if userspace is making memory 7188 + * PRIVATE, then KVM must prevent the guest from accessing the memory 7189 + * as shared. And if userspace is making memory SHARED and this point 7190 + * is reached, then at least one page within the range was previously 7191 + * PRIVATE, i.e. the slot's possible hugepage ranges are changing. 7192 + * Zapping SPTEs in this case ensures KVM will reassess whether or not 7193 + * a hugepage can be used for affected ranges. 7194 + */ 7195 + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) 7196 + return false; 7197 + 7198 + return kvm_unmap_gfn_range(kvm, range); 7199 + } 7200 + 7201 + static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn, 7202 + int level) 7203 + { 7204 + return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_MIXED_FLAG; 7205 + } 7206 + 7207 + static void hugepage_clear_mixed(struct kvm_memory_slot *slot, gfn_t gfn, 7208 + int level) 7209 + { 7210 + lpage_info_slot(gfn, slot, level)->disallow_lpage &= ~KVM_LPAGE_MIXED_FLAG; 7211 + } 7212 + 7213 + static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn, 7214 + int level) 7215 + { 7216 + lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG; 7217 + } 7218 + 7219 + static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot, 7220 + gfn_t gfn, int level, unsigned long attrs) 7221 + { 7222 + const unsigned long start = gfn; 7223 + const unsigned long end = start + KVM_PAGES_PER_HPAGE(level); 7224 + 7225 + if (level == PG_LEVEL_2M) 7226 + return kvm_range_has_memory_attributes(kvm, start, end, attrs); 7227 + 7228 + for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) { 7229 + if (hugepage_test_mixed(slot, gfn, level - 1) || 7230 + attrs != kvm_get_memory_attributes(kvm, gfn)) 7231 + return false; 7232 + } 7233 + return true; 7234 + } 7235 + 7236 + bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, 7237 + struct kvm_gfn_range *range) 7238 + { 7239 + unsigned long attrs = range->arg.attributes; 7240 + struct kvm_memory_slot *slot = range->slot; 7241 + int level; 7242 + 7243 + lockdep_assert_held_write(&kvm->mmu_lock); 7244 + lockdep_assert_held(&kvm->slots_lock); 7245 + 7246 + /* 7247 + * Calculate which ranges can be mapped with hugepages even if the slot 7248 + * can't map memory PRIVATE. KVM mustn't create a SHARED hugepage over 7249 + * a range that has PRIVATE GFNs, and conversely converting a range to 7250 + * SHARED may now allow hugepages. 7251 + */ 7252 + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) 7253 + return false; 7254 + 7255 + /* 7256 + * The sequence matters here: upper levels consume the result of lower 7257 + * level's scanning. 7258 + */ 7259 + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) { 7260 + gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level); 7261 + gfn_t gfn = gfn_round_for_level(range->start, level); 7262 + 7263 + /* Process the head page if it straddles the range. */ 7264 + if (gfn != range->start || gfn + nr_pages > range->end) { 7265 + /* 7266 + * Skip mixed tracking if the aligned gfn isn't covered 7267 + * by the memslot, KVM can't use a hugepage due to the 7268 + * misaligned address regardless of memory attributes. 7269 + */ 7270 + if (gfn >= slot->base_gfn) { 7271 + if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) 7272 + hugepage_clear_mixed(slot, gfn, level); 7273 + else 7274 + hugepage_set_mixed(slot, gfn, level); 7275 + } 7276 + gfn += nr_pages; 7277 + } 7278 + 7279 + /* 7280 + * Pages entirely covered by the range are guaranteed to have 7281 + * only the attributes which were just set. 7282 + */ 7283 + for ( ; gfn + nr_pages <= range->end; gfn += nr_pages) 7284 + hugepage_clear_mixed(slot, gfn, level); 7285 + 7286 + /* 7287 + * Process the last tail page if it straddles the range and is 7288 + * contained by the memslot. Like the head page, KVM can't 7289 + * create a hugepage if the slot size is misaligned. 7290 + */ 7291 + if (gfn < range->end && 7292 + (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) { 7293 + if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) 7294 + hugepage_clear_mixed(slot, gfn, level); 7295 + else 7296 + hugepage_set_mixed(slot, gfn, level); 7297 + } 7298 + } 7299 + return false; 7300 + } 7301 + 7302 + void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm, 7303 + struct kvm_memory_slot *slot) 7304 + { 7305 + int level; 7306 + 7307 + if (!kvm_arch_has_private_mem(kvm)) 7308 + return; 7309 + 7310 + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) { 7311 + /* 7312 + * Don't bother tracking mixed attributes for pages that can't 7313 + * be huge due to alignment, i.e. process only pages that are 7314 + * entirely contained by the memslot. 7315 + */ 7316 + gfn_t end = gfn_round_for_level(slot->base_gfn + slot->npages, level); 7317 + gfn_t start = gfn_round_for_level(slot->base_gfn, level); 7318 + gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level); 7319 + gfn_t gfn; 7320 + 7321 + if (start < slot->base_gfn) 7322 + start += nr_pages; 7323 + 7324 + /* 7325 + * Unlike setting attributes, every potential hugepage needs to 7326 + * be manually checked as the attributes may already be mixed. 7327 + */ 7328 + for (gfn = start; gfn < end; gfn += nr_pages) { 7329 + unsigned long attrs = kvm_get_memory_attributes(kvm, gfn); 7330 + 7331 + if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) 7332 + hugepage_clear_mixed(slot, gfn, level); 7333 + else 7334 + hugepage_set_mixed(slot, gfn, level); 7335 + } 7336 + } 7337 + } 7338 + #endif

+3

arch/x86/kvm/mmu/mmu_internal.h

··· 13 13 #endif 14 14 15 15 /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */ 16 + #define __PT_BASE_ADDR_MASK GENMASK_ULL(51, 12) 16 17 #define __PT_LEVEL_SHIFT(level, bits_per_level) \ 17 18 (PAGE_SHIFT + ((level) - 1) * (bits_per_level)) 18 19 #define __PT_INDEX(address, level, bits_per_level) \ ··· 202 201 203 202 /* Derived from mmu and global state. */ 204 203 const bool is_tdp; 204 + const bool is_private; 205 205 const bool nx_huge_page_workaround_enabled; 206 206 207 207 /* ··· 298 296 .max_level = KVM_MAX_HUGEPAGE_LEVEL, 299 297 .req_level = PG_LEVEL_4K, 300 298 .goal_level = PG_LEVEL_4K, 299 + .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), 301 300 }; 302 301 int r; 303 302

+1 -1

arch/x86/kvm/mmu/paging_tmpl.h

··· 62 62 #endif 63 63 64 64 /* Common logic, but per-type values. These also need to be undefined. */ 65 - #define PT_BASE_ADDR_MASK ((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))) 65 + #define PT_BASE_ADDR_MASK ((pt_element_t)__PT_BASE_ADDR_MASK) 66 66 #define PT_LVL_ADDR_MASK(lvl) __PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS) 67 67 #define PT_LVL_OFFSET_MASK(lvl) __PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS) 68 68 #define PT_INDEX(addr, lvl) __PT_INDEX(addr, lvl, PT_LEVEL_BITS)

+43 -52

arch/x86/kvm/mmu/tdp_mmu.c

··· 73 73 tdp_mmu_free_sp(sp); 74 74 } 75 75 76 - void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, 77 - bool shared) 76 + void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root) 78 77 { 79 - kvm_lockdep_assert_mmu_lock_held(kvm, shared); 80 - 81 78 if (!refcount_dec_and_test(&root->tdp_mmu_root_count)) 82 79 return; 83 80 ··· 103 106 */ 104 107 static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, 105 108 struct kvm_mmu_page *prev_root, 106 - bool shared, bool only_valid) 109 + bool only_valid) 107 110 { 108 111 struct kvm_mmu_page *next_root; 112 + 113 + /* 114 + * While the roots themselves are RCU-protected, fields such as 115 + * role.invalid are protected by mmu_lock. 116 + */ 117 + lockdep_assert_held(&kvm->mmu_lock); 109 118 110 119 rcu_read_lock(); 111 120 ··· 135 132 rcu_read_unlock(); 136 133 137 134 if (prev_root) 138 - kvm_tdp_mmu_put_root(kvm, prev_root, shared); 135 + kvm_tdp_mmu_put_root(kvm, prev_root); 139 136 140 137 return next_root; 141 138 } ··· 147 144 * recent root. (Unless keeping a live reference is desirable.) 148 145 * 149 146 * If shared is set, this function is operating under the MMU lock in read 150 - * mode. In the unlikely event that this thread must free a root, the lock 151 - * will be temporarily dropped and reacquired in write mode. 147 + * mode. 152 148 */ 153 - #define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\ 154 - for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \ 155 - _root; \ 156 - _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \ 157 - if (kvm_lockdep_assert_mmu_lock_held(_kvm, _shared) && \ 158 - kvm_mmu_page_as_id(_root) != _as_id) { \ 149 + #define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid)\ 150 + for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \ 151 + ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \ 152 + _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \ 153 + if (kvm_mmu_page_as_id(_root) != _as_id) { \ 159 154 } else 160 155 161 - #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \ 162 - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true) 156 + #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \ 157 + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, true) 163 158 164 - #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _shared) \ 165 - for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, false); \ 166 - _root; \ 167 - _root = tdp_mmu_next_root(_kvm, _root, _shared, false)) \ 168 - if (!kvm_lockdep_assert_mmu_lock_held(_kvm, _shared)) { \ 169 - } else 159 + #define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \ 160 + for (_root = tdp_mmu_next_root(_kvm, NULL, false); \ 161 + ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \ 162 + _root = tdp_mmu_next_root(_kvm, _root, false)) 170 163 171 164 /* 172 165 * Iterate over all TDP MMU roots. Requires that mmu_lock be held for write, ··· 275 276 * 276 277 * @kvm: kvm instance 277 278 * @sp: the page to be removed 278 - * @shared: This operation may not be running under the exclusive use of 279 - * the MMU lock and the operation must synchronize with other 280 - * threads that might be adding or removing pages. 281 279 */ 282 - static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp, 283 - bool shared) 280 + static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp) 284 281 { 285 282 tdp_unaccount_mmu_page(kvm, sp); 286 283 287 284 if (!sp->nx_huge_page_disallowed) 288 285 return; 289 286 290 - if (shared) 291 - spin_lock(&kvm->arch.tdp_mmu_pages_lock); 292 - else 293 - lockdep_assert_held_write(&kvm->mmu_lock); 294 - 287 + spin_lock(&kvm->arch.tdp_mmu_pages_lock); 295 288 sp->nx_huge_page_disallowed = false; 296 289 untrack_possible_nx_huge_page(kvm, sp); 297 - 298 - if (shared) 299 - spin_unlock(&kvm->arch.tdp_mmu_pages_lock); 290 + spin_unlock(&kvm->arch.tdp_mmu_pages_lock); 300 291 } 301 292 302 293 /** ··· 315 326 316 327 trace_kvm_mmu_prepare_zap_page(sp); 317 328 318 - tdp_mmu_unlink_sp(kvm, sp, shared); 329 + tdp_mmu_unlink_sp(kvm, sp); 319 330 320 331 for (i = 0; i < SPTE_ENT_PER_PAGE; i++) { 321 332 tdp_ptep_t sptep = pt + i; ··· 821 832 { 822 833 struct kvm_mmu_page *root; 823 834 824 - for_each_tdp_mmu_root_yield_safe(kvm, root, false) 835 + lockdep_assert_held_write(&kvm->mmu_lock); 836 + for_each_tdp_mmu_root_yield_safe(kvm, root) 825 837 flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush); 826 838 827 839 return flush; ··· 844 854 * is being destroyed or the userspace VMM has exited. In both cases, 845 855 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. 846 856 */ 847 - for_each_tdp_mmu_root_yield_safe(kvm, root, false) 857 + lockdep_assert_held_write(&kvm->mmu_lock); 858 + for_each_tdp_mmu_root_yield_safe(kvm, root) 848 859 tdp_mmu_zap_root(kvm, root, false); 849 860 } 850 861 ··· 859 868 860 869 read_lock(&kvm->mmu_lock); 861 870 862 - for_each_tdp_mmu_root_yield_safe(kvm, root, true) { 871 + for_each_tdp_mmu_root_yield_safe(kvm, root) { 863 872 if (!root->tdp_mmu_scheduled_root_to_zap) 864 873 continue; 865 874 ··· 882 891 * the root must be reachable by mmu_notifiers while it's being 883 892 * zapped 884 893 */ 885 - kvm_tdp_mmu_put_root(kvm, root, true); 894 + kvm_tdp_mmu_put_root(kvm, root); 886 895 } 887 896 888 897 read_unlock(&kvm->mmu_lock); ··· 1116 1125 { 1117 1126 struct kvm_mmu_page *root; 1118 1127 1119 - __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false, false) 1128 + __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) 1120 1129 flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end, 1121 1130 range->may_block, flush); 1122 1131 ··· 1305 1314 1306 1315 lockdep_assert_held_read(&kvm->mmu_lock); 1307 1316 1308 - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) 1317 + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) 1309 1318 spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn, 1310 1319 slot->base_gfn + slot->npages, min_level); 1311 1320 ··· 1336 1345 bool shared) 1337 1346 { 1338 1347 struct kvm_mmu_page *sp; 1348 + 1349 + kvm_lockdep_assert_mmu_lock_held(kvm, shared); 1339 1350 1340 1351 /* 1341 1352 * Since we are allocating while under the MMU lock we have to be ··· 1489 1496 int r = 0; 1490 1497 1491 1498 kvm_lockdep_assert_mmu_lock_held(kvm, shared); 1492 - 1493 - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) { 1499 + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) { 1494 1500 r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared); 1495 1501 if (r) { 1496 - kvm_tdp_mmu_put_root(kvm, root, shared); 1502 + kvm_tdp_mmu_put_root(kvm, root); 1497 1503 break; 1498 1504 } 1499 1505 } ··· 1514 1522 1515 1523 rcu_read_lock(); 1516 1524 1517 - tdp_root_for_each_leaf_pte(iter, root, start, end) { 1525 + tdp_root_for_each_pte(iter, root, start, end) { 1518 1526 retry: 1519 - if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) 1527 + if (!is_shadow_present_pte(iter.old_spte) || 1528 + !is_last_spte(iter.old_spte, iter.level)) 1520 1529 continue; 1521 1530 1522 - if (!is_shadow_present_pte(iter.old_spte)) 1531 + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) 1523 1532 continue; 1524 1533 1525 1534 KVM_MMU_WARN_ON(kvm_ad_enabled() && ··· 1553 1560 bool spte_set = false; 1554 1561 1555 1562 lockdep_assert_held_read(&kvm->mmu_lock); 1556 - 1557 - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) 1563 + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) 1558 1564 spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn, 1559 1565 slot->base_gfn + slot->npages); 1560 1566 ··· 1687 1695 struct kvm_mmu_page *root; 1688 1696 1689 1697 lockdep_assert_held_read(&kvm->mmu_lock); 1690 - 1691 - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) 1698 + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) 1692 1699 zap_collapsible_spte_range(kvm, root, slot); 1693 1700 } 1694 1701

+1 -2

arch/x86/kvm/mmu/tdp_mmu.h

··· 17 17 return refcount_inc_not_zero(&root->tdp_mmu_root_count); 18 18 } 19 19 20 - void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, 21 - bool shared); 20 + void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root); 22 21 23 22 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush); 24 23 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);

+117 -23

arch/x86/kvm/pmu.c

··· 127 127 struct kvm_pmc *pmc = perf_event->overflow_handler_context; 128 128 129 129 /* 130 - * Ignore overflow events for counters that are scheduled to be 131 - * reprogrammed, e.g. if a PMI for the previous event races with KVM's 132 - * handling of a related guest WRMSR. 130 + * Ignore asynchronous overflow events for counters that are scheduled 131 + * to be reprogrammed, e.g. if a PMI for the previous event races with 132 + * KVM's handling of a related guest WRMSR. 133 133 */ 134 134 if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi)) 135 135 return; ··· 159 159 * comes from the host counters or the guest. 160 160 */ 161 161 return 1; 162 + } 163 + 164 + static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) 165 + { 166 + u64 sample_period = (-counter_value) & pmc_bitmask(pmc); 167 + 168 + if (!sample_period) 169 + sample_period = pmc_bitmask(pmc) + 1; 170 + return sample_period; 162 171 } 163 172 164 173 static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config, ··· 224 215 return 0; 225 216 } 226 217 227 - static void pmc_pause_counter(struct kvm_pmc *pmc) 218 + static bool pmc_pause_counter(struct kvm_pmc *pmc) 228 219 { 229 220 u64 counter = pmc->counter; 230 - 231 - if (!pmc->perf_event || pmc->is_paused) 232 - return; 221 + u64 prev_counter; 233 222 234 223 /* update counter, reset event value to avoid redundant accumulation */ 235 - counter += perf_event_pause(pmc->perf_event, true); 224 + if (pmc->perf_event && !pmc->is_paused) 225 + counter += perf_event_pause(pmc->perf_event, true); 226 + 227 + /* 228 + * Snapshot the previous counter *after* accumulating state from perf. 229 + * If overflow already happened, hardware (via perf) is responsible for 230 + * generating a PMI. KVM just needs to detect overflow on emulated 231 + * counter events that haven't yet been processed. 232 + */ 233 + prev_counter = counter & pmc_bitmask(pmc); 234 + 235 + counter += pmc->emulated_counter; 236 236 pmc->counter = counter & pmc_bitmask(pmc); 237 + 238 + pmc->emulated_counter = 0; 237 239 pmc->is_paused = true; 240 + 241 + return pmc->counter < prev_counter; 238 242 } 239 243 240 244 static bool pmc_resume_counter(struct kvm_pmc *pmc) ··· 271 249 272 250 return true; 273 251 } 252 + 253 + static void pmc_release_perf_event(struct kvm_pmc *pmc) 254 + { 255 + if (pmc->perf_event) { 256 + perf_event_release_kernel(pmc->perf_event); 257 + pmc->perf_event = NULL; 258 + pmc->current_config = 0; 259 + pmc_to_pmu(pmc)->event_count--; 260 + } 261 + } 262 + 263 + static void pmc_stop_counter(struct kvm_pmc *pmc) 264 + { 265 + if (pmc->perf_event) { 266 + pmc->counter = pmc_read_counter(pmc); 267 + pmc_release_perf_event(pmc); 268 + } 269 + } 270 + 271 + static void pmc_update_sample_period(struct kvm_pmc *pmc) 272 + { 273 + if (!pmc->perf_event || pmc->is_paused || 274 + !is_sampling_event(pmc->perf_event)) 275 + return; 276 + 277 + perf_event_period(pmc->perf_event, 278 + get_sample_period(pmc, pmc->counter)); 279 + } 280 + 281 + void pmc_write_counter(struct kvm_pmc *pmc, u64 val) 282 + { 283 + /* 284 + * Drop any unconsumed accumulated counts, the WRMSR is a write, not a 285 + * read-modify-write. Adjust the counter value so that its value is 286 + * relative to the current count, as reading the current count from 287 + * perf is faster than pausing and repgrogramming the event in order to 288 + * reset it to '0'. Note, this very sneakily offsets the accumulated 289 + * emulated count too, by using pmc_read_counter()! 290 + */ 291 + pmc->emulated_counter = 0; 292 + pmc->counter += val - pmc_read_counter(pmc); 293 + pmc->counter &= pmc_bitmask(pmc); 294 + pmc_update_sample_period(pmc); 295 + } 296 + EXPORT_SYMBOL_GPL(pmc_write_counter); 274 297 275 298 static int filter_cmp(const void *pa, const void *pb, u64 mask) 276 299 { ··· 450 383 struct kvm_pmu *pmu = pmc_to_pmu(pmc); 451 384 u64 eventsel = pmc->eventsel; 452 385 u64 new_config = eventsel; 386 + bool emulate_overflow; 453 387 u8 fixed_ctr_ctrl; 454 388 455 - pmc_pause_counter(pmc); 389 + emulate_overflow = pmc_pause_counter(pmc); 456 390 457 391 if (!pmc_event_is_allowed(pmc)) 458 392 goto reprogram_complete; 459 393 460 - if (pmc->counter < pmc->prev_counter) 394 + if (emulate_overflow) 461 395 __kvm_perf_overflow(pmc, false); 462 396 463 397 if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL) ··· 498 430 499 431 reprogram_complete: 500 432 clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi); 501 - pmc->prev_counter = 0; 502 433 } 503 434 504 435 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) ··· 706 639 return 0; 707 640 } 708 641 709 - /* refresh PMU settings. This function generally is called when underlying 710 - * settings are changed (such as changes of PMU CPUID by guest VMs), which 711 - * should rarely happen. 642 + static void kvm_pmu_reset(struct kvm_vcpu *vcpu) 643 + { 644 + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 645 + struct kvm_pmc *pmc; 646 + int i; 647 + 648 + pmu->need_cleanup = false; 649 + 650 + bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX); 651 + 652 + for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { 653 + pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); 654 + if (!pmc) 655 + continue; 656 + 657 + pmc_stop_counter(pmc); 658 + pmc->counter = 0; 659 + pmc->emulated_counter = 0; 660 + 661 + if (pmc_is_gp(pmc)) 662 + pmc->eventsel = 0; 663 + } 664 + 665 + pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0; 666 + 667 + static_call_cond(kvm_x86_pmu_reset)(vcpu); 668 + } 669 + 670 + 671 + /* 672 + * Refresh the PMU configuration for the vCPU, e.g. if userspace changes CPUID 673 + * and/or PERF_CAPABILITIES. 712 674 */ 713 675 void kvm_pmu_refresh(struct kvm_vcpu *vcpu) 714 676 { 715 677 if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm)) 716 678 return; 717 679 680 + /* 681 + * Stop/release all existing counters/events before realizing the new 682 + * vPMU model. 683 + */ 684 + kvm_pmu_reset(vcpu); 685 + 718 686 bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX); 719 687 static_call(kvm_x86_pmu_refresh)(vcpu); 720 - } 721 - 722 - void kvm_pmu_reset(struct kvm_vcpu *vcpu) 723 - { 724 - static_call(kvm_x86_pmu_reset)(vcpu); 725 688 } 726 689 727 690 void kvm_pmu_init(struct kvm_vcpu *vcpu) ··· 760 663 761 664 memset(pmu, 0, sizeof(*pmu)); 762 665 static_call(kvm_x86_pmu_init)(vcpu); 763 - pmu->event_count = 0; 764 - pmu->need_cleanup = false; 765 666 kvm_pmu_refresh(vcpu); 766 667 } 767 668 ··· 795 700 796 701 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc) 797 702 { 798 - pmc->prev_counter = pmc->counter; 799 - pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc); 703 + pmc->emulated_counter++; 800 704 kvm_pmu_request_counter_reprogram(pmc); 801 705 } 802 706

+3 -44

arch/x86/kvm/pmu.h

··· 66 66 { 67 67 u64 counter, enabled, running; 68 68 69 - counter = pmc->counter; 69 + counter = pmc->counter + pmc->emulated_counter; 70 + 70 71 if (pmc->perf_event && !pmc->is_paused) 71 72 counter += perf_event_read_value(pmc->perf_event, 72 73 &enabled, &running); ··· 75 74 return counter & pmc_bitmask(pmc); 76 75 } 77 76 78 - static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val) 79 - { 80 - pmc->counter += val - pmc_read_counter(pmc); 81 - pmc->counter &= pmc_bitmask(pmc); 82 - } 83 - 84 - static inline void pmc_release_perf_event(struct kvm_pmc *pmc) 85 - { 86 - if (pmc->perf_event) { 87 - perf_event_release_kernel(pmc->perf_event); 88 - pmc->perf_event = NULL; 89 - pmc->current_config = 0; 90 - pmc_to_pmu(pmc)->event_count--; 91 - } 92 - } 93 - 94 - static inline void pmc_stop_counter(struct kvm_pmc *pmc) 95 - { 96 - if (pmc->perf_event) { 97 - pmc->counter = pmc_read_counter(pmc); 98 - pmc_release_perf_event(pmc); 99 - } 100 - } 77 + void pmc_write_counter(struct kvm_pmc *pmc, u64 val); 101 78 102 79 static inline bool pmc_is_gp(struct kvm_pmc *pmc) 103 80 { ··· 123 144 } 124 145 125 146 return NULL; 126 - } 127 - 128 - static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) 129 - { 130 - u64 sample_period = (-counter_value) & pmc_bitmask(pmc); 131 - 132 - if (!sample_period) 133 - sample_period = pmc_bitmask(pmc) + 1; 134 - return sample_period; 135 - } 136 - 137 - static inline void pmc_update_sample_period(struct kvm_pmc *pmc) 138 - { 139 - if (!pmc->perf_event || pmc->is_paused || 140 - !is_sampling_event(pmc->perf_event)) 141 - return; 142 - 143 - perf_event_period(pmc->perf_event, 144 - get_sample_period(pmc, pmc->counter)); 145 147 } 146 148 147 149 static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc) ··· 221 261 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); 222 262 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); 223 263 void kvm_pmu_refresh(struct kvm_vcpu *vcpu); 224 - void kvm_pmu_reset(struct kvm_vcpu *vcpu); 225 264 void kvm_pmu_init(struct kvm_vcpu *vcpu); 226 265 void kvm_pmu_cleanup(struct kvm_vcpu *vcpu); 227 266 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);

+22 -11

arch/x86/kvm/reverse_cpuid.h

··· 16 16 CPUID_7_1_EDX, 17 17 CPUID_8000_0007_EDX, 18 18 CPUID_8000_0022_EAX, 19 + CPUID_7_2_EDX, 19 20 NR_KVM_CPU_CAPS, 20 21 21 22 NKVMCAPINTS = NR_KVM_CPU_CAPS - NCAPINTS, ··· 46 45 #define X86_FEATURE_AVX_NE_CONVERT KVM_X86_FEATURE(CPUID_7_1_EDX, 5) 47 46 #define X86_FEATURE_AMX_COMPLEX KVM_X86_FEATURE(CPUID_7_1_EDX, 8) 48 47 #define X86_FEATURE_PREFETCHITI KVM_X86_FEATURE(CPUID_7_1_EDX, 14) 48 + 49 + /* Intel-defined sub-features, CPUID level 0x00000007:2 (EDX) */ 50 + #define X86_FEATURE_INTEL_PSFD KVM_X86_FEATURE(CPUID_7_2_EDX, 0) 51 + #define X86_FEATURE_IPRED_CTRL KVM_X86_FEATURE(CPUID_7_2_EDX, 1) 52 + #define KVM_X86_FEATURE_RRSBA_CTRL KVM_X86_FEATURE(CPUID_7_2_EDX, 2) 53 + #define X86_FEATURE_DDPD_U KVM_X86_FEATURE(CPUID_7_2_EDX, 3) 54 + #define X86_FEATURE_BHI_CTRL KVM_X86_FEATURE(CPUID_7_2_EDX, 4) 55 + #define X86_FEATURE_MCDT_NO KVM_X86_FEATURE(CPUID_7_2_EDX, 5) 49 56 50 57 /* CPUID level 0x80000007 (EDX). */ 51 58 #define KVM_X86_FEATURE_CONSTANT_TSC KVM_X86_FEATURE(CPUID_8000_0007_EDX, 8) ··· 89 80 [CPUID_8000_0007_EDX] = {0x80000007, 0, CPUID_EDX}, 90 81 [CPUID_8000_0021_EAX] = {0x80000021, 0, CPUID_EAX}, 91 82 [CPUID_8000_0022_EAX] = {0x80000022, 0, CPUID_EAX}, 83 + [CPUID_7_2_EDX] = { 7, 2, CPUID_EDX}, 92 84 }; 93 85 94 86 /* ··· 116 106 */ 117 107 static __always_inline u32 __feature_translate(int x86_feature) 118 108 { 119 - if (x86_feature == X86_FEATURE_SGX1) 120 - return KVM_X86_FEATURE_SGX1; 121 - else if (x86_feature == X86_FEATURE_SGX2) 122 - return KVM_X86_FEATURE_SGX2; 123 - else if (x86_feature == X86_FEATURE_SGX_EDECCSSA) 124 - return KVM_X86_FEATURE_SGX_EDECCSSA; 125 - else if (x86_feature == X86_FEATURE_CONSTANT_TSC) 126 - return KVM_X86_FEATURE_CONSTANT_TSC; 127 - else if (x86_feature == X86_FEATURE_PERFMON_V2) 128 - return KVM_X86_FEATURE_PERFMON_V2; 109 + #define KVM_X86_TRANSLATE_FEATURE(f) \ 110 + case X86_FEATURE_##f: return KVM_X86_FEATURE_##f 129 111 130 - return x86_feature; 112 + switch (x86_feature) { 113 + KVM_X86_TRANSLATE_FEATURE(SGX1); 114 + KVM_X86_TRANSLATE_FEATURE(SGX2); 115 + KVM_X86_TRANSLATE_FEATURE(SGX_EDECCSSA); 116 + KVM_X86_TRANSLATE_FEATURE(CONSTANT_TSC); 117 + KVM_X86_TRANSLATE_FEATURE(PERFMON_V2); 118 + KVM_X86_TRANSLATE_FEATURE(RRSBA_CTRL); 119 + default: 120 + return x86_feature; 121 + } 131 122 } 132 123 133 124 static __always_inline u32 __feature_leaf(int x86_feature)

+9

arch/x86/kvm/svm/hyperv.h

··· 11 11 #include "../hyperv.h" 12 12 #include "svm.h" 13 13 14 + #ifdef CONFIG_KVM_HYPERV 14 15 static inline void nested_svm_hv_update_vm_vp_ids(struct kvm_vcpu *vcpu) 15 16 { 16 17 struct vcpu_svm *svm = to_svm(vcpu); ··· 42 41 } 43 42 44 43 void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); 44 + #else /* CONFIG_KVM_HYPERV */ 45 + static inline void nested_svm_hv_update_vm_vp_ids(struct kvm_vcpu *vcpu) {} 46 + static inline bool nested_svm_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) 47 + { 48 + return false; 49 + } 50 + static inline void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu) {} 51 + #endif /* CONFIG_KVM_HYPERV */ 45 52 46 53 #endif /* __ARCH_X86_KVM_SVM_HYPERV_H__ */

+18 -31

arch/x86/kvm/svm/nested.c

··· 187 187 */ 188 188 static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) 189 189 { 190 - struct hv_vmcb_enlightenments *hve = &svm->nested.ctl.hv_enlightenments; 191 190 int i; 192 191 193 192 /* ··· 197 198 * - Nested hypervisor (L1) is using Hyper-V emulation interface and 198 199 * tells KVM (L0) there were no changes in MSR bitmap for L2. 199 200 */ 200 - if (!svm->nested.force_msr_bitmap_recalc && 201 - kvm_hv_hypercall_enabled(&svm->vcpu) && 202 - hve->hv_enlightenments_control.msr_bitmap && 203 - (svm->nested.ctl.clean & BIT(HV_VMCB_NESTED_ENLIGHTENMENTS))) 204 - goto set_msrpm_base_pa; 201 + #ifdef CONFIG_KVM_HYPERV 202 + if (!svm->nested.force_msr_bitmap_recalc) { 203 + struct hv_vmcb_enlightenments *hve = &svm->nested.ctl.hv_enlightenments; 204 + 205 + if (kvm_hv_hypercall_enabled(&svm->vcpu) && 206 + hve->hv_enlightenments_control.msr_bitmap && 207 + (svm->nested.ctl.clean & BIT(HV_VMCB_NESTED_ENLIGHTENMENTS))) 208 + goto set_msrpm_base_pa; 209 + } 210 + #endif 205 211 206 212 if (!(vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_MSR_PROT))) 207 213 return true; ··· 234 230 235 231 svm->nested.force_msr_bitmap_recalc = false; 236 232 233 + #ifdef CONFIG_KVM_HYPERV 237 234 set_msrpm_base_pa: 235 + #endif 238 236 svm->vmcb->control.msrpm_base_pa = __sme_set(__pa(svm->nested.msrpm)); 239 237 240 238 return true; ··· 251 245 252 246 return kvm_vcpu_is_legal_gpa(vcpu, addr) && 253 247 kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1); 254 - } 255 - 256 - static bool nested_svm_check_tlb_ctl(struct kvm_vcpu *vcpu, u8 tlb_ctl) 257 - { 258 - /* Nested FLUSHBYASID is not supported yet. */ 259 - switch(tlb_ctl) { 260 - case TLB_CONTROL_DO_NOTHING: 261 - case TLB_CONTROL_FLUSH_ALL_ASID: 262 - return true; 263 - default: 264 - return false; 265 - } 266 248 } 267 249 268 250 static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu, ··· 270 276 return false; 271 277 if (CC(!nested_svm_check_bitmap_pa(vcpu, control->iopm_base_pa, 272 278 IOPM_SIZE))) 273 - return false; 274 - 275 - if (CC(!nested_svm_check_tlb_ctl(vcpu, control->tlb_ctl))) 276 279 return false; 277 280 278 281 if (CC((control->int_ctl & V_NMI_ENABLE_MASK) && ··· 302 311 if ((save->efer & EFER_LME) && (save->cr0 & X86_CR0_PG)) { 303 312 if (CC(!(save->cr4 & X86_CR4_PAE)) || 304 313 CC(!(save->cr0 & X86_CR0_PE)) || 305 - CC(kvm_vcpu_is_illegal_gpa(vcpu, save->cr3))) 314 + CC(!kvm_vcpu_is_legal_cr3(vcpu, save->cr3))) 306 315 return false; 307 316 } 308 317 ··· 369 378 to->msrpm_base_pa &= ~0x0fffULL; 370 379 to->iopm_base_pa &= ~0x0fffULL; 371 380 381 + #ifdef CONFIG_KVM_HYPERV 372 382 /* Hyper-V extensions (Enlightened VMCB) */ 373 383 if (kvm_hv_hypercall_enabled(vcpu)) { 374 384 to->clean = from->clean; 375 385 memcpy(&to->hv_enlightenments, &from->hv_enlightenments, 376 386 sizeof(to->hv_enlightenments)); 377 387 } 388 + #endif 378 389 } 379 390 380 391 void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm, ··· 480 487 481 488 static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu) 482 489 { 483 - /* 484 - * KVM_REQ_HV_TLB_FLUSH flushes entries from either L1's VP_ID or 485 - * L2's VP_ID upon request from the guest. Make sure we check for 486 - * pending entries in the right FIFO upon L1/L2 transition as these 487 - * requests are put by other vCPUs asynchronously. 488 - */ 489 - if (to_hv_vcpu(vcpu) && npt_enabled) 490 - kvm_make_request(KVM_REQ_HV_TLB_FLUSH, vcpu); 490 + /* Handle pending Hyper-V TLB flush requests */ 491 + kvm_hv_nested_transtion_tlb_flush(vcpu, npt_enabled); 491 492 492 493 /* 493 494 * TODO: optimize unconditional TLB flush/MMU sync. A partial list of ··· 507 520 static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, 508 521 bool nested_npt, bool reload_pdptrs) 509 522 { 510 - if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) 523 + if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) 511 524 return -EINVAL; 512 525 513 526 if (reload_pdptrs && !nested_npt && is_pae_paging(vcpu) &&

-17

arch/x86/kvm/svm/pmu.c

··· 161 161 pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); 162 162 if (pmc) { 163 163 pmc_write_counter(pmc, data); 164 - pmc_update_sample_period(pmc); 165 164 return 0; 166 165 } 167 166 /* MSR_EVNTSELn */ ··· 232 233 } 233 234 } 234 235 235 - static void amd_pmu_reset(struct kvm_vcpu *vcpu) 236 - { 237 - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 238 - int i; 239 - 240 - for (i = 0; i < KVM_AMD_PMC_MAX_GENERIC; i++) { 241 - struct kvm_pmc *pmc = &pmu->gp_counters[i]; 242 - 243 - pmc_stop_counter(pmc); 244 - pmc->counter = pmc->prev_counter = pmc->eventsel = 0; 245 - } 246 - 247 - pmu->global_ctrl = pmu->global_status = 0; 248 - } 249 - 250 236 struct kvm_pmu_ops amd_pmu_ops __initdata = { 251 237 .hw_event_available = amd_hw_event_available, 252 238 .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, ··· 243 259 .set_msr = amd_pmu_set_msr, 244 260 .refresh = amd_pmu_refresh, 245 261 .init = amd_pmu_init, 246 - .reset = amd_pmu_reset, 247 262 .EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT, 248 263 .MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC, 249 264 .MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,

+5 -2

arch/x86/kvm/svm/sev.c

··· 2191 2191 /* 2192 2192 * SEV must obviously be supported in hardware. Sanity check that the 2193 2193 * CPU supports decode assists, which is mandatory for SEV guests to 2194 - * support instruction emulation. 2194 + * support instruction emulation. Ditto for flushing by ASID, as SEV 2195 + * guests are bound to a single ASID, i.e. KVM can't rotate to a new 2196 + * ASID to effect a TLB flush. 2195 2197 */ 2196 2198 if (!boot_cpu_has(X86_FEATURE_SEV) || 2197 - WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_DECODEASSISTS))) 2199 + WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_DECODEASSISTS)) || 2200 + WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_FLUSHBYASID))) 2198 2201 goto out; 2199 2202 2200 2203 /* Retrieve SEV CPUID information */

+16 -2

arch/x86/kvm/svm/svm.c

··· 3563 3563 if (svm->nmi_l1_to_l2) 3564 3564 return; 3565 3565 3566 - svm->nmi_masked = true; 3567 - svm_set_iret_intercept(svm); 3566 + /* 3567 + * No need to manually track NMI masking when vNMI is enabled, hardware 3568 + * automatically sets V_NMI_BLOCKING_MASK as appropriate, including the 3569 + * case where software directly injects an NMI. 3570 + */ 3571 + if (!is_vnmi_enabled(svm)) { 3572 + svm->nmi_masked = true; 3573 + svm_set_iret_intercept(svm); 3574 + } 3568 3575 ++vcpu->stat.nmi_injections; 3569 3576 } 3570 3577 ··· 5085 5078 if (nested) { 5086 5079 kvm_cpu_cap_set(X86_FEATURE_SVM); 5087 5080 kvm_cpu_cap_set(X86_FEATURE_VMCBCLEAN); 5081 + 5082 + /* 5083 + * KVM currently flushes TLBs on *every* nested SVM transition, 5084 + * and so for all intents and purposes KVM supports flushing by 5085 + * ASID, i.e. KVM is guaranteed to honor every L1 ASID flush. 5086 + */ 5087 + kvm_cpu_cap_set(X86_FEATURE_FLUSHBYASID); 5088 5088 5089 5089 if (nrips) 5090 5090 kvm_cpu_cap_set(X86_FEATURE_NRIPS);

+2

arch/x86/kvm/svm/svm.h

··· 148 148 u64 virt_ext; 149 149 u32 clean; 150 150 union { 151 + #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_KVM_HYPERV) 151 152 struct hv_vmcb_enlightenments hv_enlightenments; 153 + #endif 152 154 u8 reserved_sw[32]; 153 155 }; 154 156 };

+3 -7

arch/x86/kvm/svm/svm_onhyperv.c

··· 18 18 int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu) 19 19 { 20 20 struct hv_vmcb_enlightenments *hve; 21 - struct hv_partition_assist_pg **p_hv_pa_pg = 22 - &to_kvm_hv(vcpu->kvm)->hv_pa_pg; 21 + hpa_t partition_assist_page = hv_get_partition_assist_page(vcpu); 23 22 24 - if (!*p_hv_pa_pg) 25 - *p_hv_pa_pg = kzalloc(PAGE_SIZE, GFP_KERNEL); 26 - 27 - if (!*p_hv_pa_pg) 23 + if (partition_assist_page == INVALID_PAGE) 28 24 return -ENOMEM; 29 25 30 26 hve = &to_svm(vcpu)->vmcb->control.hv_enlightenments; 31 27 32 - hve->partition_assist_page = __pa(*p_hv_pa_pg); 28 + hve->partition_assist_page = partition_assist_page; 33 29 hve->hv_vm_id = (unsigned long)vcpu->kvm; 34 30 if (!hve->hv_enlightenments_control.nested_flush_hypercall) { 35 31 hve->hv_enlightenments_control.nested_flush_hypercall = 1;

+5 -5

arch/x86/kvm/svm/vmenter.S

··· 270 270 RESTORE_GUEST_SPEC_CTRL_BODY 271 271 RESTORE_HOST_SPEC_CTRL_BODY 272 272 273 - 10: cmpb $0, kvm_rebooting 273 + 10: cmpb $0, _ASM_RIP(kvm_rebooting) 274 274 jne 2b 275 275 ud2 276 - 30: cmpb $0, kvm_rebooting 276 + 30: cmpb $0, _ASM_RIP(kvm_rebooting) 277 277 jne 4b 278 278 ud2 279 - 50: cmpb $0, kvm_rebooting 279 + 50: cmpb $0, _ASM_RIP(kvm_rebooting) 280 280 jne 6b 281 281 ud2 282 - 70: cmpb $0, kvm_rebooting 282 + 70: cmpb $0, _ASM_RIP(kvm_rebooting) 283 283 jne 8b 284 284 ud2 285 285 ··· 381 381 RESTORE_GUEST_SPEC_CTRL_BODY 382 382 RESTORE_HOST_SPEC_CTRL_BODY 383 383 384 - 3: cmpb $0, kvm_rebooting 384 + 3: cmpb $0, _ASM_RIP(kvm_rebooting) 385 385 jne 2b 386 386 ud2 387 387

-447

arch/x86/kvm/vmx/hyperv.c

··· 13 13 14 14 #define CC KVM_NESTED_VMENTER_CONSISTENCY_CHECK 15 15 16 - /* 17 - * Enlightened VMCSv1 doesn't support these: 18 - * 19 - * POSTED_INTR_NV = 0x00000002, 20 - * GUEST_INTR_STATUS = 0x00000810, 21 - * APIC_ACCESS_ADDR = 0x00002014, 22 - * POSTED_INTR_DESC_ADDR = 0x00002016, 23 - * EOI_EXIT_BITMAP0 = 0x0000201c, 24 - * EOI_EXIT_BITMAP1 = 0x0000201e, 25 - * EOI_EXIT_BITMAP2 = 0x00002020, 26 - * EOI_EXIT_BITMAP3 = 0x00002022, 27 - * GUEST_PML_INDEX = 0x00000812, 28 - * PML_ADDRESS = 0x0000200e, 29 - * VM_FUNCTION_CONTROL = 0x00002018, 30 - * EPTP_LIST_ADDRESS = 0x00002024, 31 - * VMREAD_BITMAP = 0x00002026, 32 - * VMWRITE_BITMAP = 0x00002028, 33 - * 34 - * TSC_MULTIPLIER = 0x00002032, 35 - * PLE_GAP = 0x00004020, 36 - * PLE_WINDOW = 0x00004022, 37 - * VMX_PREEMPTION_TIMER_VALUE = 0x0000482E, 38 - * 39 - * Currently unsupported in KVM: 40 - * GUEST_IA32_RTIT_CTL = 0x00002814, 41 - */ 42 - #define EVMCS1_SUPPORTED_PINCTRL \ 43 - (PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 44 - PIN_BASED_EXT_INTR_MASK | \ 45 - PIN_BASED_NMI_EXITING | \ 46 - PIN_BASED_VIRTUAL_NMIS) 47 - 48 - #define EVMCS1_SUPPORTED_EXEC_CTRL \ 49 - (CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 50 - CPU_BASED_HLT_EXITING | \ 51 - CPU_BASED_CR3_LOAD_EXITING | \ 52 - CPU_BASED_CR3_STORE_EXITING | \ 53 - CPU_BASED_UNCOND_IO_EXITING | \ 54 - CPU_BASED_MOV_DR_EXITING | \ 55 - CPU_BASED_USE_TSC_OFFSETTING | \ 56 - CPU_BASED_MWAIT_EXITING | \ 57 - CPU_BASED_MONITOR_EXITING | \ 58 - CPU_BASED_INVLPG_EXITING | \ 59 - CPU_BASED_RDPMC_EXITING | \ 60 - CPU_BASED_INTR_WINDOW_EXITING | \ 61 - CPU_BASED_CR8_LOAD_EXITING | \ 62 - CPU_BASED_CR8_STORE_EXITING | \ 63 - CPU_BASED_RDTSC_EXITING | \ 64 - CPU_BASED_TPR_SHADOW | \ 65 - CPU_BASED_USE_IO_BITMAPS | \ 66 - CPU_BASED_MONITOR_TRAP_FLAG | \ 67 - CPU_BASED_USE_MSR_BITMAPS | \ 68 - CPU_BASED_NMI_WINDOW_EXITING | \ 69 - CPU_BASED_PAUSE_EXITING | \ 70 - CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 71 - 72 - #define EVMCS1_SUPPORTED_2NDEXEC \ 73 - (SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | \ 74 - SECONDARY_EXEC_WBINVD_EXITING | \ 75 - SECONDARY_EXEC_ENABLE_VPID | \ 76 - SECONDARY_EXEC_ENABLE_EPT | \ 77 - SECONDARY_EXEC_UNRESTRICTED_GUEST | \ 78 - SECONDARY_EXEC_DESC | \ 79 - SECONDARY_EXEC_ENABLE_RDTSCP | \ 80 - SECONDARY_EXEC_ENABLE_INVPCID | \ 81 - SECONDARY_EXEC_ENABLE_XSAVES | \ 82 - SECONDARY_EXEC_RDSEED_EXITING | \ 83 - SECONDARY_EXEC_RDRAND_EXITING | \ 84 - SECONDARY_EXEC_TSC_SCALING | \ 85 - SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE | \ 86 - SECONDARY_EXEC_PT_USE_GPA | \ 87 - SECONDARY_EXEC_PT_CONCEAL_VMX | \ 88 - SECONDARY_EXEC_BUS_LOCK_DETECTION | \ 89 - SECONDARY_EXEC_NOTIFY_VM_EXITING | \ 90 - SECONDARY_EXEC_ENCLS_EXITING) 91 - 92 - #define EVMCS1_SUPPORTED_3RDEXEC (0ULL) 93 - 94 - #define EVMCS1_SUPPORTED_VMEXIT_CTRL \ 95 - (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | \ 96 - VM_EXIT_SAVE_DEBUG_CONTROLS | \ 97 - VM_EXIT_ACK_INTR_ON_EXIT | \ 98 - VM_EXIT_HOST_ADDR_SPACE_SIZE | \ 99 - VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | \ 100 - VM_EXIT_SAVE_IA32_PAT | \ 101 - VM_EXIT_LOAD_IA32_PAT | \ 102 - VM_EXIT_SAVE_IA32_EFER | \ 103 - VM_EXIT_LOAD_IA32_EFER | \ 104 - VM_EXIT_CLEAR_BNDCFGS | \ 105 - VM_EXIT_PT_CONCEAL_PIP | \ 106 - VM_EXIT_CLEAR_IA32_RTIT_CTL) 107 - 108 - #define EVMCS1_SUPPORTED_VMENTRY_CTRL \ 109 - (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | \ 110 - VM_ENTRY_LOAD_DEBUG_CONTROLS | \ 111 - VM_ENTRY_IA32E_MODE | \ 112 - VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | \ 113 - VM_ENTRY_LOAD_IA32_PAT | \ 114 - VM_ENTRY_LOAD_IA32_EFER | \ 115 - VM_ENTRY_LOAD_BNDCFGS | \ 116 - VM_ENTRY_PT_CONCEAL_PIP | \ 117 - VM_ENTRY_LOAD_IA32_RTIT_CTL) 118 - 119 - #define EVMCS1_SUPPORTED_VMFUNC (0) 120 - 121 - #define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x) 122 - #define EVMCS1_FIELD(number, name, clean_field)[ROL16(number, 6)] = \ 123 - {EVMCS1_OFFSET(name), clean_field} 124 - 125 - const struct evmcs_field vmcs_field_to_evmcs_1[] = { 126 - /* 64 bit rw */ 127 - EVMCS1_FIELD(GUEST_RIP, guest_rip, 128 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 129 - EVMCS1_FIELD(GUEST_RSP, guest_rsp, 130 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 131 - EVMCS1_FIELD(GUEST_RFLAGS, guest_rflags, 132 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 133 - EVMCS1_FIELD(HOST_IA32_PAT, host_ia32_pat, 134 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 135 - EVMCS1_FIELD(HOST_IA32_EFER, host_ia32_efer, 136 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 137 - EVMCS1_FIELD(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl, 138 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 139 - EVMCS1_FIELD(HOST_CR0, host_cr0, 140 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 141 - EVMCS1_FIELD(HOST_CR3, host_cr3, 142 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 143 - EVMCS1_FIELD(HOST_CR4, host_cr4, 144 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 145 - EVMCS1_FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp, 146 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 147 - EVMCS1_FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip, 148 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 149 - EVMCS1_FIELD(HOST_RIP, host_rip, 150 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 151 - EVMCS1_FIELD(IO_BITMAP_A, io_bitmap_a, 152 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP), 153 - EVMCS1_FIELD(IO_BITMAP_B, io_bitmap_b, 154 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP), 155 - EVMCS1_FIELD(MSR_BITMAP, msr_bitmap, 156 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP), 157 - EVMCS1_FIELD(GUEST_ES_BASE, guest_es_base, 158 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 159 - EVMCS1_FIELD(GUEST_CS_BASE, guest_cs_base, 160 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 161 - EVMCS1_FIELD(GUEST_SS_BASE, guest_ss_base, 162 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 163 - EVMCS1_FIELD(GUEST_DS_BASE, guest_ds_base, 164 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 165 - EVMCS1_FIELD(GUEST_FS_BASE, guest_fs_base, 166 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 167 - EVMCS1_FIELD(GUEST_GS_BASE, guest_gs_base, 168 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 169 - EVMCS1_FIELD(GUEST_LDTR_BASE, guest_ldtr_base, 170 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 171 - EVMCS1_FIELD(GUEST_TR_BASE, guest_tr_base, 172 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 173 - EVMCS1_FIELD(GUEST_GDTR_BASE, guest_gdtr_base, 174 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 175 - EVMCS1_FIELD(GUEST_IDTR_BASE, guest_idtr_base, 176 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 177 - EVMCS1_FIELD(TSC_OFFSET, tsc_offset, 178 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 179 - EVMCS1_FIELD(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr, 180 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 181 - EVMCS1_FIELD(VMCS_LINK_POINTER, vmcs_link_pointer, 182 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 183 - EVMCS1_FIELD(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl, 184 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 185 - EVMCS1_FIELD(GUEST_IA32_PAT, guest_ia32_pat, 186 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 187 - EVMCS1_FIELD(GUEST_IA32_EFER, guest_ia32_efer, 188 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 189 - EVMCS1_FIELD(GUEST_IA32_PERF_GLOBAL_CTRL, guest_ia32_perf_global_ctrl, 190 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 191 - EVMCS1_FIELD(GUEST_PDPTR0, guest_pdptr0, 192 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 193 - EVMCS1_FIELD(GUEST_PDPTR1, guest_pdptr1, 194 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 195 - EVMCS1_FIELD(GUEST_PDPTR2, guest_pdptr2, 196 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 197 - EVMCS1_FIELD(GUEST_PDPTR3, guest_pdptr3, 198 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 199 - EVMCS1_FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions, 200 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 201 - EVMCS1_FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp, 202 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 203 - EVMCS1_FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip, 204 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 205 - EVMCS1_FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask, 206 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 207 - EVMCS1_FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask, 208 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 209 - EVMCS1_FIELD(CR0_READ_SHADOW, cr0_read_shadow, 210 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 211 - EVMCS1_FIELD(CR4_READ_SHADOW, cr4_read_shadow, 212 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 213 - EVMCS1_FIELD(GUEST_CR0, guest_cr0, 214 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 215 - EVMCS1_FIELD(GUEST_CR3, guest_cr3, 216 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 217 - EVMCS1_FIELD(GUEST_CR4, guest_cr4, 218 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 219 - EVMCS1_FIELD(GUEST_DR7, guest_dr7, 220 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 221 - EVMCS1_FIELD(HOST_FS_BASE, host_fs_base, 222 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 223 - EVMCS1_FIELD(HOST_GS_BASE, host_gs_base, 224 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 225 - EVMCS1_FIELD(HOST_TR_BASE, host_tr_base, 226 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 227 - EVMCS1_FIELD(HOST_GDTR_BASE, host_gdtr_base, 228 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 229 - EVMCS1_FIELD(HOST_IDTR_BASE, host_idtr_base, 230 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 231 - EVMCS1_FIELD(HOST_RSP, host_rsp, 232 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 233 - EVMCS1_FIELD(EPT_POINTER, ept_pointer, 234 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT), 235 - EVMCS1_FIELD(GUEST_BNDCFGS, guest_bndcfgs, 236 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 237 - EVMCS1_FIELD(XSS_EXIT_BITMAP, xss_exit_bitmap, 238 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 239 - EVMCS1_FIELD(ENCLS_EXITING_BITMAP, encls_exiting_bitmap, 240 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 241 - EVMCS1_FIELD(TSC_MULTIPLIER, tsc_multiplier, 242 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 243 - /* 244 - * Not used by KVM: 245 - * 246 - * EVMCS1_FIELD(0x00006828, guest_ia32_s_cet, 247 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 248 - * EVMCS1_FIELD(0x0000682A, guest_ssp, 249 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 250 - * EVMCS1_FIELD(0x0000682C, guest_ia32_int_ssp_table_addr, 251 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 252 - * EVMCS1_FIELD(0x00002816, guest_ia32_lbr_ctl, 253 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 254 - * EVMCS1_FIELD(0x00006C18, host_ia32_s_cet, 255 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 256 - * EVMCS1_FIELD(0x00006C1A, host_ssp, 257 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 258 - * EVMCS1_FIELD(0x00006C1C, host_ia32_int_ssp_table_addr, 259 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 260 - */ 261 - 262 - /* 64 bit read only */ 263 - EVMCS1_FIELD(GUEST_PHYSICAL_ADDRESS, guest_physical_address, 264 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 265 - EVMCS1_FIELD(EXIT_QUALIFICATION, exit_qualification, 266 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 267 - /* 268 - * Not defined in KVM: 269 - * 270 - * EVMCS1_FIELD(0x00006402, exit_io_instruction_ecx, 271 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 272 - * EVMCS1_FIELD(0x00006404, exit_io_instruction_esi, 273 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 274 - * EVMCS1_FIELD(0x00006406, exit_io_instruction_esi, 275 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 276 - * EVMCS1_FIELD(0x00006408, exit_io_instruction_eip, 277 - * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 278 - */ 279 - EVMCS1_FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address, 280 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 281 - 282 - /* 283 - * No mask defined in the spec as Hyper-V doesn't currently support 284 - * these. Future proof by resetting the whole clean field mask on 285 - * access. 286 - */ 287 - EVMCS1_FIELD(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr, 288 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 289 - EVMCS1_FIELD(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr, 290 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 291 - EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr, 292 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 293 - 294 - /* 32 bit rw */ 295 - EVMCS1_FIELD(TPR_THRESHOLD, tpr_threshold, 296 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 297 - EVMCS1_FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info, 298 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 299 - EVMCS1_FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control, 300 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC), 301 - EVMCS1_FIELD(EXCEPTION_BITMAP, exception_bitmap, 302 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN), 303 - EVMCS1_FIELD(VM_ENTRY_CONTROLS, vm_entry_controls, 304 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY), 305 - EVMCS1_FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field, 306 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT), 307 - EVMCS1_FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, 308 - vm_entry_exception_error_code, 309 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT), 310 - EVMCS1_FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len, 311 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT), 312 - EVMCS1_FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs, 313 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 314 - EVMCS1_FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control, 315 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1), 316 - EVMCS1_FIELD(VM_EXIT_CONTROLS, vm_exit_controls, 317 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1), 318 - EVMCS1_FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control, 319 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1), 320 - EVMCS1_FIELD(GUEST_ES_LIMIT, guest_es_limit, 321 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 322 - EVMCS1_FIELD(GUEST_CS_LIMIT, guest_cs_limit, 323 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 324 - EVMCS1_FIELD(GUEST_SS_LIMIT, guest_ss_limit, 325 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 326 - EVMCS1_FIELD(GUEST_DS_LIMIT, guest_ds_limit, 327 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 328 - EVMCS1_FIELD(GUEST_FS_LIMIT, guest_fs_limit, 329 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 330 - EVMCS1_FIELD(GUEST_GS_LIMIT, guest_gs_limit, 331 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 332 - EVMCS1_FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit, 333 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 334 - EVMCS1_FIELD(GUEST_TR_LIMIT, guest_tr_limit, 335 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 336 - EVMCS1_FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit, 337 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 338 - EVMCS1_FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit, 339 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 340 - EVMCS1_FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes, 341 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 342 - EVMCS1_FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes, 343 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 344 - EVMCS1_FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes, 345 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 346 - EVMCS1_FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes, 347 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 348 - EVMCS1_FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes, 349 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 350 - EVMCS1_FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes, 351 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 352 - EVMCS1_FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes, 353 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 354 - EVMCS1_FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes, 355 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 356 - EVMCS1_FIELD(GUEST_ACTIVITY_STATE, guest_activity_state, 357 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 358 - EVMCS1_FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs, 359 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 360 - 361 - /* 32 bit read only */ 362 - EVMCS1_FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error, 363 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 364 - EVMCS1_FIELD(VM_EXIT_REASON, vm_exit_reason, 365 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 366 - EVMCS1_FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info, 367 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 368 - EVMCS1_FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code, 369 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 370 - EVMCS1_FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field, 371 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 372 - EVMCS1_FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code, 373 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 374 - EVMCS1_FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len, 375 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 376 - EVMCS1_FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info, 377 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 378 - 379 - /* No mask defined in the spec (not used) */ 380 - EVMCS1_FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask, 381 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 382 - EVMCS1_FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match, 383 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 384 - EVMCS1_FIELD(CR3_TARGET_COUNT, cr3_target_count, 385 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 386 - EVMCS1_FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count, 387 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 388 - EVMCS1_FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count, 389 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 390 - EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count, 391 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 392 - 393 - /* 16 bit rw */ 394 - EVMCS1_FIELD(HOST_ES_SELECTOR, host_es_selector, 395 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 396 - EVMCS1_FIELD(HOST_CS_SELECTOR, host_cs_selector, 397 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 398 - EVMCS1_FIELD(HOST_SS_SELECTOR, host_ss_selector, 399 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 400 - EVMCS1_FIELD(HOST_DS_SELECTOR, host_ds_selector, 401 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 402 - EVMCS1_FIELD(HOST_FS_SELECTOR, host_fs_selector, 403 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 404 - EVMCS1_FIELD(HOST_GS_SELECTOR, host_gs_selector, 405 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 406 - EVMCS1_FIELD(HOST_TR_SELECTOR, host_tr_selector, 407 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 408 - EVMCS1_FIELD(GUEST_ES_SELECTOR, guest_es_selector, 409 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 410 - EVMCS1_FIELD(GUEST_CS_SELECTOR, guest_cs_selector, 411 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 412 - EVMCS1_FIELD(GUEST_SS_SELECTOR, guest_ss_selector, 413 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 414 - EVMCS1_FIELD(GUEST_DS_SELECTOR, guest_ds_selector, 415 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 416 - EVMCS1_FIELD(GUEST_FS_SELECTOR, guest_fs_selector, 417 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 418 - EVMCS1_FIELD(GUEST_GS_SELECTOR, guest_gs_selector, 419 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 420 - EVMCS1_FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector, 421 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 422 - EVMCS1_FIELD(GUEST_TR_SELECTOR, guest_tr_selector, 423 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 424 - EVMCS1_FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id, 425 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT), 426 - }; 427 - const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1); 428 - 429 16 u64 nested_get_evmptr(struct kvm_vcpu *vcpu) 430 17 { 431 18 struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); ··· 194 607 195 608 return 0; 196 609 } 197 - 198 - #if IS_ENABLED(CONFIG_HYPERV) 199 - DEFINE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); 200 - 201 - /* 202 - * KVM on Hyper-V always uses the latest known eVMCSv1 revision, the assumption 203 - * is: in case a feature has corresponding fields in eVMCS described and it was 204 - * exposed in VMX feature MSRs, KVM is free to use it. Warn if KVM meets a 205 - * feature which has no corresponding eVMCS field, this likely means that KVM 206 - * needs to be updated. 207 - */ 208 - #define evmcs_check_vmcs_conf(field, ctrl) \ 209 - do { \ 210 - typeof(vmcs_conf->field) unsupported; \ 211 - \ 212 - unsupported = vmcs_conf->field & ~EVMCS1_SUPPORTED_ ## ctrl; \ 213 - if (unsupported) { \ 214 - pr_warn_once(#field " unsupported with eVMCS: 0x%llx\n",\ 215 - (u64)unsupported); \ 216 - vmcs_conf->field &= EVMCS1_SUPPORTED_ ## ctrl; \ 217 - } \ 218 - } \ 219 - while (0) 220 - 221 - void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf) 222 - { 223 - evmcs_check_vmcs_conf(cpu_based_exec_ctrl, EXEC_CTRL); 224 - evmcs_check_vmcs_conf(pin_based_exec_ctrl, PINCTRL); 225 - evmcs_check_vmcs_conf(cpu_based_2nd_exec_ctrl, 2NDEXEC); 226 - evmcs_check_vmcs_conf(cpu_based_3rd_exec_ctrl, 3RDEXEC); 227 - evmcs_check_vmcs_conf(vmentry_ctrl, VMENTRY_CTRL); 228 - evmcs_check_vmcs_conf(vmexit_ctrl, VMEXIT_CTRL); 229 - } 230 - #endif 231 610 232 611 int nested_enable_evmcs(struct kvm_vcpu *vcpu, 233 612 uint16_t *vmcs_version)

+64 -174

arch/x86/kvm/vmx/hyperv.h

··· 2 2 #ifndef __KVM_X86_VMX_HYPERV_H 3 3 #define __KVM_X86_VMX_HYPERV_H 4 4 5 - #include <linux/jump_label.h> 6 - 7 - #include <asm/hyperv-tlfs.h> 8 - #include <asm/mshyperv.h> 9 - #include <asm/vmx.h> 10 - 11 - #include "../hyperv.h" 12 - 13 - #include "capabilities.h" 14 - #include "vmcs.h" 5 + #include <linux/kvm_host.h> 15 6 #include "vmcs12.h" 16 - 17 - struct vmcs_config; 18 - 19 - #define current_evmcs ((struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs)) 20 - 21 - #define KVM_EVMCS_VERSION 1 22 - 23 - struct evmcs_field { 24 - u16 offset; 25 - u16 clean_field; 26 - }; 27 - 28 - extern const struct evmcs_field vmcs_field_to_evmcs_1[]; 29 - extern const unsigned int nr_evmcs_1_fields; 30 - 31 - static __always_inline int evmcs_field_offset(unsigned long field, 32 - u16 *clean_field) 33 - { 34 - unsigned int index = ROL16(field, 6); 35 - const struct evmcs_field *evmcs_field; 36 - 37 - if (unlikely(index >= nr_evmcs_1_fields)) 38 - return -ENOENT; 39 - 40 - evmcs_field = &vmcs_field_to_evmcs_1[index]; 41 - 42 - /* 43 - * Use offset=0 to detect holes in eVMCS. This offset belongs to 44 - * 'revision_id' but this field has no encoding and is supposed to 45 - * be accessed directly. 46 - */ 47 - if (unlikely(!evmcs_field->offset)) 48 - return -ENOENT; 49 - 50 - if (clean_field) 51 - *clean_field = evmcs_field->clean_field; 52 - 53 - return evmcs_field->offset; 54 - } 55 - 56 - static inline u64 evmcs_read_any(struct hv_enlightened_vmcs *evmcs, 57 - unsigned long field, u16 offset) 58 - { 59 - /* 60 - * vmcs12_read_any() doesn't care whether the supplied structure 61 - * is 'struct vmcs12' or 'struct hv_enlightened_vmcs' as it takes 62 - * the exact offset of the required field, use it for convenience 63 - * here. 64 - */ 65 - return vmcs12_read_any((void *)evmcs, field, offset); 66 - } 67 - 68 - #if IS_ENABLED(CONFIG_HYPERV) 69 - 70 - DECLARE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); 71 - 72 - static __always_inline bool kvm_is_using_evmcs(void) 73 - { 74 - return static_branch_unlikely(&__kvm_is_using_evmcs); 75 - } 76 - 77 - static __always_inline int get_evmcs_offset(unsigned long field, 78 - u16 *clean_field) 79 - { 80 - int offset = evmcs_field_offset(field, clean_field); 81 - 82 - WARN_ONCE(offset < 0, "accessing unsupported EVMCS field %lx\n", field); 83 - return offset; 84 - } 85 - 86 - static __always_inline void evmcs_write64(unsigned long field, u64 value) 87 - { 88 - u16 clean_field; 89 - int offset = get_evmcs_offset(field, &clean_field); 90 - 91 - if (offset < 0) 92 - return; 93 - 94 - *(u64 *)((char *)current_evmcs + offset) = value; 95 - 96 - current_evmcs->hv_clean_fields &= ~clean_field; 97 - } 98 - 99 - static __always_inline void evmcs_write32(unsigned long field, u32 value) 100 - { 101 - u16 clean_field; 102 - int offset = get_evmcs_offset(field, &clean_field); 103 - 104 - if (offset < 0) 105 - return; 106 - 107 - *(u32 *)((char *)current_evmcs + offset) = value; 108 - current_evmcs->hv_clean_fields &= ~clean_field; 109 - } 110 - 111 - static __always_inline void evmcs_write16(unsigned long field, u16 value) 112 - { 113 - u16 clean_field; 114 - int offset = get_evmcs_offset(field, &clean_field); 115 - 116 - if (offset < 0) 117 - return; 118 - 119 - *(u16 *)((char *)current_evmcs + offset) = value; 120 - current_evmcs->hv_clean_fields &= ~clean_field; 121 - } 122 - 123 - static __always_inline u64 evmcs_read64(unsigned long field) 124 - { 125 - int offset = get_evmcs_offset(field, NULL); 126 - 127 - if (offset < 0) 128 - return 0; 129 - 130 - return *(u64 *)((char *)current_evmcs + offset); 131 - } 132 - 133 - static __always_inline u32 evmcs_read32(unsigned long field) 134 - { 135 - int offset = get_evmcs_offset(field, NULL); 136 - 137 - if (offset < 0) 138 - return 0; 139 - 140 - return *(u32 *)((char *)current_evmcs + offset); 141 - } 142 - 143 - static __always_inline u16 evmcs_read16(unsigned long field) 144 - { 145 - int offset = get_evmcs_offset(field, NULL); 146 - 147 - if (offset < 0) 148 - return 0; 149 - 150 - return *(u16 *)((char *)current_evmcs + offset); 151 - } 152 - 153 - static inline void evmcs_load(u64 phys_addr) 154 - { 155 - struct hv_vp_assist_page *vp_ap = 156 - hv_get_vp_assist_page(smp_processor_id()); 157 - 158 - if (current_evmcs->hv_enlightenments_control.nested_flush_hypercall) 159 - vp_ap->nested_control.features.directhypercall = 1; 160 - vp_ap->current_nested_vmcs = phys_addr; 161 - vp_ap->enlighten_vmentry = 1; 162 - } 163 - 164 - void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf); 165 - #else /* !IS_ENABLED(CONFIG_HYPERV) */ 166 - static __always_inline bool kvm_is_using_evmcs(void) { return false; } 167 - static __always_inline void evmcs_write64(unsigned long field, u64 value) {} 168 - static __always_inline void evmcs_write32(unsigned long field, u32 value) {} 169 - static __always_inline void evmcs_write16(unsigned long field, u16 value) {} 170 - static __always_inline u64 evmcs_read64(unsigned long field) { return 0; } 171 - static __always_inline u32 evmcs_read32(unsigned long field) { return 0; } 172 - static __always_inline u16 evmcs_read16(unsigned long field) { return 0; } 173 - static inline void evmcs_load(u64 phys_addr) {} 174 - #endif /* IS_ENABLED(CONFIG_HYPERV) */ 7 + #include "vmx.h" 175 8 176 9 #define EVMPTR_INVALID (-1ULL) 177 10 #define EVMPTR_MAP_PENDING (-2ULL) 178 - 179 - static inline bool evmptr_is_valid(u64 evmptr) 180 - { 181 - return evmptr != EVMPTR_INVALID && evmptr != EVMPTR_MAP_PENDING; 182 - } 183 11 184 12 enum nested_evmptrld_status { 185 13 EVMPTRLD_DISABLED, ··· 15 187 EVMPTRLD_VMFAIL, 16 188 EVMPTRLD_ERROR, 17 189 }; 190 + 191 + #ifdef CONFIG_KVM_HYPERV 192 + static inline bool evmptr_is_valid(u64 evmptr) 193 + { 194 + return evmptr != EVMPTR_INVALID && evmptr != EVMPTR_MAP_PENDING; 195 + } 196 + 197 + static inline bool nested_vmx_is_evmptr12_valid(struct vcpu_vmx *vmx) 198 + { 199 + return evmptr_is_valid(vmx->nested.hv_evmcs_vmptr); 200 + } 201 + 202 + static inline bool evmptr_is_set(u64 evmptr) 203 + { 204 + return evmptr != EVMPTR_INVALID; 205 + } 206 + 207 + static inline bool nested_vmx_is_evmptr12_set(struct vcpu_vmx *vmx) 208 + { 209 + return evmptr_is_set(vmx->nested.hv_evmcs_vmptr); 210 + } 211 + 212 + static inline struct hv_enlightened_vmcs *nested_vmx_evmcs(struct vcpu_vmx *vmx) 213 + { 214 + return vmx->nested.hv_evmcs; 215 + } 216 + 217 + static inline bool guest_cpuid_has_evmcs(struct kvm_vcpu *vcpu) 218 + { 219 + /* 220 + * eVMCS is exposed to the guest if Hyper-V is enabled in CPUID and 221 + * eVMCS has been explicitly enabled by userspace. 222 + */ 223 + return vcpu->arch.hyperv_enabled && 224 + to_vmx(vcpu)->nested.enlightened_vmcs_enabled; 225 + } 18 226 19 227 u64 nested_get_evmptr(struct kvm_vcpu *vcpu); 20 228 uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu); ··· 60 196 int nested_evmcs_check_controls(struct vmcs12 *vmcs12); 61 197 bool nested_evmcs_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu); 62 198 void vmx_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); 199 + #else 200 + static inline bool evmptr_is_valid(u64 evmptr) 201 + { 202 + return false; 203 + } 204 + 205 + static inline bool nested_vmx_is_evmptr12_valid(struct vcpu_vmx *vmx) 206 + { 207 + return false; 208 + } 209 + 210 + static inline bool evmptr_is_set(u64 evmptr) 211 + { 212 + return false; 213 + } 214 + 215 + static inline bool nested_vmx_is_evmptr12_set(struct vcpu_vmx *vmx) 216 + { 217 + return false; 218 + } 219 + 220 + static inline struct hv_enlightened_vmcs *nested_vmx_evmcs(struct vcpu_vmx *vmx) 221 + { 222 + return NULL; 223 + } 224 + #endif 63 225 64 226 #endif /* __KVM_X86_VMX_HYPERV_H */

+315

arch/x86/kvm/vmx/hyperv_evmcs.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * This file contains common code for working with Enlightened VMCS which is 4 + * used both by Hyper-V on KVM and KVM on Hyper-V. 5 + */ 6 + 7 + #include "hyperv_evmcs.h" 8 + 9 + #define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x) 10 + #define EVMCS1_FIELD(number, name, clean_field)[ROL16(number, 6)] = \ 11 + {EVMCS1_OFFSET(name), clean_field} 12 + 13 + const struct evmcs_field vmcs_field_to_evmcs_1[] = { 14 + /* 64 bit rw */ 15 + EVMCS1_FIELD(GUEST_RIP, guest_rip, 16 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 17 + EVMCS1_FIELD(GUEST_RSP, guest_rsp, 18 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 19 + EVMCS1_FIELD(GUEST_RFLAGS, guest_rflags, 20 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 21 + EVMCS1_FIELD(HOST_IA32_PAT, host_ia32_pat, 22 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 23 + EVMCS1_FIELD(HOST_IA32_EFER, host_ia32_efer, 24 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 25 + EVMCS1_FIELD(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl, 26 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 27 + EVMCS1_FIELD(HOST_CR0, host_cr0, 28 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 29 + EVMCS1_FIELD(HOST_CR3, host_cr3, 30 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 31 + EVMCS1_FIELD(HOST_CR4, host_cr4, 32 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 33 + EVMCS1_FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp, 34 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 35 + EVMCS1_FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip, 36 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 37 + EVMCS1_FIELD(HOST_RIP, host_rip, 38 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 39 + EVMCS1_FIELD(IO_BITMAP_A, io_bitmap_a, 40 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP), 41 + EVMCS1_FIELD(IO_BITMAP_B, io_bitmap_b, 42 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP), 43 + EVMCS1_FIELD(MSR_BITMAP, msr_bitmap, 44 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP), 45 + EVMCS1_FIELD(GUEST_ES_BASE, guest_es_base, 46 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 47 + EVMCS1_FIELD(GUEST_CS_BASE, guest_cs_base, 48 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 49 + EVMCS1_FIELD(GUEST_SS_BASE, guest_ss_base, 50 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 51 + EVMCS1_FIELD(GUEST_DS_BASE, guest_ds_base, 52 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 53 + EVMCS1_FIELD(GUEST_FS_BASE, guest_fs_base, 54 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 55 + EVMCS1_FIELD(GUEST_GS_BASE, guest_gs_base, 56 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 57 + EVMCS1_FIELD(GUEST_LDTR_BASE, guest_ldtr_base, 58 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 59 + EVMCS1_FIELD(GUEST_TR_BASE, guest_tr_base, 60 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 61 + EVMCS1_FIELD(GUEST_GDTR_BASE, guest_gdtr_base, 62 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 63 + EVMCS1_FIELD(GUEST_IDTR_BASE, guest_idtr_base, 64 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 65 + EVMCS1_FIELD(TSC_OFFSET, tsc_offset, 66 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 67 + EVMCS1_FIELD(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr, 68 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 69 + EVMCS1_FIELD(VMCS_LINK_POINTER, vmcs_link_pointer, 70 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 71 + EVMCS1_FIELD(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl, 72 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 73 + EVMCS1_FIELD(GUEST_IA32_PAT, guest_ia32_pat, 74 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 75 + EVMCS1_FIELD(GUEST_IA32_EFER, guest_ia32_efer, 76 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 77 + EVMCS1_FIELD(GUEST_IA32_PERF_GLOBAL_CTRL, guest_ia32_perf_global_ctrl, 78 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 79 + EVMCS1_FIELD(GUEST_PDPTR0, guest_pdptr0, 80 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 81 + EVMCS1_FIELD(GUEST_PDPTR1, guest_pdptr1, 82 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 83 + EVMCS1_FIELD(GUEST_PDPTR2, guest_pdptr2, 84 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 85 + EVMCS1_FIELD(GUEST_PDPTR3, guest_pdptr3, 86 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 87 + EVMCS1_FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions, 88 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 89 + EVMCS1_FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp, 90 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 91 + EVMCS1_FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip, 92 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 93 + EVMCS1_FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask, 94 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 95 + EVMCS1_FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask, 96 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 97 + EVMCS1_FIELD(CR0_READ_SHADOW, cr0_read_shadow, 98 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 99 + EVMCS1_FIELD(CR4_READ_SHADOW, cr4_read_shadow, 100 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 101 + EVMCS1_FIELD(GUEST_CR0, guest_cr0, 102 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 103 + EVMCS1_FIELD(GUEST_CR3, guest_cr3, 104 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 105 + EVMCS1_FIELD(GUEST_CR4, guest_cr4, 106 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 107 + EVMCS1_FIELD(GUEST_DR7, guest_dr7, 108 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR), 109 + EVMCS1_FIELD(HOST_FS_BASE, host_fs_base, 110 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 111 + EVMCS1_FIELD(HOST_GS_BASE, host_gs_base, 112 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 113 + EVMCS1_FIELD(HOST_TR_BASE, host_tr_base, 114 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 115 + EVMCS1_FIELD(HOST_GDTR_BASE, host_gdtr_base, 116 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 117 + EVMCS1_FIELD(HOST_IDTR_BASE, host_idtr_base, 118 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 119 + EVMCS1_FIELD(HOST_RSP, host_rsp, 120 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER), 121 + EVMCS1_FIELD(EPT_POINTER, ept_pointer, 122 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT), 123 + EVMCS1_FIELD(GUEST_BNDCFGS, guest_bndcfgs, 124 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 125 + EVMCS1_FIELD(XSS_EXIT_BITMAP, xss_exit_bitmap, 126 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 127 + EVMCS1_FIELD(ENCLS_EXITING_BITMAP, encls_exiting_bitmap, 128 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 129 + EVMCS1_FIELD(TSC_MULTIPLIER, tsc_multiplier, 130 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), 131 + /* 132 + * Not used by KVM: 133 + * 134 + * EVMCS1_FIELD(0x00006828, guest_ia32_s_cet, 135 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 136 + * EVMCS1_FIELD(0x0000682A, guest_ssp, 137 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 138 + * EVMCS1_FIELD(0x0000682C, guest_ia32_int_ssp_table_addr, 139 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 140 + * EVMCS1_FIELD(0x00002816, guest_ia32_lbr_ctl, 141 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 142 + * EVMCS1_FIELD(0x00006C18, host_ia32_s_cet, 143 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 144 + * EVMCS1_FIELD(0x00006C1A, host_ssp, 145 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 146 + * EVMCS1_FIELD(0x00006C1C, host_ia32_int_ssp_table_addr, 147 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 148 + */ 149 + 150 + /* 64 bit read only */ 151 + EVMCS1_FIELD(GUEST_PHYSICAL_ADDRESS, guest_physical_address, 152 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 153 + EVMCS1_FIELD(EXIT_QUALIFICATION, exit_qualification, 154 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 155 + /* 156 + * Not defined in KVM: 157 + * 158 + * EVMCS1_FIELD(0x00006402, exit_io_instruction_ecx, 159 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 160 + * EVMCS1_FIELD(0x00006404, exit_io_instruction_esi, 161 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 162 + * EVMCS1_FIELD(0x00006406, exit_io_instruction_esi, 163 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 164 + * EVMCS1_FIELD(0x00006408, exit_io_instruction_eip, 165 + * HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE); 166 + */ 167 + EVMCS1_FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address, 168 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 169 + 170 + /* 171 + * No mask defined in the spec as Hyper-V doesn't currently support 172 + * these. Future proof by resetting the whole clean field mask on 173 + * access. 174 + */ 175 + EVMCS1_FIELD(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr, 176 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 177 + EVMCS1_FIELD(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr, 178 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 179 + EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr, 180 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 181 + 182 + /* 32 bit rw */ 183 + EVMCS1_FIELD(TPR_THRESHOLD, tpr_threshold, 184 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 185 + EVMCS1_FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info, 186 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC), 187 + EVMCS1_FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control, 188 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC), 189 + EVMCS1_FIELD(EXCEPTION_BITMAP, exception_bitmap, 190 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN), 191 + EVMCS1_FIELD(VM_ENTRY_CONTROLS, vm_entry_controls, 192 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY), 193 + EVMCS1_FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field, 194 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT), 195 + EVMCS1_FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, 196 + vm_entry_exception_error_code, 197 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT), 198 + EVMCS1_FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len, 199 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT), 200 + EVMCS1_FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs, 201 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 202 + EVMCS1_FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control, 203 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1), 204 + EVMCS1_FIELD(VM_EXIT_CONTROLS, vm_exit_controls, 205 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1), 206 + EVMCS1_FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control, 207 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1), 208 + EVMCS1_FIELD(GUEST_ES_LIMIT, guest_es_limit, 209 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 210 + EVMCS1_FIELD(GUEST_CS_LIMIT, guest_cs_limit, 211 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 212 + EVMCS1_FIELD(GUEST_SS_LIMIT, guest_ss_limit, 213 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 214 + EVMCS1_FIELD(GUEST_DS_LIMIT, guest_ds_limit, 215 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 216 + EVMCS1_FIELD(GUEST_FS_LIMIT, guest_fs_limit, 217 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 218 + EVMCS1_FIELD(GUEST_GS_LIMIT, guest_gs_limit, 219 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 220 + EVMCS1_FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit, 221 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 222 + EVMCS1_FIELD(GUEST_TR_LIMIT, guest_tr_limit, 223 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 224 + EVMCS1_FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit, 225 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 226 + EVMCS1_FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit, 227 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 228 + EVMCS1_FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes, 229 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 230 + EVMCS1_FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes, 231 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 232 + EVMCS1_FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes, 233 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 234 + EVMCS1_FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes, 235 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 236 + EVMCS1_FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes, 237 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 238 + EVMCS1_FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes, 239 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 240 + EVMCS1_FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes, 241 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 242 + EVMCS1_FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes, 243 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 244 + EVMCS1_FIELD(GUEST_ACTIVITY_STATE, guest_activity_state, 245 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 246 + EVMCS1_FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs, 247 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), 248 + 249 + /* 32 bit read only */ 250 + EVMCS1_FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error, 251 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 252 + EVMCS1_FIELD(VM_EXIT_REASON, vm_exit_reason, 253 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 254 + EVMCS1_FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info, 255 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 256 + EVMCS1_FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code, 257 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 258 + EVMCS1_FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field, 259 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 260 + EVMCS1_FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code, 261 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 262 + EVMCS1_FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len, 263 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 264 + EVMCS1_FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info, 265 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE), 266 + 267 + /* No mask defined in the spec (not used) */ 268 + EVMCS1_FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask, 269 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 270 + EVMCS1_FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match, 271 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 272 + EVMCS1_FIELD(CR3_TARGET_COUNT, cr3_target_count, 273 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 274 + EVMCS1_FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count, 275 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 276 + EVMCS1_FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count, 277 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 278 + EVMCS1_FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count, 279 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL), 280 + 281 + /* 16 bit rw */ 282 + EVMCS1_FIELD(HOST_ES_SELECTOR, host_es_selector, 283 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 284 + EVMCS1_FIELD(HOST_CS_SELECTOR, host_cs_selector, 285 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 286 + EVMCS1_FIELD(HOST_SS_SELECTOR, host_ss_selector, 287 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 288 + EVMCS1_FIELD(HOST_DS_SELECTOR, host_ds_selector, 289 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 290 + EVMCS1_FIELD(HOST_FS_SELECTOR, host_fs_selector, 291 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 292 + EVMCS1_FIELD(HOST_GS_SELECTOR, host_gs_selector, 293 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 294 + EVMCS1_FIELD(HOST_TR_SELECTOR, host_tr_selector, 295 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), 296 + EVMCS1_FIELD(GUEST_ES_SELECTOR, guest_es_selector, 297 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 298 + EVMCS1_FIELD(GUEST_CS_SELECTOR, guest_cs_selector, 299 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 300 + EVMCS1_FIELD(GUEST_SS_SELECTOR, guest_ss_selector, 301 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 302 + EVMCS1_FIELD(GUEST_DS_SELECTOR, guest_ds_selector, 303 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 304 + EVMCS1_FIELD(GUEST_FS_SELECTOR, guest_fs_selector, 305 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 306 + EVMCS1_FIELD(GUEST_GS_SELECTOR, guest_gs_selector, 307 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 308 + EVMCS1_FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector, 309 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 310 + EVMCS1_FIELD(GUEST_TR_SELECTOR, guest_tr_selector, 311 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2), 312 + EVMCS1_FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id, 313 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT), 314 + }; 315 + const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);

+166

arch/x86/kvm/vmx/hyperv_evmcs.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * This file contains common definitions for working with Enlightened VMCS which 4 + * are used both by Hyper-V on KVM and KVM on Hyper-V. 5 + */ 6 + #ifndef __KVM_X86_VMX_HYPERV_EVMCS_H 7 + #define __KVM_X86_VMX_HYPERV_EVMCS_H 8 + 9 + #include <asm/hyperv-tlfs.h> 10 + 11 + #include "capabilities.h" 12 + #include "vmcs12.h" 13 + 14 + #define KVM_EVMCS_VERSION 1 15 + 16 + /* 17 + * Enlightened VMCSv1 doesn't support these: 18 + * 19 + * POSTED_INTR_NV = 0x00000002, 20 + * GUEST_INTR_STATUS = 0x00000810, 21 + * APIC_ACCESS_ADDR = 0x00002014, 22 + * POSTED_INTR_DESC_ADDR = 0x00002016, 23 + * EOI_EXIT_BITMAP0 = 0x0000201c, 24 + * EOI_EXIT_BITMAP1 = 0x0000201e, 25 + * EOI_EXIT_BITMAP2 = 0x00002020, 26 + * EOI_EXIT_BITMAP3 = 0x00002022, 27 + * GUEST_PML_INDEX = 0x00000812, 28 + * PML_ADDRESS = 0x0000200e, 29 + * VM_FUNCTION_CONTROL = 0x00002018, 30 + * EPTP_LIST_ADDRESS = 0x00002024, 31 + * VMREAD_BITMAP = 0x00002026, 32 + * VMWRITE_BITMAP = 0x00002028, 33 + * 34 + * TSC_MULTIPLIER = 0x00002032, 35 + * PLE_GAP = 0x00004020, 36 + * PLE_WINDOW = 0x00004022, 37 + * VMX_PREEMPTION_TIMER_VALUE = 0x0000482E, 38 + * 39 + * Currently unsupported in KVM: 40 + * GUEST_IA32_RTIT_CTL = 0x00002814, 41 + */ 42 + #define EVMCS1_SUPPORTED_PINCTRL \ 43 + (PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 44 + PIN_BASED_EXT_INTR_MASK | \ 45 + PIN_BASED_NMI_EXITING | \ 46 + PIN_BASED_VIRTUAL_NMIS) 47 + 48 + #define EVMCS1_SUPPORTED_EXEC_CTRL \ 49 + (CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 50 + CPU_BASED_HLT_EXITING | \ 51 + CPU_BASED_CR3_LOAD_EXITING | \ 52 + CPU_BASED_CR3_STORE_EXITING | \ 53 + CPU_BASED_UNCOND_IO_EXITING | \ 54 + CPU_BASED_MOV_DR_EXITING | \ 55 + CPU_BASED_USE_TSC_OFFSETTING | \ 56 + CPU_BASED_MWAIT_EXITING | \ 57 + CPU_BASED_MONITOR_EXITING | \ 58 + CPU_BASED_INVLPG_EXITING | \ 59 + CPU_BASED_RDPMC_EXITING | \ 60 + CPU_BASED_INTR_WINDOW_EXITING | \ 61 + CPU_BASED_CR8_LOAD_EXITING | \ 62 + CPU_BASED_CR8_STORE_EXITING | \ 63 + CPU_BASED_RDTSC_EXITING | \ 64 + CPU_BASED_TPR_SHADOW | \ 65 + CPU_BASED_USE_IO_BITMAPS | \ 66 + CPU_BASED_MONITOR_TRAP_FLAG | \ 67 + CPU_BASED_USE_MSR_BITMAPS | \ 68 + CPU_BASED_NMI_WINDOW_EXITING | \ 69 + CPU_BASED_PAUSE_EXITING | \ 70 + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 71 + 72 + #define EVMCS1_SUPPORTED_2NDEXEC \ 73 + (SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | \ 74 + SECONDARY_EXEC_WBINVD_EXITING | \ 75 + SECONDARY_EXEC_ENABLE_VPID | \ 76 + SECONDARY_EXEC_ENABLE_EPT | \ 77 + SECONDARY_EXEC_UNRESTRICTED_GUEST | \ 78 + SECONDARY_EXEC_DESC | \ 79 + SECONDARY_EXEC_ENABLE_RDTSCP | \ 80 + SECONDARY_EXEC_ENABLE_INVPCID | \ 81 + SECONDARY_EXEC_ENABLE_XSAVES | \ 82 + SECONDARY_EXEC_RDSEED_EXITING | \ 83 + SECONDARY_EXEC_RDRAND_EXITING | \ 84 + SECONDARY_EXEC_TSC_SCALING | \ 85 + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE | \ 86 + SECONDARY_EXEC_PT_USE_GPA | \ 87 + SECONDARY_EXEC_PT_CONCEAL_VMX | \ 88 + SECONDARY_EXEC_BUS_LOCK_DETECTION | \ 89 + SECONDARY_EXEC_NOTIFY_VM_EXITING | \ 90 + SECONDARY_EXEC_ENCLS_EXITING) 91 + 92 + #define EVMCS1_SUPPORTED_3RDEXEC (0ULL) 93 + 94 + #define EVMCS1_SUPPORTED_VMEXIT_CTRL \ 95 + (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | \ 96 + VM_EXIT_SAVE_DEBUG_CONTROLS | \ 97 + VM_EXIT_ACK_INTR_ON_EXIT | \ 98 + VM_EXIT_HOST_ADDR_SPACE_SIZE | \ 99 + VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | \ 100 + VM_EXIT_SAVE_IA32_PAT | \ 101 + VM_EXIT_LOAD_IA32_PAT | \ 102 + VM_EXIT_SAVE_IA32_EFER | \ 103 + VM_EXIT_LOAD_IA32_EFER | \ 104 + VM_EXIT_CLEAR_BNDCFGS | \ 105 + VM_EXIT_PT_CONCEAL_PIP | \ 106 + VM_EXIT_CLEAR_IA32_RTIT_CTL) 107 + 108 + #define EVMCS1_SUPPORTED_VMENTRY_CTRL \ 109 + (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | \ 110 + VM_ENTRY_LOAD_DEBUG_CONTROLS | \ 111 + VM_ENTRY_IA32E_MODE | \ 112 + VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | \ 113 + VM_ENTRY_LOAD_IA32_PAT | \ 114 + VM_ENTRY_LOAD_IA32_EFER | \ 115 + VM_ENTRY_LOAD_BNDCFGS | \ 116 + VM_ENTRY_PT_CONCEAL_PIP | \ 117 + VM_ENTRY_LOAD_IA32_RTIT_CTL) 118 + 119 + #define EVMCS1_SUPPORTED_VMFUNC (0) 120 + 121 + struct evmcs_field { 122 + u16 offset; 123 + u16 clean_field; 124 + }; 125 + 126 + extern const struct evmcs_field vmcs_field_to_evmcs_1[]; 127 + extern const unsigned int nr_evmcs_1_fields; 128 + 129 + static __always_inline int evmcs_field_offset(unsigned long field, 130 + u16 *clean_field) 131 + { 132 + const struct evmcs_field *evmcs_field; 133 + unsigned int index = ROL16(field, 6); 134 + 135 + if (unlikely(index >= nr_evmcs_1_fields)) 136 + return -ENOENT; 137 + 138 + evmcs_field = &vmcs_field_to_evmcs_1[index]; 139 + 140 + /* 141 + * Use offset=0 to detect holes in eVMCS. This offset belongs to 142 + * 'revision_id' but this field has no encoding and is supposed to 143 + * be accessed directly. 144 + */ 145 + if (unlikely(!evmcs_field->offset)) 146 + return -ENOENT; 147 + 148 + if (clean_field) 149 + *clean_field = evmcs_field->clean_field; 150 + 151 + return evmcs_field->offset; 152 + } 153 + 154 + static inline u64 evmcs_read_any(struct hv_enlightened_vmcs *evmcs, 155 + unsigned long field, u16 offset) 156 + { 157 + /* 158 + * vmcs12_read_any() doesn't care whether the supplied structure 159 + * is 'struct vmcs12' or 'struct hv_enlightened_vmcs' as it takes 160 + * the exact offset of the required field, use it for convenience 161 + * here. 162 + */ 163 + return vmcs12_read_any((void *)evmcs, field, offset); 164 + } 165 + 166 + #endif /* __KVM_X86_VMX_HYPERV_H */

+102 -58

arch/x86/kvm/vmx/nested.c

··· 179 179 * VM_INSTRUCTION_ERROR is not shadowed. Enlightened VMCS 'shadows' all 180 180 * fields and thus must be synced. 181 181 */ 182 - if (to_vmx(vcpu)->nested.hv_evmcs_vmptr != EVMPTR_INVALID) 182 + if (nested_vmx_is_evmptr12_set(to_vmx(vcpu))) 183 183 to_vmx(vcpu)->nested.need_vmcs12_to_shadow_sync = true; 184 184 185 185 return kvm_skip_emulated_instruction(vcpu); ··· 194 194 * can't be done if there isn't a current VMCS. 195 195 */ 196 196 if (vmx->nested.current_vmptr == INVALID_GPA && 197 - !evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 197 + !nested_vmx_is_evmptr12_valid(vmx)) 198 198 return nested_vmx_failInvalid(vcpu); 199 199 200 200 return nested_vmx_failValid(vcpu, vm_instruction_error); ··· 226 226 227 227 static inline void nested_release_evmcs(struct kvm_vcpu *vcpu) 228 228 { 229 + #ifdef CONFIG_KVM_HYPERV 229 230 struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); 230 231 struct vcpu_vmx *vmx = to_vmx(vcpu); 231 232 232 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) { 233 + if (nested_vmx_is_evmptr12_valid(vmx)) { 233 234 kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map, true); 234 235 vmx->nested.hv_evmcs = NULL; 235 236 } ··· 242 241 hv_vcpu->nested.vm_id = 0; 243 242 hv_vcpu->nested.vp_id = 0; 244 243 } 244 + #endif 245 + } 246 + 247 + static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr) 248 + { 249 + #ifdef CONFIG_KVM_HYPERV 250 + struct vcpu_vmx *vmx = to_vmx(vcpu); 251 + /* 252 + * When Enlightened VMEntry is enabled on the calling CPU we treat 253 + * memory area pointer by vmptr as Enlightened VMCS (as there's no good 254 + * way to distinguish it from VMCS12) and we must not corrupt it by 255 + * writing to the non-existent 'launch_state' field. The area doesn't 256 + * have to be the currently active EVMCS on the calling CPU and there's 257 + * nothing KVM has to do to transition it from 'active' to 'non-active' 258 + * state. It is possible that the area will stay mapped as 259 + * vmx->nested.hv_evmcs but this shouldn't be a problem. 260 + */ 261 + if (!guest_cpuid_has_evmcs(vcpu) || 262 + !evmptr_is_valid(nested_get_evmptr(vcpu))) 263 + return false; 264 + 265 + if (nested_vmx_evmcs(vmx) && vmptr == vmx->nested.hv_evmcs_vmptr) 266 + nested_release_evmcs(vcpu); 267 + 268 + return true; 269 + #else 270 + return false; 271 + #endif 245 272 } 246 273 247 274 static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx, ··· 601 572 int msr; 602 573 unsigned long *msr_bitmap_l1; 603 574 unsigned long *msr_bitmap_l0 = vmx->nested.vmcs02.msr_bitmap; 604 - struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs; 605 575 struct kvm_host_map *map = &vmx->nested.msr_bitmap_map; 606 576 607 577 /* Nothing to do if the MSR bitmap is not in use. */ ··· 616 588 * - Nested hypervisor (L1) has enabled 'Enlightened MSR Bitmap' feature 617 589 * and tells KVM (L0) there were no changes in MSR bitmap for L2. 618 590 */ 619 - if (!vmx->nested.force_msr_bitmap_recalc && evmcs && 620 - evmcs->hv_enlightenments_control.msr_bitmap && 621 - evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP) 622 - return true; 591 + if (!vmx->nested.force_msr_bitmap_recalc) { 592 + struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); 593 + 594 + if (evmcs && evmcs->hv_enlightenments_control.msr_bitmap && 595 + evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP) 596 + return true; 597 + } 623 598 624 599 if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map)) 625 600 return false; ··· 1116 1085 bool nested_ept, bool reload_pdptrs, 1117 1086 enum vm_entry_failure_code *entry_failure_code) 1118 1087 { 1119 - if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) { 1088 + if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) { 1120 1089 *entry_failure_code = ENTRY_FAIL_DEFAULT; 1121 1090 return -EINVAL; 1122 1091 } ··· 1170 1139 { 1171 1140 struct vcpu_vmx *vmx = to_vmx(vcpu); 1172 1141 1173 - /* 1174 - * KVM_REQ_HV_TLB_FLUSH flushes entries from either L1's VP_ID or 1175 - * L2's VP_ID upon request from the guest. Make sure we check for 1176 - * pending entries in the right FIFO upon L1/L2 transition as these 1177 - * requests are put by other vCPUs asynchronously. 1178 - */ 1179 - if (to_hv_vcpu(vcpu) && enable_ept) 1180 - kvm_make_request(KVM_REQ_HV_TLB_FLUSH, vcpu); 1142 + /* Handle pending Hyper-V TLB flush requests */ 1143 + kvm_hv_nested_transtion_tlb_flush(vcpu, enable_ept); 1181 1144 1182 1145 /* 1183 1146 * If vmcs12 doesn't use VPID, L1 expects linear and combined mappings ··· 1603 1578 1604 1579 static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields) 1605 1580 { 1581 + #ifdef CONFIG_KVM_HYPERV 1606 1582 struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12; 1607 - struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs; 1583 + struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); 1608 1584 struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(&vmx->vcpu); 1609 1585 1610 1586 /* HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE */ ··· 1844 1818 */ 1845 1819 1846 1820 return; 1821 + #else /* CONFIG_KVM_HYPERV */ 1822 + KVM_BUG_ON(1, vmx->vcpu.kvm); 1823 + #endif /* CONFIG_KVM_HYPERV */ 1847 1824 } 1848 1825 1849 1826 static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx) 1850 1827 { 1828 + #ifdef CONFIG_KVM_HYPERV 1851 1829 struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12; 1852 - struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs; 1830 + struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); 1853 1831 1854 1832 /* 1855 1833 * Should not be changed by KVM: ··· 2022 1992 evmcs->guest_bndcfgs = vmcs12->guest_bndcfgs; 2023 1993 2024 1994 return; 1995 + #else /* CONFIG_KVM_HYPERV */ 1996 + KVM_BUG_ON(1, vmx->vcpu.kvm); 1997 + #endif /* CONFIG_KVM_HYPERV */ 2025 1998 } 2026 1999 2027 2000 /* ··· 2034 2001 static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( 2035 2002 struct kvm_vcpu *vcpu, bool from_launch) 2036 2003 { 2004 + #ifdef CONFIG_KVM_HYPERV 2037 2005 struct vcpu_vmx *vmx = to_vmx(vcpu); 2038 2006 bool evmcs_gpa_changed = false; 2039 2007 u64 evmcs_gpa; ··· 2116 2082 } 2117 2083 2118 2084 return EVMPTRLD_SUCCEEDED; 2085 + #else 2086 + return EVMPTRLD_DISABLED; 2087 + #endif 2119 2088 } 2120 2089 2121 2090 void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu) 2122 2091 { 2123 2092 struct vcpu_vmx *vmx = to_vmx(vcpu); 2124 2093 2125 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 2094 + if (nested_vmx_is_evmptr12_valid(vmx)) 2126 2095 copy_vmcs12_to_enlightened(vmx); 2127 2096 else 2128 2097 copy_vmcs12_to_shadow(vmx); ··· 2279 2242 u32 exec_control; 2280 2243 u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12); 2281 2244 2282 - if (vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 2245 + if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) 2283 2246 prepare_vmcs02_early_rare(vmx, vmcs12); 2284 2247 2285 2248 /* ··· 2440 2403 2441 2404 static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) 2442 2405 { 2443 - struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs; 2406 + struct hv_enlightened_vmcs *hv_evmcs = nested_vmx_evmcs(vmx); 2444 2407 2445 2408 if (!hv_evmcs || !(hv_evmcs->hv_clean_fields & 2446 2409 HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) { ··· 2572 2535 enum vm_entry_failure_code *entry_failure_code) 2573 2536 { 2574 2537 struct vcpu_vmx *vmx = to_vmx(vcpu); 2538 + struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); 2575 2539 bool load_guest_pdptrs_vmcs12 = false; 2576 2540 2577 - if (vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) { 2541 + if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) { 2578 2542 prepare_vmcs02_rare(vmx, vmcs12); 2579 2543 vmx->nested.dirty_vmcs12 = false; 2580 2544 2581 - load_guest_pdptrs_vmcs12 = !evmptr_is_valid(vmx->nested.hv_evmcs_vmptr) || 2582 - !(vmx->nested.hv_evmcs->hv_clean_fields & 2583 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); 2545 + load_guest_pdptrs_vmcs12 = !nested_vmx_is_evmptr12_valid(vmx) || 2546 + !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); 2584 2547 } 2585 2548 2586 2549 if (vmx->nested.nested_run_pending && ··· 2701 2664 * bits when it changes a field in eVMCS. Mark all fields as clean 2702 2665 * here. 2703 2666 */ 2704 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 2705 - vmx->nested.hv_evmcs->hv_clean_fields |= 2706 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; 2667 + if (nested_vmx_is_evmptr12_valid(vmx)) 2668 + evmcs->hv_clean_fields |= HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; 2707 2669 2708 2670 return 0; 2709 2671 } ··· 2753 2717 } 2754 2718 2755 2719 /* Reserved bits should not be set */ 2756 - if (CC(kvm_vcpu_is_illegal_gpa(vcpu, new_eptp) || ((new_eptp >> 7) & 0x1f))) 2720 + if (CC(!kvm_vcpu_is_legal_gpa(vcpu, new_eptp) || ((new_eptp >> 7) & 0x1f))) 2757 2721 return false; 2758 2722 2759 2723 /* AD, if set, should be supported */ ··· 2924 2888 nested_check_vm_entry_controls(vcpu, vmcs12)) 2925 2889 return -EINVAL; 2926 2890 2891 + #ifdef CONFIG_KVM_HYPERV 2927 2892 if (guest_cpuid_has_evmcs(vcpu)) 2928 2893 return nested_evmcs_check_controls(vmcs12); 2894 + #endif 2929 2895 2930 2896 return 0; 2931 2897 } ··· 2950 2912 2951 2913 if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) || 2952 2914 CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) || 2953 - CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3))) 2915 + CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3))) 2954 2916 return -EINVAL; 2955 2917 2956 2918 if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) || ··· 3199 3161 return 0; 3200 3162 } 3201 3163 3164 + #ifdef CONFIG_KVM_HYPERV 3202 3165 static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu) 3203 3166 { 3204 3167 struct vcpu_vmx *vmx = to_vmx(vcpu); ··· 3227 3188 3228 3189 return true; 3229 3190 } 3191 + #endif 3230 3192 3231 3193 static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) 3232 3194 { ··· 3319 3279 3320 3280 static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu) 3321 3281 { 3282 + #ifdef CONFIG_KVM_HYPERV 3322 3283 /* 3323 3284 * Note: nested_get_evmcs_page() also updates 'vp_assist_page' copy 3324 3285 * in 'struct kvm_vcpu_hv' in case eVMCS is in use, this is mandatory ··· 3336 3295 3337 3296 return false; 3338 3297 } 3298 + #endif 3339 3299 3340 3300 if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu)) 3341 3301 return false; ··· 3580 3538 3581 3539 load_vmcs12_host_state(vcpu, vmcs12); 3582 3540 vmcs12->vm_exit_reason = exit_reason.full; 3583 - if (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 3541 + if (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx)) 3584 3542 vmx->nested.need_vmcs12_to_shadow_sync = true; 3585 3543 return NVMX_VMENTRY_VMEXIT; 3586 3544 } ··· 3611 3569 if (CC(evmptrld_status == EVMPTRLD_VMFAIL)) 3612 3570 return nested_vmx_failInvalid(vcpu); 3613 3571 3614 - if (CC(!evmptr_is_valid(vmx->nested.hv_evmcs_vmptr) && 3572 + if (CC(!nested_vmx_is_evmptr12_valid(vmx) && 3615 3573 vmx->nested.current_vmptr == INVALID_GPA)) 3616 3574 return nested_vmx_failInvalid(vcpu); 3617 3575 ··· 3626 3584 if (CC(vmcs12->hdr.shadow_vmcs)) 3627 3585 return nested_vmx_failInvalid(vcpu); 3628 3586 3629 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) { 3630 - copy_enlightened_to_vmcs12(vmx, vmx->nested.hv_evmcs->hv_clean_fields); 3587 + if (nested_vmx_is_evmptr12_valid(vmx)) { 3588 + struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); 3589 + 3590 + copy_enlightened_to_vmcs12(vmx, evmcs->hv_clean_fields); 3631 3591 /* Enlightened VMCS doesn't have launch state */ 3632 3592 vmcs12->launch_state = !launch; 3633 3593 } else if (enable_shadow_vmcs) { ··· 4373 4329 { 4374 4330 struct vcpu_vmx *vmx = to_vmx(vcpu); 4375 4331 4376 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 4332 + if (nested_vmx_is_evmptr12_valid(vmx)) 4377 4333 sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); 4378 4334 4379 4335 vmx->nested.need_sync_vmcs02_to_vmcs12_rare = 4380 - !evmptr_is_valid(vmx->nested.hv_evmcs_vmptr); 4336 + !nested_vmx_is_evmptr12_valid(vmx); 4381 4337 4382 4338 vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12); 4383 4339 vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12); ··· 4776 4732 /* trying to cancel vmlaunch/vmresume is a bug */ 4777 4733 WARN_ON_ONCE(vmx->nested.nested_run_pending); 4778 4734 4735 + #ifdef CONFIG_KVM_HYPERV 4779 4736 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { 4780 4737 /* 4781 4738 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map ··· 4786 4741 */ 4787 4742 (void)nested_get_evmcs_page(vcpu); 4788 4743 } 4744 + #endif 4789 4745 4790 4746 /* Service pending TLB flush requests for L2 before switching to L1. */ 4791 4747 kvm_service_local_tlb_flush_requests(vcpu); ··· 4900 4854 } 4901 4855 4902 4856 if ((vm_exit_reason != -1) && 4903 - (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr))) 4857 + (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))) 4904 4858 vmx->nested.need_vmcs12_to_shadow_sync = true; 4905 4859 4906 4860 /* in case we halted in L2 */ ··· 5026 4980 else 5027 4981 *ret = off; 5028 4982 4983 + *ret = vmx_get_untagged_addr(vcpu, *ret, 0); 5029 4984 /* Long mode: #GP(0)/#SS(0) if the memory address is in a 5030 4985 * non-canonical form. This is the only check on the memory 5031 4986 * destination for long mode! ··· 5339 5292 if (vmptr == vmx->nested.vmxon_ptr) 5340 5293 return nested_vmx_fail(vcpu, VMXERR_VMCLEAR_VMXON_POINTER); 5341 5294 5342 - /* 5343 - * When Enlightened VMEntry is enabled on the calling CPU we treat 5344 - * memory area pointer by vmptr as Enlightened VMCS (as there's no good 5345 - * way to distinguish it from VMCS12) and we must not corrupt it by 5346 - * writing to the non-existent 'launch_state' field. The area doesn't 5347 - * have to be the currently active EVMCS on the calling CPU and there's 5348 - * nothing KVM has to do to transition it from 'active' to 'non-active' 5349 - * state. It is possible that the area will stay mapped as 5350 - * vmx->nested.hv_evmcs but this shouldn't be a problem. 5351 - */ 5352 - if (likely(!guest_cpuid_has_evmcs(vcpu) || 5353 - !evmptr_is_valid(nested_get_evmptr(vcpu)))) { 5295 + if (likely(!nested_evmcs_handle_vmclear(vcpu, vmptr))) { 5354 5296 if (vmptr == vmx->nested.current_vmptr) 5355 5297 nested_release_vmcs12(vcpu); 5356 5298 ··· 5356 5320 vmptr + offsetof(struct vmcs12, 5357 5321 launch_state), 5358 5322 &zero, sizeof(zero)); 5359 - } else if (vmx->nested.hv_evmcs && vmptr == vmx->nested.hv_evmcs_vmptr) { 5360 - nested_release_evmcs(vcpu); 5361 5323 } 5362 5324 5363 5325 return nested_vmx_succeed(vcpu); ··· 5394 5360 /* Decode instruction info and find the field to read */ 5395 5361 field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf)); 5396 5362 5397 - if (!evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) { 5363 + if (!nested_vmx_is_evmptr12_valid(vmx)) { 5398 5364 /* 5399 5365 * In VMX non-root operation, when the VMCS-link pointer is INVALID_GPA, 5400 5366 * any VMREAD sets the ALU flags for VMfailInvalid. ··· 5432 5398 return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); 5433 5399 5434 5400 /* Read the field, zero-extended to a u64 value */ 5435 - value = evmcs_read_any(vmx->nested.hv_evmcs, field, offset); 5401 + value = evmcs_read_any(nested_vmx_evmcs(vmx), field, offset); 5436 5402 } 5437 5403 5438 5404 /* ··· 5620 5586 return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_VMXON_POINTER); 5621 5587 5622 5588 /* Forbid normal VMPTRLD if Enlightened version was used */ 5623 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 5589 + if (nested_vmx_is_evmptr12_valid(vmx)) 5624 5590 return 1; 5625 5591 5626 5592 if (vmx->nested.current_vmptr != vmptr) { ··· 5683 5649 if (!nested_vmx_check_permission(vcpu)) 5684 5650 return 1; 5685 5651 5686 - if (unlikely(evmptr_is_valid(to_vmx(vcpu)->nested.hv_evmcs_vmptr))) 5652 + if (unlikely(nested_vmx_is_evmptr12_valid(to_vmx(vcpu)))) 5687 5653 return 1; 5688 5654 5689 5655 if (get_vmx_mem_address(vcpu, exit_qual, instr_info, ··· 5831 5797 vpid02 = nested_get_vpid02(vcpu); 5832 5798 switch (type) { 5833 5799 case VMX_VPID_EXTENT_INDIVIDUAL_ADDR: 5800 + /* 5801 + * LAM doesn't apply to addresses that are inputs to TLB 5802 + * invalidation. 5803 + */ 5834 5804 if (!operand.vpid || 5835 5805 is_noncanonical_address(operand.gla, vcpu)) 5836 5806 return nested_vmx_fail(vcpu, ··· 6246 6208 * Handle L2's bus locks in L0 directly. 6247 6209 */ 6248 6210 return true; 6211 + #ifdef CONFIG_KVM_HYPERV 6249 6212 case EXIT_REASON_VMCALL: 6250 6213 /* Hyper-V L2 TLB flush hypercall is handled by L0 */ 6251 6214 return guest_hv_cpuid_has_l2_tlb_flush(vcpu) && 6252 6215 nested_evmcs_l2_tlb_flush_enabled(vcpu) && 6253 6216 kvm_hv_is_tlb_flush_hcall(vcpu); 6217 + #endif 6254 6218 default: 6255 6219 break; 6256 6220 } ··· 6475 6435 kvm_state.size += sizeof(user_vmx_nested_state->vmcs12); 6476 6436 6477 6437 /* 'hv_evmcs_vmptr' can also be EVMPTR_MAP_PENDING here */ 6478 - if (vmx->nested.hv_evmcs_vmptr != EVMPTR_INVALID) 6438 + if (nested_vmx_is_evmptr12_set(vmx)) 6479 6439 kvm_state.flags |= KVM_STATE_NESTED_EVMCS; 6480 6440 6481 6441 if (is_guest_mode(vcpu) && ··· 6531 6491 } else { 6532 6492 copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu)); 6533 6493 if (!vmx->nested.need_vmcs12_to_shadow_sync) { 6534 - if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) 6494 + if (nested_vmx_is_evmptr12_valid(vmx)) 6535 6495 /* 6536 6496 * L1 hypervisor is not obliged to keep eVMCS 6537 6497 * clean fields data always up-to-date while ··· 6672 6632 return -EINVAL; 6673 6633 6674 6634 set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa); 6635 + #ifdef CONFIG_KVM_HYPERV 6675 6636 } else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) { 6676 6637 /* 6677 6638 * nested_vmx_handle_enlightened_vmptrld() cannot be called ··· 6682 6641 */ 6683 6642 vmx->nested.hv_evmcs_vmptr = EVMPTR_MAP_PENDING; 6684 6643 kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); 6644 + #endif 6685 6645 } else { 6686 6646 return -EINVAL; 6687 6647 } ··· 7138 7096 .set_state = vmx_set_nested_state, 7139 7097 .get_nested_state_pages = vmx_get_nested_state_pages, 7140 7098 .write_log_dirty = nested_vmx_write_pml_buffer, 7099 + #ifdef CONFIG_KVM_HYPERV 7141 7100 .enable_evmcs = nested_enable_evmcs, 7142 7101 .get_evmcs_version = nested_get_evmcs_version, 7143 7102 .hv_inject_synthetic_vmexit_post_tlb_flush = vmx_hv_inject_synthetic_vmexit_post_tlb_flush, 7103 + #endif 7144 7104 };

+2 -1

arch/x86/kvm/vmx/nested.h

··· 3 3 #define __KVM_X86_VMX_NESTED_H 4 4 5 5 #include "kvm_cache_regs.h" 6 + #include "hyperv.h" 6 7 #include "vmcs12.h" 7 8 #include "vmx.h" 8 9 ··· 58 57 59 58 /* 'hv_evmcs_vmptr' can also be EVMPTR_MAP_PENDING here */ 60 59 return vmx->nested.current_vmptr != -1ull || 61 - vmx->nested.hv_evmcs_vmptr != EVMPTR_INVALID; 60 + nested_vmx_is_evmptr12_set(vmx); 62 61 } 63 62 64 63 static inline u16 nested_get_vpid02(struct kvm_vcpu *vcpu)

-22

arch/x86/kvm/vmx/pmu_intel.c

··· 437 437 !(msr & MSR_PMC_FULL_WIDTH_BIT)) 438 438 data = (s64)(s32)data; 439 439 pmc_write_counter(pmc, data); 440 - pmc_update_sample_period(pmc); 441 440 break; 442 441 } else if ((pmc = get_fixed_pmc(pmu, msr))) { 443 442 pmc_write_counter(pmc, data); 444 - pmc_update_sample_period(pmc); 445 443 break; 446 444 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { 447 445 reserved_bits = pmu->reserved_bits; ··· 630 632 631 633 static void intel_pmu_reset(struct kvm_vcpu *vcpu) 632 634 { 633 - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 634 - struct kvm_pmc *pmc = NULL; 635 - int i; 636 - 637 - for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) { 638 - pmc = &pmu->gp_counters[i]; 639 - 640 - pmc_stop_counter(pmc); 641 - pmc->counter = pmc->prev_counter = pmc->eventsel = 0; 642 - } 643 - 644 - for (i = 0; i < KVM_PMC_MAX_FIXED; i++) { 645 - pmc = &pmu->fixed_counters[i]; 646 - 647 - pmc_stop_counter(pmc); 648 - pmc->counter = pmc->prev_counter = 0; 649 - } 650 - 651 - pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0; 652 - 653 635 intel_pmu_release_guest_lbr_event(vcpu); 654 636 } 655 637

+1

arch/x86/kvm/vmx/sgx.c

··· 37 37 if (!IS_ALIGNED(*gva, alignment)) { 38 38 fault = true; 39 39 } else if (likely(is_64_bit_mode(vcpu))) { 40 + *gva = vmx_get_untagged_addr(vcpu, *gva, 0); 40 41 fault = is_noncanonical_address(*gva, vcpu); 41 42 } else { 42 43 *gva &= 0xffffffff;

+1 -1

arch/x86/kvm/vmx/vmenter.S

··· 289 289 RET 290 290 291 291 .Lfixup: 292 - cmpb $0, kvm_rebooting 292 + cmpb $0, _ASM_RIP(kvm_rebooting) 293 293 jne .Lvmfail 294 294 ud2 295 295 .Lvmfail:

+67 -19

arch/x86/kvm/vmx/vmx.c

··· 66 66 #include "vmx.h" 67 67 #include "x86.h" 68 68 #include "smm.h" 69 + #include "vmx_onhyperv.h" 69 70 70 71 MODULE_AUTHOR("Qumranet"); 71 72 MODULE_LICENSE("GPL"); ··· 524 523 static int hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu) 525 524 { 526 525 struct hv_enlightened_vmcs *evmcs; 527 - struct hv_partition_assist_pg **p_hv_pa_pg = 528 - &to_kvm_hv(vcpu->kvm)->hv_pa_pg; 529 - /* 530 - * Synthetic VM-Exit is not enabled in current code and so All 531 - * evmcs in singe VM shares same assist page. 532 - */ 533 - if (!*p_hv_pa_pg) 534 - *p_hv_pa_pg = kzalloc(PAGE_SIZE, GFP_KERNEL_ACCOUNT); 526 + hpa_t partition_assist_page = hv_get_partition_assist_page(vcpu); 535 527 536 - if (!*p_hv_pa_pg) 528 + if (partition_assist_page == INVALID_PAGE) 537 529 return -ENOMEM; 538 530 539 531 evmcs = (struct hv_enlightened_vmcs *)to_vmx(vcpu)->loaded_vmcs->vmcs; 540 532 541 - evmcs->partition_assist_page = 542 - __pa(*p_hv_pa_pg); 533 + evmcs->partition_assist_page = partition_assist_page; 543 534 evmcs->hv_vm_id = (unsigned long)vcpu->kvm; 544 535 evmcs->hv_enlightenments_control.nested_flush_hypercall = 1; 545 536 ··· 2048 2055 if (vmx_get_vmx_msr(&vmx->nested.msrs, msr_info->index, 2049 2056 &msr_info->data)) 2050 2057 return 1; 2058 + #ifdef CONFIG_KVM_HYPERV 2051 2059 /* 2052 2060 * Enlightened VMCS v1 doesn't have certain VMCS fields but 2053 2061 * instead of just ignoring the features, different Hyper-V ··· 2059 2065 if (!msr_info->host_initiated && guest_cpuid_has_evmcs(vcpu)) 2060 2066 nested_evmcs_filter_control_msr(vcpu, msr_info->index, 2061 2067 &msr_info->data); 2068 + #endif 2062 2069 break; 2063 2070 case MSR_IA32_RTIT_CTL: 2064 2071 if (!vmx_pt_mode_is_host_guest()) ··· 3395 3400 update_guest_cr3 = false; 3396 3401 vmx_ept_load_pdptrs(vcpu); 3397 3402 } else { 3398 - guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu); 3403 + guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) | 3404 + kvm_get_active_cr3_lam_bits(vcpu); 3399 3405 } 3400 3406 3401 3407 if (update_guest_cr3) ··· 4829 4833 vmx->nested.posted_intr_nv = -1; 4830 4834 vmx->nested.vmxon_ptr = INVALID_GPA; 4831 4835 vmx->nested.current_vmptr = INVALID_GPA; 4836 + 4837 + #ifdef CONFIG_KVM_HYPERV 4832 4838 vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; 4839 + #endif 4833 4840 4834 4841 vcpu->arch.microcode_version = 0x100000000ULL; 4835 4842 vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED; ··· 5781 5782 * would also use advanced VM-exit information for EPT violations to 5782 5783 * reconstruct the page fault error code. 5783 5784 */ 5784 - if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa))) 5785 + if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa))) 5785 5786 return kvm_emulate_instruction(vcpu, 0); 5786 5787 5787 5788 return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); ··· 6756 6757 return; 6757 6758 6758 6759 /* 6759 - * Grab the memslot so that the hva lookup for the mmu_notifier retry 6760 - * is guaranteed to use the same memslot as the pfn lookup, i.e. rely 6761 - * on the pfn lookup's validation of the memslot to ensure a valid hva 6762 - * is used for the retry check. 6760 + * Explicitly grab the memslot using KVM's internal slot ID to ensure 6761 + * KVM doesn't unintentionally grab a userspace memslot. It _should_ 6762 + * be impossible for userspace to create a memslot for the APIC when 6763 + * APICv is enabled, but paranoia won't hurt in this case. 6763 6764 */ 6764 6765 slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT); 6765 6766 if (!slot || slot->flags & KVM_MEMSLOT_INVALID) ··· 6784 6785 return; 6785 6786 6786 6787 read_lock(&vcpu->kvm->mmu_lock); 6787 - if (mmu_invalidate_retry_hva(kvm, mmu_seq, 6788 - gfn_to_hva_memslot(slot, gfn))) { 6788 + if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) { 6789 6789 kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); 6790 6790 read_unlock(&vcpu->kvm->mmu_lock); 6791 6791 goto out; ··· 7672 7674 cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP)); 7673 7675 cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57)); 7674 7676 7677 + entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1); 7678 + cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM)); 7679 + 7675 7680 #undef cr4_fixed1_update 7676 7681 } 7677 7682 ··· 7761 7760 kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES); 7762 7761 7763 7762 kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX); 7763 + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM); 7764 7764 7765 7765 vmx_setup_uret_msrs(vmx); 7766 7766 ··· 8208 8206 free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm)); 8209 8207 } 8210 8208 8209 + /* 8210 + * Note, the SDM states that the linear address is masked *after* the modified 8211 + * canonicality check, whereas KVM masks (untags) the address and then performs 8212 + * a "normal" canonicality check. Functionally, the two methods are identical, 8213 + * and when the masking occurs relative to the canonicality check isn't visible 8214 + * to software, i.e. KVM's behavior doesn't violate the SDM. 8215 + */ 8216 + gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags) 8217 + { 8218 + int lam_bit; 8219 + unsigned long cr3_bits; 8220 + 8221 + if (flags & (X86EMUL_F_FETCH | X86EMUL_F_IMPLICIT | X86EMUL_F_INVLPG)) 8222 + return gva; 8223 + 8224 + if (!is_64_bit_mode(vcpu)) 8225 + return gva; 8226 + 8227 + /* 8228 + * Bit 63 determines if the address should be treated as user address 8229 + * or a supervisor address. 8230 + */ 8231 + if (!(gva & BIT_ULL(63))) { 8232 + cr3_bits = kvm_get_active_cr3_lam_bits(vcpu); 8233 + if (!(cr3_bits & (X86_CR3_LAM_U57 | X86_CR3_LAM_U48))) 8234 + return gva; 8235 + 8236 + /* LAM_U48 is ignored if LAM_U57 is set. */ 8237 + lam_bit = cr3_bits & X86_CR3_LAM_U57 ? 56 : 47; 8238 + } else { 8239 + if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_LAM_SUP)) 8240 + return gva; 8241 + 8242 + lam_bit = kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 56 : 47; 8243 + } 8244 + 8245 + /* 8246 + * Untag the address by sign-extending the lam_bit, but NOT to bit 63. 8247 + * Bit 63 is retained from the raw virtual address so that untagging 8248 + * doesn't change a user access to a supervisor access, and vice versa. 8249 + */ 8250 + return (sign_extend64(gva, lam_bit) & ~BIT_ULL(63)) | (gva & BIT_ULL(63)); 8251 + } 8252 + 8211 8253 static struct kvm_x86_ops vmx_x86_ops __initdata = { 8212 8254 .name = KBUILD_MODNAME, 8213 8255 ··· 8392 8346 .complete_emulated_msr = kvm_complete_insn_gp, 8393 8347 8394 8348 .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, 8349 + 8350 + .get_untagged_addr = vmx_get_untagged_addr, 8395 8351 }; 8396 8352 8397 8353 static unsigned int vmx_handle_intel_pt_intr(void)

+4 -10

arch/x86/kvm/vmx/vmx.h

··· 241 241 bool guest_mode; 242 242 } smm; 243 243 244 + #ifdef CONFIG_KVM_HYPERV 244 245 gpa_t hv_evmcs_vmptr; 245 246 struct kvm_host_map hv_evmcs_map; 246 247 struct hv_enlightened_vmcs *hv_evmcs; 248 + #endif 247 249 }; 248 250 249 251 struct vcpu_vmx { ··· 421 419 422 420 u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu); 423 421 u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu); 422 + 423 + gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags); 424 424 425 425 static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, 426 426 int type, bool value) ··· 747 743 static inline bool vmx_can_use_ipiv(struct kvm_vcpu *vcpu) 748 744 { 749 745 return lapic_in_kernel(vcpu) && enable_ipiv; 750 - } 751 - 752 - static inline bool guest_cpuid_has_evmcs(struct kvm_vcpu *vcpu) 753 - { 754 - /* 755 - * eVMCS is exposed to the guest if Hyper-V is enabled in CPUID and 756 - * eVMCS has been explicitly enabled by userspace. 757 - */ 758 - return vcpu->arch.hyperv_enabled && 759 - to_vmx(vcpu)->nested.enlightened_vmcs_enabled; 760 746 } 761 747 762 748 #endif /* __KVM_X86_VMX_H */

+36

arch/x86/kvm/vmx/vmx_onhyperv.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + #include "capabilities.h" 4 + #include "vmx_onhyperv.h" 5 + 6 + DEFINE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); 7 + 8 + /* 9 + * KVM on Hyper-V always uses the latest known eVMCSv1 revision, the assumption 10 + * is: in case a feature has corresponding fields in eVMCS described and it was 11 + * exposed in VMX feature MSRs, KVM is free to use it. Warn if KVM meets a 12 + * feature which has no corresponding eVMCS field, this likely means that KVM 13 + * needs to be updated. 14 + */ 15 + #define evmcs_check_vmcs_conf(field, ctrl) \ 16 + do { \ 17 + typeof(vmcs_conf->field) unsupported; \ 18 + \ 19 + unsupported = vmcs_conf->field & ~EVMCS1_SUPPORTED_ ## ctrl; \ 20 + if (unsupported) { \ 21 + pr_warn_once(#field " unsupported with eVMCS: 0x%llx\n",\ 22 + (u64)unsupported); \ 23 + vmcs_conf->field &= EVMCS1_SUPPORTED_ ## ctrl; \ 24 + } \ 25 + } \ 26 + while (0) 27 + 28 + void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf) 29 + { 30 + evmcs_check_vmcs_conf(cpu_based_exec_ctrl, EXEC_CTRL); 31 + evmcs_check_vmcs_conf(pin_based_exec_ctrl, PINCTRL); 32 + evmcs_check_vmcs_conf(cpu_based_2nd_exec_ctrl, 2NDEXEC); 33 + evmcs_check_vmcs_conf(cpu_based_3rd_exec_ctrl, 3RDEXEC); 34 + evmcs_check_vmcs_conf(vmentry_ctrl, VMENTRY_CTRL); 35 + evmcs_check_vmcs_conf(vmexit_ctrl, VMEXIT_CTRL); 36 + }

+125

arch/x86/kvm/vmx/vmx_onhyperv.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + 3 + #ifndef __ARCH_X86_KVM_VMX_ONHYPERV_H__ 4 + #define __ARCH_X86_KVM_VMX_ONHYPERV_H__ 5 + 6 + #include <asm/hyperv-tlfs.h> 7 + #include <asm/mshyperv.h> 8 + 9 + #include <linux/jump_label.h> 10 + 11 + #include "capabilities.h" 12 + #include "hyperv_evmcs.h" 13 + #include "vmcs12.h" 14 + 15 + #define current_evmcs ((struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs)) 16 + 17 + #if IS_ENABLED(CONFIG_HYPERV) 18 + 19 + DECLARE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); 20 + 21 + static __always_inline bool kvm_is_using_evmcs(void) 22 + { 23 + return static_branch_unlikely(&__kvm_is_using_evmcs); 24 + } 25 + 26 + static __always_inline int get_evmcs_offset(unsigned long field, 27 + u16 *clean_field) 28 + { 29 + int offset = evmcs_field_offset(field, clean_field); 30 + 31 + WARN_ONCE(offset < 0, "accessing unsupported EVMCS field %lx\n", field); 32 + return offset; 33 + } 34 + 35 + static __always_inline void evmcs_write64(unsigned long field, u64 value) 36 + { 37 + u16 clean_field; 38 + int offset = get_evmcs_offset(field, &clean_field); 39 + 40 + if (offset < 0) 41 + return; 42 + 43 + *(u64 *)((char *)current_evmcs + offset) = value; 44 + 45 + current_evmcs->hv_clean_fields &= ~clean_field; 46 + } 47 + 48 + static __always_inline void evmcs_write32(unsigned long field, u32 value) 49 + { 50 + u16 clean_field; 51 + int offset = get_evmcs_offset(field, &clean_field); 52 + 53 + if (offset < 0) 54 + return; 55 + 56 + *(u32 *)((char *)current_evmcs + offset) = value; 57 + current_evmcs->hv_clean_fields &= ~clean_field; 58 + } 59 + 60 + static __always_inline void evmcs_write16(unsigned long field, u16 value) 61 + { 62 + u16 clean_field; 63 + int offset = get_evmcs_offset(field, &clean_field); 64 + 65 + if (offset < 0) 66 + return; 67 + 68 + *(u16 *)((char *)current_evmcs + offset) = value; 69 + current_evmcs->hv_clean_fields &= ~clean_field; 70 + } 71 + 72 + static __always_inline u64 evmcs_read64(unsigned long field) 73 + { 74 + int offset = get_evmcs_offset(field, NULL); 75 + 76 + if (offset < 0) 77 + return 0; 78 + 79 + return *(u64 *)((char *)current_evmcs + offset); 80 + } 81 + 82 + static __always_inline u32 evmcs_read32(unsigned long field) 83 + { 84 + int offset = get_evmcs_offset(field, NULL); 85 + 86 + if (offset < 0) 87 + return 0; 88 + 89 + return *(u32 *)((char *)current_evmcs + offset); 90 + } 91 + 92 + static __always_inline u16 evmcs_read16(unsigned long field) 93 + { 94 + int offset = get_evmcs_offset(field, NULL); 95 + 96 + if (offset < 0) 97 + return 0; 98 + 99 + return *(u16 *)((char *)current_evmcs + offset); 100 + } 101 + 102 + static inline void evmcs_load(u64 phys_addr) 103 + { 104 + struct hv_vp_assist_page *vp_ap = 105 + hv_get_vp_assist_page(smp_processor_id()); 106 + 107 + if (current_evmcs->hv_enlightenments_control.nested_flush_hypercall) 108 + vp_ap->nested_control.features.directhypercall = 1; 109 + vp_ap->current_nested_vmcs = phys_addr; 110 + vp_ap->enlighten_vmentry = 1; 111 + } 112 + 113 + void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf); 114 + #else /* !IS_ENABLED(CONFIG_HYPERV) */ 115 + static __always_inline bool kvm_is_using_evmcs(void) { return false; } 116 + static __always_inline void evmcs_write64(unsigned long field, u64 value) {} 117 + static __always_inline void evmcs_write32(unsigned long field, u32 value) {} 118 + static __always_inline void evmcs_write16(unsigned long field, u16 value) {} 119 + static __always_inline u64 evmcs_read64(unsigned long field) { return 0; } 120 + static __always_inline u32 evmcs_read32(unsigned long field) { return 0; } 121 + static __always_inline u16 evmcs_read16(unsigned long field) { return 0; } 122 + static inline void evmcs_load(u64 phys_addr) {} 123 + #endif /* IS_ENABLED(CONFIG_HYPERV) */ 124 + 125 + #endif /* __ARCH_X86_KVM_VMX_ONHYPERV_H__ */

+1 -1

arch/x86/kvm/vmx/vmx_ops.h

··· 6 6 7 7 #include <asm/vmx.h> 8 8 9 - #include "hyperv.h" 9 + #include "vmx_onhyperv.h" 10 10 #include "vmcs.h" 11 11 #include "../x86.h" 12 12

+126 -42

arch/x86/kvm/x86.c

··· 1284 1284 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that 1285 1285 * the current vCPU mode is accurate. 1286 1286 */ 1287 - if (kvm_vcpu_is_illegal_gpa(vcpu, cr3)) 1287 + if (!kvm_vcpu_is_legal_cr3(vcpu, cr3)) 1288 1288 return 1; 1289 1289 1290 1290 if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3)) ··· 1504 1504 static const u32 emulated_msrs_all[] = { 1505 1505 MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, 1506 1506 MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, 1507 + 1508 + #ifdef CONFIG_KVM_HYPERV 1507 1509 HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, 1508 1510 HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC, 1509 1511 HV_X64_MSR_TSC_FREQUENCY, HV_X64_MSR_APIC_FREQUENCY, ··· 1523 1521 HV_X64_MSR_SYNDBG_CONTROL, HV_X64_MSR_SYNDBG_STATUS, 1524 1522 HV_X64_MSR_SYNDBG_SEND_BUFFER, HV_X64_MSR_SYNDBG_RECV_BUFFER, 1525 1523 HV_X64_MSR_SYNDBG_PENDING_BUFFER, 1524 + #endif 1526 1525 1527 1526 MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, 1528 1527 MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT, MSR_KVM_ASYNC_PF_ACK, ··· 2513 2510 } 2514 2511 #endif 2515 2512 2516 - static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) 2513 + static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) 2517 2514 { 2518 2515 #ifdef CONFIG_X86_64 2519 - bool vcpus_matched; 2520 2516 struct kvm_arch *ka = &vcpu->kvm->arch; 2521 2517 struct pvclock_gtod_data *gtod = &pvclock_gtod_data; 2522 2518 2523 - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == 2524 - atomic_read(&vcpu->kvm->online_vcpus)); 2519 + /* 2520 + * To use the masterclock, the host clocksource must be based on TSC 2521 + * and all vCPUs must have matching TSCs. Note, the count for matching 2522 + * vCPUs doesn't include the reference vCPU, hence "+1". 2523 + */ 2524 + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == 2525 + atomic_read(&vcpu->kvm->online_vcpus)) && 2526 + gtod_is_based_on_tsc(gtod->clock.vclock_mode); 2525 2527 2526 2528 /* 2527 - * Once the masterclock is enabled, always perform request in 2528 - * order to update it. 2529 - * 2530 - * In order to enable masterclock, the host clocksource must be TSC 2531 - * and the vcpus need to have matched TSCs. When that happens, 2532 - * perform request to enable masterclock. 2529 + * Request a masterclock update if the masterclock needs to be toggled 2530 + * on/off, or when starting a new generation and the masterclock is 2531 + * enabled (compute_guest_tsc() requires the masterclock snapshot to be 2532 + * taken _after_ the new generation is created). 2533 2533 */ 2534 - if (ka->use_master_clock || 2535 - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) 2534 + if ((ka->use_master_clock && new_generation) || 2535 + (ka->use_master_clock != use_master_clock)) 2536 2536 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); 2537 2537 2538 2538 trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, ··· 2712 2706 vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; 2713 2707 vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; 2714 2708 2715 - kvm_track_tsc_matching(vcpu); 2709 + kvm_track_tsc_matching(vcpu, !matched); 2716 2710 } 2717 2711 2718 2712 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) ··· 3110 3104 3111 3105 static void kvm_setup_guest_pvclock(struct kvm_vcpu *v, 3112 3106 struct gfn_to_pfn_cache *gpc, 3113 - unsigned int offset) 3107 + unsigned int offset, 3108 + bool force_tsc_unstable) 3114 3109 { 3115 3110 struct kvm_vcpu_arch *vcpu = &v->arch; 3116 3111 struct pvclock_vcpu_time_info *guest_hv_clock; ··· 3148 3141 } 3149 3142 3150 3143 memcpy(guest_hv_clock, &vcpu->hv_clock, sizeof(*guest_hv_clock)); 3144 + 3145 + if (force_tsc_unstable) 3146 + guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT; 3147 + 3151 3148 smp_wmb(); 3152 3149 3153 3150 guest_hv_clock->version = ++vcpu->hv_clock.version; ··· 3172 3161 u64 tsc_timestamp, host_tsc; 3173 3162 u8 pvclock_flags; 3174 3163 bool use_master_clock; 3164 + #ifdef CONFIG_KVM_XEN 3165 + /* 3166 + * For Xen guests we may need to override PVCLOCK_TSC_STABLE_BIT as unless 3167 + * explicitly told to use TSC as its clocksource Xen will not set this bit. 3168 + * This default behaviour led to bugs in some guest kernels which cause 3169 + * problems if they observe PVCLOCK_TSC_STABLE_BIT in the pvclock flags. 3170 + */ 3171 + bool xen_pvclock_tsc_unstable = 3172 + ka->xen_hvm_config.flags & KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE; 3173 + #endif 3175 3174 3176 3175 kernel_ns = 0; 3177 3176 host_tsc = 0; ··· 3260 3239 vcpu->hv_clock.flags = pvclock_flags; 3261 3240 3262 3241 if (vcpu->pv_time.active) 3263 - kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0); 3242 + kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0, false); 3264 3243 #ifdef CONFIG_KVM_XEN 3265 3244 if (vcpu->xen.vcpu_info_cache.active) 3266 3245 kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_info_cache, 3267 - offsetof(struct compat_vcpu_info, time)); 3246 + offsetof(struct compat_vcpu_info, time), 3247 + xen_pvclock_tsc_unstable); 3268 3248 if (vcpu->xen.vcpu_time_info_cache.active) 3269 - kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_time_info_cache, 0); 3249 + kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_time_info_cache, 0, 3250 + xen_pvclock_tsc_unstable); 3270 3251 #endif 3271 3252 kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock); 3272 3253 return 0; ··· 4043 4020 * the need to ignore the workaround. 4044 4021 */ 4045 4022 break; 4023 + #ifdef CONFIG_KVM_HYPERV 4046 4024 case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15: 4047 4025 case HV_X64_MSR_SYNDBG_CONTROL ... HV_X64_MSR_SYNDBG_PENDING_BUFFER: 4048 4026 case HV_X64_MSR_SYNDBG_OPTIONS: ··· 4056 4032 case HV_X64_MSR_TSC_INVARIANT_CONTROL: 4057 4033 return kvm_hv_set_msr_common(vcpu, msr, data, 4058 4034 msr_info->host_initiated); 4035 + #endif 4059 4036 case MSR_IA32_BBL_CR_CTL3: 4060 4037 /* Drop writes to this legacy MSR -- see rdmsr 4061 4038 * counterpart for further detail. ··· 4402 4377 */ 4403 4378 msr_info->data = 0x20000000; 4404 4379 break; 4380 + #ifdef CONFIG_KVM_HYPERV 4405 4381 case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15: 4406 4382 case HV_X64_MSR_SYNDBG_CONTROL ... HV_X64_MSR_SYNDBG_PENDING_BUFFER: 4407 4383 case HV_X64_MSR_SYNDBG_OPTIONS: ··· 4416 4390 return kvm_hv_get_msr_common(vcpu, 4417 4391 msr_info->index, &msr_info->data, 4418 4392 msr_info->host_initiated); 4393 + #endif 4419 4394 case MSR_IA32_BBL_CR_CTL3: 4420 4395 /* This legacy MSR exists but isn't fully documented in current 4421 4396 * silicon. It is however accessed by winxp in very narrow ··· 4554 4527 boot_cpu_has(X86_FEATURE_ARAT); 4555 4528 } 4556 4529 4530 + #ifdef CONFIG_KVM_HYPERV 4557 4531 static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu, 4558 4532 struct kvm_cpuid2 __user *cpuid_arg) 4559 4533 { ··· 4574 4546 return r; 4575 4547 4576 4548 return 0; 4549 + } 4550 + #endif 4551 + 4552 + static bool kvm_is_vm_type_supported(unsigned long type) 4553 + { 4554 + return type == KVM_X86_DEFAULT_VM || 4555 + (type == KVM_X86_SW_PROTECTED_VM && 4556 + IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled); 4577 4557 } 4578 4558 4579 4559 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) ··· 4609 4573 case KVM_CAP_PIT_STATE2: 4610 4574 case KVM_CAP_SET_IDENTITY_MAP_ADDR: 4611 4575 case KVM_CAP_VCPU_EVENTS: 4576 + #ifdef CONFIG_KVM_HYPERV 4612 4577 case KVM_CAP_HYPERV: 4613 4578 case KVM_CAP_HYPERV_VAPIC: 4614 4579 case KVM_CAP_HYPERV_SPIN: 4580 + case KVM_CAP_HYPERV_TIME: 4615 4581 case KVM_CAP_HYPERV_SYNIC: 4616 4582 case KVM_CAP_HYPERV_SYNIC2: 4617 4583 case KVM_CAP_HYPERV_VP_INDEX: ··· 4623 4585 case KVM_CAP_HYPERV_CPUID: 4624 4586 case KVM_CAP_HYPERV_ENFORCE_CPUID: 4625 4587 case KVM_CAP_SYS_HYPERV_CPUID: 4588 + #endif 4626 4589 case KVM_CAP_PCI_SEGMENT: 4627 4590 case KVM_CAP_DEBUGREGS: 4628 4591 case KVM_CAP_X86_ROBUST_SINGLESTEP: ··· 4633 4594 case KVM_CAP_GET_TSC_KHZ: 4634 4595 case KVM_CAP_KVMCLOCK_CTRL: 4635 4596 case KVM_CAP_READONLY_MEM: 4636 - case KVM_CAP_HYPERV_TIME: 4637 4597 case KVM_CAP_IOAPIC_POLARITY_IGNORED: 4638 4598 case KVM_CAP_TSC_DEADLINE_TIMER: 4639 4599 case KVM_CAP_DISABLE_QUIRKS: ··· 4663 4625 case KVM_CAP_ENABLE_CAP: 4664 4626 case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: 4665 4627 case KVM_CAP_IRQFD_RESAMPLE: 4628 + case KVM_CAP_MEMORY_FAULT_INFO: 4666 4629 r = 1; 4667 4630 break; 4668 4631 case KVM_CAP_EXIT_HYPERCALL: ··· 4677 4638 KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | 4678 4639 KVM_XEN_HVM_CONFIG_SHARED_INFO | 4679 4640 KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL | 4680 - KVM_XEN_HVM_CONFIG_EVTCHN_SEND; 4641 + KVM_XEN_HVM_CONFIG_EVTCHN_SEND | 4642 + KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE; 4681 4643 if (sched_info_on()) 4682 4644 r |= KVM_XEN_HVM_CONFIG_RUNSTATE | 4683 4645 KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG; ··· 4744 4704 r = kvm_x86_ops.nested_ops->get_state ? 4745 4705 kvm_x86_ops.nested_ops->get_state(NULL, NULL, 0) : 0; 4746 4706 break; 4707 + #ifdef CONFIG_KVM_HYPERV 4747 4708 case KVM_CAP_HYPERV_DIRECT_TLBFLUSH: 4748 4709 r = kvm_x86_ops.enable_l2_tlb_flush != NULL; 4749 4710 break; 4750 4711 case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: 4751 4712 r = kvm_x86_ops.nested_ops->enable_evmcs != NULL; 4752 4713 break; 4714 + #endif 4753 4715 case KVM_CAP_SMALLER_MAXPHYADDR: 4754 4716 r = (int) allow_smaller_maxphyaddr; 4755 4717 break; ··· 4779 4737 break; 4780 4738 case KVM_CAP_X86_NOTIFY_VMEXIT: 4781 4739 r = kvm_caps.has_notify_vmexit; 4740 + break; 4741 + case KVM_CAP_VM_TYPES: 4742 + r = BIT(KVM_X86_DEFAULT_VM); 4743 + if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM)) 4744 + r |= BIT(KVM_X86_SW_PROTECTED_VM); 4782 4745 break; 4783 4746 default: 4784 4747 break; ··· 4918 4871 case KVM_GET_MSRS: 4919 4872 r = msr_io(NULL, argp, do_get_msr_feature, 1); 4920 4873 break; 4874 + #ifdef CONFIG_KVM_HYPERV 4921 4875 case KVM_GET_SUPPORTED_HV_CPUID: 4922 4876 r = kvm_ioctl_get_supported_hv_cpuid(NULL, argp); 4923 4877 break; 4878 + #endif 4924 4879 case KVM_GET_DEVICE_ATTR: { 4925 4880 struct kvm_device_attr attr; 4926 4881 r = -EFAULT; ··· 5748 5699 static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, 5749 5700 struct kvm_enable_cap *cap) 5750 5701 { 5751 - int r; 5752 - uint16_t vmcs_version; 5753 - void __user *user_ptr; 5754 - 5755 5702 if (cap->flags) 5756 5703 return -EINVAL; 5757 5704 5758 5705 switch (cap->cap) { 5706 + #ifdef CONFIG_KVM_HYPERV 5759 5707 case KVM_CAP_HYPERV_SYNIC2: 5760 5708 if (cap->args[0]) 5761 5709 return -EINVAL; ··· 5764 5718 return kvm_hv_activate_synic(vcpu, cap->cap == 5765 5719 KVM_CAP_HYPERV_SYNIC2); 5766 5720 case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: 5767 - if (!kvm_x86_ops.nested_ops->enable_evmcs) 5768 - return -ENOTTY; 5769 - r = kvm_x86_ops.nested_ops->enable_evmcs(vcpu, &vmcs_version); 5770 - if (!r) { 5771 - user_ptr = (void __user *)(uintptr_t)cap->args[0]; 5772 - if (copy_to_user(user_ptr, &vmcs_version, 5773 - sizeof(vmcs_version))) 5774 - r = -EFAULT; 5721 + { 5722 + int r; 5723 + uint16_t vmcs_version; 5724 + void __user *user_ptr; 5725 + 5726 + if (!kvm_x86_ops.nested_ops->enable_evmcs) 5727 + return -ENOTTY; 5728 + r = kvm_x86_ops.nested_ops->enable_evmcs(vcpu, &vmcs_version); 5729 + if (!r) { 5730 + user_ptr = (void __user *)(uintptr_t)cap->args[0]; 5731 + if (copy_to_user(user_ptr, &vmcs_version, 5732 + sizeof(vmcs_version))) 5733 + r = -EFAULT; 5734 + } 5735 + return r; 5775 5736 } 5776 - return r; 5777 5737 case KVM_CAP_HYPERV_DIRECT_TLBFLUSH: 5778 5738 if (!kvm_x86_ops.enable_l2_tlb_flush) 5779 5739 return -ENOTTY; ··· 5788 5736 5789 5737 case KVM_CAP_HYPERV_ENFORCE_CPUID: 5790 5738 return kvm_hv_set_enforce_cpuid(vcpu, cap->args[0]); 5739 + #endif 5791 5740 5792 5741 case KVM_CAP_ENFORCE_PV_FEATURE_CPUID: 5793 5742 vcpu->arch.pv_cpuid.enforce = cap->args[0]; ··· 6181 6128 srcu_read_unlock(&vcpu->kvm->srcu, idx); 6182 6129 break; 6183 6130 } 6131 + #ifdef CONFIG_KVM_HYPERV 6184 6132 case KVM_GET_SUPPORTED_HV_CPUID: 6185 6133 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp); 6186 6134 break; 6135 + #endif 6187 6136 #ifdef CONFIG_KVM_XEN 6188 6137 case KVM_XEN_VCPU_GET_ATTR: { 6189 6138 struct kvm_xen_vcpu_attr xva; ··· 7243 7188 r = static_call(kvm_x86_mem_enc_unregister_region)(kvm, &region); 7244 7189 break; 7245 7190 } 7191 + #ifdef CONFIG_KVM_HYPERV 7246 7192 case KVM_HYPERV_EVENTFD: { 7247 7193 struct kvm_hyperv_eventfd hvevfd; 7248 7194 ··· 7253 7197 r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd); 7254 7198 break; 7255 7199 } 7200 + #endif 7256 7201 case KVM_SET_PMU_EVENT_FILTER: 7257 7202 r = kvm_vm_ioctl_set_pmu_event_filter(kvm, argp); 7258 7203 break; ··· 8489 8432 kvm_vm_bugged(kvm); 8490 8433 } 8491 8434 8435 + static gva_t emulator_get_untagged_addr(struct x86_emulate_ctxt *ctxt, 8436 + gva_t addr, unsigned int flags) 8437 + { 8438 + if (!kvm_x86_ops.get_untagged_addr) 8439 + return addr; 8440 + 8441 + return static_call(kvm_x86_get_untagged_addr)(emul_to_vcpu(ctxt), addr, flags); 8442 + } 8443 + 8492 8444 static const struct x86_emulate_ops emulate_ops = { 8493 8445 .vm_bugged = emulator_vm_bugged, 8494 8446 .read_gpr = emulator_read_gpr, ··· 8542 8476 .leave_smm = emulator_leave_smm, 8543 8477 .triple_fault = emulator_triple_fault, 8544 8478 .set_xcr = emulator_set_xcr, 8479 + .get_untagged_addr = emulator_get_untagged_addr, 8545 8480 }; 8546 8481 8547 8482 static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) ··· 10642 10575 10643 10576 static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) 10644 10577 { 10645 - u64 eoi_exit_bitmap[4]; 10646 - 10647 10578 if (!kvm_apic_hw_enabled(vcpu->arch.apic)) 10648 10579 return; 10649 10580 10581 + #ifdef CONFIG_KVM_HYPERV 10650 10582 if (to_hv_vcpu(vcpu)) { 10583 + u64 eoi_exit_bitmap[4]; 10584 + 10651 10585 bitmap_or((ulong *)eoi_exit_bitmap, 10652 10586 vcpu->arch.ioapic_handled_vectors, 10653 10587 to_hv_synic(vcpu)->vec_bitmap, 256); 10654 10588 static_call_cond(kvm_x86_load_eoi_exitmap)(vcpu, eoi_exit_bitmap); 10655 10589 return; 10656 10590 } 10657 - 10591 + #endif 10658 10592 static_call_cond(kvm_x86_load_eoi_exitmap)( 10659 10593 vcpu, (u64 *)vcpu->arch.ioapic_handled_vectors); 10660 10594 } ··· 10746 10678 * the flushes are considered "remote" and not "local" because 10747 10679 * the requests can be initiated from other vCPUs. 10748 10680 */ 10681 + #ifdef CONFIG_KVM_HYPERV 10749 10682 if (kvm_check_request(KVM_REQ_HV_TLB_FLUSH, vcpu) && 10750 10683 kvm_hv_vcpu_flush_tlb(vcpu)) 10751 10684 kvm_vcpu_flush_tlb_guest(vcpu); 10685 + #endif 10752 10686 10753 10687 if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { 10754 10688 vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS; ··· 10803 10733 vcpu_load_eoi_exitmap(vcpu); 10804 10734 if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu)) 10805 10735 kvm_vcpu_reload_apic_access_page(vcpu); 10736 + #ifdef CONFIG_KVM_HYPERV 10806 10737 if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) { 10807 10738 vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT; 10808 10739 vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH; ··· 10834 10763 */ 10835 10764 if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu)) 10836 10765 kvm_hv_process_stimers(vcpu); 10766 + #endif 10837 10767 if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) 10838 10768 kvm_vcpu_update_apicv(vcpu); 10839 10769 if (kvm_check_request(KVM_REQ_APF_READY, vcpu)) ··· 11153 11081 { 11154 11082 int r; 11155 11083 11084 + vcpu->run->exit_reason = KVM_EXIT_UNKNOWN; 11156 11085 vcpu->arch.l1tf_flush_l1d = true; 11157 11086 11158 11087 for (;;) { ··· 11671 11598 */ 11672 11599 if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA)) 11673 11600 return false; 11674 - if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3)) 11601 + if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3)) 11675 11602 return false; 11676 11603 } else { 11677 11604 /* ··· 12280 12207 } 12281 12208 12282 12209 if (!init_event) { 12283 - kvm_pmu_reset(vcpu); 12284 12210 vcpu->arch.smbase = 0x30000; 12285 12211 12286 12212 vcpu->arch.msr_misc_features_enables = 0; ··· 12496 12424 12497 12425 void kvm_arch_free_vm(struct kvm *kvm) 12498 12426 { 12499 - kfree(to_kvm_hv(kvm)->hv_pa_pg); 12427 + #if IS_ENABLED(CONFIG_HYPERV) 12428 + kfree(kvm->arch.hv_pa_pg); 12429 + #endif 12500 12430 __kvm_arch_free_vm(kvm); 12501 12431 } 12502 12432 ··· 12508 12434 int ret; 12509 12435 unsigned long flags; 12510 12436 12511 - if (type) 12437 + if (!kvm_is_vm_type_supported(type)) 12512 12438 return -EINVAL; 12439 + 12440 + kvm->arch.vm_type = type; 12513 12441 12514 12442 ret = kvm_page_track_init(kvm); 12515 12443 if (ret) ··· 12651 12575 hva = slot->userspace_addr; 12652 12576 } 12653 12577 12654 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 12655 - struct kvm_userspace_memory_region m; 12578 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 12579 + struct kvm_userspace_memory_region2 m; 12656 12580 12657 12581 m.slot = id | (i << 16); 12658 12582 m.flags = 0; ··· 12801 12725 linfo[j].disallow_lpage = 1; 12802 12726 } 12803 12727 } 12728 + 12729 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 12730 + kvm_mmu_init_memslot_memory_attributes(kvm, slot); 12731 + #endif 12804 12732 12805 12733 if (kvm_page_track_create_memslot(kvm, slot, npages)) 12806 12734 goto out_free; ··· 13616 13536 13617 13537 switch (type) { 13618 13538 case INVPCID_TYPE_INDIV_ADDR: 13539 + /* 13540 + * LAM doesn't apply to addresses that are inputs to TLB 13541 + * invalidation. 13542 + */ 13619 13543 if ((!pcid_enabled && (operand.pcid != 0)) || 13620 13544 is_noncanonical_address(operand.gla, vcpu)) { 13621 13545 kvm_inject_gp(vcpu, 0);

+2

arch/x86/kvm/x86.h

··· 530 530 __reserved_bits |= X86_CR4_VMXE; \ 531 531 if (!__cpu_has(__c, X86_FEATURE_PCID)) \ 532 532 __reserved_bits |= X86_CR4_PCIDE; \ 533 + if (!__cpu_has(__c, X86_FEATURE_LAM)) \ 534 + __reserved_bits |= X86_CR4_LAM_SUP; \ 533 535 __reserved_bits; \ 534 536 }) 535 537

+8 -1

arch/x86/kvm/xen.c

··· 1162 1162 { 1163 1163 /* Only some feature flags need to be *enabled* by userspace */ 1164 1164 u32 permitted_flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | 1165 - KVM_XEN_HVM_CONFIG_EVTCHN_SEND; 1165 + KVM_XEN_HVM_CONFIG_EVTCHN_SEND | 1166 + KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE; 1167 + u32 old_flags; 1166 1168 1167 1169 if (xhc->flags & ~permitted_flags) 1168 1170 return -EINVAL; ··· 1185 1183 else if (!xhc->msr && kvm->arch.xen_hvm_config.msr) 1186 1184 static_branch_slow_dec_deferred(&kvm_xen_enabled); 1187 1185 1186 + old_flags = kvm->arch.xen_hvm_config.flags; 1188 1187 memcpy(&kvm->arch.xen_hvm_config, xhc, sizeof(*xhc)); 1189 1188 1190 1189 mutex_unlock(&kvm->arch.xen.xen_lock); 1190 + 1191 + if ((old_flags ^ xhc->flags) & KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE) 1192 + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); 1193 + 1191 1194 return 0; 1192 1195 } 1193 1196

+3

drivers/s390/char/uvdevice.c

··· 109 109 struct uvio_attest *uvio_attest) 110 110 { 111 111 struct uvio_attest __user *user_uvio_attest = (void __user *)uv_ioctl->argument_addr; 112 + u32 __user *user_buf_add_len = (u32 __user *)&user_uvio_attest->add_data_len; 112 113 void __user *user_buf_add = (void __user *)uvio_attest->add_data_addr; 113 114 void __user *user_buf_meas = (void __user *)uvio_attest->meas_addr; 114 115 void __user *user_buf_uid = &user_uvio_attest->config_uid; ··· 117 116 if (copy_to_user(user_buf_meas, measurement, uvio_attest->meas_len)) 118 117 return -EFAULT; 119 118 if (add_data && copy_to_user(user_buf_add, add_data, uvio_attest->add_data_len)) 119 + return -EFAULT; 120 + if (put_user(uvio_attest->add_data_len, user_buf_add_len)) 120 121 return -EFAULT; 121 122 if (copy_to_user(user_buf_uid, uvcb_attest->config_uid, sizeof(uvcb_attest->config_uid))) 122 123 return -EFAULT;

+34 -17

fs/anon_inodes.c

··· 79 79 const struct file_operations *fops, 80 80 void *priv, int flags, 81 81 const struct inode *context_inode, 82 - bool secure) 82 + bool make_inode) 83 83 { 84 84 struct inode *inode; 85 85 struct file *file; ··· 87 87 if (fops->owner && !try_module_get(fops->owner)) 88 88 return ERR_PTR(-ENOENT); 89 89 90 - if (secure) { 90 + if (make_inode) { 91 91 inode = anon_inode_make_secure_inode(name, context_inode); 92 92 if (IS_ERR(inode)) { 93 93 file = ERR_CAST(inode); ··· 149 149 EXPORT_SYMBOL_GPL(anon_inode_getfile); 150 150 151 151 /** 152 - * anon_inode_getfile_secure - Like anon_inode_getfile(), but creates a new 152 + * anon_inode_create_getfile - Like anon_inode_getfile(), but creates a new 153 153 * !S_PRIVATE anon inode rather than reuse the 154 154 * singleton anon inode and calls the 155 - * inode_init_security_anon() LSM hook. This 156 - * allows for both the inode to have its own 157 - * security context and for the LSM to enforce 158 - * policy on the inode's creation. 155 + * inode_init_security_anon() LSM hook. 159 156 * 160 157 * @name: [in] name of the "class" of the new file 161 158 * @fops: [in] file operations for the new file ··· 161 164 * @context_inode: 162 165 * [in] the logical relationship with the new inode (optional) 163 166 * 167 + * Create a new anonymous inode and file pair. This can be done for two 168 + * reasons: 169 + * 170 + * - for the inode to have its own security context, so that LSMs can enforce 171 + * policy on the inode's creation; 172 + * 173 + * - if the caller needs a unique inode, for example in order to customize 174 + * the size returned by fstat() 175 + * 164 176 * The LSM may use @context_inode in inode_init_security_anon(), but a 165 - * reference to it is not held. Returns the newly created file* or an error 166 - * pointer. See the anon_inode_getfile() documentation for more information. 177 + * reference to it is not held. 178 + * 179 + * Returns the newly created file* or an error pointer. 167 180 */ 168 - struct file *anon_inode_getfile_secure(const char *name, 181 + struct file *anon_inode_create_getfile(const char *name, 169 182 const struct file_operations *fops, 170 183 void *priv, int flags, 171 184 const struct inode *context_inode) ··· 183 176 return __anon_inode_getfile(name, fops, priv, flags, 184 177 context_inode, true); 185 178 } 179 + EXPORT_SYMBOL_GPL(anon_inode_create_getfile); 186 180 187 181 static int __anon_inode_getfd(const char *name, 188 182 const struct file_operations *fops, 189 183 void *priv, int flags, 190 184 const struct inode *context_inode, 191 - bool secure) 185 + bool make_inode) 192 186 { 193 187 int error, fd; 194 188 struct file *file; ··· 200 192 fd = error; 201 193 202 194 file = __anon_inode_getfile(name, fops, priv, flags, context_inode, 203 - secure); 195 + make_inode); 204 196 if (IS_ERR(file)) { 205 197 error = PTR_ERR(file); 206 198 goto err_put_unused_fd; ··· 239 231 EXPORT_SYMBOL_GPL(anon_inode_getfd); 240 232 241 233 /** 242 - * anon_inode_getfd_secure - Like anon_inode_getfd(), but creates a new 234 + * anon_inode_create_getfd - Like anon_inode_getfd(), but creates a new 243 235 * !S_PRIVATE anon inode rather than reuse the singleton anon inode, and calls 244 - * the inode_init_security_anon() LSM hook. This allows the inode to have its 245 - * own security context and for a LSM to reject creation of the inode. 236 + * the inode_init_security_anon() LSM hook. 246 237 * 247 238 * @name: [in] name of the "class" of the new file 248 239 * @fops: [in] file operations for the new file ··· 250 243 * @context_inode: 251 244 * [in] the logical relationship with the new inode (optional) 252 245 * 246 + * Create a new anonymous inode and file pair. This can be done for two 247 + * reasons: 248 + * 249 + * - for the inode to have its own security context, so that LSMs can enforce 250 + * policy on the inode's creation; 251 + * 252 + * - if the caller needs a unique inode, for example in order to customize 253 + * the size returned by fstat() 254 + * 253 255 * The LSM may use @context_inode in inode_init_security_anon(), but a 254 256 * reference to it is not held. 257 + * 258 + * Returns a newly created file descriptor or an error code. 255 259 */ 256 - int anon_inode_getfd_secure(const char *name, const struct file_operations *fops, 260 + int anon_inode_create_getfd(const char *name, const struct file_operations *fops, 257 261 void *priv, int flags, 258 262 const struct inode *context_inode) 259 263 { 260 264 return __anon_inode_getfd(name, fops, priv, flags, context_inode, true); 261 265 } 262 - EXPORT_SYMBOL_GPL(anon_inode_getfd_secure); 263 266 264 267 static int __init anon_inode_init(void) 265 268 {

+3 -2

fs/userfaultfd.c

··· 1032 1032 { 1033 1033 int fd; 1034 1034 1035 - fd = anon_inode_getfd_secure("[userfaultfd]", &userfaultfd_fops, new, 1035 + fd = anon_inode_create_getfd("[userfaultfd]", &userfaultfd_fops, new, 1036 1036 O_RDONLY | (new->flags & UFFD_SHARED_FCNTL_FLAGS), inode); 1037 1037 if (fd < 0) 1038 1038 return fd; ··· 2260 2260 /* prevent the mm struct to be freed */ 2261 2261 mmgrab(ctx->mm); 2262 2262 2263 - fd = anon_inode_getfd_secure("[userfaultfd]", &userfaultfd_fops, ctx, 2263 + /* Create a new inode so that the LSM can block the creation. */ 2264 + fd = anon_inode_create_getfd("[userfaultfd]", &userfaultfd_fops, ctx, 2264 2265 O_RDONLY | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL); 2265 2266 if (fd < 0) { 2266 2267 mmdrop(ctx->mm);

+2 -2

include/linux/anon_inodes.h

··· 15 15 struct file *anon_inode_getfile(const char *name, 16 16 const struct file_operations *fops, 17 17 void *priv, int flags); 18 - struct file *anon_inode_getfile_secure(const char *name, 18 + struct file *anon_inode_create_getfile(const char *name, 19 19 const struct file_operations *fops, 20 20 void *priv, int flags, 21 21 const struct inode *context_inode); 22 22 int anon_inode_getfd(const char *name, const struct file_operations *fops, 23 23 void *priv, int flags); 24 - int anon_inode_getfd_secure(const char *name, 24 + int anon_inode_create_getfd(const char *name, 25 25 const struct file_operations *fops, 26 26 void *priv, int flags, 27 27 const struct inode *context_inode);

+125 -56

include/linux/kvm_host.h

··· 80 80 /* Two fragments for cross MMIO pages. */ 81 81 #define KVM_MAX_MMIO_FRAGMENTS 2 82 82 83 - #ifndef KVM_ADDRESS_SPACE_NUM 84 - #define KVM_ADDRESS_SPACE_NUM 1 83 + #ifndef KVM_MAX_NR_ADDRESS_SPACES 84 + #define KVM_MAX_NR_ADDRESS_SPACES 1 85 85 #endif 86 86 87 87 /* ··· 253 253 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); 254 254 #endif 255 255 256 - #ifdef KVM_ARCH_WANT_MMU_NOTIFIER 256 + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER 257 257 union kvm_mmu_notifier_arg { 258 258 pte_t pte; 259 + unsigned long attributes; 259 260 }; 260 261 261 262 struct kvm_gfn_range { ··· 589 588 u32 flags; 590 589 short id; 591 590 u16 as_id; 591 + 592 + #ifdef CONFIG_KVM_PRIVATE_MEM 593 + struct { 594 + struct file __rcu *file; 595 + pgoff_t pgoff; 596 + } gmem; 597 + #endif 592 598 }; 599 + 600 + static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) 601 + { 602 + return slot && (slot->flags & KVM_MEM_GUEST_MEMFD); 603 + } 593 604 594 605 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) 595 606 { ··· 690 677 #define KVM_MEM_SLOTS_NUM SHRT_MAX 691 678 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS) 692 679 693 - #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE 680 + #if KVM_MAX_NR_ADDRESS_SPACES == 1 681 + static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm) 682 + { 683 + return KVM_MAX_NR_ADDRESS_SPACES; 684 + } 685 + 694 686 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) 695 687 { 696 688 return 0; 689 + } 690 + #endif 691 + 692 + /* 693 + * Arch code must define kvm_arch_has_private_mem if support for private memory 694 + * is enabled. 695 + */ 696 + #if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) 697 + static inline bool kvm_arch_has_private_mem(struct kvm *kvm) 698 + { 699 + return false; 697 700 } 698 701 #endif 699 702 ··· 750 721 struct mm_struct *mm; /* userspace tied to this vm */ 751 722 unsigned long nr_memslot_pages; 752 723 /* The two memslot sets - active and inactive (per address space) */ 753 - struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2]; 724 + struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2]; 754 725 /* The current active memslot set for each address space */ 755 - struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; 726 + struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES]; 756 727 struct xarray vcpu_array; 757 728 /* 758 729 * Protected by slots_lock, but can be read outside if an ··· 782 753 struct list_head vm_list; 783 754 struct mutex lock; 784 755 struct kvm_io_bus __rcu *buses[KVM_NR_BUSES]; 785 - #ifdef CONFIG_HAVE_KVM_EVENTFD 756 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 786 757 struct { 787 758 spinlock_t lock; 788 759 struct list_head items; ··· 790 761 struct list_head resampler_list; 791 762 struct mutex resampler_lock; 792 763 } irqfds; 793 - struct list_head ioeventfds; 794 764 #endif 765 + struct list_head ioeventfds; 795 766 struct kvm_vm_stat stat; 796 767 struct kvm_arch arch; 797 768 refcount_t users_count; ··· 807 778 * Update side is protected by irq_lock. 808 779 */ 809 780 struct kvm_irq_routing_table __rcu *irq_routing; 810 - #endif 811 - #ifdef CONFIG_HAVE_KVM_IRQFD 781 + 812 782 struct hlist_head irq_ack_notifier_list; 813 783 #endif 814 784 815 - #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) 785 + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER 816 786 struct mmu_notifier mmu_notifier; 817 787 unsigned long mmu_invalidate_seq; 818 788 long mmu_invalidate_in_progress; 819 - unsigned long mmu_invalidate_range_start; 820 - unsigned long mmu_invalidate_range_end; 789 + gfn_t mmu_invalidate_range_start; 790 + gfn_t mmu_invalidate_range_end; 821 791 #endif 822 792 struct list_head devices; 823 793 u64 manual_dirty_log_protect; ··· 834 806 835 807 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER 836 808 struct notifier_block pm_notifier; 809 + #endif 810 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 811 + /* Protected by slots_locks (for writes) and RCU (for reads) */ 812 + struct xarray mem_attr_array; 837 813 #endif 838 814 char stats_id[KVM_STATS_NAME_SIZE]; 839 815 }; ··· 997 965 } 998 966 #endif 999 967 1000 - #ifdef CONFIG_HAVE_KVM_IRQFD 968 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 1001 969 int kvm_irqfd_init(void); 1002 970 void kvm_irqfd_exit(void); 1003 971 #else ··· 1021 989 1022 990 static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id) 1023 991 { 1024 - as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM); 992 + as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES); 1025 993 return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu, 1026 994 lockdep_is_held(&kvm->slots_lock) || 1027 995 !refcount_read(&kvm->users_count)); ··· 1178 1146 }; 1179 1147 1180 1148 int kvm_set_memory_region(struct kvm *kvm, 1181 - const struct kvm_userspace_memory_region *mem); 1149 + const struct kvm_userspace_memory_region2 *mem); 1182 1150 int __kvm_set_memory_region(struct kvm *kvm, 1183 - const struct kvm_userspace_memory_region *mem); 1151 + const struct kvm_userspace_memory_region2 *mem); 1184 1152 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); 1185 1153 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); 1186 1154 int kvm_arch_prepare_memory_region(struct kvm *kvm, ··· 1424 1392 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); 1425 1393 #endif 1426 1394 1427 - void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, 1428 - unsigned long end); 1429 - void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start, 1430 - unsigned long end); 1395 + void kvm_mmu_invalidate_begin(struct kvm *kvm); 1396 + void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end); 1397 + void kvm_mmu_invalidate_end(struct kvm *kvm); 1398 + bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); 1431 1399 1432 1400 long kvm_arch_dev_ioctl(struct file *filp, 1433 1401 unsigned int ioctl, unsigned long arg); ··· 1979 1947 extern const struct kvm_stats_header kvm_vcpu_stats_header; 1980 1948 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; 1981 1949 1982 - #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) 1950 + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER 1983 1951 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) 1984 1952 { 1985 1953 if (unlikely(kvm->mmu_invalidate_in_progress)) ··· 2002 1970 return 0; 2003 1971 } 2004 1972 2005 - static inline int mmu_invalidate_retry_hva(struct kvm *kvm, 1973 + static inline int mmu_invalidate_retry_gfn(struct kvm *kvm, 2006 1974 unsigned long mmu_seq, 2007 - unsigned long hva) 1975 + gfn_t gfn) 2008 1976 { 2009 1977 lockdep_assert_held(&kvm->mmu_lock); 2010 1978 /* ··· 2013 1981 * that might be being invalidated. Note that it may include some false 2014 1982 * positives, due to shortcuts when handing concurrent invalidations. 2015 1983 */ 2016 - if (unlikely(kvm->mmu_invalidate_in_progress) && 2017 - hva >= kvm->mmu_invalidate_range_start && 2018 - hva < kvm->mmu_invalidate_range_end) 2019 - return 1; 1984 + if (unlikely(kvm->mmu_invalidate_in_progress)) { 1985 + /* 1986 + * Dropping mmu_lock after bumping mmu_invalidate_in_progress 1987 + * but before updating the range is a KVM bug. 1988 + */ 1989 + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA || 1990 + kvm->mmu_invalidate_range_end == INVALID_GPA)) 1991 + return 1; 1992 + 1993 + if (gfn >= kvm->mmu_invalidate_range_start && 1994 + gfn < kvm->mmu_invalidate_range_end) 1995 + return 1; 1996 + } 1997 + 2020 1998 if (kvm->mmu_invalidate_seq != mmu_seq) 2021 1999 return 1; 2022 2000 return 0; ··· 2055 2013 2056 2014 int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi); 2057 2015 2058 - #ifdef CONFIG_HAVE_KVM_EVENTFD 2059 - 2060 2016 void kvm_eventfd_init(struct kvm *kvm); 2061 2017 int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args); 2062 2018 2063 - #ifdef CONFIG_HAVE_KVM_IRQFD 2019 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 2064 2020 int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args); 2065 2021 void kvm_irqfd_release(struct kvm *kvm); 2066 2022 bool kvm_notify_irqfd_resampler(struct kvm *kvm, ··· 2079 2039 { 2080 2040 return false; 2081 2041 } 2082 - #endif 2083 - 2084 - #else 2085 - 2086 - static inline void kvm_eventfd_init(struct kvm *kvm) {} 2087 - 2088 - static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args) 2089 - { 2090 - return -EINVAL; 2091 - } 2092 - 2093 - static inline void kvm_irqfd_release(struct kvm *kvm) {} 2094 - 2095 - #ifdef CONFIG_HAVE_KVM_IRQCHIP 2096 - static inline void kvm_irq_routing_update(struct kvm *kvm) 2097 - { 2098 - } 2099 - #endif 2100 - 2101 - static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) 2102 - { 2103 - return -ENOSYS; 2104 - } 2105 - 2106 - #endif /* CONFIG_HAVE_KVM_EVENTFD */ 2042 + #endif /* CONFIG_HAVE_KVM_IRQCHIP */ 2107 2043 2108 2044 void kvm_arch_irq_routing_update(struct kvm *kvm); 2109 2045 ··· 2333 2317 2334 2318 /* Max number of entries allowed for each kvm dirty ring */ 2335 2319 #define KVM_DIRTY_RING_MAX_ENTRIES 65536 2320 + 2321 + static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, 2322 + gpa_t gpa, gpa_t size, 2323 + bool is_write, bool is_exec, 2324 + bool is_private) 2325 + { 2326 + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; 2327 + vcpu->run->memory_fault.gpa = gpa; 2328 + vcpu->run->memory_fault.size = size; 2329 + 2330 + /* RWX flags are not (yet) defined or communicated to userspace. */ 2331 + vcpu->run->memory_fault.flags = 0; 2332 + if (is_private) 2333 + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; 2334 + } 2335 + 2336 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 2337 + static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) 2338 + { 2339 + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); 2340 + } 2341 + 2342 + bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, 2343 + unsigned long attrs); 2344 + bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, 2345 + struct kvm_gfn_range *range); 2346 + bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, 2347 + struct kvm_gfn_range *range); 2348 + 2349 + static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) 2350 + { 2351 + return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && 2352 + kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE; 2353 + } 2354 + #else 2355 + static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) 2356 + { 2357 + return false; 2358 + } 2359 + #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */ 2360 + 2361 + #ifdef CONFIG_KVM_PRIVATE_MEM 2362 + int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, 2363 + gfn_t gfn, kvm_pfn_t *pfn, int *max_order); 2364 + #else 2365 + static inline int kvm_gmem_get_pfn(struct kvm *kvm, 2366 + struct kvm_memory_slot *slot, gfn_t gfn, 2367 + kvm_pfn_t *pfn, int *max_order) 2368 + { 2369 + KVM_BUG_ON(1, kvm); 2370 + return -EIO; 2371 + } 2372 + #endif /* CONFIG_KVM_PRIVATE_MEM */ 2336 2373 2337 2374 #endif

+1

include/linux/kvm_types.h

··· 6 6 struct kvm; 7 7 struct kvm_async_pf; 8 8 struct kvm_device_ops; 9 + struct kvm_gfn_range; 9 10 struct kvm_interrupt; 10 11 struct kvm_irq_routing_table; 11 12 struct kvm_memory_slot;

+17

include/linux/pagemap.h

··· 206 206 AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */ 207 207 AS_STABLE_WRITES, /* must wait for writeback before modifying 208 208 folio contents */ 209 + AS_UNMOVABLE, /* The mapping cannot be moved, ever */ 209 210 }; 210 211 211 212 /** ··· 305 304 static inline void mapping_clear_stable_writes(struct address_space *mapping) 306 305 { 307 306 clear_bit(AS_STABLE_WRITES, &mapping->flags); 307 + } 308 + 309 + static inline void mapping_set_unmovable(struct address_space *mapping) 310 + { 311 + /* 312 + * It's expected unmovable mappings are also unevictable. Compaction 313 + * migrate scanner (isolate_migratepages_block()) relies on this to 314 + * reduce page locking. 315 + */ 316 + set_bit(AS_UNEVICTABLE, &mapping->flags); 317 + set_bit(AS_UNMOVABLE, &mapping->flags); 318 + } 319 + 320 + static inline bool mapping_unmovable(struct address_space *mapping) 321 + { 322 + return test_bit(AS_UNMOVABLE, &mapping->flags); 308 323 } 309 324 310 325 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)

+4 -4

include/trace/events/kvm.h

··· 62 62 __entry->valid ? "valid" : "invalid") 63 63 ); 64 64 65 - #if defined(CONFIG_HAVE_KVM_IRQFD) 65 + #if defined(CONFIG_HAVE_KVM_IRQCHIP) 66 66 TRACE_EVENT(kvm_set_irq, 67 67 TP_PROTO(unsigned int gsi, int level, int irq_source_id), 68 68 TP_ARGS(gsi, level, irq_source_id), ··· 82 82 TP_printk("gsi %u level %d source %d", 83 83 __entry->gsi, __entry->level, __entry->irq_source_id) 84 84 ); 85 - #endif /* defined(CONFIG_HAVE_KVM_IRQFD) */ 85 + #endif /* defined(CONFIG_HAVE_KVM_IRQCHIP) */ 86 86 87 87 #if defined(__KVM_HAVE_IOAPIC) 88 88 #define kvm_deliver_mode \ ··· 170 170 171 171 #endif /* defined(__KVM_HAVE_IOAPIC) */ 172 172 173 - #if defined(CONFIG_HAVE_KVM_IRQFD) 173 + #if defined(CONFIG_HAVE_KVM_IRQCHIP) 174 174 175 175 #ifdef kvm_irqchips 176 176 #define kvm_ack_irq_string "irqchip %s pin %u" ··· 197 197 TP_printk(kvm_ack_irq_string, kvm_ack_irq_parm) 198 198 ); 199 199 200 - #endif /* defined(CONFIG_HAVE_KVM_IRQFD) */ 200 + #endif /* defined(CONFIG_HAVE_KVM_IRQCHIP) */ 201 201 202 202 203 203

+50 -90

include/uapi/linux/kvm.h

··· 16 16 17 17 #define KVM_API_VERSION 12 18 18 19 - /* *** Deprecated interfaces *** */ 20 - 21 - #define KVM_TRC_SHIFT 16 22 - 23 - #define KVM_TRC_ENTRYEXIT (1 << KVM_TRC_SHIFT) 24 - #define KVM_TRC_HANDLER (1 << (KVM_TRC_SHIFT + 1)) 25 - 26 - #define KVM_TRC_VMENTRY (KVM_TRC_ENTRYEXIT + 0x01) 27 - #define KVM_TRC_VMEXIT (KVM_TRC_ENTRYEXIT + 0x02) 28 - #define KVM_TRC_PAGE_FAULT (KVM_TRC_HANDLER + 0x01) 29 - 30 - #define KVM_TRC_HEAD_SIZE 12 31 - #define KVM_TRC_CYCLE_SIZE 8 32 - #define KVM_TRC_EXTRA_MAX 7 33 - 34 - #define KVM_TRC_INJ_VIRQ (KVM_TRC_HANDLER + 0x02) 35 - #define KVM_TRC_REDELIVER_EVT (KVM_TRC_HANDLER + 0x03) 36 - #define KVM_TRC_PEND_INTR (KVM_TRC_HANDLER + 0x04) 37 - #define KVM_TRC_IO_READ (KVM_TRC_HANDLER + 0x05) 38 - #define KVM_TRC_IO_WRITE (KVM_TRC_HANDLER + 0x06) 39 - #define KVM_TRC_CR_READ (KVM_TRC_HANDLER + 0x07) 40 - #define KVM_TRC_CR_WRITE (KVM_TRC_HANDLER + 0x08) 41 - #define KVM_TRC_DR_READ (KVM_TRC_HANDLER + 0x09) 42 - #define KVM_TRC_DR_WRITE (KVM_TRC_HANDLER + 0x0A) 43 - #define KVM_TRC_MSR_READ (KVM_TRC_HANDLER + 0x0B) 44 - #define KVM_TRC_MSR_WRITE (KVM_TRC_HANDLER + 0x0C) 45 - #define KVM_TRC_CPUID (KVM_TRC_HANDLER + 0x0D) 46 - #define KVM_TRC_INTR (KVM_TRC_HANDLER + 0x0E) 47 - #define KVM_TRC_NMI (KVM_TRC_HANDLER + 0x0F) 48 - #define KVM_TRC_VMMCALL (KVM_TRC_HANDLER + 0x10) 49 - #define KVM_TRC_HLT (KVM_TRC_HANDLER + 0x11) 50 - #define KVM_TRC_CLTS (KVM_TRC_HANDLER + 0x12) 51 - #define KVM_TRC_LMSW (KVM_TRC_HANDLER + 0x13) 52 - #define KVM_TRC_APIC_ACCESS (KVM_TRC_HANDLER + 0x14) 53 - #define KVM_TRC_TDP_FAULT (KVM_TRC_HANDLER + 0x15) 54 - #define KVM_TRC_GTLB_WRITE (KVM_TRC_HANDLER + 0x16) 55 - #define KVM_TRC_STLB_WRITE (KVM_TRC_HANDLER + 0x17) 56 - #define KVM_TRC_STLB_INVAL (KVM_TRC_HANDLER + 0x18) 57 - #define KVM_TRC_PPC_INSTR (KVM_TRC_HANDLER + 0x19) 58 - 59 - struct kvm_user_trace_setup { 60 - __u32 buf_size; 61 - __u32 buf_nr; 62 - }; 63 - 64 - #define __KVM_DEPRECATED_MAIN_W_0x06 \ 65 - _IOW(KVMIO, 0x06, struct kvm_user_trace_setup) 66 - #define __KVM_DEPRECATED_MAIN_0x07 _IO(KVMIO, 0x07) 67 - #define __KVM_DEPRECATED_MAIN_0x08 _IO(KVMIO, 0x08) 68 - 69 - #define __KVM_DEPRECATED_VM_R_0x70 _IOR(KVMIO, 0x70, struct kvm_assigned_irq) 70 - 71 - struct kvm_breakpoint { 72 - __u32 enabled; 73 - __u32 padding; 74 - __u64 address; 75 - }; 76 - 77 - struct kvm_debug_guest { 78 - __u32 enabled; 79 - __u32 pad; 80 - struct kvm_breakpoint breakpoints[4]; 81 - __u32 singlestep; 82 - }; 83 - 84 - #define __KVM_DEPRECATED_VCPU_W_0x87 _IOW(KVMIO, 0x87, struct kvm_debug_guest) 85 - 86 - /* *** End of deprecated interfaces *** */ 87 - 88 - 89 19 /* for KVM_SET_USER_MEMORY_REGION */ 90 20 struct kvm_userspace_memory_region { 91 21 __u32 slot; ··· 25 95 __u64 userspace_addr; /* start of the userspace allocated memory */ 26 96 }; 27 97 98 + /* for KVM_SET_USER_MEMORY_REGION2 */ 99 + struct kvm_userspace_memory_region2 { 100 + __u32 slot; 101 + __u32 flags; 102 + __u64 guest_phys_addr; 103 + __u64 memory_size; 104 + __u64 userspace_addr; 105 + __u64 guest_memfd_offset; 106 + __u32 guest_memfd; 107 + __u32 pad1; 108 + __u64 pad2[14]; 109 + }; 110 + 28 111 /* 29 112 * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for 30 113 * userspace, other bits are reserved for kvm internal use which are defined ··· 45 102 */ 46 103 #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) 47 104 #define KVM_MEM_READONLY (1UL << 1) 105 + #define KVM_MEM_GUEST_MEMFD (1UL << 2) 48 106 49 107 /* for KVM_IRQ_LINE */ 50 108 struct kvm_irq_level { ··· 209 265 #define KVM_EXIT_RISCV_CSR 36 210 266 #define KVM_EXIT_NOTIFY 37 211 267 #define KVM_EXIT_LOONGARCH_IOCSR 38 268 + #define KVM_EXIT_MEMORY_FAULT 39 212 269 213 270 /* For KVM_EXIT_INTERNAL_ERROR */ 214 271 /* Emulate instruction failed. */ ··· 463 518 #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) 464 519 __u32 flags; 465 520 } notify; 521 + /* KVM_EXIT_MEMORY_FAULT */ 522 + struct { 523 + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) 524 + __u64 flags; 525 + __u64 gpa; 526 + __u64 size; 527 + } memory_fault; 466 528 /* Fix the size of the union. */ 467 529 char padding[256]; 468 530 }; ··· 897 945 */ 898 946 #define KVM_GET_VCPU_MMAP_SIZE _IO(KVMIO, 0x04) /* in bytes */ 899 947 #define KVM_GET_SUPPORTED_CPUID _IOWR(KVMIO, 0x05, struct kvm_cpuid2) 900 - #define KVM_TRACE_ENABLE __KVM_DEPRECATED_MAIN_W_0x06 901 - #define KVM_TRACE_PAUSE __KVM_DEPRECATED_MAIN_0x07 902 - #define KVM_TRACE_DISABLE __KVM_DEPRECATED_MAIN_0x08 903 948 #define KVM_GET_EMULATED_CPUID _IOWR(KVMIO, 0x09, struct kvm_cpuid2) 904 949 #define KVM_GET_MSR_FEATURE_INDEX_LIST _IOWR(KVMIO, 0x0a, struct kvm_msr_list) 905 950 ··· 1150 1201 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 1151 1202 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 1152 1203 #define KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES 230 1204 + #define KVM_CAP_USER_MEMORY2 231 1205 + #define KVM_CAP_MEMORY_FAULT_INFO 232 1206 + #define KVM_CAP_MEMORY_ATTRIBUTES 233 1207 + #define KVM_CAP_GUEST_MEMFD 234 1208 + #define KVM_CAP_VM_TYPES 235 1153 1209 1154 1210 #ifdef KVM_CAP_IRQ_ROUTING 1155 1211 ··· 1245 1291 #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) 1246 1292 #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) 1247 1293 #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6) 1294 + #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7) 1248 1295 1249 1296 struct kvm_xen_hvm_config { 1250 1297 __u32 flags; ··· 1438 1483 struct kvm_userspace_memory_region) 1439 1484 #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) 1440 1485 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64) 1486 + #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \ 1487 + struct kvm_userspace_memory_region2) 1441 1488 1442 1489 /* enable ucontrol for s390 */ 1443 1490 struct kvm_s390_ucas_mapping { ··· 1464 1507 _IOW(KVMIO, 0x67, struct kvm_coalesced_mmio_zone) 1465 1508 #define KVM_UNREGISTER_COALESCED_MMIO \ 1466 1509 _IOW(KVMIO, 0x68, struct kvm_coalesced_mmio_zone) 1467 - #define KVM_ASSIGN_PCI_DEVICE _IOR(KVMIO, 0x69, \ 1468 - struct kvm_assigned_pci_dev) 1469 1510 #define KVM_SET_GSI_ROUTING _IOW(KVMIO, 0x6a, struct kvm_irq_routing) 1470 - /* deprecated, replaced by KVM_ASSIGN_DEV_IRQ */ 1471 - #define KVM_ASSIGN_IRQ __KVM_DEPRECATED_VM_R_0x70 1472 - #define KVM_ASSIGN_DEV_IRQ _IOW(KVMIO, 0x70, struct kvm_assigned_irq) 1473 1511 #define KVM_REINJECT_CONTROL _IO(KVMIO, 0x71) 1474 - #define KVM_DEASSIGN_PCI_DEVICE _IOW(KVMIO, 0x72, \ 1475 - struct kvm_assigned_pci_dev) 1476 - #define KVM_ASSIGN_SET_MSIX_NR _IOW(KVMIO, 0x73, \ 1477 - struct kvm_assigned_msix_nr) 1478 - #define KVM_ASSIGN_SET_MSIX_ENTRY _IOW(KVMIO, 0x74, \ 1479 - struct kvm_assigned_msix_entry) 1480 - #define KVM_DEASSIGN_DEV_IRQ _IOW(KVMIO, 0x75, struct kvm_assigned_irq) 1481 1512 #define KVM_IRQFD _IOW(KVMIO, 0x76, struct kvm_irqfd) 1482 1513 #define KVM_CREATE_PIT2 _IOW(KVMIO, 0x77, struct kvm_pit_config) 1483 1514 #define KVM_SET_BOOT_CPU_ID _IO(KVMIO, 0x78) ··· 1482 1537 * KVM_CAP_VM_TSC_CONTROL to set defaults for a VM */ 1483 1538 #define KVM_SET_TSC_KHZ _IO(KVMIO, 0xa2) 1484 1539 #define KVM_GET_TSC_KHZ _IO(KVMIO, 0xa3) 1485 - /* Available with KVM_CAP_PCI_2_3 */ 1486 - #define KVM_ASSIGN_SET_INTX_MASK _IOW(KVMIO, 0xa4, \ 1487 - struct kvm_assigned_pci_dev) 1488 1540 /* Available with KVM_CAP_SIGNAL_MSI */ 1489 1541 #define KVM_SIGNAL_MSI _IOW(KVMIO, 0xa5, struct kvm_msi) 1490 1542 /* Available with KVM_CAP_PPC_GET_SMMU_INFO */ ··· 1534 1592 #define KVM_SET_SREGS _IOW(KVMIO, 0x84, struct kvm_sregs) 1535 1593 #define KVM_TRANSLATE _IOWR(KVMIO, 0x85, struct kvm_translation) 1536 1594 #define KVM_INTERRUPT _IOW(KVMIO, 0x86, struct kvm_interrupt) 1537 - /* KVM_DEBUG_GUEST is no longer supported, use KVM_SET_GUEST_DEBUG instead */ 1538 - #define KVM_DEBUG_GUEST __KVM_DEPRECATED_VCPU_W_0x87 1539 1595 #define KVM_GET_MSRS _IOWR(KVMIO, 0x88, struct kvm_msrs) 1540 1596 #define KVM_SET_MSRS _IOW(KVMIO, 0x89, struct kvm_msrs) 1541 1597 #define KVM_SET_CPUID _IOW(KVMIO, 0x8a, struct kvm_cpuid) ··· 2206 2266 2207 2267 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */ 2208 2268 #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0) 2269 + 2270 + /* Available with KVM_CAP_MEMORY_ATTRIBUTES */ 2271 + #define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd2, struct kvm_memory_attributes) 2272 + 2273 + struct kvm_memory_attributes { 2274 + __u64 address; 2275 + __u64 size; 2276 + __u64 attributes; 2277 + __u64 flags; 2278 + }; 2279 + 2280 + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) 2281 + 2282 + #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd) 2283 + 2284 + struct kvm_create_guest_memfd { 2285 + __u64 size; 2286 + __u64 flags; 2287 + __u64 reserved[6]; 2288 + }; 2209 2289 2210 2290 #endif /* __LINUX_KVM_H */

+2 -1

io_uring/io_uring.c

··· 3787 3787 */ 3788 3788 static struct file *io_uring_get_file(struct io_ring_ctx *ctx) 3789 3789 { 3790 - return anon_inode_getfile_secure("[io_uring]", &io_uring_fops, ctx, 3790 + /* Create a new inode so that the LSM can block the creation. */ 3791 + return anon_inode_create_getfile("[io_uring]", &io_uring_fops, ctx, 3791 3792 O_RDWR | O_CLOEXEC, NULL); 3792 3793 } 3793 3794

+31 -12

mm/compaction.c

··· 882 882 883 883 /* Time to isolate some pages for migration */ 884 884 for (; low_pfn < end_pfn; low_pfn++) { 885 + bool is_dirty, is_unevictable; 885 886 886 887 if (skip_on_failure && low_pfn >= next_skip_pfn) { 887 888 /* ··· 1080 1079 if (!folio_test_lru(folio)) 1081 1080 goto isolate_fail_put; 1082 1081 1082 + is_unevictable = folio_test_unevictable(folio); 1083 + 1083 1084 /* Compaction might skip unevictable pages but CMA takes them */ 1084 - if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio)) 1085 + if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable) 1085 1086 goto isolate_fail_put; 1086 1087 1087 1088 /* ··· 1095 1092 if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio)) 1096 1093 goto isolate_fail_put; 1097 1094 1098 - if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) { 1099 - bool migrate_dirty; 1095 + is_dirty = folio_test_dirty(folio); 1096 + 1097 + if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) || 1098 + (mapping && is_unevictable)) { 1099 + bool migrate_dirty = true; 1100 + bool is_unmovable; 1100 1101 1101 1102 /* 1102 1103 * Only folios without mappings or that have 1103 - * a ->migrate_folio callback are possible to 1104 - * migrate without blocking. However, we may 1105 - * be racing with truncation, which can free 1106 - * the mapping. Truncation holds the folio lock 1107 - * until after the folio is removed from the page 1108 - * cache so holding it ourselves is sufficient. 1104 + * a ->migrate_folio callback are possible to migrate 1105 + * without blocking. 1106 + * 1107 + * Folios from unmovable mappings are not migratable. 1108 + * 1109 + * However, we can be racing with truncation, which can 1110 + * free the mapping that we need to check. Truncation 1111 + * holds the folio lock until after the folio is removed 1112 + * from the page so holding it ourselves is sufficient. 1113 + * 1114 + * To avoid locking the folio just to check unmovable, 1115 + * assume every unmovable folio is also unevictable, 1116 + * which is a cheaper test. If our assumption goes 1117 + * wrong, it's not a correctness bug, just potentially 1118 + * wasted cycles. 1109 1119 */ 1110 1120 if (!folio_trylock(folio)) 1111 1121 goto isolate_fail_put; 1112 1122 1113 1123 mapping = folio_mapping(folio); 1114 - migrate_dirty = !mapping || 1115 - mapping->a_ops->migrate_folio; 1124 + if ((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) { 1125 + migrate_dirty = !mapping || 1126 + mapping->a_ops->migrate_folio; 1127 + } 1128 + is_unmovable = mapping && mapping_unmovable(mapping); 1116 1129 folio_unlock(folio); 1117 - if (!migrate_dirty) 1130 + if (!migrate_dirty || is_unmovable) 1118 1131 goto isolate_fail_put; 1119 1132 } 1120 1133

+2

mm/migrate.c

··· 962 962 963 963 if (!mapping) 964 964 rc = migrate_folio(mapping, dst, src, mode); 965 + else if (mapping_unmovable(mapping)) 966 + rc = -EOPNOTSUPP; 965 967 else if (mapping->a_ops->migrate_folio) 966 968 /* 967 969 * Most folios have a mapping and most filesystems

+6 -3

tools/testing/selftests/kvm/Makefile

··· 77 77 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush 78 78 TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test 79 79 TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test 80 - TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test 81 80 TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test 82 81 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test 83 82 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test 84 83 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test 84 + TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test 85 + TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test 85 86 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id 86 87 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test 87 88 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test ··· 125 124 TEST_GEN_PROGS_x86_64 += demand_paging_test 126 125 TEST_GEN_PROGS_x86_64 += dirty_log_test 127 126 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test 127 + TEST_GEN_PROGS_x86_64 += guest_memfd_test 128 128 TEST_GEN_PROGS_x86_64 += guest_print_test 129 129 TEST_GEN_PROGS_x86_64 += hardware_disable_test 130 130 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus ··· 186 184 187 185 TEST_GEN_PROGS_riscv += demand_paging_test 188 186 TEST_GEN_PROGS_riscv += dirty_log_test 189 - TEST_GEN_PROGS_riscv += guest_print_test 190 187 TEST_GEN_PROGS_riscv += get-reg-list 188 + TEST_GEN_PROGS_riscv += guest_print_test 189 + TEST_GEN_PROGS_riscv += kvm_binary_stats_test 191 190 TEST_GEN_PROGS_riscv += kvm_create_max_vcpus 192 191 TEST_GEN_PROGS_riscv += kvm_page_table_test 193 192 TEST_GEN_PROGS_riscv += set_memory_region_test 194 - TEST_GEN_PROGS_riscv += kvm_binary_stats_test 193 + TEST_GEN_PROGS_riscv += steal_time 195 194 196 195 SPLIT_TESTS += get-reg-list 197 196

+1 -1

tools/testing/selftests/kvm/aarch64/page_fault_test.c

··· 705 705 706 706 print_test_banner(mode, p); 707 707 708 - vm = ____vm_create(mode); 708 + vm = ____vm_create(VM_SHAPE(mode)); 709 709 setup_memslots(vm, p); 710 710 kvm_vm_elf_load(vm, program_invocation_name); 711 711 setup_ucall(vm);

+1 -1

tools/testing/selftests/kvm/dirty_log_test.c

··· 699 699 700 700 pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode)); 701 701 702 - vm = __vm_create(mode, 1, extra_mem_pages); 702 + vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages); 703 703 704 704 log_mode_create_vm_done(vm); 705 705 *vcpu = vm_vcpu_add(vm, 0, guest_code);

+198

tools/testing/selftests/kvm/guest_memfd_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright Intel Corporation, 2023 4 + * 5 + * Author: Chao Peng <chao.p.peng@linux.intel.com> 6 + */ 7 + 8 + #define _GNU_SOURCE 9 + #include <stdlib.h> 10 + #include <string.h> 11 + #include <unistd.h> 12 + #include <errno.h> 13 + #include <stdio.h> 14 + #include <fcntl.h> 15 + 16 + #include <linux/bitmap.h> 17 + #include <linux/falloc.h> 18 + #include <sys/mman.h> 19 + #include <sys/types.h> 20 + #include <sys/stat.h> 21 + 22 + #include "test_util.h" 23 + #include "kvm_util_base.h" 24 + 25 + static void test_file_read_write(int fd) 26 + { 27 + char buf[64]; 28 + 29 + TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0, 30 + "read on a guest_mem fd should fail"); 31 + TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0, 32 + "write on a guest_mem fd should fail"); 33 + TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0, 34 + "pread on a guest_mem fd should fail"); 35 + TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0, 36 + "pwrite on a guest_mem fd should fail"); 37 + } 38 + 39 + static void test_mmap(int fd, size_t page_size) 40 + { 41 + char *mem; 42 + 43 + mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); 44 + TEST_ASSERT_EQ(mem, MAP_FAILED); 45 + } 46 + 47 + static void test_file_size(int fd, size_t page_size, size_t total_size) 48 + { 49 + struct stat sb; 50 + int ret; 51 + 52 + ret = fstat(fd, &sb); 53 + TEST_ASSERT(!ret, "fstat should succeed"); 54 + TEST_ASSERT_EQ(sb.st_size, total_size); 55 + TEST_ASSERT_EQ(sb.st_blksize, page_size); 56 + } 57 + 58 + static void test_fallocate(int fd, size_t page_size, size_t total_size) 59 + { 60 + int ret; 61 + 62 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size); 63 + TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed"); 64 + 65 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 66 + page_size - 1, page_size); 67 + TEST_ASSERT(ret, "fallocate with unaligned offset should fail"); 68 + 69 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size); 70 + TEST_ASSERT(ret, "fallocate beginning at total_size should fail"); 71 + 72 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size); 73 + TEST_ASSERT(ret, "fallocate beginning after total_size should fail"); 74 + 75 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 76 + total_size, page_size); 77 + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed"); 78 + 79 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 80 + total_size + page_size, page_size); 81 + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed"); 82 + 83 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 84 + page_size, page_size - 1); 85 + TEST_ASSERT(ret, "fallocate with unaligned size should fail"); 86 + 87 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 88 + page_size, page_size); 89 + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed"); 90 + 91 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size); 92 + TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed"); 93 + } 94 + 95 + static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size) 96 + { 97 + struct { 98 + off_t offset; 99 + off_t len; 100 + } testcases[] = { 101 + {0, 1}, 102 + {0, page_size - 1}, 103 + {0, page_size + 1}, 104 + 105 + {1, 1}, 106 + {1, page_size - 1}, 107 + {1, page_size}, 108 + {1, page_size + 1}, 109 + 110 + {page_size, 1}, 111 + {page_size, page_size - 1}, 112 + {page_size, page_size + 1}, 113 + }; 114 + int ret, i; 115 + 116 + for (i = 0; i < ARRAY_SIZE(testcases); i++) { 117 + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 118 + testcases[i].offset, testcases[i].len); 119 + TEST_ASSERT(ret == -1 && errno == EINVAL, 120 + "PUNCH_HOLE with !PAGE_SIZE offset (%lx) and/or length (%lx) should fail", 121 + testcases[i].offset, testcases[i].len); 122 + } 123 + } 124 + 125 + static void test_create_guest_memfd_invalid(struct kvm_vm *vm) 126 + { 127 + size_t page_size = getpagesize(); 128 + uint64_t flag; 129 + size_t size; 130 + int fd; 131 + 132 + for (size = 1; size < page_size; size++) { 133 + fd = __vm_create_guest_memfd(vm, size, 0); 134 + TEST_ASSERT(fd == -1 && errno == EINVAL, 135 + "guest_memfd() with non-page-aligned page size '0x%lx' should fail with EINVAL", 136 + size); 137 + } 138 + 139 + for (flag = 0; flag; flag <<= 1) { 140 + fd = __vm_create_guest_memfd(vm, page_size, flag); 141 + TEST_ASSERT(fd == -1 && errno == EINVAL, 142 + "guest_memfd() with flag '0x%lx' should fail with EINVAL", 143 + flag); 144 + } 145 + } 146 + 147 + static void test_create_guest_memfd_multiple(struct kvm_vm *vm) 148 + { 149 + int fd1, fd2, ret; 150 + struct stat st1, st2; 151 + 152 + fd1 = __vm_create_guest_memfd(vm, 4096, 0); 153 + TEST_ASSERT(fd1 != -1, "memfd creation should succeed"); 154 + 155 + ret = fstat(fd1, &st1); 156 + TEST_ASSERT(ret != -1, "memfd fstat should succeed"); 157 + TEST_ASSERT(st1.st_size == 4096, "memfd st_size should match requested size"); 158 + 159 + fd2 = __vm_create_guest_memfd(vm, 8192, 0); 160 + TEST_ASSERT(fd2 != -1, "memfd creation should succeed"); 161 + 162 + ret = fstat(fd2, &st2); 163 + TEST_ASSERT(ret != -1, "memfd fstat should succeed"); 164 + TEST_ASSERT(st2.st_size == 8192, "second memfd st_size should match requested size"); 165 + 166 + ret = fstat(fd1, &st1); 167 + TEST_ASSERT(ret != -1, "memfd fstat should succeed"); 168 + TEST_ASSERT(st1.st_size == 4096, "first memfd st_size should still match requested size"); 169 + TEST_ASSERT(st1.st_ino != st2.st_ino, "different memfd should have different inode numbers"); 170 + } 171 + 172 + int main(int argc, char *argv[]) 173 + { 174 + size_t page_size; 175 + size_t total_size; 176 + int fd; 177 + struct kvm_vm *vm; 178 + 179 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD)); 180 + 181 + page_size = getpagesize(); 182 + total_size = page_size * 4; 183 + 184 + vm = vm_create_barebones(); 185 + 186 + test_create_guest_memfd_invalid(vm); 187 + test_create_guest_memfd_multiple(vm); 188 + 189 + fd = vm_create_guest_memfd(vm, total_size, 0); 190 + 191 + test_file_read_write(fd); 192 + test_mmap(fd, page_size); 193 + test_file_size(fd, page_size, total_size); 194 + test_fallocate(fd, page_size, total_size); 195 + test_invalid_punch_hole(fd, page_size, total_size); 196 + 197 + close(fd); 198 + }

+2 -2

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 119 119 /* Access flag update enable/disable */ 120 120 #define TCR_EL1_HA (1ULL << 39) 121 121 122 - void aarch64_get_supported_page_sizes(uint32_t ipa, 123 - bool *ps4k, bool *ps16k, bool *ps64k); 122 + void aarch64_get_supported_page_sizes(uint32_t ipa, uint32_t *ipa4k, 123 + uint32_t *ipa16k, uint32_t *ipa64k); 124 124 125 125 void vm_init_descriptor_tables(struct kvm_vm *vm); 126 126 void vcpu_init_descriptor_tables(struct kvm_vcpu *vcpu);

+2 -2

tools/testing/selftests/kvm/include/guest_modes.h

··· 11 11 12 12 extern struct guest_mode guest_modes[NUM_VM_MODES]; 13 13 14 - #define guest_mode_append(mode, supported, enabled) ({ \ 15 - guest_modes[mode] = (struct guest_mode){ supported, enabled }; \ 14 + #define guest_mode_append(mode, enabled) ({ \ 15 + guest_modes[mode] = (struct guest_mode){ (enabled), (enabled) }; \ 16 16 }) 17 17 18 18 void guest_modes_append_default(void);

+183 -34

tools/testing/selftests/kvm/include/kvm_util_base.h

··· 44 44 typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */ 45 45 46 46 struct userspace_mem_region { 47 - struct kvm_userspace_memory_region region; 47 + struct kvm_userspace_memory_region2 region; 48 48 struct sparsebit *unused_phy_pages; 49 49 int fd; 50 50 off_t offset; ··· 129 129 const char *name; 130 130 long capability; 131 131 int feature; 132 + int feature_type; 132 133 bool finalize; 133 134 __u64 *regs; 134 135 __u64 regs_n; ··· 172 171 173 172 enum vm_guest_mode { 174 173 VM_MODE_P52V48_4K, 174 + VM_MODE_P52V48_16K, 175 175 VM_MODE_P52V48_64K, 176 176 VM_MODE_P48V48_4K, 177 177 VM_MODE_P48V48_16K, ··· 189 187 VM_MODE_P36V47_16K, 190 188 NUM_VM_MODES, 191 189 }; 190 + 191 + struct vm_shape { 192 + enum vm_guest_mode mode; 193 + unsigned int type; 194 + }; 195 + 196 + #define VM_TYPE_DEFAULT 0 197 + 198 + #define VM_SHAPE(__mode) \ 199 + ({ \ 200 + struct vm_shape shape = { \ 201 + .mode = (__mode), \ 202 + .type = VM_TYPE_DEFAULT \ 203 + }; \ 204 + \ 205 + shape; \ 206 + }) 192 207 193 208 #if defined(__aarch64__) 194 209 ··· 239 220 240 221 #endif 241 222 223 + #define VM_SHAPE_DEFAULT VM_SHAPE(VM_MODE_DEFAULT) 224 + 242 225 #define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT) 243 226 #define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE) 244 227 ··· 269 248 #define __KVM_SYSCALL_ERROR(_name, _ret) \ 270 249 "%s failed, rc: %i errno: %i (%s)", (_name), (_ret), errno, strerror(errno) 271 250 251 + /* 252 + * Use the "inner", double-underscore macro when reporting errors from within 253 + * other macros so that the name of ioctl() and not its literal numeric value 254 + * is printed on error. The "outer" macro is strongly preferred when reporting 255 + * errors "directly", i.e. without an additional layer of macros, as it reduces 256 + * the probability of passing in the wrong string. 257 + */ 272 258 #define __KVM_IOCTL_ERROR(_name, _ret) __KVM_SYSCALL_ERROR(_name, _ret) 273 259 #define KVM_IOCTL_ERROR(_ioctl, _ret) __KVM_IOCTL_ERROR(#_ioctl, _ret) 274 260 ··· 288 260 #define __kvm_ioctl(kvm_fd, cmd, arg) \ 289 261 kvm_do_ioctl(kvm_fd, cmd, arg) 290 262 291 - 292 - #define _kvm_ioctl(kvm_fd, cmd, name, arg) \ 263 + #define kvm_ioctl(kvm_fd, cmd, arg) \ 293 264 ({ \ 294 265 int ret = __kvm_ioctl(kvm_fd, cmd, arg); \ 295 266 \ 296 - TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(name, ret)); \ 267 + TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(#cmd, ret)); \ 297 268 }) 298 - 299 - #define kvm_ioctl(kvm_fd, cmd, arg) \ 300 - _kvm_ioctl(kvm_fd, cmd, #cmd, arg) 301 269 302 270 static __always_inline void static_assert_is_vm(struct kvm_vm *vm) { } 303 271 ··· 303 279 kvm_do_ioctl((vm)->fd, cmd, arg); \ 304 280 }) 305 281 306 - #define _vm_ioctl(vm, cmd, name, arg) \ 282 + /* 283 + * Assert that a VM or vCPU ioctl() succeeded, with extra magic to detect if 284 + * the ioctl() failed because KVM killed/bugged the VM. To detect a dead VM, 285 + * probe KVM_CAP_USER_MEMORY, which (a) has been supported by KVM since before 286 + * selftests existed and (b) should never outright fail, i.e. is supposed to 287 + * return 0 or 1. If KVM kills a VM, KVM returns -EIO for all ioctl()s for the 288 + * VM and its vCPUs, including KVM_CHECK_EXTENSION. 289 + */ 290 + #define __TEST_ASSERT_VM_VCPU_IOCTL(cond, name, ret, vm) \ 291 + do { \ 292 + int __errno = errno; \ 293 + \ 294 + static_assert_is_vm(vm); \ 295 + \ 296 + if (cond) \ 297 + break; \ 298 + \ 299 + if (errno == EIO && \ 300 + __vm_ioctl(vm, KVM_CHECK_EXTENSION, (void *)KVM_CAP_USER_MEMORY) < 0) { \ 301 + TEST_ASSERT(errno == EIO, "KVM killed the VM, should return -EIO"); \ 302 + TEST_FAIL("KVM killed/bugged the VM, check the kernel log for clues"); \ 303 + } \ 304 + errno = __errno; \ 305 + TEST_ASSERT(cond, __KVM_IOCTL_ERROR(name, ret)); \ 306 + } while (0) 307 + 308 + #define TEST_ASSERT_VM_VCPU_IOCTL(cond, cmd, ret, vm) \ 309 + __TEST_ASSERT_VM_VCPU_IOCTL(cond, #cmd, ret, vm) 310 + 311 + #define vm_ioctl(vm, cmd, arg) \ 307 312 ({ \ 308 313 int ret = __vm_ioctl(vm, cmd, arg); \ 309 314 \ 310 - TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(name, ret)); \ 315 + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, vm); \ 311 316 }) 312 - 313 - #define vm_ioctl(vm, cmd, arg) \ 314 - _vm_ioctl(vm, cmd, #cmd, arg) 315 - 316 317 317 318 static __always_inline void static_assert_is_vcpu(struct kvm_vcpu *vcpu) { } 318 319 ··· 347 298 kvm_do_ioctl((vcpu)->fd, cmd, arg); \ 348 299 }) 349 300 350 - #define _vcpu_ioctl(vcpu, cmd, name, arg) \ 301 + #define vcpu_ioctl(vcpu, cmd, arg) \ 351 302 ({ \ 352 303 int ret = __vcpu_ioctl(vcpu, cmd, arg); \ 353 304 \ 354 - TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(name, ret)); \ 305 + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, (vcpu)->vm); \ 355 306 }) 356 - 357 - #define vcpu_ioctl(vcpu, cmd, arg) \ 358 - _vcpu_ioctl(vcpu, cmd, #cmd, arg) 359 307 360 308 /* 361 309 * Looks up and returns the value corresponding to the capability ··· 362 316 { 363 317 int ret = __vm_ioctl(vm, KVM_CHECK_EXTENSION, (void *)cap); 364 318 365 - TEST_ASSERT(ret >= 0, KVM_IOCTL_ERROR(KVM_CHECK_EXTENSION, ret)); 319 + TEST_ASSERT_VM_VCPU_IOCTL(ret >= 0, KVM_CHECK_EXTENSION, ret, vm); 366 320 return ret; 367 321 } 368 322 ··· 377 331 struct kvm_enable_cap enable_cap = { .cap = cap, .args = { arg0 } }; 378 332 379 333 vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap); 334 + } 335 + 336 + static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa, 337 + uint64_t size, uint64_t attributes) 338 + { 339 + struct kvm_memory_attributes attr = { 340 + .attributes = attributes, 341 + .address = gpa, 342 + .size = size, 343 + .flags = 0, 344 + }; 345 + 346 + /* 347 + * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes. These flows 348 + * need significant enhancements to support multiple attributes. 349 + */ 350 + TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE, 351 + "Update me to support multiple attributes!"); 352 + 353 + vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr); 354 + } 355 + 356 + 357 + static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa, 358 + uint64_t size) 359 + { 360 + vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE); 361 + } 362 + 363 + static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa, 364 + uint64_t size) 365 + { 366 + vm_set_memory_attributes(vm, gpa, size, 0); 367 + } 368 + 369 + void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size, 370 + bool punch_hole); 371 + 372 + static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa, 373 + uint64_t size) 374 + { 375 + vm_guest_mem_fallocate(vm, gpa, size, true); 376 + } 377 + 378 + static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa, 379 + uint64_t size) 380 + { 381 + vm_guest_mem_fallocate(vm, gpa, size, false); 380 382 } 381 383 382 384 void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size); ··· 469 375 { 470 376 int fd = __vm_ioctl(vm, KVM_GET_STATS_FD, NULL); 471 377 472 - TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_GET_STATS_FD, fd)); 378 + TEST_ASSERT_VM_VCPU_IOCTL(fd >= 0, KVM_GET_STATS_FD, fd, vm); 473 379 return fd; 474 380 } 475 381 ··· 525 431 526 432 void vm_create_irqchip(struct kvm_vm *vm); 527 433 434 + static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size, 435 + uint64_t flags) 436 + { 437 + struct kvm_create_guest_memfd guest_memfd = { 438 + .size = size, 439 + .flags = flags, 440 + }; 441 + 442 + return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &guest_memfd); 443 + } 444 + 445 + static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size, 446 + uint64_t flags) 447 + { 448 + int fd = __vm_create_guest_memfd(vm, size, flags); 449 + 450 + TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd)); 451 + return fd; 452 + } 453 + 528 454 void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags, 529 455 uint64_t gpa, uint64_t size, void *hva); 530 456 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags, 531 457 uint64_t gpa, uint64_t size, void *hva); 458 + void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags, 459 + uint64_t gpa, uint64_t size, void *hva, 460 + uint32_t guest_memfd, uint64_t guest_memfd_offset); 461 + int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags, 462 + uint64_t gpa, uint64_t size, void *hva, 463 + uint32_t guest_memfd, uint64_t guest_memfd_offset); 464 + 532 465 void vm_userspace_mem_region_add(struct kvm_vm *vm, 533 466 enum vm_mem_backing_src_type src_type, 534 467 uint64_t guest_paddr, uint32_t slot, uint64_t npages, 535 468 uint32_t flags); 469 + void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, 470 + uint64_t guest_paddr, uint32_t slot, uint64_t npages, 471 + uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset); 536 472 537 473 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags); 538 474 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa); ··· 711 587 { 712 588 int fd = __vcpu_ioctl(vcpu, KVM_GET_STATS_FD, NULL); 713 589 714 - TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_GET_STATS_FD, fd)); 590 + TEST_ASSERT_VM_VCPU_IOCTL(fd >= 0, KVM_CHECK_EXTENSION, fd, vcpu->vm); 715 591 return fd; 716 592 } 717 593 ··· 837 713 * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to 838 714 * calculate the amount of memory needed for per-vCPU data, e.g. stacks. 839 715 */ 840 - struct kvm_vm *____vm_create(enum vm_guest_mode mode); 841 - struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus, 716 + struct kvm_vm *____vm_create(struct vm_shape shape); 717 + struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus, 842 718 uint64_t nr_extra_pages); 843 719 844 720 static inline struct kvm_vm *vm_create_barebones(void) 845 721 { 846 - return ____vm_create(VM_MODE_DEFAULT); 722 + return ____vm_create(VM_SHAPE_DEFAULT); 847 723 } 724 + 725 + #ifdef __x86_64__ 726 + static inline struct kvm_vm *vm_create_barebones_protected_vm(void) 727 + { 728 + const struct vm_shape shape = { 729 + .mode = VM_MODE_DEFAULT, 730 + .type = KVM_X86_SW_PROTECTED_VM, 731 + }; 732 + 733 + return ____vm_create(shape); 734 + } 735 + #endif 848 736 849 737 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus) 850 738 { 851 - return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0); 739 + return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0); 852 740 } 853 741 854 - struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus, 742 + struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus, 855 743 uint64_t extra_mem_pages, 856 744 void *guest_code, struct kvm_vcpu *vcpus[]); 857 745 ··· 871 735 void *guest_code, 872 736 struct kvm_vcpu *vcpus[]) 873 737 { 874 - return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0, 738 + return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0, 875 739 guest_code, vcpus); 876 740 } 741 + 742 + 743 + struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape, 744 + struct kvm_vcpu **vcpu, 745 + uint64_t extra_mem_pages, 746 + void *guest_code); 877 747 878 748 /* 879 749 * Create a VM with a single vCPU with reasonable defaults and @extra_mem_pages 880 750 * additional pages of guest memory. Returns the VM and vCPU (via out param). 881 751 */ 882 - struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu, 883 - uint64_t extra_mem_pages, 884 - void *guest_code); 752 + static inline struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu, 753 + uint64_t extra_mem_pages, 754 + void *guest_code) 755 + { 756 + return __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, vcpu, 757 + extra_mem_pages, guest_code); 758 + } 885 759 886 760 static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu, 887 761 void *guest_code) 888 762 { 889 763 return __vm_create_with_one_vcpu(vcpu, 0, guest_code); 764 + } 765 + 766 + static inline struct kvm_vm *vm_create_shape_with_one_vcpu(struct vm_shape shape, 767 + struct kvm_vcpu **vcpu, 768 + void *guest_code) 769 + { 770 + return __vm_create_shape_with_one_vcpu(shape, vcpu, 0, guest_code); 890 771 } 891 772 892 773 struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm); ··· 928 775 #endif 929 776 return n; 930 777 } 931 - 932 - struct kvm_userspace_memory_region * 933 - kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start, 934 - uint64_t end); 935 778 936 779 #define sync_global_to_guest(vm, g) ({ \ 937 780 typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \

+45 -17

tools/testing/selftests/kvm/include/riscv/processor.h

··· 10 10 #include "kvm_util.h" 11 11 #include <linux/stringify.h> 12 12 13 - static inline uint64_t __kvm_reg_id(uint64_t type, uint64_t idx, 14 - uint64_t size) 13 + static inline uint64_t __kvm_reg_id(uint64_t type, uint64_t subtype, 14 + uint64_t idx, uint64_t size) 15 15 { 16 - return KVM_REG_RISCV | type | idx | size; 16 + return KVM_REG_RISCV | type | subtype | idx | size; 17 17 } 18 18 19 19 #if __riscv_xlen == 64 ··· 22 22 #define KVM_REG_SIZE_ULONG KVM_REG_SIZE_U32 23 23 #endif 24 24 25 - #define RISCV_CONFIG_REG(name) __kvm_reg_id(KVM_REG_RISCV_CONFIG, \ 26 - KVM_REG_RISCV_CONFIG_REG(name), \ 27 - KVM_REG_SIZE_ULONG) 25 + #define RISCV_CONFIG_REG(name) __kvm_reg_id(KVM_REG_RISCV_CONFIG, 0, \ 26 + KVM_REG_RISCV_CONFIG_REG(name), \ 27 + KVM_REG_SIZE_ULONG) 28 28 29 - #define RISCV_CORE_REG(name) __kvm_reg_id(KVM_REG_RISCV_CORE, \ 30 - KVM_REG_RISCV_CORE_REG(name), \ 31 - KVM_REG_SIZE_ULONG) 29 + #define RISCV_CORE_REG(name) __kvm_reg_id(KVM_REG_RISCV_CORE, 0, \ 30 + KVM_REG_RISCV_CORE_REG(name), \ 31 + KVM_REG_SIZE_ULONG) 32 32 33 - #define RISCV_CSR_REG(name) __kvm_reg_id(KVM_REG_RISCV_CSR, \ 34 - KVM_REG_RISCV_CSR_REG(name), \ 35 - KVM_REG_SIZE_ULONG) 33 + #define RISCV_GENERAL_CSR_REG(name) __kvm_reg_id(KVM_REG_RISCV_CSR, \ 34 + KVM_REG_RISCV_CSR_GENERAL, \ 35 + KVM_REG_RISCV_CSR_REG(name), \ 36 + KVM_REG_SIZE_ULONG) 36 37 37 - #define RISCV_TIMER_REG(name) __kvm_reg_id(KVM_REG_RISCV_TIMER, \ 38 - KVM_REG_RISCV_TIMER_REG(name), \ 39 - KVM_REG_SIZE_U64) 38 + #define RISCV_TIMER_REG(name) __kvm_reg_id(KVM_REG_RISCV_TIMER, 0, \ 39 + KVM_REG_RISCV_TIMER_REG(name), \ 40 + KVM_REG_SIZE_U64) 40 41 41 - #define RISCV_ISA_EXT_REG(idx) __kvm_reg_id(KVM_REG_RISCV_ISA_EXT, \ 42 - idx, KVM_REG_SIZE_ULONG) 42 + #define RISCV_ISA_EXT_REG(idx) __kvm_reg_id(KVM_REG_RISCV_ISA_EXT, \ 43 + KVM_REG_RISCV_ISA_SINGLE, \ 44 + idx, KVM_REG_SIZE_ULONG) 45 + 46 + #define RISCV_SBI_EXT_REG(idx) __kvm_reg_id(KVM_REG_RISCV_SBI_EXT, \ 47 + KVM_REG_RISCV_SBI_SINGLE, \ 48 + idx, KVM_REG_SIZE_ULONG) 43 49 44 50 /* L3 index Bit[47:39] */ 45 51 #define PGTBL_L3_INDEX_MASK 0x0000FF8000000000ULL ··· 108 102 #define SATP_ASID_SHIFT 44 109 103 #define SATP_ASID_MASK _AC(0xFFFF, UL) 110 104 105 + /* SBI return error codes */ 106 + #define SBI_SUCCESS 0 107 + #define SBI_ERR_FAILURE -1 108 + #define SBI_ERR_NOT_SUPPORTED -2 109 + #define SBI_ERR_INVALID_PARAM -3 110 + #define SBI_ERR_DENIED -4 111 + #define SBI_ERR_INVALID_ADDRESS -5 112 + #define SBI_ERR_ALREADY_AVAILABLE -6 113 + #define SBI_ERR_ALREADY_STARTED -7 114 + #define SBI_ERR_ALREADY_STOPPED -8 115 + 111 116 #define SBI_EXT_EXPERIMENTAL_START 0x08000000 112 117 #define SBI_EXT_EXPERIMENTAL_END 0x08FFFFFF 113 118 114 119 #define KVM_RISCV_SELFTESTS_SBI_EXT SBI_EXT_EXPERIMENTAL_END 115 120 #define KVM_RISCV_SELFTESTS_SBI_UCALL 0 116 121 #define KVM_RISCV_SELFTESTS_SBI_UNEXP 1 122 + 123 + enum sbi_ext_id { 124 + SBI_EXT_BASE = 0x10, 125 + SBI_EXT_STA = 0x535441, 126 + }; 127 + 128 + enum sbi_ext_base_fid { 129 + SBI_EXT_BASE_PROBE_EXT = 3, 130 + }; 117 131 118 132 struct sbiret { 119 133 long error; ··· 144 118 unsigned long arg1, unsigned long arg2, 145 119 unsigned long arg3, unsigned long arg4, 146 120 unsigned long arg5); 121 + 122 + bool guest_sbi_probe_extension(int extid, long *out_val); 147 123 148 124 #endif /* SELFTEST_KVM_PROCESSOR_H */

+6 -1

tools/testing/selftests/kvm/include/test_util.h

··· 142 142 return vm_mem_backing_src_alias(t)->flag & MAP_SHARED; 143 143 } 144 144 145 + static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t) 146 + { 147 + return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM; 148 + } 149 + 145 150 /* Aligns x up to the next multiple of size. Size must be a power of 2. */ 146 151 static inline uint64_t align_up(uint64_t x, uint64_t size) 147 152 { ··· 191 186 } 192 187 193 188 int guest_vsnprintf(char *buf, int n, const char *fmt, va_list args); 194 - int guest_snprintf(char *buf, int n, const char *fmt, ...); 189 + __printf(3, 4) int guest_snprintf(char *buf, int n, const char *fmt, ...); 195 190 196 191 char *strdup_printf(const char *fmt, ...) __attribute__((format(printf, 1, 2), nonnull(1))); 197 192

+15 -3

tools/testing/selftests/kvm/include/ucall_common.h

··· 34 34 void *ucall_arch_get_ucall(struct kvm_vcpu *vcpu); 35 35 36 36 void ucall(uint64_t cmd, int nargs, ...); 37 - void ucall_fmt(uint64_t cmd, const char *fmt, ...); 38 - void ucall_assert(uint64_t cmd, const char *exp, const char *file, 39 - unsigned int line, const char *fmt, ...); 37 + __printf(2, 3) void ucall_fmt(uint64_t cmd, const char *fmt, ...); 38 + __printf(5, 6) void ucall_assert(uint64_t cmd, const char *exp, 39 + const char *file, unsigned int line, 40 + const char *fmt, ...); 40 41 uint64_t get_ucall(struct kvm_vcpu *vcpu, struct ucall *uc); 41 42 void ucall_init(struct kvm_vm *vm, vm_paddr_t mmio_gpa); 42 43 int ucall_nr_pages_required(uint64_t page_size); ··· 53 52 #define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4) \ 54 53 ucall(UCALL_SYNC, 6, "hello", stage, arg1, arg2, arg3, arg4) 55 54 #define GUEST_SYNC(stage) ucall(UCALL_SYNC, 2, "hello", stage) 55 + #define GUEST_SYNC1(arg0) ucall(UCALL_SYNC, 1, arg0) 56 + #define GUEST_SYNC2(arg0, arg1) ucall(UCALL_SYNC, 2, arg0, arg1) 57 + #define GUEST_SYNC3(arg0, arg1, arg2) \ 58 + ucall(UCALL_SYNC, 3, arg0, arg1, arg2) 59 + #define GUEST_SYNC4(arg0, arg1, arg2, arg3) \ 60 + ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3) 61 + #define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \ 62 + ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, arg4) 63 + #define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \ 64 + ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, arg4, arg5) 65 + 56 66 #define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args) 57 67 #define GUEST_DONE() ucall(UCALL_DONE, 0) 58 68

+15

tools/testing/selftests/kvm/include/x86_64/processor.h

··· 15 15 #include <asm/msr-index.h> 16 16 #include <asm/prctl.h> 17 17 18 + #include <linux/kvm_para.h> 18 19 #include <linux/stringify.h> 19 20 20 21 #include "../kvm_util.h" ··· 1194 1193 uint64_t a3); 1195 1194 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1); 1196 1195 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1); 1196 + 1197 + static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa, 1198 + uint64_t size, uint64_t flags) 1199 + { 1200 + return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, flags, 0); 1201 + } 1202 + 1203 + static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size, 1204 + uint64_t flags) 1205 + { 1206 + uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags); 1207 + 1208 + GUEST_ASSERT(!ret); 1209 + } 1197 1210 1198 1211 void __vm_xsave_require_permission(uint64_t xfeature, const char *name); 1199 1212

+1 -1

tools/testing/selftests/kvm/kvm_page_table_test.c

··· 254 254 255 255 /* Create a VM with enough guest pages */ 256 256 guest_num_pages = test_mem_size / guest_page_size; 257 - vm = __vm_create_with_vcpus(mode, nr_vcpus, guest_num_pages, 257 + vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus, guest_num_pages, 258 258 guest_code, test_args.vcpus); 259 259 260 260 /* Align down GPA of the testing memslot */

+55 -14

tools/testing/selftests/kvm/lib/aarch64/processor.c

··· 12 12 #include "kvm_util.h" 13 13 #include "processor.h" 14 14 #include <linux/bitfield.h> 15 + #include <linux/sizes.h> 15 16 16 17 #define DEFAULT_ARM64_GUEST_STACK_VADDR_MIN 0xac0000 17 18 ··· 59 58 return (gva >> vm->page_shift) & mask; 60 59 } 61 60 61 + static inline bool use_lpa2_pte_format(struct kvm_vm *vm) 62 + { 63 + return (vm->page_size == SZ_4K || vm->page_size == SZ_16K) && 64 + (vm->pa_bits > 48 || vm->va_bits > 48); 65 + } 66 + 62 67 static uint64_t addr_pte(struct kvm_vm *vm, uint64_t pa, uint64_t attrs) 63 68 { 64 69 uint64_t pte; 65 70 66 - pte = pa & GENMASK(47, vm->page_shift); 67 - if (vm->page_shift == 16) 68 - pte |= FIELD_GET(GENMASK(51, 48), pa) << 12; 71 + if (use_lpa2_pte_format(vm)) { 72 + pte = pa & GENMASK(49, vm->page_shift); 73 + pte |= FIELD_GET(GENMASK(51, 50), pa) << 8; 74 + attrs &= ~GENMASK(9, 8); 75 + } else { 76 + pte = pa & GENMASK(47, vm->page_shift); 77 + if (vm->page_shift == 16) 78 + pte |= FIELD_GET(GENMASK(51, 48), pa) << 12; 79 + } 69 80 pte |= attrs; 70 81 71 82 return pte; ··· 87 74 { 88 75 uint64_t pa; 89 76 90 - pa = pte & GENMASK(47, vm->page_shift); 91 - if (vm->page_shift == 16) 92 - pa |= FIELD_GET(GENMASK(15, 12), pte) << 48; 77 + if (use_lpa2_pte_format(vm)) { 78 + pa = pte & GENMASK(49, vm->page_shift); 79 + pa |= FIELD_GET(GENMASK(9, 8), pte) << 50; 80 + } else { 81 + pa = pte & GENMASK(47, vm->page_shift); 82 + if (vm->page_shift == 16) 83 + pa |= FIELD_GET(GENMASK(15, 12), pte) << 48; 84 + } 93 85 94 86 return pa; 95 87 } ··· 284 266 285 267 /* Configure base granule size */ 286 268 switch (vm->mode) { 287 - case VM_MODE_P52V48_4K: 288 - TEST_FAIL("AArch64 does not support 4K sized pages " 289 - "with 52-bit physical address ranges"); 290 269 case VM_MODE_PXXV48_4K: 291 270 TEST_FAIL("AArch64 does not support 4K sized pages " 292 271 "with ANY-bit physical address ranges"); ··· 293 278 case VM_MODE_P36V48_64K: 294 279 tcr_el1 |= 1ul << 14; /* TG0 = 64KB */ 295 280 break; 281 + case VM_MODE_P52V48_16K: 296 282 case VM_MODE_P48V48_16K: 297 283 case VM_MODE_P40V48_16K: 298 284 case VM_MODE_P36V48_16K: 299 285 case VM_MODE_P36V47_16K: 300 286 tcr_el1 |= 2ul << 14; /* TG0 = 16KB */ 301 287 break; 288 + case VM_MODE_P52V48_4K: 302 289 case VM_MODE_P48V48_4K: 303 290 case VM_MODE_P40V48_4K: 304 291 case VM_MODE_P36V48_4K: ··· 314 297 315 298 /* Configure output size */ 316 299 switch (vm->mode) { 300 + case VM_MODE_P52V48_4K: 301 + case VM_MODE_P52V48_16K: 317 302 case VM_MODE_P52V48_64K: 318 303 tcr_el1 |= 6ul << 32; /* IPS = 52 bits */ 319 304 ttbr0_el1 |= FIELD_GET(GENMASK(51, 48), vm->pgd) << 2; ··· 344 325 /* TCR_EL1 |= IRGN0:WBWA | ORGN0:WBWA | SH0:Inner-Shareable */; 345 326 tcr_el1 |= (1 << 8) | (1 << 10) | (3 << 12); 346 327 tcr_el1 |= (64 - vm->va_bits) /* T0SZ */; 328 + if (use_lpa2_pte_format(vm)) 329 + tcr_el1 |= (1ul << 59) /* DS */; 347 330 348 331 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_SCTLR_EL1), sctlr_el1); 349 332 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_TCR_EL1), tcr_el1); ··· 513 492 return read_sysreg(tpidr_el1); 514 493 } 515 494 516 - void aarch64_get_supported_page_sizes(uint32_t ipa, 517 - bool *ps4k, bool *ps16k, bool *ps64k) 495 + static uint32_t max_ipa_for_page_size(uint32_t vm_ipa, uint32_t gran, 496 + uint32_t not_sup_val, uint32_t ipa52_min_val) 497 + { 498 + if (gran == not_sup_val) 499 + return 0; 500 + else if (gran >= ipa52_min_val && vm_ipa >= 52) 501 + return 52; 502 + else 503 + return min(vm_ipa, 48U); 504 + } 505 + 506 + void aarch64_get_supported_page_sizes(uint32_t ipa, uint32_t *ipa4k, 507 + uint32_t *ipa16k, uint32_t *ipa64k) 518 508 { 519 509 struct kvm_vcpu_init preferred_init; 520 510 int kvm_fd, vm_fd, vcpu_fd, err; 521 511 uint64_t val; 512 + uint32_t gran; 522 513 struct kvm_one_reg reg = { 523 514 .id = KVM_ARM64_SYS_REG(SYS_ID_AA64MMFR0_EL1), 524 515 .addr = (uint64_t)&val, ··· 551 518 err = ioctl(vcpu_fd, KVM_GET_ONE_REG, &reg); 552 519 TEST_ASSERT(err == 0, KVM_IOCTL_ERROR(KVM_GET_ONE_REG, vcpu_fd)); 553 520 554 - *ps4k = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_EL1_TGRAN4), val) != 0xf; 555 - *ps64k = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_EL1_TGRAN64), val) == 0; 556 - *ps16k = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_EL1_TGRAN16), val) != 0; 521 + gran = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_EL1_TGRAN4), val); 522 + *ipa4k = max_ipa_for_page_size(ipa, gran, ID_AA64MMFR0_EL1_TGRAN4_NI, 523 + ID_AA64MMFR0_EL1_TGRAN4_52_BIT); 524 + 525 + gran = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_EL1_TGRAN64), val); 526 + *ipa64k = max_ipa_for_page_size(ipa, gran, ID_AA64MMFR0_EL1_TGRAN64_NI, 527 + ID_AA64MMFR0_EL1_TGRAN64_IMP); 528 + 529 + gran = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64MMFR0_EL1_TGRAN16), val); 530 + *ipa16k = max_ipa_for_page_size(ipa, gran, ID_AA64MMFR0_EL1_TGRAN16_NI, 531 + ID_AA64MMFR0_EL1_TGRAN16_52_BIT); 557 532 558 533 close(vcpu_fd); 559 534 close(vm_fd);

+23 -27

tools/testing/selftests/kvm/lib/guest_modes.c

··· 14 14 void guest_modes_append_default(void) 15 15 { 16 16 #ifndef __aarch64__ 17 - guest_mode_append(VM_MODE_DEFAULT, true, true); 17 + guest_mode_append(VM_MODE_DEFAULT, true); 18 18 #else 19 19 { 20 20 unsigned int limit = kvm_check_cap(KVM_CAP_ARM_VM_IPA_SIZE); 21 - bool ps4k, ps16k, ps64k; 21 + uint32_t ipa4k, ipa16k, ipa64k; 22 22 int i; 23 23 24 - aarch64_get_supported_page_sizes(limit, &ps4k, &ps16k, &ps64k); 24 + aarch64_get_supported_page_sizes(limit, &ipa4k, &ipa16k, &ipa64k); 25 25 26 - vm_mode_default = NUM_VM_MODES; 26 + guest_mode_append(VM_MODE_P52V48_4K, ipa4k >= 52); 27 + guest_mode_append(VM_MODE_P52V48_16K, ipa16k >= 52); 28 + guest_mode_append(VM_MODE_P52V48_64K, ipa64k >= 52); 27 29 28 - if (limit >= 52) 29 - guest_mode_append(VM_MODE_P52V48_64K, ps64k, ps64k); 30 - if (limit >= 48) { 31 - guest_mode_append(VM_MODE_P48V48_4K, ps4k, ps4k); 32 - guest_mode_append(VM_MODE_P48V48_16K, ps16k, ps16k); 33 - guest_mode_append(VM_MODE_P48V48_64K, ps64k, ps64k); 34 - } 35 - if (limit >= 40) { 36 - guest_mode_append(VM_MODE_P40V48_4K, ps4k, ps4k); 37 - guest_mode_append(VM_MODE_P40V48_16K, ps16k, ps16k); 38 - guest_mode_append(VM_MODE_P40V48_64K, ps64k, ps64k); 39 - if (ps4k) 40 - vm_mode_default = VM_MODE_P40V48_4K; 41 - } 42 - if (limit >= 36) { 43 - guest_mode_append(VM_MODE_P36V48_4K, ps4k, ps4k); 44 - guest_mode_append(VM_MODE_P36V48_16K, ps16k, ps16k); 45 - guest_mode_append(VM_MODE_P36V48_64K, ps64k, ps64k); 46 - guest_mode_append(VM_MODE_P36V47_16K, ps16k, ps16k); 47 - } 30 + guest_mode_append(VM_MODE_P48V48_4K, ipa4k >= 48); 31 + guest_mode_append(VM_MODE_P48V48_16K, ipa16k >= 48); 32 + guest_mode_append(VM_MODE_P48V48_64K, ipa64k >= 48); 33 + 34 + guest_mode_append(VM_MODE_P40V48_4K, ipa4k >= 40); 35 + guest_mode_append(VM_MODE_P40V48_16K, ipa16k >= 40); 36 + guest_mode_append(VM_MODE_P40V48_64K, ipa64k >= 40); 37 + 38 + guest_mode_append(VM_MODE_P36V48_4K, ipa4k >= 36); 39 + guest_mode_append(VM_MODE_P36V48_16K, ipa16k >= 36); 40 + guest_mode_append(VM_MODE_P36V48_64K, ipa64k >= 36); 41 + guest_mode_append(VM_MODE_P36V47_16K, ipa16k >= 36); 42 + 43 + vm_mode_default = ipa4k >= 40 ? VM_MODE_P40V48_4K : NUM_VM_MODES; 48 44 49 45 /* 50 46 * Pick the first supported IPA size if the default ··· 68 72 close(kvm_fd); 69 73 /* Starting with z13 we have 47bits of physical address */ 70 74 if (info.ibc >= 0x30) 71 - guest_mode_append(VM_MODE_P47V64_4K, true, true); 75 + guest_mode_append(VM_MODE_P47V64_4K, true); 72 76 } 73 77 #endif 74 78 #ifdef __riscv ··· 76 80 unsigned int sz = kvm_check_cap(KVM_CAP_VM_GPA_BITS); 77 81 78 82 if (sz >= 52) 79 - guest_mode_append(VM_MODE_P52V48_4K, true, true); 83 + guest_mode_append(VM_MODE_P52V48_4K, true); 80 84 if (sz >= 48) 81 - guest_mode_append(VM_MODE_P48V48_4K, true, true); 85 + guest_mode_append(VM_MODE_P48V48_4K, true); 82 86 } 83 87 #endif 84 88 }

+138 -91

tools/testing/selftests/kvm/lib/kvm_util.c

··· 148 148 { 149 149 static const char * const strings[] = { 150 150 [VM_MODE_P52V48_4K] = "PA-bits:52, VA-bits:48, 4K pages", 151 + [VM_MODE_P52V48_16K] = "PA-bits:52, VA-bits:48, 16K pages", 151 152 [VM_MODE_P52V48_64K] = "PA-bits:52, VA-bits:48, 64K pages", 152 153 [VM_MODE_P48V48_4K] = "PA-bits:48, VA-bits:48, 4K pages", 153 154 [VM_MODE_P48V48_16K] = "PA-bits:48, VA-bits:48, 16K pages", ··· 174 173 175 174 const struct vm_guest_mode_params vm_guest_mode_params[] = { 176 175 [VM_MODE_P52V48_4K] = { 52, 48, 0x1000, 12 }, 176 + [VM_MODE_P52V48_16K] = { 52, 48, 0x4000, 14 }, 177 177 [VM_MODE_P52V48_64K] = { 52, 48, 0x10000, 16 }, 178 178 [VM_MODE_P48V48_4K] = { 48, 48, 0x1000, 12 }, 179 179 [VM_MODE_P48V48_16K] = { 48, 48, 0x4000, 14 }, ··· 211 209 (1ULL << (vm->va_bits - 1)) >> vm->page_shift); 212 210 } 213 211 214 - struct kvm_vm *____vm_create(enum vm_guest_mode mode) 212 + struct kvm_vm *____vm_create(struct vm_shape shape) 215 213 { 216 214 struct kvm_vm *vm; 217 215 ··· 223 221 vm->regions.hva_tree = RB_ROOT; 224 222 hash_init(vm->regions.slot_hash); 225 223 226 - vm->mode = mode; 227 - vm->type = 0; 224 + vm->mode = shape.mode; 225 + vm->type = shape.type; 228 226 229 - vm->pa_bits = vm_guest_mode_params[mode].pa_bits; 230 - vm->va_bits = vm_guest_mode_params[mode].va_bits; 231 - vm->page_size = vm_guest_mode_params[mode].page_size; 232 - vm->page_shift = vm_guest_mode_params[mode].page_shift; 227 + vm->pa_bits = vm_guest_mode_params[vm->mode].pa_bits; 228 + vm->va_bits = vm_guest_mode_params[vm->mode].va_bits; 229 + vm->page_size = vm_guest_mode_params[vm->mode].page_size; 230 + vm->page_shift = vm_guest_mode_params[vm->mode].page_shift; 233 231 234 232 /* Setup mode specific traits. */ 235 233 switch (vm->mode) { ··· 253 251 case VM_MODE_P36V48_64K: 254 252 vm->pgtable_levels = 3; 255 253 break; 254 + case VM_MODE_P52V48_16K: 256 255 case VM_MODE_P48V48_16K: 257 256 case VM_MODE_P40V48_16K: 258 257 case VM_MODE_P36V48_16K: ··· 268 265 /* 269 266 * Ignore KVM support for 5-level paging (vm->va_bits == 57), 270 267 * it doesn't take effect unless a CR4.LA57 is set, which it 271 - * isn't for this VM_MODE. 268 + * isn't for this mode (48-bit virtual address space). 272 269 */ 273 270 TEST_ASSERT(vm->va_bits == 48 || vm->va_bits == 57, 274 271 "Linear address width (%d bits) not supported", ··· 288 285 vm->pgtable_levels = 5; 289 286 break; 290 287 default: 291 - TEST_FAIL("Unknown guest mode, mode: 0x%x", mode); 288 + TEST_FAIL("Unknown guest mode: 0x%x", vm->mode); 292 289 } 293 290 294 291 #ifdef __aarch64__ 292 + TEST_ASSERT(!vm->type, "ARM doesn't support test-provided types"); 295 293 if (vm->pa_bits != 40) 296 294 vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits); 297 295 #endif ··· 351 347 return vm_adjust_num_guest_pages(mode, nr_pages); 352 348 } 353 349 354 - struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus, 350 + struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus, 355 351 uint64_t nr_extra_pages) 356 352 { 357 - uint64_t nr_pages = vm_nr_pages_required(mode, nr_runnable_vcpus, 353 + uint64_t nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus, 358 354 nr_extra_pages); 359 355 struct userspace_mem_region *slot0; 360 356 struct kvm_vm *vm; 361 357 int i; 362 358 363 - pr_debug("%s: mode='%s' pages='%ld'\n", __func__, 364 - vm_guest_mode_string(mode), nr_pages); 359 + pr_debug("%s: mode='%s' type='%d', pages='%ld'\n", __func__, 360 + vm_guest_mode_string(shape.mode), shape.type, nr_pages); 365 361 366 - vm = ____vm_create(mode); 362 + vm = ____vm_create(shape); 367 363 368 364 vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0); 369 365 for (i = 0; i < NR_MEM_REGIONS; i++) ··· 404 400 * extra_mem_pages is only used to calculate the maximum page table size, 405 401 * no real memory allocation for non-slot0 memory in this function. 406 402 */ 407 - struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus, 403 + struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus, 408 404 uint64_t extra_mem_pages, 409 405 void *guest_code, struct kvm_vcpu *vcpus[]) 410 406 { ··· 413 409 414 410 TEST_ASSERT(!nr_vcpus || vcpus, "Must provide vCPU array"); 415 411 416 - vm = __vm_create(mode, nr_vcpus, extra_mem_pages); 412 + vm = __vm_create(shape, nr_vcpus, extra_mem_pages); 417 413 418 414 for (i = 0; i < nr_vcpus; ++i) 419 415 vcpus[i] = vm_vcpu_add(vm, i, guest_code); ··· 421 417 return vm; 422 418 } 423 419 424 - struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu, 425 - uint64_t extra_mem_pages, 426 - void *guest_code) 420 + struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape, 421 + struct kvm_vcpu **vcpu, 422 + uint64_t extra_mem_pages, 423 + void *guest_code) 427 424 { 428 425 struct kvm_vcpu *vcpus[1]; 429 426 struct kvm_vm *vm; 430 427 431 - vm = __vm_create_with_vcpus(VM_MODE_DEFAULT, 1, extra_mem_pages, 432 - guest_code, vcpus); 428 + vm = __vm_create_with_vcpus(shape, 1, extra_mem_pages, guest_code, vcpus); 433 429 434 430 *vcpu = vcpus[0]; 435 431 return vm; ··· 457 453 vm_create_irqchip(vmp); 458 454 459 455 hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) { 460 - int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, &region->region); 461 - TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n" 456 + int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, &region->region); 457 + 458 + TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n" 462 459 " rc: %i errno: %i\n" 463 460 " slot: %u flags: 0x%x\n" 464 461 " guest_phys_addr: 0x%llx size: 0x%llx", ··· 595 590 return NULL; 596 591 } 597 592 598 - /* 599 - * KVM Userspace Memory Region Find 600 - * 601 - * Input Args: 602 - * vm - Virtual Machine 603 - * start - Starting VM physical address 604 - * end - Ending VM physical address, inclusive. 605 - * 606 - * Output Args: None 607 - * 608 - * Return: 609 - * Pointer to overlapping region, NULL if no such region. 610 - * 611 - * Public interface to userspace_mem_region_find. Allows tests to look up 612 - * the memslot datastructure for a given range of guest physical memory. 613 - */ 614 - struct kvm_userspace_memory_region * 615 - kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start, 616 - uint64_t end) 617 - { 618 - struct userspace_mem_region *region; 619 - 620 - region = userspace_mem_region_find(vm, start, end); 621 - if (!region) 622 - return NULL; 623 - 624 - return &region->region; 625 - } 626 - 627 593 __weak void vcpu_arch_free(struct kvm_vcpu *vcpu) 628 594 { 629 595 ··· 662 686 } 663 687 664 688 region->region.memory_size = 0; 665 - vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region); 689 + vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 666 690 667 691 sparsebit_free(&region->unused_phy_pages); 668 692 ret = munmap(region->mmap_start, region->mmap_size); ··· 673 697 TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret)); 674 698 close(region->fd); 675 699 } 700 + if (region->region.guest_memfd >= 0) 701 + close(region->region.guest_memfd); 676 702 677 703 free(region); 678 704 } ··· 876 898 errno, strerror(errno)); 877 899 } 878 900 879 - /* 880 - * VM Userspace Memory Region Add 881 - * 882 - * Input Args: 883 - * vm - Virtual Machine 884 - * src_type - Storage source for this region. 885 - * NULL to use anonymous memory. 886 - * guest_paddr - Starting guest physical address 887 - * slot - KVM region slot 888 - * npages - Number of physical pages 889 - * flags - KVM memory region flags (e.g. KVM_MEM_LOG_DIRTY_PAGES) 890 - * 891 - * Output Args: None 892 - * 893 - * Return: None 894 - * 895 - * Allocates a memory area of the number of pages specified by npages 896 - * and maps it to the VM specified by vm, at a starting physical address 897 - * given by guest_paddr. The region is created with a KVM region slot 898 - * given by slot, which must be unique and < KVM_MEM_SLOTS_NUM. The 899 - * region is created with the flags given by flags. 900 - */ 901 - void vm_userspace_mem_region_add(struct kvm_vm *vm, 902 - enum vm_mem_backing_src_type src_type, 903 - uint64_t guest_paddr, uint32_t slot, uint64_t npages, 904 - uint32_t flags) 901 + int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags, 902 + uint64_t gpa, uint64_t size, void *hva, 903 + uint32_t guest_memfd, uint64_t guest_memfd_offset) 904 + { 905 + struct kvm_userspace_memory_region2 region = { 906 + .slot = slot, 907 + .flags = flags, 908 + .guest_phys_addr = gpa, 909 + .memory_size = size, 910 + .userspace_addr = (uintptr_t)hva, 911 + .guest_memfd = guest_memfd, 912 + .guest_memfd_offset = guest_memfd_offset, 913 + }; 914 + 915 + return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region); 916 + } 917 + 918 + void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags, 919 + uint64_t gpa, uint64_t size, void *hva, 920 + uint32_t guest_memfd, uint64_t guest_memfd_offset) 921 + { 922 + int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva, 923 + guest_memfd, guest_memfd_offset); 924 + 925 + TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)", 926 + errno, strerror(errno)); 927 + } 928 + 929 + 930 + /* FIXME: This thing needs to be ripped apart and rewritten. */ 931 + void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, 932 + uint64_t guest_paddr, uint32_t slot, uint64_t npages, 933 + uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset) 905 934 { 906 935 int ret; 907 936 struct userspace_mem_region *region; 908 937 size_t backing_src_pagesz = get_backing_src_pagesz(src_type); 938 + size_t mem_size = npages * vm->page_size; 909 939 size_t alignment; 910 940 911 941 TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages, ··· 966 980 /* Allocate and initialize new mem region structure. */ 967 981 region = calloc(1, sizeof(*region)); 968 982 TEST_ASSERT(region != NULL, "Insufficient Memory"); 969 - region->mmap_size = npages * vm->page_size; 983 + region->mmap_size = mem_size; 970 984 971 985 #ifdef __s390x__ 972 986 /* On s390x, the host address must be aligned to 1M (due to PGSTEs) */ ··· 1013 1027 /* As needed perform madvise */ 1014 1028 if ((src_type == VM_MEM_SRC_ANONYMOUS || 1015 1029 src_type == VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) { 1016 - ret = madvise(region->host_mem, npages * vm->page_size, 1030 + ret = madvise(region->host_mem, mem_size, 1017 1031 src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE); 1018 1032 TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx src_type: %s", 1019 - region->host_mem, npages * vm->page_size, 1033 + region->host_mem, mem_size, 1020 1034 vm_mem_backing_src_alias(src_type)->name); 1021 1035 } 1022 1036 1023 1037 region->backing_src_type = src_type; 1038 + 1039 + if (flags & KVM_MEM_GUEST_MEMFD) { 1040 + if (guest_memfd < 0) { 1041 + uint32_t guest_memfd_flags = 0; 1042 + TEST_ASSERT(!guest_memfd_offset, 1043 + "Offset must be zero when creating new guest_memfd"); 1044 + guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags); 1045 + } else { 1046 + /* 1047 + * Install a unique fd for each memslot so that the fd 1048 + * can be closed when the region is deleted without 1049 + * needing to track if the fd is owned by the framework 1050 + * or by the caller. 1051 + */ 1052 + guest_memfd = dup(guest_memfd); 1053 + TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd)); 1054 + } 1055 + 1056 + region->region.guest_memfd = guest_memfd; 1057 + region->region.guest_memfd_offset = guest_memfd_offset; 1058 + } else { 1059 + region->region.guest_memfd = -1; 1060 + } 1061 + 1024 1062 region->unused_phy_pages = sparsebit_alloc(); 1025 1063 sparsebit_set_num(region->unused_phy_pages, 1026 1064 guest_paddr >> vm->page_shift, npages); ··· 1053 1043 region->region.guest_phys_addr = guest_paddr; 1054 1044 region->region.memory_size = npages * vm->page_size; 1055 1045 region->region.userspace_addr = (uintptr_t) region->host_mem; 1056 - ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region); 1057 - TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n" 1046 + ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 1047 + TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n" 1058 1048 " rc: %i errno: %i\n" 1059 1049 " slot: %u flags: 0x%x\n" 1060 - " guest_phys_addr: 0x%lx size: 0x%lx", 1050 + " guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d\n", 1061 1051 ret, errno, slot, flags, 1062 - guest_paddr, (uint64_t) region->region.memory_size); 1052 + guest_paddr, (uint64_t) region->region.memory_size, 1053 + region->region.guest_memfd); 1063 1054 1064 1055 /* Add to quick lookup data structures */ 1065 1056 vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region); ··· 1079 1068 /* Align host alias address */ 1080 1069 region->host_alias = align_ptr_up(region->mmap_alias, alignment); 1081 1070 } 1071 + } 1072 + 1073 + void vm_userspace_mem_region_add(struct kvm_vm *vm, 1074 + enum vm_mem_backing_src_type src_type, 1075 + uint64_t guest_paddr, uint32_t slot, 1076 + uint64_t npages, uint32_t flags) 1077 + { 1078 + vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0); 1082 1079 } 1083 1080 1084 1081 /* ··· 1145 1126 1146 1127 region->region.flags = flags; 1147 1128 1148 - ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region); 1129 + ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 1149 1130 1150 - TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n" 1131 + TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n" 1151 1132 " rc: %i errno: %i slot: %u flags: 0x%x", 1152 1133 ret, errno, slot, flags); 1153 1134 } ··· 1175 1156 1176 1157 region->region.guest_phys_addr = new_gpa; 1177 1158 1178 - ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region); 1159 + ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 1179 1160 1180 - TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n" 1161 + TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n" 1181 1162 "ret: %i errno: %i slot: %u new_gpa: 0x%lx", 1182 1163 ret, errno, slot, new_gpa); 1183 1164 } ··· 1198 1179 void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot) 1199 1180 { 1200 1181 __vm_mem_region_delete(vm, memslot2region(vm, slot), true); 1182 + } 1183 + 1184 + void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size, 1185 + bool punch_hole) 1186 + { 1187 + const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0); 1188 + struct userspace_mem_region *region; 1189 + uint64_t end = base + size; 1190 + uint64_t gpa, len; 1191 + off_t fd_offset; 1192 + int ret; 1193 + 1194 + for (gpa = base; gpa < end; gpa += len) { 1195 + uint64_t offset; 1196 + 1197 + region = userspace_mem_region_find(vm, gpa, gpa); 1198 + TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD, 1199 + "Private memory region not found for GPA 0x%lx", gpa); 1200 + 1201 + offset = gpa - region->region.guest_phys_addr; 1202 + fd_offset = region->region.guest_memfd_offset + offset; 1203 + len = min_t(uint64_t, end - gpa, region->region.memory_size - offset); 1204 + 1205 + ret = fallocate(region->region.guest_memfd, mode, fd_offset, len); 1206 + TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx\n", 1207 + punch_hole ? "punch hole" : "allocate", gpa, len, 1208 + region->region.guest_memfd, mode, fd_offset); 1209 + } 1201 1210 } 1202 1211 1203 1212 /* Returns the size of a vCPU's kvm_run structure. */ ··· 1274 1227 vcpu->vm = vm; 1275 1228 vcpu->id = vcpu_id; 1276 1229 vcpu->fd = __vm_ioctl(vm, KVM_CREATE_VCPU, (void *)(unsigned long)vcpu_id); 1277 - TEST_ASSERT(vcpu->fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_VCPU, vcpu->fd)); 1230 + TEST_ASSERT_VM_VCPU_IOCTL(vcpu->fd >= 0, KVM_CREATE_VCPU, vcpu->fd, vm); 1278 1231 1279 1232 TEST_ASSERT(vcpu_mmap_sz() >= sizeof(*vcpu->run), "vcpu mmap size " 1280 1233 "smaller than expected, vcpu_mmap_sz: %i expected_min: %zi",

+2 -1

tools/testing/selftests/kvm/lib/memstress.c

··· 168 168 * The memory is also added to memslot 0, but that's a benign side 169 169 * effect as KVM allows aliasing HVAs in meslots. 170 170 */ 171 - vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages, 171 + vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus, 172 + slot0_pages + guest_num_pages, 172 173 memstress_guest_code, vcpus); 173 174 174 175 args->vm = vm;

+47 -2

tools/testing/selftests/kvm/lib/riscv/processor.c

··· 201 201 satp = (vm->pgd >> PGTBL_PAGE_SIZE_SHIFT) & SATP_PPN; 202 202 satp |= SATP_MODE_48; 203 203 204 - vcpu_set_reg(vcpu, RISCV_CSR_REG(satp), satp); 204 + vcpu_set_reg(vcpu, RISCV_GENERAL_CSR_REG(satp), satp); 205 205 } 206 206 207 207 void vcpu_arch_dump(FILE *stream, struct kvm_vcpu *vcpu, uint8_t indent) ··· 315 315 vcpu_set_reg(vcpu, RISCV_CORE_REG(regs.pc), (unsigned long)guest_code); 316 316 317 317 /* Setup default exception vector of guest */ 318 - vcpu_set_reg(vcpu, RISCV_CSR_REG(stvec), (unsigned long)guest_unexp_trap); 318 + vcpu_set_reg(vcpu, RISCV_GENERAL_CSR_REG(stvec), (unsigned long)guest_unexp_trap); 319 319 320 320 return vcpu; 321 321 } ··· 366 366 367 367 void assert_on_unhandled_exception(struct kvm_vcpu *vcpu) 368 368 { 369 + } 370 + 371 + struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0, 372 + unsigned long arg1, unsigned long arg2, 373 + unsigned long arg3, unsigned long arg4, 374 + unsigned long arg5) 375 + { 376 + register uintptr_t a0 asm ("a0") = (uintptr_t)(arg0); 377 + register uintptr_t a1 asm ("a1") = (uintptr_t)(arg1); 378 + register uintptr_t a2 asm ("a2") = (uintptr_t)(arg2); 379 + register uintptr_t a3 asm ("a3") = (uintptr_t)(arg3); 380 + register uintptr_t a4 asm ("a4") = (uintptr_t)(arg4); 381 + register uintptr_t a5 asm ("a5") = (uintptr_t)(arg5); 382 + register uintptr_t a6 asm ("a6") = (uintptr_t)(fid); 383 + register uintptr_t a7 asm ("a7") = (uintptr_t)(ext); 384 + struct sbiret ret; 385 + 386 + asm volatile ( 387 + "ecall" 388 + : "+r" (a0), "+r" (a1) 389 + : "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6), "r" (a7) 390 + : "memory"); 391 + ret.error = a0; 392 + ret.value = a1; 393 + 394 + return ret; 395 + } 396 + 397 + bool guest_sbi_probe_extension(int extid, long *out_val) 398 + { 399 + struct sbiret ret; 400 + 401 + ret = sbi_ecall(SBI_EXT_BASE, SBI_EXT_BASE_PROBE_EXT, extid, 402 + 0, 0, 0, 0, 0); 403 + 404 + __GUEST_ASSERT(!ret.error || ret.error == SBI_ERR_NOT_SUPPORTED, 405 + "ret.error=%ld, ret.value=%ld\n", ret.error, ret.value); 406 + 407 + if (ret.error == SBI_ERR_NOT_SUPPORTED) 408 + return false; 409 + 410 + if (out_val) 411 + *out_val = ret.value; 412 + 413 + return true; 369 414 }

-26

tools/testing/selftests/kvm/lib/riscv/ucall.c

··· 10 10 #include "kvm_util.h" 11 11 #include "processor.h" 12 12 13 - struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0, 14 - unsigned long arg1, unsigned long arg2, 15 - unsigned long arg3, unsigned long arg4, 16 - unsigned long arg5) 17 - { 18 - register uintptr_t a0 asm ("a0") = (uintptr_t)(arg0); 19 - register uintptr_t a1 asm ("a1") = (uintptr_t)(arg1); 20 - register uintptr_t a2 asm ("a2") = (uintptr_t)(arg2); 21 - register uintptr_t a3 asm ("a3") = (uintptr_t)(arg3); 22 - register uintptr_t a4 asm ("a4") = (uintptr_t)(arg4); 23 - register uintptr_t a5 asm ("a5") = (uintptr_t)(arg5); 24 - register uintptr_t a6 asm ("a6") = (uintptr_t)(fid); 25 - register uintptr_t a7 asm ("a7") = (uintptr_t)(ext); 26 - struct sbiret ret; 27 - 28 - asm volatile ( 29 - "ecall" 30 - : "+r" (a0), "+r" (a1) 31 - : "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6), "r" (a7) 32 - : "memory"); 33 - ret.error = a0; 34 - ret.value = a1; 35 - 36 - return ret; 37 - } 38 - 39 13 void *ucall_arch_get_ucall(struct kvm_vcpu *vcpu) 40 14 { 41 15 struct kvm_run *run = vcpu->run;

+274 -302

tools/testing/selftests/kvm/riscv/get-reg-list.c

··· 12 12 13 13 #define REG_MASK (KVM_REG_ARCH_MASK | KVM_REG_SIZE_MASK) 14 14 15 + enum { 16 + VCPU_FEATURE_ISA_EXT = 0, 17 + VCPU_FEATURE_SBI_EXT, 18 + }; 19 + 15 20 static bool isa_ext_cant_disable[KVM_RISCV_ISA_EXT_MAX]; 16 21 17 22 bool filter_reg(__u64 reg) ··· 33 28 * 34 29 * Note: The below list is alphabetically sorted. 35 30 */ 36 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_A: 37 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_C: 38 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_D: 39 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_F: 40 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_H: 41 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_I: 42 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_M: 43 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_V: 44 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SMSTATEEN: 45 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SSAIA: 46 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SSTC: 47 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SVINVAL: 48 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SVNAPOT: 49 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SVPBMT: 50 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZBA: 51 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZBB: 52 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZBS: 53 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICBOM: 54 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICBOZ: 55 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICNTR: 56 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICOND: 57 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICSR: 58 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZIFENCEI: 59 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZIHINTPAUSE: 60 - case KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZIHPM: 31 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_A: 32 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_C: 33 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_D: 34 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_F: 35 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_H: 36 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_I: 37 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_M: 38 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_V: 39 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SMSTATEEN: 40 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SSAIA: 41 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SSTC: 42 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SVINVAL: 43 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SVNAPOT: 44 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SVPBMT: 45 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZBA: 46 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZBB: 47 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZBS: 48 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICBOM: 49 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICBOZ: 50 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICNTR: 51 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICOND: 52 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICSR: 53 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZIFENCEI: 54 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZIHINTPAUSE: 55 + case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZIHPM: 56 + /* 57 + * Like ISA_EXT registers, SBI_EXT registers are only visible when the 58 + * host supports them and disabling them does not affect the visibility 59 + * of the SBI_EXT register itself. 60 + */ 61 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_V01: 62 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_TIME: 63 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_IPI: 64 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_RFENCE: 65 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_SRST: 66 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_HSM: 67 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_PMU: 68 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_DBCN: 69 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_STA: 70 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_EXPERIMENTAL: 71 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_VENDOR: 61 72 return true; 62 73 /* AIA registers are always available when Ssaia can't be disabled */ 63 74 case KVM_REG_RISCV_CSR | KVM_REG_RISCV_CSR_AIA | KVM_REG_RISCV_CSR_AIA_REG(siselect): ··· 96 75 return err == EINVAL; 97 76 } 98 77 99 - static inline bool vcpu_has_ext(struct kvm_vcpu *vcpu, int ext) 78 + static bool vcpu_has_ext(struct kvm_vcpu *vcpu, uint64_t ext_id) 100 79 { 101 80 int ret; 102 81 unsigned long value; 103 82 104 - ret = __vcpu_get_reg(vcpu, RISCV_ISA_EXT_REG(ext), &value); 83 + ret = __vcpu_get_reg(vcpu, ext_id, &value); 105 84 return (ret) ? false : !!value; 106 85 } 107 86 ··· 109 88 { 110 89 unsigned long isa_ext_state[KVM_RISCV_ISA_EXT_MAX] = { 0 }; 111 90 struct vcpu_reg_sublist *s; 91 + uint64_t feature; 112 92 int rc; 113 93 114 94 for (int i = 0; i < KVM_RISCV_ISA_EXT_MAX; i++) ··· 125 103 isa_ext_cant_disable[i] = true; 126 104 } 127 105 106 + for (int i = 0; i < KVM_RISCV_SBI_EXT_MAX; i++) { 107 + rc = __vcpu_set_reg(vcpu, RISCV_SBI_EXT_REG(i), 0); 108 + TEST_ASSERT(!rc || (rc == -1 && errno == ENOENT), "Unexpected error"); 109 + } 110 + 128 111 for_each_sublist(c, s) { 129 112 if (!s->feature) 130 113 continue; 131 114 115 + switch (s->feature_type) { 116 + case VCPU_FEATURE_ISA_EXT: 117 + feature = RISCV_ISA_EXT_REG(s->feature); 118 + break; 119 + case VCPU_FEATURE_SBI_EXT: 120 + feature = RISCV_SBI_EXT_REG(s->feature); 121 + break; 122 + default: 123 + TEST_FAIL("Unknown feature type"); 124 + } 125 + 132 126 /* Try to enable the desired extension */ 133 - __vcpu_set_reg(vcpu, RISCV_ISA_EXT_REG(s->feature), 1); 127 + __vcpu_set_reg(vcpu, feature, 1); 134 128 135 129 /* Double check whether the desired extension was enabled */ 136 - __TEST_REQUIRE(vcpu_has_ext(vcpu, s->feature), 130 + __TEST_REQUIRE(vcpu_has_ext(vcpu, feature), 137 131 "%s not available, skipping tests\n", s->name); 138 132 } 139 133 } ··· 373 335 } 374 336 375 337 #define KVM_ISA_EXT_ARR(ext) \ 376 - [KVM_RISCV_ISA_EXT_##ext] = "KVM_RISCV_ISA_EXT_" #ext 338 + [KVM_RISCV_ISA_EXT_##ext] = "KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_" #ext 377 339 378 - static const char *isa_ext_id_to_str(const char *prefix, __u64 id) 340 + static const char *isa_ext_single_id_to_str(__u64 reg_off) 379 341 { 380 - /* reg_off is the offset into unsigned long kvm_isa_ext_arr[] */ 381 - __u64 reg_off = id & ~(REG_MASK | KVM_REG_RISCV_ISA_EXT); 382 - 383 - assert((id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_ISA_EXT); 384 - 385 342 static const char * const kvm_isa_ext_reg_name[] = { 386 343 KVM_ISA_EXT_ARR(A), 387 344 KVM_ISA_EXT_ARR(C), ··· 406 373 }; 407 374 408 375 if (reg_off >= ARRAY_SIZE(kvm_isa_ext_reg_name)) 409 - return strdup_printf("%lld /* UNKNOWN */", reg_off); 376 + return strdup_printf("KVM_REG_RISCV_ISA_SINGLE | %lld /* UNKNOWN */", reg_off); 410 377 411 378 return kvm_isa_ext_reg_name[reg_off]; 379 + } 380 + 381 + static const char *isa_ext_multi_id_to_str(__u64 reg_subtype, __u64 reg_off) 382 + { 383 + const char *unknown = ""; 384 + 385 + if (reg_off > KVM_REG_RISCV_ISA_MULTI_REG_LAST) 386 + unknown = " /* UNKNOWN */"; 387 + 388 + switch (reg_subtype) { 389 + case KVM_REG_RISCV_ISA_MULTI_EN: 390 + return strdup_printf("KVM_REG_RISCV_ISA_MULTI_EN | %lld%s", reg_off, unknown); 391 + case KVM_REG_RISCV_ISA_MULTI_DIS: 392 + return strdup_printf("KVM_REG_RISCV_ISA_MULTI_DIS | %lld%s", reg_off, unknown); 393 + } 394 + 395 + return strdup_printf("%lld | %lld /* UNKNOWN */", reg_subtype, reg_off); 396 + } 397 + 398 + static const char *isa_ext_id_to_str(const char *prefix, __u64 id) 399 + { 400 + __u64 reg_off = id & ~(REG_MASK | KVM_REG_RISCV_ISA_EXT); 401 + __u64 reg_subtype = reg_off & KVM_REG_RISCV_SUBTYPE_MASK; 402 + 403 + assert((id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_ISA_EXT); 404 + 405 + reg_off &= ~KVM_REG_RISCV_SUBTYPE_MASK; 406 + 407 + switch (reg_subtype) { 408 + case KVM_REG_RISCV_ISA_SINGLE: 409 + return isa_ext_single_id_to_str(reg_off); 410 + case KVM_REG_RISCV_ISA_MULTI_EN: 411 + case KVM_REG_RISCV_ISA_MULTI_DIS: 412 + return isa_ext_multi_id_to_str(reg_subtype, reg_off); 413 + } 414 + 415 + return strdup_printf("%lld | %lld /* UNKNOWN */", reg_subtype, reg_off); 412 416 } 413 417 414 418 #define KVM_SBI_EXT_ARR(ext) \ ··· 462 392 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_SRST), 463 393 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_HSM), 464 394 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_PMU), 395 + KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_STA), 465 396 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_EXPERIMENTAL), 466 397 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_VENDOR), 467 398 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_DBCN), ··· 506 435 case KVM_REG_RISCV_SBI_MULTI_EN: 507 436 case KVM_REG_RISCV_SBI_MULTI_DIS: 508 437 return sbi_ext_multi_id_to_str(reg_subtype, reg_off); 438 + } 439 + 440 + return strdup_printf("%lld | %lld /* UNKNOWN */", reg_subtype, reg_off); 441 + } 442 + 443 + static const char *sbi_sta_id_to_str(__u64 reg_off) 444 + { 445 + switch (reg_off) { 446 + case 0: return "KVM_REG_RISCV_SBI_STA | KVM_REG_RISCV_SBI_STA_REG(shmem_lo)"; 447 + case 1: return "KVM_REG_RISCV_SBI_STA | KVM_REG_RISCV_SBI_STA_REG(shmem_hi)"; 448 + } 449 + return strdup_printf("KVM_REG_RISCV_SBI_STA | %lld /* UNKNOWN */", reg_off); 450 + } 451 + 452 + static const char *sbi_id_to_str(const char *prefix, __u64 id) 453 + { 454 + __u64 reg_off = id & ~(REG_MASK | KVM_REG_RISCV_SBI_STATE); 455 + __u64 reg_subtype = reg_off & KVM_REG_RISCV_SUBTYPE_MASK; 456 + 457 + assert((id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_SBI_STATE); 458 + 459 + reg_off &= ~KVM_REG_RISCV_SUBTYPE_MASK; 460 + 461 + switch (reg_subtype) { 462 + case KVM_REG_RISCV_SBI_STA: 463 + return sbi_sta_id_to_str(reg_off); 509 464 } 510 465 511 466 return strdup_printf("%lld | %lld /* UNKNOWN */", reg_subtype, reg_off); ··· 592 495 case KVM_REG_RISCV_SBI_EXT: 593 496 printf("\tKVM_REG_RISCV | %s | KVM_REG_RISCV_SBI_EXT | %s,\n", 594 497 reg_size, sbi_ext_id_to_str(prefix, id)); 498 + break; 499 + case KVM_REG_RISCV_SBI_STATE: 500 + printf("\tKVM_REG_RISCV | %s | KVM_REG_RISCV_SBI_STATE | %s,\n", 501 + reg_size, sbi_id_to_str(prefix, id)); 595 502 break; 596 503 default: 597 504 printf("\tKVM_REG_RISCV | %s | 0x%llx /* UNKNOWN */,\n", ··· 662 561 KVM_REG_RISCV | KVM_REG_SIZE_U64 | KVM_REG_RISCV_TIMER | KVM_REG_RISCV_TIMER_REG(time), 663 562 KVM_REG_RISCV | KVM_REG_SIZE_U64 | KVM_REG_RISCV_TIMER | KVM_REG_RISCV_TIMER_REG(compare), 664 563 KVM_REG_RISCV | KVM_REG_SIZE_U64 | KVM_REG_RISCV_TIMER | KVM_REG_RISCV_TIMER_REG(state), 665 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_V01, 666 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_TIME, 667 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_IPI, 668 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_RFENCE, 669 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_SRST, 670 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_HSM, 671 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_PMU, 672 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_EXPERIMENTAL, 673 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_VENDOR, 674 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_DBCN, 675 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_MULTI_EN | 0, 676 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_MULTI_DIS | 0, 677 564 }; 678 565 679 566 /* ··· 672 583 KVM_REG_RISCV | KVM_REG_SIZE_U64 | KVM_REG_RISCV_TIMER | KVM_REG_RISCV_TIMER_REG(state), 673 584 }; 674 585 675 - static __u64 h_regs[] = { 676 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_H, 586 + static __u64 sbi_base_regs[] = { 587 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_V01, 588 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_TIME, 589 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_IPI, 590 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_RFENCE, 591 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_SRST, 592 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_HSM, 593 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_EXPERIMENTAL, 594 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_VENDOR, 595 + }; 596 + 597 + static __u64 sbi_sta_regs[] = { 598 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_STA, 599 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_STATE | KVM_REG_RISCV_SBI_STA | KVM_REG_RISCV_SBI_STA_REG(shmem_lo), 600 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_SBI_STATE | KVM_REG_RISCV_SBI_STA | KVM_REG_RISCV_SBI_STA_REG(shmem_hi), 677 601 }; 678 602 679 603 static __u64 zicbom_regs[] = { 680 604 KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_CONFIG | KVM_REG_RISCV_CONFIG_REG(zicbom_block_size), 681 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICBOM, 605 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICBOM, 682 606 }; 683 607 684 608 static __u64 zicboz_regs[] = { 685 609 KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_CONFIG | KVM_REG_RISCV_CONFIG_REG(zicboz_block_size), 686 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICBOZ, 687 - }; 688 - 689 - static __u64 svpbmt_regs[] = { 690 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SVPBMT, 691 - }; 692 - 693 - static __u64 sstc_regs[] = { 694 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SSTC, 695 - }; 696 - 697 - static __u64 svinval_regs[] = { 698 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SVINVAL, 699 - }; 700 - 701 - static __u64 zihintpause_regs[] = { 702 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZIHINTPAUSE, 703 - }; 704 - 705 - static __u64 zba_regs[] = { 706 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZBA, 707 - }; 708 - 709 - static __u64 zbb_regs[] = { 710 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZBB, 711 - }; 712 - 713 - static __u64 zbs_regs[] = { 714 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZBS, 715 - }; 716 - 717 - static __u64 zicntr_regs[] = { 718 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICNTR, 719 - }; 720 - 721 - static __u64 zicond_regs[] = { 722 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICOND, 723 - }; 724 - 725 - static __u64 zicsr_regs[] = { 726 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZICSR, 727 - }; 728 - 729 - static __u64 zifencei_regs[] = { 730 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZIFENCEI, 731 - }; 732 - 733 - static __u64 zihpm_regs[] = { 734 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_ZIHPM, 610 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_ZICBOZ, 735 611 }; 736 612 737 613 static __u64 aia_regs[] = { ··· 707 653 KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_CSR | KVM_REG_RISCV_CSR_AIA | KVM_REG_RISCV_CSR_AIA_REG(siph), 708 654 KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_CSR | KVM_REG_RISCV_CSR_AIA | KVM_REG_RISCV_CSR_AIA_REG(iprio1h), 709 655 KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_CSR | KVM_REG_RISCV_CSR_AIA | KVM_REG_RISCV_CSR_AIA_REG(iprio2h), 710 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SSAIA, 656 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SSAIA, 711 657 }; 712 658 713 659 static __u64 smstateen_regs[] = { 714 660 KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_CSR | KVM_REG_RISCV_CSR_SMSTATEEN | KVM_REG_RISCV_CSR_SMSTATEEN_REG(sstateen0), 715 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_SMSTATEEN, 661 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_SMSTATEEN, 716 662 }; 717 663 718 664 static __u64 fp_f_regs[] = { ··· 749 695 KVM_REG_RISCV | KVM_REG_SIZE_U32 | KVM_REG_RISCV_FP_F | KVM_REG_RISCV_FP_F_REG(f[30]), 750 696 KVM_REG_RISCV | KVM_REG_SIZE_U32 | KVM_REG_RISCV_FP_F | KVM_REG_RISCV_FP_F_REG(f[31]), 751 697 KVM_REG_RISCV | KVM_REG_SIZE_U32 | KVM_REG_RISCV_FP_F | KVM_REG_RISCV_FP_F_REG(fcsr), 752 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_F, 698 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_F, 753 699 }; 754 700 755 701 static __u64 fp_d_regs[] = { ··· 786 732 KVM_REG_RISCV | KVM_REG_SIZE_U64 | KVM_REG_RISCV_FP_D | KVM_REG_RISCV_FP_D_REG(f[30]), 787 733 KVM_REG_RISCV | KVM_REG_SIZE_U64 | KVM_REG_RISCV_FP_D | KVM_REG_RISCV_FP_D_REG(f[31]), 788 734 KVM_REG_RISCV | KVM_REG_SIZE_U32 | KVM_REG_RISCV_FP_D | KVM_REG_RISCV_FP_D_REG(fcsr), 789 - KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_RISCV_ISA_EXT_D, 735 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | KVM_RISCV_ISA_EXT_D, 790 736 }; 791 737 792 - #define BASE_SUBLIST \ 738 + #define SUBLIST_BASE \ 793 739 {"base", .regs = base_regs, .regs_n = ARRAY_SIZE(base_regs), \ 794 740 .skips_set = base_skips_set, .skips_set_n = ARRAY_SIZE(base_skips_set),} 795 - #define H_REGS_SUBLIST \ 796 - {"h", .feature = KVM_RISCV_ISA_EXT_H, .regs = h_regs, .regs_n = ARRAY_SIZE(h_regs),} 797 - #define ZICBOM_REGS_SUBLIST \ 741 + #define SUBLIST_SBI_BASE \ 742 + {"sbi-base", .feature_type = VCPU_FEATURE_SBI_EXT, .feature = KVM_RISCV_SBI_EXT_V01, \ 743 + .regs = sbi_base_regs, .regs_n = ARRAY_SIZE(sbi_base_regs),} 744 + #define SUBLIST_SBI_STA \ 745 + {"sbi-sta", .feature_type = VCPU_FEATURE_SBI_EXT, .feature = KVM_RISCV_SBI_EXT_STA, \ 746 + .regs = sbi_sta_regs, .regs_n = ARRAY_SIZE(sbi_sta_regs),} 747 + #define SUBLIST_ZICBOM \ 798 748 {"zicbom", .feature = KVM_RISCV_ISA_EXT_ZICBOM, .regs = zicbom_regs, .regs_n = ARRAY_SIZE(zicbom_regs),} 799 - #define ZICBOZ_REGS_SUBLIST \ 749 + #define SUBLIST_ZICBOZ \ 800 750 {"zicboz", .feature = KVM_RISCV_ISA_EXT_ZICBOZ, .regs = zicboz_regs, .regs_n = ARRAY_SIZE(zicboz_regs),} 801 - #define SVPBMT_REGS_SUBLIST \ 802 - {"svpbmt", .feature = KVM_RISCV_ISA_EXT_SVPBMT, .regs = svpbmt_regs, .regs_n = ARRAY_SIZE(svpbmt_regs),} 803 - #define SSTC_REGS_SUBLIST \ 804 - {"sstc", .feature = KVM_RISCV_ISA_EXT_SSTC, .regs = sstc_regs, .regs_n = ARRAY_SIZE(sstc_regs),} 805 - #define SVINVAL_REGS_SUBLIST \ 806 - {"svinval", .feature = KVM_RISCV_ISA_EXT_SVINVAL, .regs = svinval_regs, .regs_n = ARRAY_SIZE(svinval_regs),} 807 - #define ZIHINTPAUSE_REGS_SUBLIST \ 808 - {"zihintpause", .feature = KVM_RISCV_ISA_EXT_ZIHINTPAUSE, .regs = zihintpause_regs, .regs_n = ARRAY_SIZE(zihintpause_regs),} 809 - #define ZBA_REGS_SUBLIST \ 810 - {"zba", .feature = KVM_RISCV_ISA_EXT_ZBA, .regs = zba_regs, .regs_n = ARRAY_SIZE(zba_regs),} 811 - #define ZBB_REGS_SUBLIST \ 812 - {"zbb", .feature = KVM_RISCV_ISA_EXT_ZBB, .regs = zbb_regs, .regs_n = ARRAY_SIZE(zbb_regs),} 813 - #define ZBS_REGS_SUBLIST \ 814 - {"zbs", .feature = KVM_RISCV_ISA_EXT_ZBS, .regs = zbs_regs, .regs_n = ARRAY_SIZE(zbs_regs),} 815 - #define ZICNTR_REGS_SUBLIST \ 816 - {"zicntr", .feature = KVM_RISCV_ISA_EXT_ZICNTR, .regs = zicntr_regs, .regs_n = ARRAY_SIZE(zicntr_regs),} 817 - #define ZICOND_REGS_SUBLIST \ 818 - {"zicond", .feature = KVM_RISCV_ISA_EXT_ZICOND, .regs = zicond_regs, .regs_n = ARRAY_SIZE(zicond_regs),} 819 - #define ZICSR_REGS_SUBLIST \ 820 - {"zicsr", .feature = KVM_RISCV_ISA_EXT_ZICSR, .regs = zicsr_regs, .regs_n = ARRAY_SIZE(zicsr_regs),} 821 - #define ZIFENCEI_REGS_SUBLIST \ 822 - {"zifencei", .feature = KVM_RISCV_ISA_EXT_ZIFENCEI, .regs = zifencei_regs, .regs_n = ARRAY_SIZE(zifencei_regs),} 823 - #define ZIHPM_REGS_SUBLIST \ 824 - {"zihpm", .feature = KVM_RISCV_ISA_EXT_ZIHPM, .regs = zihpm_regs, .regs_n = ARRAY_SIZE(zihpm_regs),} 825 - #define AIA_REGS_SUBLIST \ 751 + #define SUBLIST_AIA \ 826 752 {"aia", .feature = KVM_RISCV_ISA_EXT_SSAIA, .regs = aia_regs, .regs_n = ARRAY_SIZE(aia_regs),} 827 - #define SMSTATEEN_REGS_SUBLIST \ 753 + #define SUBLIST_SMSTATEEN \ 828 754 {"smstateen", .feature = KVM_RISCV_ISA_EXT_SMSTATEEN, .regs = smstateen_regs, .regs_n = ARRAY_SIZE(smstateen_regs),} 829 - #define FP_F_REGS_SUBLIST \ 755 + #define SUBLIST_FP_F \ 830 756 {"fp_f", .feature = KVM_RISCV_ISA_EXT_F, .regs = fp_f_regs, \ 831 757 .regs_n = ARRAY_SIZE(fp_f_regs),} 832 - #define FP_D_REGS_SUBLIST \ 758 + #define SUBLIST_FP_D \ 833 759 {"fp_d", .feature = KVM_RISCV_ISA_EXT_D, .regs = fp_d_regs, \ 834 760 .regs_n = ARRAY_SIZE(fp_d_regs),} 835 761 836 - static struct vcpu_reg_list h_config = { 837 - .sublists = { 838 - BASE_SUBLIST, 839 - H_REGS_SUBLIST, 840 - {0}, 841 - }, 842 - }; 762 + #define KVM_ISA_EXT_SIMPLE_CONFIG(ext, extu) \ 763 + static __u64 regs_##ext[] = { \ 764 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | \ 765 + KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | \ 766 + KVM_RISCV_ISA_EXT_##extu, \ 767 + }; \ 768 + static struct vcpu_reg_list config_##ext = { \ 769 + .sublists = { \ 770 + SUBLIST_BASE, \ 771 + { \ 772 + .name = #ext, \ 773 + .feature = KVM_RISCV_ISA_EXT_##extu, \ 774 + .regs = regs_##ext, \ 775 + .regs_n = ARRAY_SIZE(regs_##ext), \ 776 + }, \ 777 + {0}, \ 778 + }, \ 779 + } \ 843 780 844 - static struct vcpu_reg_list zicbom_config = { 845 - .sublists = { 846 - BASE_SUBLIST, 847 - ZICBOM_REGS_SUBLIST, 848 - {0}, 849 - }, 850 - }; 781 + #define KVM_SBI_EXT_SIMPLE_CONFIG(ext, extu) \ 782 + static __u64 regs_sbi_##ext[] = { \ 783 + KVM_REG_RISCV | KVM_REG_SIZE_ULONG | \ 784 + KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | \ 785 + KVM_RISCV_SBI_EXT_##extu, \ 786 + }; \ 787 + static struct vcpu_reg_list config_sbi_##ext = { \ 788 + .sublists = { \ 789 + SUBLIST_BASE, \ 790 + { \ 791 + .name = "sbi-"#ext, \ 792 + .feature_type = VCPU_FEATURE_SBI_EXT, \ 793 + .feature = KVM_RISCV_SBI_EXT_##extu, \ 794 + .regs = regs_sbi_##ext, \ 795 + .regs_n = ARRAY_SIZE(regs_sbi_##ext), \ 796 + }, \ 797 + {0}, \ 798 + }, \ 799 + } \ 851 800 852 - static struct vcpu_reg_list zicboz_config = { 853 - .sublists = { 854 - BASE_SUBLIST, 855 - ZICBOZ_REGS_SUBLIST, 856 - {0}, 857 - }, 858 - }; 801 + #define KVM_ISA_EXT_SUBLIST_CONFIG(ext, extu) \ 802 + static struct vcpu_reg_list config_##ext = { \ 803 + .sublists = { \ 804 + SUBLIST_BASE, \ 805 + SUBLIST_##extu, \ 806 + {0}, \ 807 + }, \ 808 + } \ 859 809 860 - static struct vcpu_reg_list svpbmt_config = { 861 - .sublists = { 862 - BASE_SUBLIST, 863 - SVPBMT_REGS_SUBLIST, 864 - {0}, 865 - }, 866 - }; 810 + #define KVM_SBI_EXT_SUBLIST_CONFIG(ext, extu) \ 811 + static struct vcpu_reg_list config_sbi_##ext = { \ 812 + .sublists = { \ 813 + SUBLIST_BASE, \ 814 + SUBLIST_SBI_##extu, \ 815 + {0}, \ 816 + }, \ 817 + } \ 867 818 868 - static struct vcpu_reg_list sstc_config = { 869 - .sublists = { 870 - BASE_SUBLIST, 871 - SSTC_REGS_SUBLIST, 872 - {0}, 873 - }, 874 - }; 819 + /* Note: The below list is alphabetically sorted. */ 875 820 876 - static struct vcpu_reg_list svinval_config = { 877 - .sublists = { 878 - BASE_SUBLIST, 879 - SVINVAL_REGS_SUBLIST, 880 - {0}, 881 - }, 882 - }; 821 + KVM_SBI_EXT_SUBLIST_CONFIG(base, BASE); 822 + KVM_SBI_EXT_SUBLIST_CONFIG(sta, STA); 823 + KVM_SBI_EXT_SIMPLE_CONFIG(pmu, PMU); 824 + KVM_SBI_EXT_SIMPLE_CONFIG(dbcn, DBCN); 883 825 884 - static struct vcpu_reg_list zihintpause_config = { 885 - .sublists = { 886 - BASE_SUBLIST, 887 - ZIHINTPAUSE_REGS_SUBLIST, 888 - {0}, 889 - }, 890 - }; 891 - 892 - static struct vcpu_reg_list zba_config = { 893 - .sublists = { 894 - BASE_SUBLIST, 895 - ZBA_REGS_SUBLIST, 896 - {0}, 897 - }, 898 - }; 899 - 900 - static struct vcpu_reg_list zbb_config = { 901 - .sublists = { 902 - BASE_SUBLIST, 903 - ZBB_REGS_SUBLIST, 904 - {0}, 905 - }, 906 - }; 907 - 908 - static struct vcpu_reg_list zbs_config = { 909 - .sublists = { 910 - BASE_SUBLIST, 911 - ZBS_REGS_SUBLIST, 912 - {0}, 913 - }, 914 - }; 915 - 916 - static struct vcpu_reg_list zicntr_config = { 917 - .sublists = { 918 - BASE_SUBLIST, 919 - ZICNTR_REGS_SUBLIST, 920 - {0}, 921 - }, 922 - }; 923 - 924 - static struct vcpu_reg_list zicond_config = { 925 - .sublists = { 926 - BASE_SUBLIST, 927 - ZICOND_REGS_SUBLIST, 928 - {0}, 929 - }, 930 - }; 931 - 932 - static struct vcpu_reg_list zicsr_config = { 933 - .sublists = { 934 - BASE_SUBLIST, 935 - ZICSR_REGS_SUBLIST, 936 - {0}, 937 - }, 938 - }; 939 - 940 - static struct vcpu_reg_list zifencei_config = { 941 - .sublists = { 942 - BASE_SUBLIST, 943 - ZIFENCEI_REGS_SUBLIST, 944 - {0}, 945 - }, 946 - }; 947 - 948 - static struct vcpu_reg_list zihpm_config = { 949 - .sublists = { 950 - BASE_SUBLIST, 951 - ZIHPM_REGS_SUBLIST, 952 - {0}, 953 - }, 954 - }; 955 - 956 - static struct vcpu_reg_list aia_config = { 957 - .sublists = { 958 - BASE_SUBLIST, 959 - AIA_REGS_SUBLIST, 960 - {0}, 961 - }, 962 - }; 963 - 964 - static struct vcpu_reg_list smstateen_config = { 965 - .sublists = { 966 - BASE_SUBLIST, 967 - SMSTATEEN_REGS_SUBLIST, 968 - {0}, 969 - }, 970 - }; 971 - 972 - static struct vcpu_reg_list fp_f_config = { 973 - .sublists = { 974 - BASE_SUBLIST, 975 - FP_F_REGS_SUBLIST, 976 - {0}, 977 - }, 978 - }; 979 - 980 - static struct vcpu_reg_list fp_d_config = { 981 - .sublists = { 982 - BASE_SUBLIST, 983 - FP_D_REGS_SUBLIST, 984 - {0}, 985 - }, 986 - }; 826 + KVM_ISA_EXT_SUBLIST_CONFIG(aia, AIA); 827 + KVM_ISA_EXT_SUBLIST_CONFIG(fp_f, FP_F); 828 + KVM_ISA_EXT_SUBLIST_CONFIG(fp_d, FP_D); 829 + KVM_ISA_EXT_SIMPLE_CONFIG(h, H); 830 + KVM_ISA_EXT_SUBLIST_CONFIG(smstateen, SMSTATEEN); 831 + KVM_ISA_EXT_SIMPLE_CONFIG(sstc, SSTC); 832 + KVM_ISA_EXT_SIMPLE_CONFIG(svinval, SVINVAL); 833 + KVM_ISA_EXT_SIMPLE_CONFIG(svnapot, SVNAPOT); 834 + KVM_ISA_EXT_SIMPLE_CONFIG(svpbmt, SVPBMT); 835 + KVM_ISA_EXT_SIMPLE_CONFIG(zba, ZBA); 836 + KVM_ISA_EXT_SIMPLE_CONFIG(zbb, ZBB); 837 + KVM_ISA_EXT_SIMPLE_CONFIG(zbs, ZBS); 838 + KVM_ISA_EXT_SUBLIST_CONFIG(zicbom, ZICBOM); 839 + KVM_ISA_EXT_SUBLIST_CONFIG(zicboz, ZICBOZ); 840 + KVM_ISA_EXT_SIMPLE_CONFIG(zicntr, ZICNTR); 841 + KVM_ISA_EXT_SIMPLE_CONFIG(zicond, ZICOND); 842 + KVM_ISA_EXT_SIMPLE_CONFIG(zicsr, ZICSR); 843 + KVM_ISA_EXT_SIMPLE_CONFIG(zifencei, ZIFENCEI); 844 + KVM_ISA_EXT_SIMPLE_CONFIG(zihintpause, ZIHINTPAUSE); 845 + KVM_ISA_EXT_SIMPLE_CONFIG(zihpm, ZIHPM); 987 846 988 847 struct vcpu_reg_list *vcpu_configs[] = { 989 - &h_config, 990 - &zicbom_config, 991 - &zicboz_config, 992 - &svpbmt_config, 993 - &sstc_config, 994 - &svinval_config, 995 - &zihintpause_config, 996 - &zba_config, 997 - &zbb_config, 998 - &zbs_config, 999 - &zicntr_config, 1000 - &zicond_config, 1001 - &zicsr_config, 1002 - &zifencei_config, 1003 - &zihpm_config, 1004 - &aia_config, 1005 - &smstateen_config, 1006 - &fp_f_config, 1007 - &fp_d_config, 848 + &config_sbi_base, 849 + &config_sbi_sta, 850 + &config_sbi_pmu, 851 + &config_sbi_dbcn, 852 + &config_aia, 853 + &config_fp_f, 854 + &config_fp_d, 855 + &config_h, 856 + &config_smstateen, 857 + &config_sstc, 858 + &config_svinval, 859 + &config_svnapot, 860 + &config_svpbmt, 861 + &config_zba, 862 + &config_zbb, 863 + &config_zbs, 864 + &config_zicbom, 865 + &config_zicboz, 866 + &config_zicntr, 867 + &config_zicond, 868 + &config_zicsr, 869 + &config_zifencei, 870 + &config_zihintpause, 871 + &config_zihpm, 1008 872 }; 1009 873 int vcpu_configs_n = ARRAY_SIZE(vcpu_configs);

+3 -8

tools/testing/selftests/kvm/s390x/cmma_test.c

··· 94 94 ); 95 95 } 96 96 97 - static struct kvm_vm *create_vm(void) 98 - { 99 - return ____vm_create(VM_MODE_DEFAULT); 100 - } 101 - 102 97 static void create_main_memslot(struct kvm_vm *vm) 103 98 { 104 99 int i; ··· 152 157 { 153 158 struct kvm_vm *vm; 154 159 155 - vm = create_vm(); 160 + vm = vm_create_barebones(); 156 161 157 162 create_memslots(vm); 158 163 ··· 271 276 272 277 static void test_migration_mode(void) 273 278 { 274 - struct kvm_vm *vm = create_vm(); 279 + struct kvm_vm *vm = vm_create_barebones(); 275 280 struct kvm_vcpu *vcpu; 276 281 u64 orig_psw; 277 282 int rc; ··· 665 670 */ 666 671 static int machine_has_cmma(void) 667 672 { 668 - struct kvm_vm *vm = create_vm(); 673 + struct kvm_vm *vm = vm_create_barebones(); 669 674 int r; 670 675 671 676 r = !__kvm_has_device_attr(vm->fd, KVM_S390_VM_MEM_CTRL, KVM_S390_VM_MEM_ENABLE_CMMA);

+156 -5

tools/testing/selftests/kvm/set_memory_region_test.c

··· 157 157 */ 158 158 val = guest_spin_on_val(0); 159 159 __GUEST_ASSERT(val == 1 || val == MMIO_VAL, 160 - "Expected '1' or MMIO ('%llx'), got '%llx'", MMIO_VAL, val); 160 + "Expected '1' or MMIO ('%lx'), got '%lx'", MMIO_VAL, val); 161 161 162 162 /* Spin until the misaligning memory region move completes. */ 163 163 val = guest_spin_on_val(MMIO_VAL); 164 164 __GUEST_ASSERT(val == 1 || val == 0, 165 - "Expected '0' or '1' (no MMIO), got '%llx'", val); 165 + "Expected '0' or '1' (no MMIO), got '%lx'", val); 166 166 167 167 /* Spin until the memory region starts to get re-aligned. */ 168 168 val = guest_spin_on_val(0); 169 169 __GUEST_ASSERT(val == 1 || val == MMIO_VAL, 170 - "Expected '1' or MMIO ('%llx'), got '%llx'", MMIO_VAL, val); 170 + "Expected '1' or MMIO ('%lx'), got '%lx'", MMIO_VAL, val); 171 171 172 172 /* Spin until the re-aligning memory region move completes. */ 173 173 val = guest_spin_on_val(MMIO_VAL); ··· 326 326 } 327 327 #endif /* __x86_64__ */ 328 328 329 + static void test_invalid_memory_region_flags(void) 330 + { 331 + uint32_t supported_flags = KVM_MEM_LOG_DIRTY_PAGES; 332 + const uint32_t v2_only_flags = KVM_MEM_GUEST_MEMFD; 333 + struct kvm_vm *vm; 334 + int r, i; 335 + 336 + #if defined __aarch64__ || defined __x86_64__ 337 + supported_flags |= KVM_MEM_READONLY; 338 + #endif 339 + 340 + #ifdef __x86_64__ 341 + if (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)) 342 + vm = vm_create_barebones_protected_vm(); 343 + else 344 + #endif 345 + vm = vm_create_barebones(); 346 + 347 + if (kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES) & KVM_MEMORY_ATTRIBUTE_PRIVATE) 348 + supported_flags |= KVM_MEM_GUEST_MEMFD; 349 + 350 + for (i = 0; i < 32; i++) { 351 + if ((supported_flags & BIT(i)) && !(v2_only_flags & BIT(i))) 352 + continue; 353 + 354 + r = __vm_set_user_memory_region(vm, 0, BIT(i), 355 + 0, MEM_REGION_SIZE, NULL); 356 + 357 + TEST_ASSERT(r && errno == EINVAL, 358 + "KVM_SET_USER_MEMORY_REGION should have failed on v2 only flag 0x%lx", BIT(i)); 359 + 360 + if (supported_flags & BIT(i)) 361 + continue; 362 + 363 + r = __vm_set_user_memory_region2(vm, 0, BIT(i), 364 + 0, MEM_REGION_SIZE, NULL, 0, 0); 365 + TEST_ASSERT(r && errno == EINVAL, 366 + "KVM_SET_USER_MEMORY_REGION2 should have failed on unsupported flag 0x%lx", BIT(i)); 367 + } 368 + 369 + if (supported_flags & KVM_MEM_GUEST_MEMFD) { 370 + r = __vm_set_user_memory_region2(vm, 0, 371 + KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_GUEST_MEMFD, 372 + 0, MEM_REGION_SIZE, NULL, 0, 0); 373 + TEST_ASSERT(r && errno == EINVAL, 374 + "KVM_SET_USER_MEMORY_REGION2 should have failed, dirty logging private memory is unsupported"); 375 + } 376 + } 377 + 329 378 /* 330 379 * Test it can be added memory slots up to KVM_CAP_NR_MEMSLOTS, then any 331 380 * tentative to add further slots should fail. ··· 434 385 kvm_vm_free(vm); 435 386 } 436 387 388 + 389 + #ifdef __x86_64__ 390 + static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd, 391 + size_t offset, const char *msg) 392 + { 393 + int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 394 + MEM_REGION_GPA, MEM_REGION_SIZE, 395 + 0, memfd, offset); 396 + TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg); 397 + } 398 + 399 + static void test_add_private_memory_region(void) 400 + { 401 + struct kvm_vm *vm, *vm2; 402 + int memfd, i; 403 + 404 + pr_info("Testing ADD of KVM_MEM_GUEST_MEMFD memory regions\n"); 405 + 406 + vm = vm_create_barebones_protected_vm(); 407 + 408 + test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail"); 409 + test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail"); 410 + 411 + memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false); 412 + test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail"); 413 + close(memfd); 414 + 415 + vm2 = vm_create_barebones_protected_vm(); 416 + memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0); 417 + test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should fail"); 418 + 419 + vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 420 + MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0); 421 + close(memfd); 422 + kvm_vm_free(vm2); 423 + 424 + memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0); 425 + for (i = 1; i < PAGE_SIZE; i++) 426 + test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should fail"); 427 + 428 + vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 429 + MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0); 430 + close(memfd); 431 + 432 + kvm_vm_free(vm); 433 + } 434 + 435 + static void test_add_overlapping_private_memory_regions(void) 436 + { 437 + struct kvm_vm *vm; 438 + int memfd; 439 + int r; 440 + 441 + pr_info("Testing ADD of overlapping KVM_MEM_GUEST_MEMFD memory regions\n"); 442 + 443 + vm = vm_create_barebones_protected_vm(); 444 + 445 + memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0); 446 + 447 + vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 448 + MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, memfd, 0); 449 + 450 + vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_GUEST_MEMFD, 451 + MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2, 452 + 0, memfd, MEM_REGION_SIZE * 2); 453 + 454 + /* 455 + * Delete the first memslot, and then attempt to recreate it except 456 + * with a "bad" offset that results in overlap in the guest_memfd(). 457 + */ 458 + vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 459 + MEM_REGION_GPA, 0, NULL, -1, 0); 460 + 461 + /* Overlap the front half of the other slot. */ 462 + r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 463 + MEM_REGION_GPA * 2 - MEM_REGION_SIZE, 464 + MEM_REGION_SIZE * 2, 465 + 0, memfd, 0); 466 + TEST_ASSERT(r == -1 && errno == EEXIST, "%s", 467 + "Overlapping guest_memfd() bindings should fail with EEXIST"); 468 + 469 + /* And now the back half of the other slot. */ 470 + r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_GUEST_MEMFD, 471 + MEM_REGION_GPA * 2 + MEM_REGION_SIZE, 472 + MEM_REGION_SIZE * 2, 473 + 0, memfd, 0); 474 + TEST_ASSERT(r == -1 && errno == EEXIST, "%s", 475 + "Overlapping guest_memfd() bindings should fail with EEXIST"); 476 + 477 + close(memfd); 478 + kvm_vm_free(vm); 479 + } 480 + #endif 481 + 437 482 int main(int argc, char *argv[]) 438 483 { 439 484 #ifdef __x86_64__ 440 485 int i, loops; 441 - #endif 442 486 443 - #ifdef __x86_64__ 444 487 /* 445 488 * FIXME: the zero-memslot test fails on aarch64 and s390x because 446 489 * KVM_RUN fails with ENOEXEC or EFAULT. ··· 540 399 test_zero_memory_regions(); 541 400 #endif 542 401 402 + test_invalid_memory_region_flags(); 403 + 543 404 test_add_max_memory_regions(); 544 405 545 406 #ifdef __x86_64__ 407 + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) && 408 + (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) { 409 + test_add_private_memory_region(); 410 + test_add_overlapping_private_memory_regions(); 411 + } else { 412 + pr_info("Skipping tests for KVM_MEM_GUEST_MEMFD memory regions\n"); 413 + } 414 + 546 415 if (argc > 1) 547 416 loops = atoi_positive("Number of iterations", argv[1]); 548 417 else

+99

tools/testing/selftests/kvm/steal_time.c

··· 11 11 #include <pthread.h> 12 12 #include <linux/kernel.h> 13 13 #include <asm/kvm.h> 14 + #ifndef __riscv 14 15 #include <asm/kvm_para.h> 16 + #endif 15 17 16 18 #include "test_util.h" 17 19 #include "kvm_util.h" ··· 203 201 pr_info(" rev: %d\n", st->rev); 204 202 pr_info(" attr: %d\n", st->attr); 205 203 pr_info(" st_time: %ld\n", st->st_time); 204 + } 205 + 206 + #elif defined(__riscv) 207 + 208 + /* SBI STA shmem must have 64-byte alignment */ 209 + #define STEAL_TIME_SIZE ((sizeof(struct sta_struct) + 63) & ~63) 210 + 211 + static vm_paddr_t st_gpa[NR_VCPUS]; 212 + 213 + struct sta_struct { 214 + uint32_t sequence; 215 + uint32_t flags; 216 + uint64_t steal; 217 + uint8_t preempted; 218 + uint8_t pad[47]; 219 + } __packed; 220 + 221 + static void sta_set_shmem(vm_paddr_t gpa, unsigned long flags) 222 + { 223 + unsigned long lo = (unsigned long)gpa; 224 + #if __riscv_xlen == 32 225 + unsigned long hi = (unsigned long)(gpa >> 32); 226 + #else 227 + unsigned long hi = gpa == -1 ? -1 : 0; 228 + #endif 229 + struct sbiret ret = sbi_ecall(SBI_EXT_STA, 0, lo, hi, flags, 0, 0, 0); 230 + 231 + GUEST_ASSERT(ret.value == 0 && ret.error == 0); 232 + } 233 + 234 + static void check_status(struct sta_struct *st) 235 + { 236 + GUEST_ASSERT(!(READ_ONCE(st->sequence) & 1)); 237 + GUEST_ASSERT(READ_ONCE(st->flags) == 0); 238 + GUEST_ASSERT(READ_ONCE(st->preempted) == 0); 239 + } 240 + 241 + static void guest_code(int cpu) 242 + { 243 + struct sta_struct *st = st_gva[cpu]; 244 + uint32_t sequence; 245 + long out_val = 0; 246 + bool probe; 247 + 248 + probe = guest_sbi_probe_extension(SBI_EXT_STA, &out_val); 249 + GUEST_ASSERT(probe && out_val == 1); 250 + 251 + sta_set_shmem(st_gpa[cpu], 0); 252 + GUEST_SYNC(0); 253 + 254 + check_status(st); 255 + WRITE_ONCE(guest_stolen_time[cpu], st->steal); 256 + sequence = READ_ONCE(st->sequence); 257 + check_status(st); 258 + GUEST_SYNC(1); 259 + 260 + check_status(st); 261 + GUEST_ASSERT(sequence < READ_ONCE(st->sequence)); 262 + WRITE_ONCE(guest_stolen_time[cpu], st->steal); 263 + check_status(st); 264 + GUEST_DONE(); 265 + } 266 + 267 + static bool is_steal_time_supported(struct kvm_vcpu *vcpu) 268 + { 269 + uint64_t id = RISCV_SBI_EXT_REG(KVM_RISCV_SBI_EXT_STA); 270 + unsigned long enabled; 271 + 272 + vcpu_get_reg(vcpu, id, &enabled); 273 + TEST_ASSERT(enabled == 0 || enabled == 1, "Expected boolean result"); 274 + 275 + return enabled; 276 + } 277 + 278 + static void steal_time_init(struct kvm_vcpu *vcpu, uint32_t i) 279 + { 280 + /* ST_GPA_BASE is identity mapped */ 281 + st_gva[i] = (void *)(ST_GPA_BASE + i * STEAL_TIME_SIZE); 282 + st_gpa[i] = addr_gva2gpa(vcpu->vm, (vm_vaddr_t)st_gva[i]); 283 + sync_global_to_guest(vcpu->vm, st_gva[i]); 284 + sync_global_to_guest(vcpu->vm, st_gpa[i]); 285 + } 286 + 287 + static void steal_time_dump(struct kvm_vm *vm, uint32_t vcpu_idx) 288 + { 289 + struct sta_struct *st = addr_gva2hva(vm, (ulong)st_gva[vcpu_idx]); 290 + int i; 291 + 292 + pr_info("VCPU%d:\n", vcpu_idx); 293 + pr_info(" sequence: %d\n", st->sequence); 294 + pr_info(" flags: %d\n", st->flags); 295 + pr_info(" steal: %"PRIu64"\n", st->steal); 296 + pr_info(" preempted: %d\n", st->preempted); 297 + pr_info(" pad: "); 298 + for (i = 0; i < 47; ++i) 299 + pr_info("%d", st->pad[i]); 300 + pr_info("\n"); 206 301 } 207 302 208 303 #endif

+2

tools/testing/selftests/kvm/x86_64/hyperv_clock.c

··· 211 211 vm_vaddr_t tsc_page_gva; 212 212 int stage; 213 213 214 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_TIME)); 215 + 214 216 vm = vm_create_with_one_vcpu(&vcpu, guest_main); 215 217 216 218 vcpu_set_hv_cpuid(vcpu);

+3 -2

tools/testing/selftests/kvm/x86_64/hyperv_evmcs.c

··· 240 240 struct ucall uc; 241 241 int stage; 242 242 243 - vm = vm_create_with_one_vcpu(&vcpu, guest_code); 244 - 245 243 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 246 244 TEST_REQUIRE(kvm_has_cap(KVM_CAP_NESTED_STATE)); 247 245 TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_ENLIGHTENED_VMCS)); 246 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_DIRECT_TLBFLUSH)); 247 + 248 + vm = vm_create_with_one_vcpu(&vcpu, guest_code); 248 249 249 250 hcall_page = vm_vaddr_alloc_pages(vm, 1); 250 251 memset(addr_gva2hva(vm, hcall_page), 0x0, getpagesize());

+2

tools/testing/selftests/kvm/x86_64/hyperv_extended_hypercalls.c

··· 43 43 uint64_t *outval; 44 44 struct ucall uc; 45 45 46 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_CPUID)); 47 + 46 48 /* Verify if extended hypercalls are supported */ 47 49 if (!kvm_cpuid_has(kvm_get_supported_hv_cpuid(), 48 50 HV_ENABLE_EXTENDED_HYPERCALLS)) {

+7 -5

tools/testing/selftests/kvm/x86_64/hyperv_features.c

··· 55 55 if (msr->fault_expected) 56 56 __GUEST_ASSERT(vector == GP_VECTOR, 57 57 "Expected #GP on %sMSR(0x%x), got vector '0x%x'", 58 - msr->idx, msr->write ? "WR" : "RD", vector); 58 + msr->write ? "WR" : "RD", msr->idx, vector); 59 59 else 60 60 __GUEST_ASSERT(!vector, 61 61 "Expected success on %sMSR(0x%x), got vector '0x%x'", 62 - msr->idx, msr->write ? "WR" : "RD", vector); 62 + msr->write ? "WR" : "RD", msr->idx, vector); 63 63 64 64 if (vector || is_write_only_msr(msr->idx)) 65 65 goto done; 66 66 67 67 if (msr->write) 68 68 __GUEST_ASSERT(!vector, 69 - "WRMSR(0x%x) to '0x%llx', RDMSR read '0x%llx'", 69 + "WRMSR(0x%x) to '0x%lx', RDMSR read '0x%lx'", 70 70 msr->idx, msr->write_val, msr_val); 71 71 72 72 /* Invariant TSC bit appears when TSC invariant control MSR is written to */ ··· 102 102 vector = __hyperv_hypercall(hcall->control, input, output, &res); 103 103 if (hcall->ud_expected) { 104 104 __GUEST_ASSERT(vector == UD_VECTOR, 105 - "Expected #UD for control '%u', got vector '0x%x'", 105 + "Expected #UD for control '%lu', got vector '0x%x'", 106 106 hcall->control, vector); 107 107 } else { 108 108 __GUEST_ASSERT(!vector, 109 - "Expected no exception for control '%u', got vector '0x%x'", 109 + "Expected no exception for control '%lu', got vector '0x%x'", 110 110 hcall->control, vector); 111 111 GUEST_ASSERT_EQ(res, hcall->expect); 112 112 } ··· 690 690 691 691 int main(void) 692 692 { 693 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_ENFORCE_CPUID)); 694 + 693 695 pr_info("Testing access to Hyper-V specific MSRs\n"); 694 696 guest_test_msrs_access(); 695 697

+2

tools/testing/selftests/kvm/x86_64/hyperv_ipi.c

··· 248 248 int stage = 1, r; 249 249 struct ucall uc; 250 250 251 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_SEND_IPI)); 252 + 251 253 vm = vm_create_with_one_vcpu(&vcpu[0], sender_guest_code); 252 254 253 255 /* Hypercall input/output */

+1

tools/testing/selftests/kvm/x86_64/hyperv_svm_test.c

··· 158 158 int stage; 159 159 160 160 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM)); 161 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_DIRECT_TLBFLUSH)); 161 162 162 163 /* Create VM */ 163 164 vm = vm_create_with_one_vcpu(&vcpu, guest_code);

+2

tools/testing/selftests/kvm/x86_64/hyperv_tlb_flush.c

··· 590 590 struct ucall uc; 591 591 int stage = 1, r, i; 592 592 593 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_TLBFLUSH)); 594 + 593 595 vm = vm_create_with_one_vcpu(&vcpu[0], sender_guest_code); 594 596 595 597 /* Test data page */

-121

tools/testing/selftests/kvm/x86_64/mmio_warning_test.c

··· 1 - /* 2 - * mmio_warning_test 3 - * 4 - * Copyright (C) 2019, Google LLC. 5 - * 6 - * This work is licensed under the terms of the GNU GPL, version 2. 7 - * 8 - * Test that we don't get a kernel warning when we call KVM_RUN after a 9 - * triple fault occurs. To get the triple fault to occur we call KVM_RUN 10 - * on a VCPU that hasn't been properly setup. 11 - * 12 - */ 13 - 14 - #define _GNU_SOURCE 15 - #include <fcntl.h> 16 - #include <kvm_util.h> 17 - #include <linux/kvm.h> 18 - #include <processor.h> 19 - #include <pthread.h> 20 - #include <stdio.h> 21 - #include <stdlib.h> 22 - #include <string.h> 23 - #include <sys/ioctl.h> 24 - #include <sys/mman.h> 25 - #include <sys/stat.h> 26 - #include <sys/types.h> 27 - #include <sys/wait.h> 28 - #include <test_util.h> 29 - #include <unistd.h> 30 - 31 - #define NTHREAD 4 32 - #define NPROCESS 5 33 - 34 - struct thread_context { 35 - int kvmcpu; 36 - struct kvm_run *run; 37 - }; 38 - 39 - void *thr(void *arg) 40 - { 41 - struct thread_context *tc = (struct thread_context *)arg; 42 - int res; 43 - int kvmcpu = tc->kvmcpu; 44 - struct kvm_run *run = tc->run; 45 - 46 - res = ioctl(kvmcpu, KVM_RUN, 0); 47 - pr_info("ret1=%d exit_reason=%d suberror=%d\n", 48 - res, run->exit_reason, run->internal.suberror); 49 - 50 - return 0; 51 - } 52 - 53 - void test(void) 54 - { 55 - int i, kvm, kvmvm, kvmcpu; 56 - pthread_t th[NTHREAD]; 57 - struct kvm_run *run; 58 - struct thread_context tc; 59 - 60 - kvm = open("/dev/kvm", O_RDWR); 61 - TEST_ASSERT(kvm != -1, "failed to open /dev/kvm"); 62 - kvmvm = __kvm_ioctl(kvm, KVM_CREATE_VM, NULL); 63 - TEST_ASSERT(kvmvm > 0, KVM_IOCTL_ERROR(KVM_CREATE_VM, kvmvm)); 64 - kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0); 65 - TEST_ASSERT(kvmcpu != -1, KVM_IOCTL_ERROR(KVM_CREATE_VCPU, kvmcpu)); 66 - run = (struct kvm_run *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 67 - kvmcpu, 0); 68 - tc.kvmcpu = kvmcpu; 69 - tc.run = run; 70 - srand(getpid()); 71 - for (i = 0; i < NTHREAD; i++) { 72 - pthread_create(&th[i], NULL, thr, (void *)(uintptr_t)&tc); 73 - usleep(rand() % 10000); 74 - } 75 - for (i = 0; i < NTHREAD; i++) 76 - pthread_join(th[i], NULL); 77 - } 78 - 79 - int get_warnings_count(void) 80 - { 81 - int warnings; 82 - FILE *f; 83 - 84 - f = popen("dmesg | grep \"WARNING:\" | wc -l", "r"); 85 - if (fscanf(f, "%d", &warnings) < 1) 86 - warnings = 0; 87 - pclose(f); 88 - 89 - return warnings; 90 - } 91 - 92 - int main(void) 93 - { 94 - int warnings_before, warnings_after; 95 - 96 - TEST_REQUIRE(host_cpu_is_intel); 97 - 98 - TEST_REQUIRE(!vm_is_unrestricted_guest(NULL)); 99 - 100 - warnings_before = get_warnings_count(); 101 - 102 - for (int i = 0; i < NPROCESS; ++i) { 103 - int status; 104 - int pid = fork(); 105 - 106 - if (pid < 0) 107 - exit(1); 108 - if (pid == 0) { 109 - test(); 110 - exit(0); 111 - } 112 - while (waitpid(pid, &status, __WALL) != pid) 113 - ; 114 - } 115 - 116 - warnings_after = get_warnings_count(); 117 - TEST_ASSERT(warnings_before == warnings_after, 118 - "Warnings found in kernel. Run 'dmesg' to inspect them."); 119 - 120 - return 0; 121 - }

+4 -2

tools/testing/selftests/kvm/x86_64/monitor_mwait_test.c

··· 27 27 \ 28 28 if (fault_wanted) \ 29 29 __GUEST_ASSERT((vector) == UD_VECTOR, \ 30 - "Expected #UD on " insn " for testcase '0x%x', got '0x%x'", vector); \ 30 + "Expected #UD on " insn " for testcase '0x%x', got '0x%x'", \ 31 + testcase, vector); \ 31 32 else \ 32 33 __GUEST_ASSERT(!(vector), \ 33 - "Expected success on " insn " for testcase '0x%x', got '0x%x'", vector); \ 34 + "Expected success on " insn " for testcase '0x%x', got '0x%x'", \ 35 + testcase, vector); \ 34 36 } while (0) 35 37 36 38 static void guest_monitor_wait(int testcase)

+482

tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2022, Google LLC. 4 + */ 5 + #define _GNU_SOURCE /* for program_invocation_short_name */ 6 + #include <fcntl.h> 7 + #include <limits.h> 8 + #include <pthread.h> 9 + #include <sched.h> 10 + #include <signal.h> 11 + #include <stdio.h> 12 + #include <stdlib.h> 13 + #include <string.h> 14 + #include <sys/ioctl.h> 15 + 16 + #include <linux/compiler.h> 17 + #include <linux/kernel.h> 18 + #include <linux/kvm_para.h> 19 + #include <linux/memfd.h> 20 + #include <linux/sizes.h> 21 + 22 + #include <test_util.h> 23 + #include <kvm_util.h> 24 + #include <processor.h> 25 + 26 + #define BASE_DATA_SLOT 10 27 + #define BASE_DATA_GPA ((uint64_t)(1ull << 32)) 28 + #define PER_CPU_DATA_SIZE ((uint64_t)(SZ_2M + PAGE_SIZE)) 29 + 30 + /* Horrific macro so that the line info is captured accurately :-( */ 31 + #define memcmp_g(gpa, pattern, size) \ 32 + do { \ 33 + uint8_t *mem = (uint8_t *)gpa; \ 34 + size_t i; \ 35 + \ 36 + for (i = 0; i < size; i++) \ 37 + __GUEST_ASSERT(mem[i] == pattern, \ 38 + "Guest expected 0x%x at offset %lu (gpa 0x%lx), got 0x%x", \ 39 + pattern, i, gpa + i, mem[i]); \ 40 + } while (0) 41 + 42 + static void memcmp_h(uint8_t *mem, uint64_t gpa, uint8_t pattern, size_t size) 43 + { 44 + size_t i; 45 + 46 + for (i = 0; i < size; i++) 47 + TEST_ASSERT(mem[i] == pattern, 48 + "Host expected 0x%x at gpa 0x%lx, got 0x%x", 49 + pattern, gpa + i, mem[i]); 50 + } 51 + 52 + /* 53 + * Run memory conversion tests with explicit conversion: 54 + * Execute KVM hypercall to map/unmap gpa range which will cause userspace exit 55 + * to back/unback private memory. Subsequent accesses by guest to the gpa range 56 + * will not cause exit to userspace. 57 + * 58 + * Test memory conversion scenarios with following steps: 59 + * 1) Access private memory using private access and verify that memory contents 60 + * are not visible to userspace. 61 + * 2) Convert memory to shared using explicit conversions and ensure that 62 + * userspace is able to access the shared regions. 63 + * 3) Convert memory back to private using explicit conversions and ensure that 64 + * userspace is again not able to access converted private regions. 65 + */ 66 + 67 + #define GUEST_STAGE(o, s) { .offset = o, .size = s } 68 + 69 + enum ucall_syncs { 70 + SYNC_SHARED, 71 + SYNC_PRIVATE, 72 + }; 73 + 74 + static void guest_sync_shared(uint64_t gpa, uint64_t size, 75 + uint8_t current_pattern, uint8_t new_pattern) 76 + { 77 + GUEST_SYNC5(SYNC_SHARED, gpa, size, current_pattern, new_pattern); 78 + } 79 + 80 + static void guest_sync_private(uint64_t gpa, uint64_t size, uint8_t pattern) 81 + { 82 + GUEST_SYNC4(SYNC_PRIVATE, gpa, size, pattern); 83 + } 84 + 85 + /* Arbitrary values, KVM doesn't care about the attribute flags. */ 86 + #define MAP_GPA_SET_ATTRIBUTES BIT(0) 87 + #define MAP_GPA_SHARED BIT(1) 88 + #define MAP_GPA_DO_FALLOCATE BIT(2) 89 + 90 + static void guest_map_mem(uint64_t gpa, uint64_t size, bool map_shared, 91 + bool do_fallocate) 92 + { 93 + uint64_t flags = MAP_GPA_SET_ATTRIBUTES; 94 + 95 + if (map_shared) 96 + flags |= MAP_GPA_SHARED; 97 + if (do_fallocate) 98 + flags |= MAP_GPA_DO_FALLOCATE; 99 + kvm_hypercall_map_gpa_range(gpa, size, flags); 100 + } 101 + 102 + static void guest_map_shared(uint64_t gpa, uint64_t size, bool do_fallocate) 103 + { 104 + guest_map_mem(gpa, size, true, do_fallocate); 105 + } 106 + 107 + static void guest_map_private(uint64_t gpa, uint64_t size, bool do_fallocate) 108 + { 109 + guest_map_mem(gpa, size, false, do_fallocate); 110 + } 111 + 112 + struct { 113 + uint64_t offset; 114 + uint64_t size; 115 + } static const test_ranges[] = { 116 + GUEST_STAGE(0, PAGE_SIZE), 117 + GUEST_STAGE(0, SZ_2M), 118 + GUEST_STAGE(PAGE_SIZE, PAGE_SIZE), 119 + GUEST_STAGE(PAGE_SIZE, SZ_2M), 120 + GUEST_STAGE(SZ_2M, PAGE_SIZE), 121 + }; 122 + 123 + static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate) 124 + { 125 + const uint8_t def_p = 0xaa; 126 + const uint8_t init_p = 0xcc; 127 + uint64_t j; 128 + int i; 129 + 130 + /* Memory should be shared by default. */ 131 + memset((void *)base_gpa, def_p, PER_CPU_DATA_SIZE); 132 + memcmp_g(base_gpa, def_p, PER_CPU_DATA_SIZE); 133 + guest_sync_shared(base_gpa, PER_CPU_DATA_SIZE, def_p, init_p); 134 + 135 + memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE); 136 + 137 + for (i = 0; i < ARRAY_SIZE(test_ranges); i++) { 138 + uint64_t gpa = base_gpa + test_ranges[i].offset; 139 + uint64_t size = test_ranges[i].size; 140 + uint8_t p1 = 0x11; 141 + uint8_t p2 = 0x22; 142 + uint8_t p3 = 0x33; 143 + uint8_t p4 = 0x44; 144 + 145 + /* 146 + * Set the test region to pattern one to differentiate it from 147 + * the data range as a whole (contains the initial pattern). 148 + */ 149 + memset((void *)gpa, p1, size); 150 + 151 + /* 152 + * Convert to private, set and verify the private data, and 153 + * then verify that the rest of the data (map shared) still 154 + * holds the initial pattern, and that the host always sees the 155 + * shared memory (initial pattern). Unlike shared memory, 156 + * punching a hole in private memory is destructive, i.e. 157 + * previous values aren't guaranteed to be preserved. 158 + */ 159 + guest_map_private(gpa, size, do_fallocate); 160 + 161 + if (size > PAGE_SIZE) { 162 + memset((void *)gpa, p2, PAGE_SIZE); 163 + goto skip; 164 + } 165 + 166 + memset((void *)gpa, p2, size); 167 + guest_sync_private(gpa, size, p1); 168 + 169 + /* 170 + * Verify that the private memory was set to pattern two, and 171 + * that shared memory still holds the initial pattern. 172 + */ 173 + memcmp_g(gpa, p2, size); 174 + if (gpa > base_gpa) 175 + memcmp_g(base_gpa, init_p, gpa - base_gpa); 176 + if (gpa + size < base_gpa + PER_CPU_DATA_SIZE) 177 + memcmp_g(gpa + size, init_p, 178 + (base_gpa + PER_CPU_DATA_SIZE) - (gpa + size)); 179 + 180 + /* 181 + * Convert odd-number page frames back to shared to verify KVM 182 + * also correctly handles holes in private ranges. 183 + */ 184 + for (j = 0; j < size; j += PAGE_SIZE) { 185 + if ((j >> PAGE_SHIFT) & 1) { 186 + guest_map_shared(gpa + j, PAGE_SIZE, do_fallocate); 187 + guest_sync_shared(gpa + j, PAGE_SIZE, p1, p3); 188 + 189 + memcmp_g(gpa + j, p3, PAGE_SIZE); 190 + } else { 191 + guest_sync_private(gpa + j, PAGE_SIZE, p1); 192 + } 193 + } 194 + 195 + skip: 196 + /* 197 + * Convert the entire region back to shared, explicitly write 198 + * pattern three to fill in the even-number frames before 199 + * asking the host to verify (and write pattern four). 200 + */ 201 + guest_map_shared(gpa, size, do_fallocate); 202 + memset((void *)gpa, p3, size); 203 + guest_sync_shared(gpa, size, p3, p4); 204 + memcmp_g(gpa, p4, size); 205 + 206 + /* Reset the shared memory back to the initial pattern. */ 207 + memset((void *)gpa, init_p, size); 208 + 209 + /* 210 + * Free (via PUNCH_HOLE) *all* private memory so that the next 211 + * iteration starts from a clean slate, e.g. with respect to 212 + * whether or not there are pages/folios in guest_mem. 213 + */ 214 + guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true); 215 + } 216 + } 217 + 218 + static void guest_punch_hole(uint64_t gpa, uint64_t size) 219 + { 220 + /* "Mapping" memory shared via fallocate() is done via PUNCH_HOLE. */ 221 + uint64_t flags = MAP_GPA_SHARED | MAP_GPA_DO_FALLOCATE; 222 + 223 + kvm_hypercall_map_gpa_range(gpa, size, flags); 224 + } 225 + 226 + /* 227 + * Test that PUNCH_HOLE actually frees memory by punching holes without doing a 228 + * proper conversion. Freeing (PUNCH_HOLE) should zap SPTEs, and reallocating 229 + * (subsequent fault) should zero memory. 230 + */ 231 + static void guest_test_punch_hole(uint64_t base_gpa, bool precise) 232 + { 233 + const uint8_t init_p = 0xcc; 234 + int i; 235 + 236 + /* 237 + * Convert the entire range to private, this testcase is all about 238 + * punching holes in guest_memfd, i.e. shared mappings aren't needed. 239 + */ 240 + guest_map_private(base_gpa, PER_CPU_DATA_SIZE, false); 241 + 242 + for (i = 0; i < ARRAY_SIZE(test_ranges); i++) { 243 + uint64_t gpa = base_gpa + test_ranges[i].offset; 244 + uint64_t size = test_ranges[i].size; 245 + 246 + /* 247 + * Free all memory before each iteration, even for the !precise 248 + * case where the memory will be faulted back in. Freeing and 249 + * reallocating should obviously work, and freeing all memory 250 + * minimizes the probability of cross-testcase influence. 251 + */ 252 + guest_punch_hole(base_gpa, PER_CPU_DATA_SIZE); 253 + 254 + /* Fault-in and initialize memory, and verify the pattern. */ 255 + if (precise) { 256 + memset((void *)gpa, init_p, size); 257 + memcmp_g(gpa, init_p, size); 258 + } else { 259 + memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE); 260 + memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE); 261 + } 262 + 263 + /* 264 + * Punch a hole at the target range and verify that reads from 265 + * the guest succeed and return zeroes. 266 + */ 267 + guest_punch_hole(gpa, size); 268 + memcmp_g(gpa, 0, size); 269 + } 270 + } 271 + 272 + static void guest_code(uint64_t base_gpa) 273 + { 274 + /* 275 + * Run the conversion test twice, with and without doing fallocate() on 276 + * the guest_memfd backing when converting between shared and private. 277 + */ 278 + guest_test_explicit_conversion(base_gpa, false); 279 + guest_test_explicit_conversion(base_gpa, true); 280 + 281 + /* 282 + * Run the PUNCH_HOLE test twice too, once with the entire guest_memfd 283 + * faulted in, once with only the target range faulted in. 284 + */ 285 + guest_test_punch_hole(base_gpa, false); 286 + guest_test_punch_hole(base_gpa, true); 287 + GUEST_DONE(); 288 + } 289 + 290 + static void handle_exit_hypercall(struct kvm_vcpu *vcpu) 291 + { 292 + struct kvm_run *run = vcpu->run; 293 + uint64_t gpa = run->hypercall.args[0]; 294 + uint64_t size = run->hypercall.args[1] * PAGE_SIZE; 295 + bool set_attributes = run->hypercall.args[2] & MAP_GPA_SET_ATTRIBUTES; 296 + bool map_shared = run->hypercall.args[2] & MAP_GPA_SHARED; 297 + bool do_fallocate = run->hypercall.args[2] & MAP_GPA_DO_FALLOCATE; 298 + struct kvm_vm *vm = vcpu->vm; 299 + 300 + TEST_ASSERT(run->hypercall.nr == KVM_HC_MAP_GPA_RANGE, 301 + "Wanted MAP_GPA_RANGE (%u), got '%llu'", 302 + KVM_HC_MAP_GPA_RANGE, run->hypercall.nr); 303 + 304 + if (do_fallocate) 305 + vm_guest_mem_fallocate(vm, gpa, size, map_shared); 306 + 307 + if (set_attributes) 308 + vm_set_memory_attributes(vm, gpa, size, 309 + map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE); 310 + run->hypercall.ret = 0; 311 + } 312 + 313 + static bool run_vcpus; 314 + 315 + static void *__test_mem_conversions(void *__vcpu) 316 + { 317 + struct kvm_vcpu *vcpu = __vcpu; 318 + struct kvm_run *run = vcpu->run; 319 + struct kvm_vm *vm = vcpu->vm; 320 + struct ucall uc; 321 + 322 + while (!READ_ONCE(run_vcpus)) 323 + ; 324 + 325 + for ( ;; ) { 326 + vcpu_run(vcpu); 327 + 328 + if (run->exit_reason == KVM_EXIT_HYPERCALL) { 329 + handle_exit_hypercall(vcpu); 330 + continue; 331 + } 332 + 333 + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, 334 + "Wanted KVM_EXIT_IO, got exit reason: %u (%s)", 335 + run->exit_reason, exit_reason_str(run->exit_reason)); 336 + 337 + switch (get_ucall(vcpu, &uc)) { 338 + case UCALL_ABORT: 339 + REPORT_GUEST_ASSERT(uc); 340 + case UCALL_SYNC: { 341 + uint64_t gpa = uc.args[1]; 342 + size_t size = uc.args[2]; 343 + size_t i; 344 + 345 + TEST_ASSERT(uc.args[0] == SYNC_SHARED || 346 + uc.args[0] == SYNC_PRIVATE, 347 + "Unknown sync command '%ld'", uc.args[0]); 348 + 349 + for (i = 0; i < size; i += vm->page_size) { 350 + size_t nr_bytes = min_t(size_t, vm->page_size, size - i); 351 + uint8_t *hva = addr_gpa2hva(vm, gpa + i); 352 + 353 + /* In all cases, the host should observe the shared data. */ 354 + memcmp_h(hva, gpa + i, uc.args[3], nr_bytes); 355 + 356 + /* For shared, write the new pattern to guest memory. */ 357 + if (uc.args[0] == SYNC_SHARED) 358 + memset(hva, uc.args[4], nr_bytes); 359 + } 360 + break; 361 + } 362 + case UCALL_DONE: 363 + return NULL; 364 + default: 365 + TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd); 366 + } 367 + } 368 + } 369 + 370 + static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus, 371 + uint32_t nr_memslots) 372 + { 373 + /* 374 + * Allocate enough memory so that each vCPU's chunk of memory can be 375 + * naturally aligned with respect to the size of the backing store. 376 + */ 377 + const size_t alignment = max_t(size_t, SZ_2M, get_backing_src_pagesz(src_type)); 378 + const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment); 379 + const size_t memfd_size = per_cpu_size * nr_vcpus; 380 + const size_t slot_size = memfd_size / nr_memslots; 381 + struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; 382 + pthread_t threads[KVM_MAX_VCPUS]; 383 + struct kvm_vm *vm; 384 + int memfd, i, r; 385 + 386 + const struct vm_shape shape = { 387 + .mode = VM_MODE_DEFAULT, 388 + .type = KVM_X86_SW_PROTECTED_VM, 389 + }; 390 + 391 + TEST_ASSERT(slot_size * nr_memslots == memfd_size, 392 + "The memfd size (0x%lx) needs to be cleanly divisible by the number of memslots (%u)", 393 + memfd_size, nr_memslots); 394 + vm = __vm_create_with_vcpus(shape, nr_vcpus, 0, guest_code, vcpus); 395 + 396 + vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE)); 397 + 398 + memfd = vm_create_guest_memfd(vm, memfd_size, 0); 399 + 400 + for (i = 0; i < nr_memslots; i++) 401 + vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i, 402 + BASE_DATA_SLOT + i, slot_size / vm->page_size, 403 + KVM_MEM_GUEST_MEMFD, memfd, slot_size * i); 404 + 405 + for (i = 0; i < nr_vcpus; i++) { 406 + uint64_t gpa = BASE_DATA_GPA + i * per_cpu_size; 407 + 408 + vcpu_args_set(vcpus[i], 1, gpa); 409 + 410 + /* 411 + * Map only what is needed so that an out-of-bounds access 412 + * results #PF => SHUTDOWN instead of data corruption. 413 + */ 414 + virt_map(vm, gpa, gpa, PER_CPU_DATA_SIZE / vm->page_size); 415 + 416 + pthread_create(&threads[i], NULL, __test_mem_conversions, vcpus[i]); 417 + } 418 + 419 + WRITE_ONCE(run_vcpus, true); 420 + 421 + for (i = 0; i < nr_vcpus; i++) 422 + pthread_join(threads[i], NULL); 423 + 424 + kvm_vm_free(vm); 425 + 426 + /* 427 + * Allocate and free memory from the guest_memfd after closing the VM 428 + * fd. The guest_memfd is gifted a reference to its owning VM, i.e. 429 + * should prevent the VM from being fully destroyed until the last 430 + * reference to the guest_memfd is also put. 431 + */ 432 + r = fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, memfd_size); 433 + TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r)); 434 + 435 + r = fallocate(memfd, FALLOC_FL_KEEP_SIZE, 0, memfd_size); 436 + TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r)); 437 + } 438 + 439 + static void usage(const char *cmd) 440 + { 441 + puts(""); 442 + printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd); 443 + puts(""); 444 + backing_src_help("-s"); 445 + puts(""); 446 + puts(" -n: specify the number of vcpus (default: 1)"); 447 + puts(""); 448 + puts(" -m: specify the number of memslots (default: 1)"); 449 + puts(""); 450 + } 451 + 452 + int main(int argc, char *argv[]) 453 + { 454 + enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC; 455 + uint32_t nr_memslots = 1; 456 + uint32_t nr_vcpus = 1; 457 + int opt; 458 + 459 + TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)); 460 + 461 + while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) { 462 + switch (opt) { 463 + case 's': 464 + src_type = parse_backing_src_type(optarg); 465 + break; 466 + case 'n': 467 + nr_vcpus = atoi_positive("nr_vcpus", optarg); 468 + break; 469 + case 'm': 470 + nr_memslots = atoi_positive("nr_memslots", optarg); 471 + break; 472 + case 'h': 473 + default: 474 + usage(argv[0]); 475 + exit(0); 476 + } 477 + } 478 + 479 + test_mem_conversions(src_type, nr_vcpus, nr_memslots); 480 + 481 + return 0; 482 + }

+120

tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2023, Google LLC. 4 + */ 5 + #include <linux/kvm.h> 6 + #include <pthread.h> 7 + #include <stdint.h> 8 + 9 + #include "kvm_util.h" 10 + #include "processor.h" 11 + #include "test_util.h" 12 + 13 + /* Arbitrarily selected to avoid overlaps with anything else */ 14 + #define EXITS_TEST_GVA 0xc0000000 15 + #define EXITS_TEST_GPA EXITS_TEST_GVA 16 + #define EXITS_TEST_NPAGES 1 17 + #define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE) 18 + #define EXITS_TEST_SLOT 10 19 + 20 + static uint64_t guest_repeatedly_read(void) 21 + { 22 + volatile uint64_t value; 23 + 24 + while (true) 25 + value = *((uint64_t *) EXITS_TEST_GVA); 26 + 27 + return value; 28 + } 29 + 30 + static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu) 31 + { 32 + int r; 33 + 34 + r = _vcpu_run(vcpu); 35 + if (r) { 36 + TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r)); 37 + TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT); 38 + } 39 + return vcpu->run->exit_reason; 40 + } 41 + 42 + const struct vm_shape protected_vm_shape = { 43 + .mode = VM_MODE_DEFAULT, 44 + .type = KVM_X86_SW_PROTECTED_VM, 45 + }; 46 + 47 + static void test_private_access_memslot_deleted(void) 48 + { 49 + struct kvm_vm *vm; 50 + struct kvm_vcpu *vcpu; 51 + pthread_t vm_thread; 52 + void *thread_return; 53 + uint32_t exit_reason; 54 + 55 + vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu, 56 + guest_repeatedly_read); 57 + 58 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 59 + EXITS_TEST_GPA, EXITS_TEST_SLOT, 60 + EXITS_TEST_NPAGES, 61 + KVM_MEM_GUEST_MEMFD); 62 + 63 + virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES); 64 + 65 + /* Request to access page privately */ 66 + vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE); 67 + 68 + pthread_create(&vm_thread, NULL, 69 + (void *(*)(void *))run_vcpu_get_exit_reason, 70 + (void *)vcpu); 71 + 72 + vm_mem_region_delete(vm, EXITS_TEST_SLOT); 73 + 74 + pthread_join(vm_thread, &thread_return); 75 + exit_reason = (uint32_t)(uint64_t)thread_return; 76 + 77 + TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT); 78 + TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE); 79 + TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA); 80 + TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE); 81 + 82 + kvm_vm_free(vm); 83 + } 84 + 85 + static void test_private_access_memslot_not_private(void) 86 + { 87 + struct kvm_vm *vm; 88 + struct kvm_vcpu *vcpu; 89 + uint32_t exit_reason; 90 + 91 + vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu, 92 + guest_repeatedly_read); 93 + 94 + /* Add a non-private memslot (flags = 0) */ 95 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 96 + EXITS_TEST_GPA, EXITS_TEST_SLOT, 97 + EXITS_TEST_NPAGES, 0); 98 + 99 + virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES); 100 + 101 + /* Request to access page privately */ 102 + vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE); 103 + 104 + exit_reason = run_vcpu_get_exit_reason(vcpu); 105 + 106 + TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT); 107 + TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE); 108 + TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA); 109 + TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE); 110 + 111 + kvm_vm_free(vm); 112 + } 113 + 114 + int main(int argc, char *argv[]) 115 + { 116 + TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)); 117 + 118 + test_private_access_memslot_deleted(); 119 + test_private_access_memslot_not_private(); 120 + }

+2 -2

tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c

··· 103 103 104 104 run_guest(vmcb, svm->vmcb_gpa); 105 105 __GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL, 106 - "Expected VMMCAL #VMEXIT, got '0x%x', info1 = '0x%llx, info2 = '0x%llx'", 106 + "Expected VMMCAL #VMEXIT, got '0x%x', info1 = '0x%lx, info2 = '0x%lx'", 107 107 vmcb->control.exit_code, 108 108 vmcb->control.exit_info_1, vmcb->control.exit_info_2); 109 109 ··· 133 133 134 134 run_guest(vmcb, svm->vmcb_gpa); 135 135 __GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_HLT, 136 - "Expected HLT #VMEXIT, got '0x%x', info1 = '0x%llx, info2 = '0x%llx'", 136 + "Expected HLT #VMEXIT, got '0x%x', info1 = '0x%lx, info2 = '0x%lx'", 137 137 vmcb->control.exit_code, 138 138 vmcb->control.exit_info_1, vmcb->control.exit_info_2); 139 139

+1 -1

tools/testing/selftests/kvm/x86_64/ucna_injection_test.c

··· 271 271 272 272 kvm_check_cap(KVM_CAP_MCE); 273 273 274 - vm = __vm_create(VM_MODE_DEFAULT, 3, 0); 274 + vm = __vm_create(VM_SHAPE_DEFAULT, 3, 0); 275 275 276 276 kvm_ioctl(vm->kvm_fd, KVM_X86_GET_MCE_CAP_SUPPORTED, 277 277 &supported_mcg_caps);

+1 -1

tools/testing/selftests/kvm/x86_64/vmx_pmu_caps_test.c

··· 56 56 uint8_t vector = wrmsr_safe(MSR_IA32_PERF_CAPABILITIES, val); 57 57 58 58 __GUEST_ASSERT(vector == GP_VECTOR, 59 - "Expected #GP for value '0x%llx', got vector '0x%x'", 59 + "Expected #GP for value '0x%lx', got vector '0x%x'", 60 60 val, vector); 61 61 } 62 62

+10 -6

tools/testing/selftests/kvm/x86_64/vmx_set_nested_state_test.c

··· 125 125 126 126 /* 127 127 * Setting vmxon_pa == -1ull and vmcs_pa == -1ull exits early without 128 - * setting the nested state but flags other than eVMCS must be clear. 129 - * The eVMCS flag can be set if the enlightened VMCS capability has 130 - * been enabled. 128 + * setting the nested state. When the eVMCS flag is not set, the 129 + * expected return value is '0'. 131 130 */ 132 131 set_default_vmx_state(state, state_sz); 132 + state->flags = 0; 133 133 state->hdr.vmx.vmxon_pa = -1ull; 134 134 state->hdr.vmx.vmcs12_pa = -1ull; 135 - test_nested_state_expect_einval(vcpu, state); 135 + test_nested_state(vcpu, state); 136 136 137 - state->flags &= KVM_STATE_NESTED_EVMCS; 137 + /* 138 + * When eVMCS is supported, the eVMCS flag can only be set if the 139 + * enlightened VMCS capability has been enabled. 140 + */ 138 141 if (have_evmcs) { 142 + state->flags = KVM_STATE_NESTED_EVMCS; 139 143 test_nested_state_expect_einval(vcpu, state); 140 144 vcpu_enable_evmcs(vcpu); 145 + test_nested_state(vcpu, state); 141 146 } 142 - test_nested_state(vcpu, state); 143 147 144 148 /* It is invalid to have vmxon_pa == -1ull and SMM flags non-zero. */ 145 149 state->hdr.vmx.smm.flags = 1;

+4 -4

tools/testing/selftests/kvm/x86_64/xcr0_cpuid_test.c

··· 25 25 \ 26 26 __GUEST_ASSERT((__supported & (xfeatures)) != (xfeatures) || \ 27 27 __supported == ((xfeatures) | (dependencies)), \ 28 - "supported = 0x%llx, xfeatures = 0x%llx, dependencies = 0x%llx", \ 28 + "supported = 0x%lx, xfeatures = 0x%llx, dependencies = 0x%llx", \ 29 29 __supported, (xfeatures), (dependencies)); \ 30 30 } while (0) 31 31 ··· 42 42 uint64_t __supported = (supported_xcr0) & (xfeatures); \ 43 43 \ 44 44 __GUEST_ASSERT(!__supported || __supported == (xfeatures), \ 45 - "supported = 0x%llx, xfeatures = 0x%llx", \ 45 + "supported = 0x%lx, xfeatures = 0x%llx", \ 46 46 __supported, (xfeatures)); \ 47 47 } while (0) 48 48 ··· 81 81 82 82 vector = xsetbv_safe(0, supported_xcr0); 83 83 __GUEST_ASSERT(!vector, 84 - "Expected success on XSETBV(0x%llx), got vector '0x%x'", 84 + "Expected success on XSETBV(0x%lx), got vector '0x%x'", 85 85 supported_xcr0, vector); 86 86 87 87 for (i = 0; i < 64; i++) { ··· 90 90 91 91 vector = xsetbv_safe(0, supported_xcr0 | BIT_ULL(i)); 92 92 __GUEST_ASSERT(vector == GP_VECTOR, 93 - "Expected #GP on XSETBV(0x%llx), supported XCR0 = %llx, got vector '0x%x'", 93 + "Expected #GP on XSETBV(0x%llx), supported XCR0 = %lx, got vector '0x%x'", 94 94 BIT_ULL(i), supported_xcr0, vector); 95 95 } 96 96

+23 -7

virt/kvm/Kconfig

··· 4 4 config HAVE_KVM 5 5 bool 6 6 7 + config KVM_COMMON 8 + bool 9 + select EVENTFD 10 + select INTERVAL_TREE 11 + select PREEMPT_NOTIFIERS 12 + 7 13 config HAVE_KVM_PFNCACHE 8 14 bool 9 15 10 16 config HAVE_KVM_IRQCHIP 11 - bool 12 - 13 - config HAVE_KVM_IRQFD 14 17 bool 15 18 16 19 config HAVE_KVM_IRQ_ROUTING ··· 41 38 config NEED_KVM_DIRTY_RING_WITH_BITMAP 42 39 bool 43 40 depends on HAVE_KVM_DIRTY_RING 44 - 45 - config HAVE_KVM_EVENTFD 46 - bool 47 - select EVENTFD 48 41 49 42 config KVM_MMIO 50 43 bool ··· 90 91 bool 91 92 92 93 config KVM_GENERIC_HARDWARE_ENABLING 94 + bool 95 + 96 + config KVM_GENERIC_MMU_NOTIFIER 97 + select MMU_NOTIFIER 98 + bool 99 + 100 + config KVM_GENERIC_MEMORY_ATTRIBUTES 101 + depends on KVM_GENERIC_MMU_NOTIFIER 102 + bool 103 + 104 + config KVM_PRIVATE_MEM 105 + select XARRAY_MULTI 106 + bool 107 + 108 + config KVM_GENERIC_PRIVATE_MEM 109 + select KVM_GENERIC_MEMORY_ATTRIBUTES 110 + select KVM_PRIVATE_MEM 93 111 bool

+1

virt/kvm/Makefile.kvm

··· 12 12 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o 13 13 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o 14 14 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o 15 + kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o

+1 -1

virt/kvm/dirty_ring.c

··· 58 58 as_id = slot >> 16; 59 59 id = (u16)slot; 60 60 61 - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 61 + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS) 62 62 return; 63 63 64 64 memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);

+13 -15

virt/kvm/eventfd.c

··· 28 28 29 29 #include <kvm/iodev.h> 30 30 31 - #ifdef CONFIG_HAVE_KVM_IRQFD 31 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 32 32 33 33 static struct workqueue_struct *irqfd_cleanup_wq; 34 34 ··· 526 526 synchronize_srcu(&kvm->irq_srcu); 527 527 kvm_arch_post_irq_ack_notifier_list_update(kvm); 528 528 } 529 - #endif 530 529 531 - void 532 - kvm_eventfd_init(struct kvm *kvm) 533 - { 534 - #ifdef CONFIG_HAVE_KVM_IRQFD 535 - spin_lock_init(&kvm->irqfds.lock); 536 - INIT_LIST_HEAD(&kvm->irqfds.items); 537 - INIT_LIST_HEAD(&kvm->irqfds.resampler_list); 538 - mutex_init(&kvm->irqfds.resampler_lock); 539 - #endif 540 - INIT_LIST_HEAD(&kvm->ioeventfds); 541 - } 542 - 543 - #ifdef CONFIG_HAVE_KVM_IRQFD 544 530 /* 545 531 * shutdown any irqfd's that match fd+gsi 546 532 */ ··· 997 1011 return kvm_deassign_ioeventfd(kvm, args); 998 1012 999 1013 return kvm_assign_ioeventfd(kvm, args); 1014 + } 1015 + 1016 + void 1017 + kvm_eventfd_init(struct kvm *kvm) 1018 + { 1019 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 1020 + spin_lock_init(&kvm->irqfds.lock); 1021 + INIT_LIST_HEAD(&kvm->irqfds.items); 1022 + INIT_LIST_HEAD(&kvm->irqfds.resampler_list); 1023 + mutex_init(&kvm->irqfds.resampler_lock); 1024 + #endif 1025 + INIT_LIST_HEAD(&kvm->ioeventfds); 1000 1026 }

+532

virt/kvm/guest_memfd.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <linux/backing-dev.h> 3 + #include <linux/falloc.h> 4 + #include <linux/kvm_host.h> 5 + #include <linux/pagemap.h> 6 + #include <linux/anon_inodes.h> 7 + 8 + #include "kvm_mm.h" 9 + 10 + struct kvm_gmem { 11 + struct kvm *kvm; 12 + struct xarray bindings; 13 + struct list_head entry; 14 + }; 15 + 16 + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) 17 + { 18 + struct folio *folio; 19 + 20 + /* TODO: Support huge pages. */ 21 + folio = filemap_grab_folio(inode->i_mapping, index); 22 + if (IS_ERR_OR_NULL(folio)) 23 + return NULL; 24 + 25 + /* 26 + * Use the up-to-date flag to track whether or not the memory has been 27 + * zeroed before being handed off to the guest. There is no backing 28 + * storage for the memory, so the folio will remain up-to-date until 29 + * it's removed. 30 + * 31 + * TODO: Skip clearing pages when trusted firmware will do it when 32 + * assigning memory to the guest. 33 + */ 34 + if (!folio_test_uptodate(folio)) { 35 + unsigned long nr_pages = folio_nr_pages(folio); 36 + unsigned long i; 37 + 38 + for (i = 0; i < nr_pages; i++) 39 + clear_highpage(folio_page(folio, i)); 40 + 41 + folio_mark_uptodate(folio); 42 + } 43 + 44 + /* 45 + * Ignore accessed, referenced, and dirty flags. The memory is 46 + * unevictable and there is no storage to write back to. 47 + */ 48 + return folio; 49 + } 50 + 51 + static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, 52 + pgoff_t end) 53 + { 54 + bool flush = false, found_memslot = false; 55 + struct kvm_memory_slot *slot; 56 + struct kvm *kvm = gmem->kvm; 57 + unsigned long index; 58 + 59 + xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { 60 + pgoff_t pgoff = slot->gmem.pgoff; 61 + 62 + struct kvm_gfn_range gfn_range = { 63 + .start = slot->base_gfn + max(pgoff, start) - pgoff, 64 + .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff, 65 + .slot = slot, 66 + .may_block = true, 67 + }; 68 + 69 + if (!found_memslot) { 70 + found_memslot = true; 71 + 72 + KVM_MMU_LOCK(kvm); 73 + kvm_mmu_invalidate_begin(kvm); 74 + } 75 + 76 + flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range); 77 + } 78 + 79 + if (flush) 80 + kvm_flush_remote_tlbs(kvm); 81 + 82 + if (found_memslot) 83 + KVM_MMU_UNLOCK(kvm); 84 + } 85 + 86 + static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, 87 + pgoff_t end) 88 + { 89 + struct kvm *kvm = gmem->kvm; 90 + 91 + if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { 92 + KVM_MMU_LOCK(kvm); 93 + kvm_mmu_invalidate_end(kvm); 94 + KVM_MMU_UNLOCK(kvm); 95 + } 96 + } 97 + 98 + static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) 99 + { 100 + struct list_head *gmem_list = &inode->i_mapping->i_private_list; 101 + pgoff_t start = offset >> PAGE_SHIFT; 102 + pgoff_t end = (offset + len) >> PAGE_SHIFT; 103 + struct kvm_gmem *gmem; 104 + 105 + /* 106 + * Bindings must be stable across invalidation to ensure the start+end 107 + * are balanced. 108 + */ 109 + filemap_invalidate_lock(inode->i_mapping); 110 + 111 + list_for_each_entry(gmem, gmem_list, entry) 112 + kvm_gmem_invalidate_begin(gmem, start, end); 113 + 114 + truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1); 115 + 116 + list_for_each_entry(gmem, gmem_list, entry) 117 + kvm_gmem_invalidate_end(gmem, start, end); 118 + 119 + filemap_invalidate_unlock(inode->i_mapping); 120 + 121 + return 0; 122 + } 123 + 124 + static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) 125 + { 126 + struct address_space *mapping = inode->i_mapping; 127 + pgoff_t start, index, end; 128 + int r; 129 + 130 + /* Dedicated guest is immutable by default. */ 131 + if (offset + len > i_size_read(inode)) 132 + return -EINVAL; 133 + 134 + filemap_invalidate_lock_shared(mapping); 135 + 136 + start = offset >> PAGE_SHIFT; 137 + end = (offset + len) >> PAGE_SHIFT; 138 + 139 + r = 0; 140 + for (index = start; index < end; ) { 141 + struct folio *folio; 142 + 143 + if (signal_pending(current)) { 144 + r = -EINTR; 145 + break; 146 + } 147 + 148 + folio = kvm_gmem_get_folio(inode, index); 149 + if (!folio) { 150 + r = -ENOMEM; 151 + break; 152 + } 153 + 154 + index = folio_next_index(folio); 155 + 156 + folio_unlock(folio); 157 + folio_put(folio); 158 + 159 + /* 64-bit only, wrapping the index should be impossible. */ 160 + if (WARN_ON_ONCE(!index)) 161 + break; 162 + 163 + cond_resched(); 164 + } 165 + 166 + filemap_invalidate_unlock_shared(mapping); 167 + 168 + return r; 169 + } 170 + 171 + static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, 172 + loff_t len) 173 + { 174 + int ret; 175 + 176 + if (!(mode & FALLOC_FL_KEEP_SIZE)) 177 + return -EOPNOTSUPP; 178 + 179 + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) 180 + return -EOPNOTSUPP; 181 + 182 + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) 183 + return -EINVAL; 184 + 185 + if (mode & FALLOC_FL_PUNCH_HOLE) 186 + ret = kvm_gmem_punch_hole(file_inode(file), offset, len); 187 + else 188 + ret = kvm_gmem_allocate(file_inode(file), offset, len); 189 + 190 + if (!ret) 191 + file_modified(file); 192 + return ret; 193 + } 194 + 195 + static int kvm_gmem_release(struct inode *inode, struct file *file) 196 + { 197 + struct kvm_gmem *gmem = file->private_data; 198 + struct kvm_memory_slot *slot; 199 + struct kvm *kvm = gmem->kvm; 200 + unsigned long index; 201 + 202 + /* 203 + * Prevent concurrent attempts to *unbind* a memslot. This is the last 204 + * reference to the file and thus no new bindings can be created, but 205 + * dereferencing the slot for existing bindings needs to be protected 206 + * against memslot updates, specifically so that unbind doesn't race 207 + * and free the memslot (kvm_gmem_get_file() will return NULL). 208 + */ 209 + mutex_lock(&kvm->slots_lock); 210 + 211 + filemap_invalidate_lock(inode->i_mapping); 212 + 213 + xa_for_each(&gmem->bindings, index, slot) 214 + rcu_assign_pointer(slot->gmem.file, NULL); 215 + 216 + synchronize_rcu(); 217 + 218 + /* 219 + * All in-flight operations are gone and new bindings can be created. 220 + * Zap all SPTEs pointed at by this file. Do not free the backing 221 + * memory, as its lifetime is associated with the inode, not the file. 222 + */ 223 + kvm_gmem_invalidate_begin(gmem, 0, -1ul); 224 + kvm_gmem_invalidate_end(gmem, 0, -1ul); 225 + 226 + list_del(&gmem->entry); 227 + 228 + filemap_invalidate_unlock(inode->i_mapping); 229 + 230 + mutex_unlock(&kvm->slots_lock); 231 + 232 + xa_destroy(&gmem->bindings); 233 + kfree(gmem); 234 + 235 + kvm_put_kvm(kvm); 236 + 237 + return 0; 238 + } 239 + 240 + static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot) 241 + { 242 + /* 243 + * Do not return slot->gmem.file if it has already been closed; 244 + * there might be some time between the last fput() and when 245 + * kvm_gmem_release() clears slot->gmem.file, and you do not 246 + * want to spin in the meanwhile. 247 + */ 248 + return get_file_active(&slot->gmem.file); 249 + } 250 + 251 + static struct file_operations kvm_gmem_fops = { 252 + .open = generic_file_open, 253 + .release = kvm_gmem_release, 254 + .fallocate = kvm_gmem_fallocate, 255 + }; 256 + 257 + void kvm_gmem_init(struct module *module) 258 + { 259 + kvm_gmem_fops.owner = module; 260 + } 261 + 262 + static int kvm_gmem_migrate_folio(struct address_space *mapping, 263 + struct folio *dst, struct folio *src, 264 + enum migrate_mode mode) 265 + { 266 + WARN_ON_ONCE(1); 267 + return -EINVAL; 268 + } 269 + 270 + static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *folio) 271 + { 272 + struct list_head *gmem_list = &mapping->i_private_list; 273 + struct kvm_gmem *gmem; 274 + pgoff_t start, end; 275 + 276 + filemap_invalidate_lock_shared(mapping); 277 + 278 + start = folio->index; 279 + end = start + folio_nr_pages(folio); 280 + 281 + list_for_each_entry(gmem, gmem_list, entry) 282 + kvm_gmem_invalidate_begin(gmem, start, end); 283 + 284 + /* 285 + * Do not truncate the range, what action is taken in response to the 286 + * error is userspace's decision (assuming the architecture supports 287 + * gracefully handling memory errors). If/when the guest attempts to 288 + * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON, 289 + * at which point KVM can either terminate the VM or propagate the 290 + * error to userspace. 291 + */ 292 + 293 + list_for_each_entry(gmem, gmem_list, entry) 294 + kvm_gmem_invalidate_end(gmem, start, end); 295 + 296 + filemap_invalidate_unlock_shared(mapping); 297 + 298 + return MF_DELAYED; 299 + } 300 + 301 + static const struct address_space_operations kvm_gmem_aops = { 302 + .dirty_folio = noop_dirty_folio, 303 + .migrate_folio = kvm_gmem_migrate_folio, 304 + .error_remove_folio = kvm_gmem_error_folio, 305 + }; 306 + 307 + static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path, 308 + struct kstat *stat, u32 request_mask, 309 + unsigned int query_flags) 310 + { 311 + struct inode *inode = path->dentry->d_inode; 312 + 313 + generic_fillattr(idmap, request_mask, inode, stat); 314 + return 0; 315 + } 316 + 317 + static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry, 318 + struct iattr *attr) 319 + { 320 + return -EINVAL; 321 + } 322 + static const struct inode_operations kvm_gmem_iops = { 323 + .getattr = kvm_gmem_getattr, 324 + .setattr = kvm_gmem_setattr, 325 + }; 326 + 327 + static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) 328 + { 329 + const char *anon_name = "[kvm-gmem]"; 330 + struct kvm_gmem *gmem; 331 + struct inode *inode; 332 + struct file *file; 333 + int fd, err; 334 + 335 + fd = get_unused_fd_flags(0); 336 + if (fd < 0) 337 + return fd; 338 + 339 + gmem = kzalloc(sizeof(*gmem), GFP_KERNEL); 340 + if (!gmem) { 341 + err = -ENOMEM; 342 + goto err_fd; 343 + } 344 + 345 + file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, 346 + O_RDWR, NULL); 347 + if (IS_ERR(file)) { 348 + err = PTR_ERR(file); 349 + goto err_gmem; 350 + } 351 + 352 + file->f_flags |= O_LARGEFILE; 353 + 354 + inode = file->f_inode; 355 + WARN_ON(file->f_mapping != inode->i_mapping); 356 + 357 + inode->i_private = (void *)(unsigned long)flags; 358 + inode->i_op = &kvm_gmem_iops; 359 + inode->i_mapping->a_ops = &kvm_gmem_aops; 360 + inode->i_mode |= S_IFREG; 361 + inode->i_size = size; 362 + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); 363 + mapping_set_unmovable(inode->i_mapping); 364 + /* Unmovable mappings are supposed to be marked unevictable as well. */ 365 + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); 366 + 367 + kvm_get_kvm(kvm); 368 + gmem->kvm = kvm; 369 + xa_init(&gmem->bindings); 370 + list_add(&gmem->entry, &inode->i_mapping->i_private_list); 371 + 372 + fd_install(fd, file); 373 + return fd; 374 + 375 + err_gmem: 376 + kfree(gmem); 377 + err_fd: 378 + put_unused_fd(fd); 379 + return err; 380 + } 381 + 382 + int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) 383 + { 384 + loff_t size = args->size; 385 + u64 flags = args->flags; 386 + u64 valid_flags = 0; 387 + 388 + if (flags & ~valid_flags) 389 + return -EINVAL; 390 + 391 + if (size <= 0 || !PAGE_ALIGNED(size)) 392 + return -EINVAL; 393 + 394 + return __kvm_gmem_create(kvm, size, flags); 395 + } 396 + 397 + int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, 398 + unsigned int fd, loff_t offset) 399 + { 400 + loff_t size = slot->npages << PAGE_SHIFT; 401 + unsigned long start, end; 402 + struct kvm_gmem *gmem; 403 + struct inode *inode; 404 + struct file *file; 405 + int r = -EINVAL; 406 + 407 + BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff)); 408 + 409 + file = fget(fd); 410 + if (!file) 411 + return -EBADF; 412 + 413 + if (file->f_op != &kvm_gmem_fops) 414 + goto err; 415 + 416 + gmem = file->private_data; 417 + if (gmem->kvm != kvm) 418 + goto err; 419 + 420 + inode = file_inode(file); 421 + 422 + if (offset < 0 || !PAGE_ALIGNED(offset) || 423 + offset + size > i_size_read(inode)) 424 + goto err; 425 + 426 + filemap_invalidate_lock(inode->i_mapping); 427 + 428 + start = offset >> PAGE_SHIFT; 429 + end = start + slot->npages; 430 + 431 + if (!xa_empty(&gmem->bindings) && 432 + xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { 433 + filemap_invalidate_unlock(inode->i_mapping); 434 + goto err; 435 + } 436 + 437 + /* 438 + * No synchronize_rcu() needed, any in-flight readers are guaranteed to 439 + * be see either a NULL file or this new file, no need for them to go 440 + * away. 441 + */ 442 + rcu_assign_pointer(slot->gmem.file, file); 443 + slot->gmem.pgoff = start; 444 + 445 + xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL); 446 + filemap_invalidate_unlock(inode->i_mapping); 447 + 448 + /* 449 + * Drop the reference to the file, even on success. The file pins KVM, 450 + * not the other way 'round. Active bindings are invalidated if the 451 + * file is closed before memslots are destroyed. 452 + */ 453 + r = 0; 454 + err: 455 + fput(file); 456 + return r; 457 + } 458 + 459 + void kvm_gmem_unbind(struct kvm_memory_slot *slot) 460 + { 461 + unsigned long start = slot->gmem.pgoff; 462 + unsigned long end = start + slot->npages; 463 + struct kvm_gmem *gmem; 464 + struct file *file; 465 + 466 + /* 467 + * Nothing to do if the underlying file was already closed (or is being 468 + * closed right now), kvm_gmem_release() invalidates all bindings. 469 + */ 470 + file = kvm_gmem_get_file(slot); 471 + if (!file) 472 + return; 473 + 474 + gmem = file->private_data; 475 + 476 + filemap_invalidate_lock(file->f_mapping); 477 + xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL); 478 + rcu_assign_pointer(slot->gmem.file, NULL); 479 + synchronize_rcu(); 480 + filemap_invalidate_unlock(file->f_mapping); 481 + 482 + fput(file); 483 + } 484 + 485 + int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, 486 + gfn_t gfn, kvm_pfn_t *pfn, int *max_order) 487 + { 488 + pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; 489 + struct kvm_gmem *gmem; 490 + struct folio *folio; 491 + struct page *page; 492 + struct file *file; 493 + int r; 494 + 495 + file = kvm_gmem_get_file(slot); 496 + if (!file) 497 + return -EFAULT; 498 + 499 + gmem = file->private_data; 500 + 501 + if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) { 502 + r = -EIO; 503 + goto out_fput; 504 + } 505 + 506 + folio = kvm_gmem_get_folio(file_inode(file), index); 507 + if (!folio) { 508 + r = -ENOMEM; 509 + goto out_fput; 510 + } 511 + 512 + if (folio_test_hwpoison(folio)) { 513 + r = -EHWPOISON; 514 + goto out_unlock; 515 + } 516 + 517 + page = folio_file_page(folio, index); 518 + 519 + *pfn = page_to_pfn(page); 520 + if (max_order) 521 + *max_order = 0; 522 + 523 + r = 0; 524 + 525 + out_unlock: 526 + folio_unlock(folio); 527 + out_fput: 528 + fput(file); 529 + 530 + return r; 531 + } 532 + EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);

+436 -86

virt/kvm/kvm_main.c

··· 533 533 } 534 534 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); 535 535 536 - #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) 536 + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER 537 537 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) 538 538 { 539 539 return container_of(mn, struct kvm, mmu_notifier); 540 540 } 541 541 542 - typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); 542 + typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); 543 543 544 - typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start, 545 - unsigned long end); 544 + typedef void (*on_lock_fn_t)(struct kvm *kvm); 546 545 547 - typedef void (*on_unlock_fn_t)(struct kvm *kvm); 548 - 549 - struct kvm_hva_range { 550 - unsigned long start; 551 - unsigned long end; 546 + struct kvm_mmu_notifier_range { 547 + /* 548 + * 64-bit addresses, as KVM notifiers can operate on host virtual 549 + * addresses (unsigned long) and guest physical addresses (64-bit). 550 + */ 551 + u64 start; 552 + u64 end; 552 553 union kvm_mmu_notifier_arg arg; 553 - hva_handler_t handler; 554 + gfn_handler_t handler; 554 555 on_lock_fn_t on_lock; 555 - on_unlock_fn_t on_unlock; 556 556 bool flush_on_ret; 557 557 bool may_block; 558 558 }; 559 + 560 + /* 561 + * The inner-most helper returns a tuple containing the return value from the 562 + * arch- and action-specific handler, plus a flag indicating whether or not at 563 + * least one memslot was found, i.e. if the handler found guest memory. 564 + * 565 + * Note, most notifiers are averse to booleans, so even though KVM tracks the 566 + * return from arch code as a bool, outer helpers will cast it to an int. :-( 567 + */ 568 + typedef struct kvm_mmu_notifier_return { 569 + bool ret; 570 + bool found_memslot; 571 + } kvm_mn_ret_t; 559 572 560 573 /* 561 574 * Use a dedicated stub instead of NULL to indicate that there is no callback ··· 591 578 node; \ 592 579 node = interval_tree_iter_next(node, start, last)) \ 593 580 594 - static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, 595 - const struct kvm_hva_range *range) 581 + static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, 582 + const struct kvm_mmu_notifier_range *range) 596 583 { 597 - bool ret = false, locked = false; 584 + struct kvm_mmu_notifier_return r = { 585 + .ret = false, 586 + .found_memslot = false, 587 + }; 598 588 struct kvm_gfn_range gfn_range; 599 589 struct kvm_memory_slot *slot; 600 590 struct kvm_memslots *slots; 601 591 int i, idx; 602 592 603 593 if (WARN_ON_ONCE(range->end <= range->start)) 604 - return 0; 594 + return r; 605 595 606 596 /* A null handler is allowed if and only if on_lock() is provided. */ 607 597 if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) && 608 598 IS_KVM_NULL_FN(range->handler))) 609 - return 0; 599 + return r; 610 600 611 601 idx = srcu_read_lock(&kvm->srcu); 612 602 613 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 603 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 614 604 struct interval_tree_node *node; 615 605 616 606 slots = __kvm_memslots(kvm, i); ··· 622 606 unsigned long hva_start, hva_end; 623 607 624 608 slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); 625 - hva_start = max(range->start, slot->userspace_addr); 626 - hva_end = min(range->end, slot->userspace_addr + 627 - (slot->npages << PAGE_SHIFT)); 609 + hva_start = max_t(unsigned long, range->start, slot->userspace_addr); 610 + hva_end = min_t(unsigned long, range->end, 611 + slot->userspace_addr + (slot->npages << PAGE_SHIFT)); 628 612 629 613 /* 630 614 * To optimize for the likely case where the address ··· 643 627 gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot); 644 628 gfn_range.slot = slot; 645 629 646 - if (!locked) { 647 - locked = true; 630 + if (!r.found_memslot) { 631 + r.found_memslot = true; 648 632 KVM_MMU_LOCK(kvm); 649 633 if (!IS_KVM_NULL_FN(range->on_lock)) 650 - range->on_lock(kvm, range->start, range->end); 634 + range->on_lock(kvm); 635 + 651 636 if (IS_KVM_NULL_FN(range->handler)) 652 637 break; 653 638 } 654 - ret |= range->handler(kvm, &gfn_range); 639 + r.ret |= range->handler(kvm, &gfn_range); 655 640 } 656 641 } 657 642 658 - if (range->flush_on_ret && ret) 643 + if (range->flush_on_ret && r.ret) 659 644 kvm_flush_remote_tlbs(kvm); 660 645 661 - if (locked) { 646 + if (r.found_memslot) 662 647 KVM_MMU_UNLOCK(kvm); 663 - if (!IS_KVM_NULL_FN(range->on_unlock)) 664 - range->on_unlock(kvm); 665 - } 666 648 667 649 srcu_read_unlock(&kvm->srcu, idx); 668 650 669 - /* The notifiers are averse to booleans. :-( */ 670 - return (int)ret; 651 + return r; 671 652 } 672 653 673 654 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, 674 655 unsigned long start, 675 656 unsigned long end, 676 657 union kvm_mmu_notifier_arg arg, 677 - hva_handler_t handler) 658 + gfn_handler_t handler) 678 659 { 679 660 struct kvm *kvm = mmu_notifier_to_kvm(mn); 680 - const struct kvm_hva_range range = { 661 + const struct kvm_mmu_notifier_range range = { 681 662 .start = start, 682 663 .end = end, 683 664 .arg = arg, 684 665 .handler = handler, 685 666 .on_lock = (void *)kvm_null_fn, 686 - .on_unlock = (void *)kvm_null_fn, 687 667 .flush_on_ret = true, 688 668 .may_block = false, 689 669 }; 690 670 691 - return __kvm_handle_hva_range(kvm, &range); 671 + return __kvm_handle_hva_range(kvm, &range).ret; 692 672 } 693 673 694 674 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn, 695 675 unsigned long start, 696 676 unsigned long end, 697 - hva_handler_t handler) 677 + gfn_handler_t handler) 698 678 { 699 679 struct kvm *kvm = mmu_notifier_to_kvm(mn); 700 - const struct kvm_hva_range range = { 680 + const struct kvm_mmu_notifier_range range = { 701 681 .start = start, 702 682 .end = end, 703 683 .handler = handler, 704 684 .on_lock = (void *)kvm_null_fn, 705 - .on_unlock = (void *)kvm_null_fn, 706 685 .flush_on_ret = false, 707 686 .may_block = false, 708 687 }; 709 688 710 - return __kvm_handle_hva_range(kvm, &range); 689 + return __kvm_handle_hva_range(kvm, &range).ret; 711 690 } 712 691 713 692 static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) ··· 747 736 kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn); 748 737 } 749 738 750 - void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, 751 - unsigned long end) 739 + void kvm_mmu_invalidate_begin(struct kvm *kvm) 752 740 { 741 + lockdep_assert_held_write(&kvm->mmu_lock); 753 742 /* 754 743 * The count increase must become visible at unlock time as no 755 744 * spte can be established without taking the mmu_lock and 756 745 * count is also read inside the mmu_lock critical section. 757 746 */ 758 747 kvm->mmu_invalidate_in_progress++; 748 + 759 749 if (likely(kvm->mmu_invalidate_in_progress == 1)) { 750 + kvm->mmu_invalidate_range_start = INVALID_GPA; 751 + kvm->mmu_invalidate_range_end = INVALID_GPA; 752 + } 753 + } 754 + 755 + void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end) 756 + { 757 + lockdep_assert_held_write(&kvm->mmu_lock); 758 + 759 + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress); 760 + 761 + if (likely(kvm->mmu_invalidate_range_start == INVALID_GPA)) { 760 762 kvm->mmu_invalidate_range_start = start; 761 763 kvm->mmu_invalidate_range_end = end; 762 764 } else { ··· 789 765 } 790 766 } 791 767 768 + bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) 769 + { 770 + kvm_mmu_invalidate_range_add(kvm, range->start, range->end); 771 + return kvm_unmap_gfn_range(kvm, range); 772 + } 773 + 792 774 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, 793 775 const struct mmu_notifier_range *range) 794 776 { 795 777 struct kvm *kvm = mmu_notifier_to_kvm(mn); 796 - const struct kvm_hva_range hva_range = { 778 + const struct kvm_mmu_notifier_range hva_range = { 797 779 .start = range->start, 798 780 .end = range->end, 799 - .handler = kvm_unmap_gfn_range, 781 + .handler = kvm_mmu_unmap_gfn_range, 800 782 .on_lock = kvm_mmu_invalidate_begin, 801 - .on_unlock = kvm_arch_guest_memory_reclaimed, 802 783 .flush_on_ret = true, 803 784 .may_block = mmu_notifier_range_blockable(range), 804 785 }; ··· 835 806 gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, 836 807 hva_range.may_block); 837 808 838 - __kvm_handle_hva_range(kvm, &hva_range); 809 + /* 810 + * If one or more memslots were found and thus zapped, notify arch code 811 + * that guest memory has been reclaimed. This needs to be done *after* 812 + * dropping mmu_lock, as x86's reclaim path is slooooow. 813 + */ 814 + if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot) 815 + kvm_arch_guest_memory_reclaimed(kvm); 839 816 840 817 return 0; 841 818 } 842 819 843 - void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start, 844 - unsigned long end) 820 + void kvm_mmu_invalidate_end(struct kvm *kvm) 845 821 { 822 + lockdep_assert_held_write(&kvm->mmu_lock); 823 + 846 824 /* 847 825 * This sequence increase will notify the kvm page fault that 848 826 * the page that is going to be mapped in the spte could have ··· 863 827 * in conjunction with the smp_rmb in mmu_invalidate_retry(). 864 828 */ 865 829 kvm->mmu_invalidate_in_progress--; 830 + KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm); 831 + 832 + /* 833 + * Assert that at least one range was added between start() and end(). 834 + * Not adding a range isn't fatal, but it is a KVM bug. 835 + */ 836 + WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA); 866 837 } 867 838 868 839 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, 869 840 const struct mmu_notifier_range *range) 870 841 { 871 842 struct kvm *kvm = mmu_notifier_to_kvm(mn); 872 - const struct kvm_hva_range hva_range = { 843 + const struct kvm_mmu_notifier_range hva_range = { 873 844 .start = range->start, 874 845 .end = range->end, 875 846 .handler = (void *)kvm_null_fn, 876 847 .on_lock = kvm_mmu_invalidate_end, 877 - .on_unlock = (void *)kvm_null_fn, 878 848 .flush_on_ret = false, 879 849 .may_block = mmu_notifier_range_blockable(range), 880 850 }; ··· 899 857 */ 900 858 if (wake) 901 859 rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait); 902 - 903 - BUG_ON(kvm->mmu_invalidate_in_progress < 0); 904 860 } 905 861 906 862 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, ··· 972 932 return mmu_notifier_register(&kvm->mmu_notifier, current->mm); 973 933 } 974 934 975 - #else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */ 935 + #else /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */ 976 936 977 937 static int kvm_init_mmu_notifier(struct kvm *kvm) 978 938 { 979 939 return 0; 980 940 } 981 941 982 - #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ 942 + #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */ 983 943 984 944 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER 985 945 static int kvm_pm_notifier_call(struct notifier_block *bl, ··· 1025 985 /* This does not remove the slot from struct kvm_memslots data structures */ 1026 986 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) 1027 987 { 988 + if (slot->flags & KVM_MEM_GUEST_MEMFD) 989 + kvm_gmem_unbind(slot); 990 + 1028 991 kvm_destroy_dirty_bitmap(slot); 1029 992 1030 993 kvm_arch_free_memslot(kvm, slot); ··· 1209 1166 spin_lock_init(&kvm->mn_invalidate_lock); 1210 1167 rcuwait_init(&kvm->mn_memslots_update_rcuwait); 1211 1168 xa_init(&kvm->vcpu_array); 1169 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 1170 + xa_init(&kvm->mem_attr_array); 1171 + #endif 1212 1172 1213 1173 INIT_LIST_HEAD(&kvm->gpc_list); 1214 1174 spin_lock_init(&kvm->gpc_lock); ··· 1236 1190 goto out_err_no_irq_srcu; 1237 1191 1238 1192 refcount_set(&kvm->users_count, 1); 1239 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 1193 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 1240 1194 for (j = 0; j < 2; j++) { 1241 1195 slots = &kvm->__memslots[i][j]; 1242 1196 ··· 1268 1222 if (r) 1269 1223 goto out_err_no_disable; 1270 1224 1271 - #ifdef CONFIG_HAVE_KVM_IRQFD 1225 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 1272 1226 INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list); 1273 1227 #endif 1274 1228 ··· 1302 1256 out_err_no_debugfs: 1303 1257 kvm_coalesced_mmio_free(kvm); 1304 1258 out_no_coalesced_mmio: 1305 - #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) 1259 + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER 1306 1260 if (kvm->mmu_notifier.ops) 1307 1261 mmu_notifier_unregister(&kvm->mmu_notifier, current->mm); 1308 1262 #endif ··· 1361 1315 kvm->buses[i] = NULL; 1362 1316 } 1363 1317 kvm_coalesced_mmio_free(kvm); 1364 - #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) 1318 + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER 1365 1319 mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm); 1366 1320 /* 1367 1321 * At this point, pending calls to invalidate_range_start() ··· 1370 1324 * No threads can be waiting in kvm_swap_active_memslots() as the 1371 1325 * last reference on KVM has been dropped, but freeing 1372 1326 * memslots would deadlock without this manual intervention. 1327 + * 1328 + * If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU 1329 + * notifier between a start() and end(), then there shouldn't be any 1330 + * in-progress invalidations. 1373 1331 */ 1374 1332 WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait)); 1375 - kvm->mn_active_invalidate_count = 0; 1333 + if (kvm->mn_active_invalidate_count) 1334 + kvm->mn_active_invalidate_count = 0; 1335 + else 1336 + WARN_ON(kvm->mmu_invalidate_in_progress); 1376 1337 #else 1377 1338 kvm_flush_shadow_all(kvm); 1378 1339 #endif 1379 1340 kvm_arch_destroy_vm(kvm); 1380 1341 kvm_destroy_devices(kvm); 1381 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 1342 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 1382 1343 kvm_free_memslots(kvm, &kvm->__memslots[i][0]); 1383 1344 kvm_free_memslots(kvm, &kvm->__memslots[i][1]); 1384 1345 } 1385 1346 cleanup_srcu_struct(&kvm->irq_srcu); 1386 1347 cleanup_srcu_struct(&kvm->srcu); 1348 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 1349 + xa_destroy(&kvm->mem_attr_array); 1350 + #endif 1387 1351 kvm_arch_free_vm(kvm); 1388 1352 preempt_notifier_dec(); 1389 1353 hardware_disable_all(); ··· 1594 1538 } 1595 1539 } 1596 1540 1597 - static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) 1541 + /* 1542 + * Flags that do not access any of the extra space of struct 1543 + * kvm_userspace_memory_region2. KVM_SET_USER_MEMORY_REGION_V1_FLAGS 1544 + * only allows these. 1545 + */ 1546 + #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \ 1547 + (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY) 1548 + 1549 + static int check_memory_region_flags(struct kvm *kvm, 1550 + const struct kvm_userspace_memory_region2 *mem) 1598 1551 { 1599 1552 u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; 1553 + 1554 + if (kvm_arch_has_private_mem(kvm)) 1555 + valid_flags |= KVM_MEM_GUEST_MEMFD; 1556 + 1557 + /* Dirty logging private memory is not currently supported. */ 1558 + if (mem->flags & KVM_MEM_GUEST_MEMFD) 1559 + valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES; 1600 1560 1601 1561 #ifdef __KVM_HAVE_READONLY_MEM 1602 1562 valid_flags |= KVM_MEM_READONLY; ··· 1675 1603 * space 0 will use generations 0, 2, 4, ... while address space 1 will 1676 1604 * use generations 1, 3, 5, ... 1677 1605 */ 1678 - gen += KVM_ADDRESS_SPACE_NUM; 1606 + gen += kvm_arch_nr_memslot_as_ids(kvm); 1679 1607 1680 1608 kvm_arch_memslots_updated(kvm, gen); 1681 1609 ··· 2012 1940 * Must be called holding kvm->slots_lock for write. 2013 1941 */ 2014 1942 int __kvm_set_memory_region(struct kvm *kvm, 2015 - const struct kvm_userspace_memory_region *mem) 1943 + const struct kvm_userspace_memory_region2 *mem) 2016 1944 { 2017 1945 struct kvm_memory_slot *old, *new; 2018 1946 struct kvm_memslots *slots; ··· 2022 1950 int as_id, id; 2023 1951 int r; 2024 1952 2025 - r = check_memory_region_flags(mem); 1953 + r = check_memory_region_flags(kvm, mem); 2026 1954 if (r) 2027 1955 return r; 2028 1956 ··· 2041 1969 !access_ok((void __user *)(unsigned long)mem->userspace_addr, 2042 1970 mem->memory_size)) 2043 1971 return -EINVAL; 2044 - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) 1972 + if (mem->flags & KVM_MEM_GUEST_MEMFD && 1973 + (mem->guest_memfd_offset & (PAGE_SIZE - 1) || 1974 + mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset)) 1975 + return -EINVAL; 1976 + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM) 2045 1977 return -EINVAL; 2046 1978 if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) 2047 1979 return -EINVAL; ··· 2083 2007 if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) 2084 2008 return -EINVAL; 2085 2009 } else { /* Modify an existing slot. */ 2010 + /* Private memslots are immutable, they can only be deleted. */ 2011 + if (mem->flags & KVM_MEM_GUEST_MEMFD) 2012 + return -EINVAL; 2086 2013 if ((mem->userspace_addr != old->userspace_addr) || 2087 2014 (npages != old->npages) || 2088 2015 ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) ··· 2114 2035 new->npages = npages; 2115 2036 new->flags = mem->flags; 2116 2037 new->userspace_addr = mem->userspace_addr; 2038 + if (mem->flags & KVM_MEM_GUEST_MEMFD) { 2039 + r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset); 2040 + if (r) 2041 + goto out; 2042 + } 2117 2043 2118 2044 r = kvm_set_memslot(kvm, old, new, change); 2119 2045 if (r) 2120 - kfree(new); 2046 + goto out_unbind; 2047 + 2048 + return 0; 2049 + 2050 + out_unbind: 2051 + if (mem->flags & KVM_MEM_GUEST_MEMFD) 2052 + kvm_gmem_unbind(new); 2053 + out: 2054 + kfree(new); 2121 2055 return r; 2122 2056 } 2123 2057 EXPORT_SYMBOL_GPL(__kvm_set_memory_region); 2124 2058 2125 2059 int kvm_set_memory_region(struct kvm *kvm, 2126 - const struct kvm_userspace_memory_region *mem) 2060 + const struct kvm_userspace_memory_region2 *mem) 2127 2061 { 2128 2062 int r; 2129 2063 ··· 2148 2056 EXPORT_SYMBOL_GPL(kvm_set_memory_region); 2149 2057 2150 2058 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, 2151 - struct kvm_userspace_memory_region *mem) 2059 + struct kvm_userspace_memory_region2 *mem) 2152 2060 { 2153 2061 if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) 2154 2062 return -EINVAL; ··· 2181 2089 2182 2090 as_id = log->slot >> 16; 2183 2091 id = (u16)log->slot; 2184 - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 2092 + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS) 2185 2093 return -EINVAL; 2186 2094 2187 2095 slots = __kvm_memslots(kvm, as_id); ··· 2243 2151 2244 2152 as_id = log->slot >> 16; 2245 2153 id = (u16)log->slot; 2246 - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 2154 + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS) 2247 2155 return -EINVAL; 2248 2156 2249 2157 slots = __kvm_memslots(kvm, as_id); ··· 2355 2263 2356 2264 as_id = log->slot >> 16; 2357 2265 id = (u16)log->slot; 2358 - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 2266 + if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS) 2359 2267 return -EINVAL; 2360 2268 2361 2269 if (log->first_page & 63) ··· 2426 2334 return r; 2427 2335 } 2428 2336 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */ 2337 + 2338 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 2339 + /* 2340 + * Returns true if _all_ gfns in the range [@start, @end) have attributes 2341 + * matching @attrs. 2342 + */ 2343 + bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, 2344 + unsigned long attrs) 2345 + { 2346 + XA_STATE(xas, &kvm->mem_attr_array, start); 2347 + unsigned long index; 2348 + bool has_attrs; 2349 + void *entry; 2350 + 2351 + rcu_read_lock(); 2352 + 2353 + if (!attrs) { 2354 + has_attrs = !xas_find(&xas, end - 1); 2355 + goto out; 2356 + } 2357 + 2358 + has_attrs = true; 2359 + for (index = start; index < end; index++) { 2360 + do { 2361 + entry = xas_next(&xas); 2362 + } while (xas_retry(&xas, entry)); 2363 + 2364 + if (xas.xa_index != index || xa_to_value(entry) != attrs) { 2365 + has_attrs = false; 2366 + break; 2367 + } 2368 + } 2369 + 2370 + out: 2371 + rcu_read_unlock(); 2372 + return has_attrs; 2373 + } 2374 + 2375 + static u64 kvm_supported_mem_attributes(struct kvm *kvm) 2376 + { 2377 + if (!kvm || kvm_arch_has_private_mem(kvm)) 2378 + return KVM_MEMORY_ATTRIBUTE_PRIVATE; 2379 + 2380 + return 0; 2381 + } 2382 + 2383 + static __always_inline void kvm_handle_gfn_range(struct kvm *kvm, 2384 + struct kvm_mmu_notifier_range *range) 2385 + { 2386 + struct kvm_gfn_range gfn_range; 2387 + struct kvm_memory_slot *slot; 2388 + struct kvm_memslots *slots; 2389 + struct kvm_memslot_iter iter; 2390 + bool found_memslot = false; 2391 + bool ret = false; 2392 + int i; 2393 + 2394 + gfn_range.arg = range->arg; 2395 + gfn_range.may_block = range->may_block; 2396 + 2397 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 2398 + slots = __kvm_memslots(kvm, i); 2399 + 2400 + kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) { 2401 + slot = iter.slot; 2402 + gfn_range.slot = slot; 2403 + 2404 + gfn_range.start = max(range->start, slot->base_gfn); 2405 + gfn_range.end = min(range->end, slot->base_gfn + slot->npages); 2406 + if (gfn_range.start >= gfn_range.end) 2407 + continue; 2408 + 2409 + if (!found_memslot) { 2410 + found_memslot = true; 2411 + KVM_MMU_LOCK(kvm); 2412 + if (!IS_KVM_NULL_FN(range->on_lock)) 2413 + range->on_lock(kvm); 2414 + } 2415 + 2416 + ret |= range->handler(kvm, &gfn_range); 2417 + } 2418 + } 2419 + 2420 + if (range->flush_on_ret && ret) 2421 + kvm_flush_remote_tlbs(kvm); 2422 + 2423 + if (found_memslot) 2424 + KVM_MMU_UNLOCK(kvm); 2425 + } 2426 + 2427 + static bool kvm_pre_set_memory_attributes(struct kvm *kvm, 2428 + struct kvm_gfn_range *range) 2429 + { 2430 + /* 2431 + * Unconditionally add the range to the invalidation set, regardless of 2432 + * whether or not the arch callback actually needs to zap SPTEs. E.g. 2433 + * if KVM supports RWX attributes in the future and the attributes are 2434 + * going from R=>RW, zapping isn't strictly necessary. Unconditionally 2435 + * adding the range allows KVM to require that MMU invalidations add at 2436 + * least one range between begin() and end(), e.g. allows KVM to detect 2437 + * bugs where the add() is missed. Relaxing the rule *might* be safe, 2438 + * but it's not obvious that allowing new mappings while the attributes 2439 + * are in flux is desirable or worth the complexity. 2440 + */ 2441 + kvm_mmu_invalidate_range_add(kvm, range->start, range->end); 2442 + 2443 + return kvm_arch_pre_set_memory_attributes(kvm, range); 2444 + } 2445 + 2446 + /* Set @attributes for the gfn range [@start, @end). */ 2447 + static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, 2448 + unsigned long attributes) 2449 + { 2450 + struct kvm_mmu_notifier_range pre_set_range = { 2451 + .start = start, 2452 + .end = end, 2453 + .handler = kvm_pre_set_memory_attributes, 2454 + .on_lock = kvm_mmu_invalidate_begin, 2455 + .flush_on_ret = true, 2456 + .may_block = true, 2457 + }; 2458 + struct kvm_mmu_notifier_range post_set_range = { 2459 + .start = start, 2460 + .end = end, 2461 + .arg.attributes = attributes, 2462 + .handler = kvm_arch_post_set_memory_attributes, 2463 + .on_lock = kvm_mmu_invalidate_end, 2464 + .may_block = true, 2465 + }; 2466 + unsigned long i; 2467 + void *entry; 2468 + int r = 0; 2469 + 2470 + entry = attributes ? xa_mk_value(attributes) : NULL; 2471 + 2472 + mutex_lock(&kvm->slots_lock); 2473 + 2474 + /* Nothing to do if the entire range as the desired attributes. */ 2475 + if (kvm_range_has_memory_attributes(kvm, start, end, attributes)) 2476 + goto out_unlock; 2477 + 2478 + /* 2479 + * Reserve memory ahead of time to avoid having to deal with failures 2480 + * partway through setting the new attributes. 2481 + */ 2482 + for (i = start; i < end; i++) { 2483 + r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT); 2484 + if (r) 2485 + goto out_unlock; 2486 + } 2487 + 2488 + kvm_handle_gfn_range(kvm, &pre_set_range); 2489 + 2490 + for (i = start; i < end; i++) { 2491 + r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, 2492 + GFP_KERNEL_ACCOUNT)); 2493 + KVM_BUG_ON(r, kvm); 2494 + } 2495 + 2496 + kvm_handle_gfn_range(kvm, &post_set_range); 2497 + 2498 + out_unlock: 2499 + mutex_unlock(&kvm->slots_lock); 2500 + 2501 + return r; 2502 + } 2503 + static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm, 2504 + struct kvm_memory_attributes *attrs) 2505 + { 2506 + gfn_t start, end; 2507 + 2508 + /* flags is currently not used. */ 2509 + if (attrs->flags) 2510 + return -EINVAL; 2511 + if (attrs->attributes & ~kvm_supported_mem_attributes(kvm)) 2512 + return -EINVAL; 2513 + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address) 2514 + return -EINVAL; 2515 + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size)) 2516 + return -EINVAL; 2517 + 2518 + start = attrs->address >> PAGE_SHIFT; 2519 + end = (attrs->address + attrs->size) >> PAGE_SHIFT; 2520 + 2521 + /* 2522 + * xarray tracks data using "unsigned long", and as a result so does 2523 + * KVM. For simplicity, supports generic attributes only on 64-bit 2524 + * architectures. 2525 + */ 2526 + BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long)); 2527 + 2528 + return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes); 2529 + } 2530 + #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */ 2429 2531 2430 2532 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn) 2431 2533 { ··· 4813 4527 { 4814 4528 switch (arg) { 4815 4529 case KVM_CAP_USER_MEMORY: 4530 + case KVM_CAP_USER_MEMORY2: 4816 4531 case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: 4817 4532 case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: 4818 4533 case KVM_CAP_INTERNAL_ERROR_DATA: 4819 4534 #ifdef CONFIG_HAVE_KVM_MSI 4820 4535 case KVM_CAP_SIGNAL_MSI: 4821 4536 #endif 4822 - #ifdef CONFIG_HAVE_KVM_IRQFD 4537 + #ifdef CONFIG_HAVE_KVM_IRQCHIP 4823 4538 case KVM_CAP_IRQFD: 4824 4539 #endif 4825 4540 case KVM_CAP_IOEVENTFD_ANY_LENGTH: ··· 4842 4555 case KVM_CAP_IRQ_ROUTING: 4843 4556 return KVM_MAX_IRQ_ROUTES; 4844 4557 #endif 4845 - #if KVM_ADDRESS_SPACE_NUM > 1 4558 + #if KVM_MAX_NR_ADDRESS_SPACES > 1 4846 4559 case KVM_CAP_MULTI_ADDRESS_SPACE: 4847 - return KVM_ADDRESS_SPACE_NUM; 4560 + if (kvm) 4561 + return kvm_arch_nr_memslot_as_ids(kvm); 4562 + return KVM_MAX_NR_ADDRESS_SPACES; 4848 4563 #endif 4849 4564 case KVM_CAP_NR_MEMSLOTS: 4850 4565 return KVM_USER_MEM_SLOTS; ··· 4867 4578 #endif 4868 4579 case KVM_CAP_BINARY_STATS_FD: 4869 4580 case KVM_CAP_SYSTEM_EVENT_DATA: 4581 + case KVM_CAP_DEVICE_CTRL: 4870 4582 return 1; 4583 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 4584 + case KVM_CAP_MEMORY_ATTRIBUTES: 4585 + return kvm_supported_mem_attributes(kvm); 4586 + #endif 4587 + #ifdef CONFIG_KVM_PRIVATE_MEM 4588 + case KVM_CAP_GUEST_MEMFD: 4589 + return !kvm || kvm_arch_has_private_mem(kvm); 4590 + #endif 4871 4591 default: 4872 4592 break; 4873 4593 } ··· 4955 4657 4956 4658 lockdep_assert_held(&kvm->slots_lock); 4957 4659 4958 - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 4660 + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { 4959 4661 if (!kvm_memslots_empty(__kvm_memslots(kvm, i))) 4960 4662 return false; 4961 4663 } ··· 5081 4783 return fd; 5082 4784 } 5083 4785 4786 + #define SANITY_CHECK_MEM_REGION_FIELD(field) \ 4787 + do { \ 4788 + BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) != \ 4789 + offsetof(struct kvm_userspace_memory_region2, field)); \ 4790 + BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) != \ 4791 + sizeof_field(struct kvm_userspace_memory_region2, field)); \ 4792 + } while (0) 4793 + 5084 4794 static long kvm_vm_ioctl(struct file *filp, 5085 4795 unsigned int ioctl, unsigned long arg) 5086 4796 { ··· 5111 4805 r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap); 5112 4806 break; 5113 4807 } 4808 + case KVM_SET_USER_MEMORY_REGION2: 5114 4809 case KVM_SET_USER_MEMORY_REGION: { 5115 - struct kvm_userspace_memory_region kvm_userspace_mem; 4810 + struct kvm_userspace_memory_region2 mem; 4811 + unsigned long size; 4812 + 4813 + if (ioctl == KVM_SET_USER_MEMORY_REGION) { 4814 + /* 4815 + * Fields beyond struct kvm_userspace_memory_region shouldn't be 4816 + * accessed, but avoid leaking kernel memory in case of a bug. 4817 + */ 4818 + memset(&mem, 0, sizeof(mem)); 4819 + size = sizeof(struct kvm_userspace_memory_region); 4820 + } else { 4821 + size = sizeof(struct kvm_userspace_memory_region2); 4822 + } 4823 + 4824 + /* Ensure the common parts of the two structs are identical. */ 4825 + SANITY_CHECK_MEM_REGION_FIELD(slot); 4826 + SANITY_CHECK_MEM_REGION_FIELD(flags); 4827 + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); 4828 + SANITY_CHECK_MEM_REGION_FIELD(memory_size); 4829 + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); 5116 4830 5117 4831 r = -EFAULT; 5118 - if (copy_from_user(&kvm_userspace_mem, argp, 5119 - sizeof(kvm_userspace_mem))) 4832 + if (copy_from_user(&mem, argp, size)) 5120 4833 goto out; 5121 4834 5122 - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); 4835 + r = -EINVAL; 4836 + if (ioctl == KVM_SET_USER_MEMORY_REGION && 4837 + (mem.flags & ~KVM_SET_USER_MEMORY_REGION_V1_FLAGS)) 4838 + goto out; 4839 + 4840 + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); 5123 4841 break; 5124 4842 } 5125 4843 case KVM_GET_DIRTY_LOG: { ··· 5257 4927 goto out; 5258 4928 if (routing.nr) { 5259 4929 urouting = argp; 5260 - entries = vmemdup_user(urouting->entries, 5261 - array_size(sizeof(*entries), 5262 - routing.nr)); 4930 + entries = vmemdup_array_user(urouting->entries, 4931 + routing.nr, sizeof(*entries)); 5263 4932 if (IS_ERR(entries)) { 5264 4933 r = PTR_ERR(entries); 5265 4934 goto out; ··· 5270 4941 break; 5271 4942 } 5272 4943 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */ 4944 + #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 4945 + case KVM_SET_MEMORY_ATTRIBUTES: { 4946 + struct kvm_memory_attributes attrs; 4947 + 4948 + r = -EFAULT; 4949 + if (copy_from_user(&attrs, argp, sizeof(attrs))) 4950 + goto out; 4951 + 4952 + r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs); 4953 + break; 4954 + } 4955 + #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */ 5273 4956 case KVM_CREATE_DEVICE: { 5274 4957 struct kvm_create_device cd; 5275 4958 ··· 5309 4968 case KVM_GET_STATS_FD: 5310 4969 r = kvm_vm_ioctl_get_stats_fd(kvm); 5311 4970 break; 4971 + #ifdef CONFIG_KVM_PRIVATE_MEM 4972 + case KVM_CREATE_GUEST_MEMFD: { 4973 + struct kvm_create_guest_memfd guest_memfd; 4974 + 4975 + r = -EFAULT; 4976 + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) 4977 + goto out; 4978 + 4979 + r = kvm_gmem_create(kvm, &guest_memfd); 4980 + break; 4981 + } 4982 + #endif 5312 4983 default: 5313 4984 r = kvm_arch_vm_ioctl(filp, ioctl, arg); 5314 4985 } ··· 5491 5138 #ifdef CONFIG_KVM_MMIO 5492 5139 r += PAGE_SIZE; /* coalesced mmio ring page */ 5493 5140 #endif 5494 - break; 5495 - case KVM_TRACE_ENABLE: 5496 - case KVM_TRACE_PAUSE: 5497 - case KVM_TRACE_DISABLE: 5498 - r = -EOPNOTSUPP; 5499 5141 break; 5500 5142 default: 5501 5143 return kvm_arch_dev_ioctl(filp, ioctl, arg); ··· 6451 6103 r = kvm_vfio_ops_init(); 6452 6104 if (WARN_ON_ONCE(r)) 6453 6105 goto err_vfio; 6106 + 6107 + kvm_gmem_init(module); 6454 6108 6455 6109 /* 6456 6110 * Registration _must_ be the very last thing done, as this exposes

+26

virt/kvm/kvm_mm.h

··· 37 37 } 38 38 #endif /* HAVE_KVM_PFNCACHE */ 39 39 40 + #ifdef CONFIG_KVM_PRIVATE_MEM 41 + void kvm_gmem_init(struct module *module); 42 + int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); 43 + int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, 44 + unsigned int fd, loff_t offset); 45 + void kvm_gmem_unbind(struct kvm_memory_slot *slot); 46 + #else 47 + static inline void kvm_gmem_init(struct module *module) 48 + { 49 + 50 + } 51 + 52 + static inline int kvm_gmem_bind(struct kvm *kvm, 53 + struct kvm_memory_slot *slot, 54 + unsigned int fd, loff_t offset) 55 + { 56 + WARN_ON_ONCE(1); 57 + return -EIO; 58 + } 59 + 60 + static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot) 61 + { 62 + WARN_ON_ONCE(1); 63 + } 64 + #endif /* CONFIG_KVM_PRIVATE_MEM */ 65 + 40 66 #endif /* __KVM_MM_H__ */