Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+562 -585

Documentation/virt/kvm/api.rst

··· 7447 7447 7448 7448 This capability connects the vcpu to an in-kernel XIVE device. 7449 7449 7450 + 6.76 KVM_CAP_HYPERV_SYNIC 7451 + ------------------------- 7452 + 7453 + :Architectures: x86 7454 + :Target: vcpu 7455 + 7456 + This capability, if KVM_CHECK_EXTENSION indicates that it is 7457 + available, means that the kernel has an implementation of the 7458 + Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is 7459 + used to support Windows Hyper-V based guest paravirt drivers(VMBus). 7460 + 7461 + In order to use SynIC, it has to be activated by setting this 7462 + capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this 7463 + will disable the use of APIC hardware virtualization even if supported 7464 + by the CPU, as it's incompatible with SynIC auto-EOI behavior. 7465 + 7466 + 6.77 KVM_CAP_HYPERV_SYNIC2 7467 + -------------------------- 7468 + 7469 + :Architectures: x86 7470 + :Target: vcpu 7471 + 7472 + This capability enables a newer version of Hyper-V Synthetic interrupt 7473 + controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM 7474 + doesn't clear SynIC message and event flags pages when they are enabled by 7475 + writing to the respective MSRs. 7476 + 7477 + 6.78 KVM_CAP_HYPERV_DIRECT_TLBFLUSH 7478 + ----------------------------------- 7479 + 7480 + :Architectures: x86 7481 + :Target: vcpu 7482 + 7483 + This capability indicates that KVM running on top of Hyper-V hypervisor 7484 + enables Direct TLB flush for its guests meaning that TLB flush 7485 + hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. 7486 + Due to the different ABI for hypercall parameters between Hyper-V and 7487 + KVM, enabling this capability effectively disables all hypercall 7488 + handling by KVM (as some KVM hypercall may be mistakenly treated as TLB 7489 + flush hypercalls by Hyper-V) so userspace should disable KVM identification 7490 + in CPUID and only exposes Hyper-V identification. In this case, guest 7491 + thinks it's running on Hyper-V and only use Hyper-V hypercalls. 7492 + 7493 + 6.79 KVM_CAP_HYPERV_ENFORCE_CPUID 7494 + --------------------------------- 7495 + 7496 + :Architectures: x86 7497 + :Target: vcpu 7498 + 7499 + When enabled, KVM will disable emulated Hyper-V features provided to the 7500 + guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all 7501 + currently implemented Hyper-V features are provided unconditionally when 7502 + Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001) 7503 + leaf. 7504 + 7505 + 6.80 KVM_CAP_ENFORCE_PV_FEATURE_CPUID 7506 + ------------------------------------- 7507 + 7508 + :Architectures: x86 7509 + :Target: vcpu 7510 + 7511 + When enabled, KVM will disable paravirtual features provided to the 7512 + guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf 7513 + (0x40000001). Otherwise, a guest may use the paravirtual features 7514 + regardless of what has actually been exposed through the CPUID leaf. 7515 + 7516 + .. _KVM_CAP_DIRTY_LOG_RING: 7517 + 7518 + 7450 7519 .. _cap_enable_vm: 7451 7520 7452 7521 7. Capabilities that can be enabled on VMs ··· 7996 7927 7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 7997 7928 ------------------------------------- 7998 7929 7999 - Architectures: x86 SEV enabled 8000 - Type: vm 8001 - Parameters: args[0] is the fd of the source vm 8002 - Returns: 0 on success; ENOTTY on error 7930 + :Architectures: x86 SEV enabled 7931 + :Type: vm 7932 + :Parameters: args[0] is the fd of the source vm 7933 + :Returns: 0 on success; ENOTTY on error 8003 7934 8004 7935 This capability enables userspace to copy encryption context from the vm 8005 7936 indicated by the fd to the vm this is called on. ··· 8031 7962 default. 8032 7963 8033 7964 See Documentation/arch/x86/sgx.rst for more details. 8034 - 8035 - 7.26 KVM_CAP_PPC_RPT_INVALIDATE 8036 - ------------------------------- 8037 - 8038 - :Capability: KVM_CAP_PPC_RPT_INVALIDATE 8039 - :Architectures: ppc 8040 - :Type: vm 8041 - 8042 - This capability indicates that the kernel is capable of handling 8043 - H_RPT_INVALIDATE hcall. 8044 - 8045 - In order to enable the use of H_RPT_INVALIDATE in the guest, 8046 - user space might have to advertise it for the guest. For example, 8047 - IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is 8048 - present in the "ibm,hypertas-functions" device-tree property. 8049 - 8050 - This capability is enabled for hypervisors on platforms like POWER9 8051 - that support radix MMU. 8052 7965 8053 7966 7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE 8054 7967 -------------------------------------- ··· 8089 8038 This is intended to support intra-host migration of VMs between userspace VMMs, 8090 8039 upgrading the VMM process without interrupting the guest. 8091 8040 8092 - 7.30 KVM_CAP_PPC_AIL_MODE_3 8093 - ------------------------------- 8094 - 8095 - :Capability: KVM_CAP_PPC_AIL_MODE_3 8096 - :Architectures: ppc 8097 - :Type: vm 8098 - 8099 - This capability indicates that the kernel supports the mode 3 setting for the 8100 - "Address Translation Mode on Interrupt" aka "Alternate Interrupt Location" 8101 - resource that is controlled with the H_SET_MODE hypercall. 8102 - 8103 - This capability allows a guest kernel to use a better-performance mode for 8104 - handling interrupts and system calls. 8105 - 8106 8041 7.31 KVM_CAP_DISABLE_QUIRKS2 8107 8042 ---------------------------- 8108 8043 8109 - :Capability: KVM_CAP_DISABLE_QUIRKS2 8110 8044 :Parameters: args[0] - set of KVM quirks to disable 8111 8045 :Architectures: x86 8112 8046 :Type: vm ··· 8246 8210 cause CPU stuck (due to event windows don't open up) and make the CPU 8247 8211 unavailable to host or other VMs. 8248 8212 8249 - 7.34 KVM_CAP_MEMORY_FAULT_INFO 8250 - ------------------------------ 8251 - 8252 - :Architectures: x86 8253 - :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. 8254 - 8255 - The presence of this capability indicates that KVM_RUN will fill 8256 - kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if 8257 - there is a valid memslot but no backing VMA for the corresponding host virtual 8258 - address. 8259 - 8260 - The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns 8261 - an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set 8262 - to KVM_EXIT_MEMORY_FAULT. 8263 - 8264 - Note: Userspaces which attempt to resolve memory faults so that they can retry 8265 - KVM_RUN are encouraged to guard against repeatedly receiving the same 8266 - error/annotated fault. 8267 - 8268 - See KVM_EXIT_MEMORY_FAULT for more information. 8269 - 8270 8213 7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS 8271 8214 ----------------------------------- 8272 8215 ··· 8263 8248 Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the 8264 8249 core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest. 8265 8250 8266 - 7.36 KVM_CAP_X86_GUEST_MODE 8267 - ------------------------------ 8268 - 8269 - :Architectures: x86 8270 - :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. 8271 - 8272 - The presence of this capability indicates that KVM_RUN will update the 8273 - KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the 8274 - vCPU was executing nested guest code when it exited. 8275 - 8276 - KVM exits with the register state of either the L1 or L2 guest 8277 - depending on which executed at the time of an exit. Userspace must 8278 - take care to differentiate between these cases. 8279 - 8280 - 7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 8281 - ------------------------------------- 8282 - 8283 - :Architectures: arm64 8284 - :Target: VM 8285 - :Parameters: None 8286 - :Returns: 0 on success, -EINVAL if vCPUs have been created before enabling this 8287 - capability. 8288 - 8289 - This capability changes the behavior of the registers that identify a PE 8290 - implementation of the Arm architecture: MIDR_EL1, REVIDR_EL1, and AIDR_EL1. 8291 - By default, these registers are visible to userspace but treated as invariant. 8292 - 8293 - When this capability is enabled, KVM allows userspace to change the 8294 - aforementioned registers before the first KVM_RUN. These registers are VM 8295 - scoped, meaning that the same set of values are presented on all vCPUs in a 8296 - given VM. 8297 - 8298 - 8. Other capabilities. 8299 - ====================== 8300 - 8301 - This section lists capabilities that give information about other 8302 - features of the KVM implementation. 8303 - 8304 - 8.1 KVM_CAP_PPC_HWRNG 8305 - --------------------- 8306 - 8307 - :Architectures: ppc 8308 - 8309 - This capability, if KVM_CHECK_EXTENSION indicates that it is 8310 - available, means that the kernel has an implementation of the 8311 - H_RANDOM hypercall backed by a hardware random-number generator. 8312 - If present, the kernel H_RANDOM handler can be enabled for guest use 8313 - with the KVM_CAP_PPC_ENABLE_HCALL capability. 8314 - 8315 - 8.2 KVM_CAP_HYPERV_SYNIC 8316 - ------------------------ 8317 - 8318 - :Architectures: x86 8319 - 8320 - This capability, if KVM_CHECK_EXTENSION indicates that it is 8321 - available, means that the kernel has an implementation of the 8322 - Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is 8323 - used to support Windows Hyper-V based guest paravirt drivers(VMBus). 8324 - 8325 - In order to use SynIC, it has to be activated by setting this 8326 - capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this 8327 - will disable the use of APIC hardware virtualization even if supported 8328 - by the CPU, as it's incompatible with SynIC auto-EOI behavior. 8329 - 8330 - 8.3 KVM_CAP_PPC_MMU_RADIX 8331 - ------------------------- 8332 - 8333 - :Architectures: ppc 8334 - 8335 - This capability, if KVM_CHECK_EXTENSION indicates that it is 8336 - available, means that the kernel can support guests using the 8337 - radix MMU defined in Power ISA V3.00 (as implemented in the POWER9 8338 - processor). 8339 - 8340 - 8.4 KVM_CAP_PPC_MMU_HASH_V3 8341 - --------------------------- 8342 - 8343 - :Architectures: ppc 8344 - 8345 - This capability, if KVM_CHECK_EXTENSION indicates that it is 8346 - available, means that the kernel can support guests using the 8347 - hashed page table MMU defined in Power ISA V3.00 (as implemented in 8348 - the POWER9 processor), including in-memory segment tables. 8349 - 8350 - 8.5 KVM_CAP_MIPS_VZ 8351 - ------------------- 8352 - 8353 - :Architectures: mips 8354 - 8355 - This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that 8356 - it is available, means that full hardware assisted virtualization capabilities 8357 - of the hardware are available for use through KVM. An appropriate 8358 - KVM_VM_MIPS_* type must be passed to KVM_CREATE_VM to create a VM which 8359 - utilises it. 8360 - 8361 - If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is 8362 - available, it means that the VM is using full hardware assisted virtualization 8363 - capabilities of the hardware. This is useful to check after creating a VM with 8364 - KVM_VM_MIPS_DEFAULT. 8365 - 8366 - The value returned by KVM_CHECK_EXTENSION should be compared against known 8367 - values (see below). All other values are reserved. This is to allow for the 8368 - possibility of other hardware assisted virtualization implementations which 8369 - may be incompatible with the MIPS VZ ASE. 8370 - 8371 - == ========================================================================== 8372 - 0 The trap & emulate implementation is in use to run guest code in user 8373 - mode. Guest virtual memory segments are rearranged to fit the guest in the 8374 - user mode address space. 8375 - 8376 - 1 The MIPS VZ ASE is in use, providing full hardware assisted 8377 - virtualization, including standard guest virtual memory segments. 8378 - == ========================================================================== 8379 - 8380 - 8.6 KVM_CAP_MIPS_TE 8381 - ------------------- 8382 - 8383 - :Architectures: mips 8384 - 8385 - This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that 8386 - it is available, means that the trap & emulate implementation is available to 8387 - run guest code in user mode, even if KVM_CAP_MIPS_VZ indicates that hardware 8388 - assisted virtualisation is also available. KVM_VM_MIPS_TE (0) must be passed 8389 - to KVM_CREATE_VM to create a VM which utilises it. 8390 - 8391 - If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is 8392 - available, it means that the VM is using trap & emulate. 8393 - 8394 - 8.7 KVM_CAP_MIPS_64BIT 8395 - ---------------------- 8396 - 8397 - :Architectures: mips 8398 - 8399 - This capability indicates the supported architecture type of the guest, i.e. the 8400 - supported register and address width. 8401 - 8402 - The values returned when this capability is checked by KVM_CHECK_EXTENSION on a 8403 - kvm VM handle correspond roughly to the CP0_Config.AT register field, and should 8404 - be checked specifically against known values (see below). All other values are 8405 - reserved. 8406 - 8407 - == ======================================================================== 8408 - 0 MIPS32 or microMIPS32. 8409 - Both registers and addresses are 32-bits wide. 8410 - It will only be possible to run 32-bit guest code. 8411 - 8412 - 1 MIPS64 or microMIPS64 with access only to 32-bit compatibility segments. 8413 - Registers are 64-bits wide, but addresses are 32-bits wide. 8414 - 64-bit guest code may run but cannot access MIPS64 memory segments. 8415 - It will also be possible to run 32-bit guest code. 8416 - 8417 - 2 MIPS64 or microMIPS64 with access to all address segments. 8418 - Both registers and addresses are 64-bits wide. 8419 - It will be possible to run 64-bit or 32-bit guest code. 8420 - == ======================================================================== 8421 - 8422 - 8.9 KVM_CAP_ARM_USER_IRQ 8423 - ------------------------ 8424 - 8425 - :Architectures: arm64 8426 - 8427 - This capability, if KVM_CHECK_EXTENSION indicates that it is available, means 8428 - that if userspace creates a VM without an in-kernel interrupt controller, it 8429 - will be notified of changes to the output level of in-kernel emulated devices, 8430 - which can generate virtual interrupts, presented to the VM. 8431 - For such VMs, on every return to userspace, the kernel 8432 - updates the vcpu's run->s.regs.device_irq_level field to represent the actual 8433 - output level of the device. 8434 - 8435 - Whenever kvm detects a change in the device output level, kvm guarantees at 8436 - least one return to userspace before running the VM. This exit could either 8437 - be a KVM_EXIT_INTR or any other exit event, like KVM_EXIT_MMIO. This way, 8438 - userspace can always sample the device output level and re-compute the state of 8439 - the userspace interrupt controller. Userspace should always check the state 8440 - of run->s.regs.device_irq_level on every kvm exit. 8441 - The value in run->s.regs.device_irq_level can represent both level and edge 8442 - triggered interrupt signals, depending on the device. Edge triggered interrupt 8443 - signals will exit to userspace with the bit in run->s.regs.device_irq_level 8444 - set exactly once per edge signal. 8445 - 8446 - The field run->s.regs.device_irq_level is available independent of 8447 - run->kvm_valid_regs or run->kvm_dirty_regs bits. 8448 - 8449 - If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a 8450 - number larger than 0 indicating the version of this capability is implemented 8451 - and thereby which bits in run->s.regs.device_irq_level can signal values. 8452 - 8453 - Currently the following bits are defined for the device_irq_level bitmap:: 8454 - 8455 - KVM_CAP_ARM_USER_IRQ >= 1: 8456 - 8457 - KVM_ARM_DEV_EL1_VTIMER - EL1 virtual timer 8458 - KVM_ARM_DEV_EL1_PTIMER - EL1 physical timer 8459 - KVM_ARM_DEV_PMU - ARM PMU overflow interrupt signal 8460 - 8461 - Future versions of kvm may implement additional events. These will get 8462 - indicated by returning a higher number from KVM_CHECK_EXTENSION and will be 8463 - listed above. 8464 - 8465 - 8.10 KVM_CAP_PPC_SMT_POSSIBLE 8466 - ----------------------------- 8467 - 8468 - :Architectures: ppc 8469 - 8470 - Querying this capability returns a bitmap indicating the possible 8471 - virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N 8472 - (counting from the right) is set, then a virtual SMT mode of 2^N is 8473 - available. 8474 - 8475 - 8.11 KVM_CAP_HYPERV_SYNIC2 8476 - -------------------------- 8477 - 8478 - :Architectures: x86 8479 - 8480 - This capability enables a newer version of Hyper-V Synthetic interrupt 8481 - controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM 8482 - doesn't clear SynIC message and event flags pages when they are enabled by 8483 - writing to the respective MSRs. 8484 - 8485 - 8.12 KVM_CAP_HYPERV_VP_INDEX 8486 - ---------------------------- 8487 - 8488 - :Architectures: x86 8489 - 8490 - This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr. Its 8491 - value is used to denote the target vcpu for a SynIC interrupt. For 8492 - compatibility, KVM initializes this msr to KVM's internal vcpu index. When this 8493 - capability is absent, userspace can still query this msr's value. 8494 - 8495 - 8.13 KVM_CAP_S390_AIS_MIGRATION 8496 - ------------------------------- 8497 - 8498 - :Architectures: s390 8499 - :Parameters: none 8500 - 8501 - This capability indicates if the flic device will be able to get/set the 8502 - AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows 8503 - to discover this without having to create a flic device. 8504 - 8505 - 8.14 KVM_CAP_S390_PSW 8506 - --------------------- 8507 - 8508 - :Architectures: s390 8509 - 8510 - This capability indicates that the PSW is exposed via the kvm_run structure. 8511 - 8512 - 8.15 KVM_CAP_S390_GMAP 8513 - ---------------------- 8514 - 8515 - :Architectures: s390 8516 - 8517 - This capability indicates that the user space memory used as guest mapping can 8518 - be anywhere in the user memory address space, as long as the memory slots are 8519 - aligned and sized to a segment (1MB) boundary. 8520 - 8521 - 8.16 KVM_CAP_S390_COW 8522 - --------------------- 8523 - 8524 - :Architectures: s390 8525 - 8526 - This capability indicates that the user space memory used as guest mapping can 8527 - use copy-on-write semantics as well as dirty pages tracking via read-only page 8528 - tables. 8529 - 8530 - 8.17 KVM_CAP_S390_BPB 8531 - --------------------- 8532 - 8533 - :Architectures: s390 8534 - 8535 - This capability indicates that kvm will implement the interfaces to handle 8536 - reset, migration and nested KVM for branch prediction blocking. The stfle 8537 - facility 82 should not be provided to the guest without this capability. 8538 - 8539 - 8.18 KVM_CAP_HYPERV_TLBFLUSH 8540 - ---------------------------- 8541 - 8542 - :Architectures: x86 8543 - 8544 - This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush 8545 - hypercalls: 8546 - HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx, 8547 - HvFlushVirtualAddressList, HvFlushVirtualAddressListEx. 8548 - 8549 - 8.19 KVM_CAP_ARM_INJECT_SERROR_ESR 8550 - ---------------------------------- 8551 - 8552 - :Architectures: arm64 8553 - 8554 - This capability indicates that userspace can specify (via the 8555 - KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it 8556 - takes a virtual SError interrupt exception. 8557 - If KVM advertises this capability, userspace can only specify the ISS field for 8558 - the ESR syndrome. Other parts of the ESR, such as the EC are generated by the 8559 - CPU when the exception is taken. If this virtual SError is taken to EL1 using 8560 - AArch64, this value will be reported in the ISS field of ESR_ELx. 8561 - 8562 - See KVM_CAP_VCPU_EVENTS for more details. 8563 - 8564 - 8.20 KVM_CAP_HYPERV_SEND_IPI 8565 - ---------------------------- 8566 - 8567 - :Architectures: x86 8568 - 8569 - This capability indicates that KVM supports paravirtualized Hyper-V IPI send 8570 - hypercalls: 8571 - HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. 8572 - 8573 - 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH 8574 - ----------------------------------- 8575 - 8576 - :Architectures: x86 8577 - 8578 - This capability indicates that KVM running on top of Hyper-V hypervisor 8579 - enables Direct TLB flush for its guests meaning that TLB flush 8580 - hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. 8581 - Due to the different ABI for hypercall parameters between Hyper-V and 8582 - KVM, enabling this capability effectively disables all hypercall 8583 - handling by KVM (as some KVM hypercall may be mistakenly treated as TLB 8584 - flush hypercalls by Hyper-V) so userspace should disable KVM identification 8585 - in CPUID and only exposes Hyper-V identification. In this case, guest 8586 - thinks it's running on Hyper-V and only use Hyper-V hypercalls. 8587 - 8588 - 8.22 KVM_CAP_S390_VCPU_RESETS 8589 - ----------------------------- 8590 - 8591 - :Architectures: s390 8592 - 8593 - This capability indicates that the KVM_S390_NORMAL_RESET and 8594 - KVM_S390_CLEAR_RESET ioctls are available. 8595 - 8596 - 8.23 KVM_CAP_S390_PROTECTED 8597 - --------------------------- 8598 - 8599 - :Architectures: s390 8600 - 8601 - This capability indicates that the Ultravisor has been initialized and 8602 - KVM can therefore start protected VMs. 8603 - This capability governs the KVM_S390_PV_COMMAND ioctl and the 8604 - KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected 8605 - guests when the state change is invalid. 8606 - 8607 - 8.24 KVM_CAP_STEAL_TIME 8608 - ----------------------- 8609 - 8610 - :Architectures: arm64, x86 8611 - 8612 - This capability indicates that KVM supports steal time accounting. 8613 - When steal time accounting is supported it may be enabled with 8614 - architecture-specific interfaces. This capability and the architecture- 8615 - specific interfaces must be consistent, i.e. if one says the feature 8616 - is supported, than the other should as well and vice versa. For arm64 8617 - see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL". 8618 - For x86 see Documentation/virt/kvm/x86/msr.rst "MSR_KVM_STEAL_TIME". 8619 - 8620 - 8.25 KVM_CAP_S390_DIAG318 8621 - ------------------------- 8622 - 8623 - :Architectures: s390 8624 - 8625 - This capability enables a guest to set information about its control program 8626 - (i.e. guest kernel type and version). The information is helpful during 8627 - system/firmware service events, providing additional data about the guest 8628 - environments running on the machine. 8629 - 8630 - The information is associated with the DIAGNOSE 0x318 instruction, which sets 8631 - an 8-byte value consisting of a one-byte Control Program Name Code (CPNC) and 8632 - a 7-byte Control Program Version Code (CPVC). The CPNC determines what 8633 - environment the control program is running in (e.g. Linux, z/VM...), and the 8634 - CPVC is used for information specific to OS (e.g. Linux version, Linux 8635 - distribution...) 8636 - 8637 - If this capability is available, then the CPNC and CPVC can be synchronized 8638 - between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318). 8639 - 8640 - 8.26 KVM_CAP_X86_USER_SPACE_MSR 8641 - ------------------------------- 8642 - 8643 - :Architectures: x86 8644 - 8645 - This capability indicates that KVM supports deflection of MSR reads and 8646 - writes to user space. It can be enabled on a VM level. If enabled, MSR 8647 - accesses that would usually trigger a #GP by KVM into the guest will 8648 - instead get bounced to user space through the KVM_EXIT_X86_RDMSR and 8649 - KVM_EXIT_X86_WRMSR exit notifications. 8650 - 8651 - 8.27 KVM_CAP_X86_MSR_FILTER 8652 - --------------------------- 8653 - 8654 - :Architectures: x86 8655 - 8656 - This capability indicates that KVM supports that accesses to user defined MSRs 8657 - may be rejected. With this capability exposed, KVM exports new VM ioctl 8658 - KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR 8659 - ranges that KVM should deny access to. 8660 - 8661 - In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to 8662 - trap and emulate MSRs that are outside of the scope of KVM as well as 8663 - limit the attack surface on KVM's MSR emulation code. 8664 - 8665 - 8.28 KVM_CAP_ENFORCE_PV_FEATURE_CPUID 8666 - ------------------------------------- 8667 - 8668 - Architectures: x86 8669 - 8670 - When enabled, KVM will disable paravirtual features provided to the 8671 - guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf 8672 - (0x40000001). Otherwise, a guest may use the paravirtual features 8673 - regardless of what has actually been exposed through the CPUID leaf. 8674 - 8675 - .. _KVM_CAP_DIRTY_LOG_RING: 8676 - 8677 - 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8251 + 7.36 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8678 8252 ---------------------------------------------------------- 8679 8253 8680 8254 :Architectures: x86, arm64 8255 + :Type: vm 8681 8256 :Parameters: args[0] - size of the dirty log ring 8682 8257 8683 8258 KVM is capable of tracking dirty memory using ring buffers that are ··· 8388 8783 vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES} 8389 8784 command on KVM device "kvm-arm-vgic-v3". 8390 8785 8786 + 7.37 KVM_CAP_PMU_CAPABILITY 8787 + --------------------------- 8788 + 8789 + :Architectures: x86 8790 + :Type: vm 8791 + :Parameters: arg[0] is bitmask of PMU virtualization capabilities. 8792 + :Returns: 0 on success, -EINVAL when arg[0] contains invalid bits 8793 + 8794 + This capability alters PMU virtualization in KVM. 8795 + 8796 + Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of 8797 + PMU virtualization capabilities that can be adjusted on a VM. 8798 + 8799 + The argument to KVM_ENABLE_CAP is also a bitmask and selects specific 8800 + PMU virtualization capabilities to be applied to the VM. This can 8801 + only be invoked on a VM prior to the creation of VCPUs. 8802 + 8803 + At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting 8804 + this capability will disable PMU virtualization for that VM. Usermode 8805 + should adjust CPUID leaf 0xA to reflect that the PMU is disabled. 8806 + 8807 + 7.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 8808 + ------------------------------------- 8809 + 8810 + :Architectures: x86 8811 + :Type: vm 8812 + :Parameters: arg[0] must be 0. 8813 + :Returns: 0 on success, -EPERM if the userspace process does not 8814 + have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been 8815 + created. 8816 + 8817 + This capability disables the NX huge pages mitigation for iTLB MULTIHIT. 8818 + 8819 + The capability has no effect if the nx_huge_pages module parameter is not set. 8820 + 8821 + This capability may only be set before any vCPUs are created. 8822 + 8823 + 7.39 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 8824 + --------------------------------------- 8825 + 8826 + :Architectures: arm64 8827 + :Type: vm 8828 + :Parameters: arg[0] is the new split chunk size. 8829 + :Returns: 0 on success, -EINVAL if any memslot was already created. 8830 + 8831 + This capability sets the chunk size used in Eager Page Splitting. 8832 + 8833 + Eager Page Splitting improves the performance of dirty-logging (used 8834 + in live migrations) when guest memory is backed by huge-pages. It 8835 + avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing 8836 + it eagerly when enabling dirty logging (with the 8837 + KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using 8838 + KVM_CLEAR_DIRTY_LOG. 8839 + 8840 + The chunk size specifies how many pages to break at a time, using a 8841 + single allocation for each chunk. Bigger the chunk size, more pages 8842 + need to be allocated ahead of time. 8843 + 8844 + The chunk size needs to be a valid block size. The list of acceptable 8845 + block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 8846 + 64-bit bitmap (each bit describing a block size). The default value is 8847 + 0, to disable the eager page splitting. 8848 + 8849 + 7.40 KVM_CAP_EXIT_HYPERCALL 8850 + --------------------------- 8851 + 8852 + :Architectures: x86 8853 + :Type: vm 8854 + 8855 + This capability, if enabled, will cause KVM to exit to userspace 8856 + with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. 8857 + 8858 + Calling KVM_CHECK_EXTENSION for this capability will return a bitmask 8859 + of hypercalls that can be configured to exit to userspace. 8860 + Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE. 8861 + 8862 + The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset 8863 + of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace 8864 + the hypercalls whose corresponding bit is in the argument, and return 8865 + ENOSYS for the others. 8866 + 8867 + 7.41 KVM_CAP_ARM_SYSTEM_SUSPEND 8868 + ------------------------------- 8869 + 8870 + :Architectures: arm64 8871 + :Type: vm 8872 + 8873 + When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of 8874 + type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. 8875 + 8876 + 7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 8877 + ------------------------------------- 8878 + 8879 + :Architectures: arm64 8880 + :Target: VM 8881 + :Parameters: None 8882 + :Returns: 0 on success, -EINVAL if vCPUs have been created before enabling this 8883 + capability. 8884 + 8885 + This capability changes the behavior of the registers that identify a PE 8886 + implementation of the Arm architecture: MIDR_EL1, REVIDR_EL1, and AIDR_EL1. 8887 + By default, these registers are visible to userspace but treated as invariant. 8888 + 8889 + When this capability is enabled, KVM allows userspace to change the 8890 + aforementioned registers before the first KVM_RUN. These registers are VM 8891 + scoped, meaning that the same set of values are presented on all vCPUs in a 8892 + given VM. 8893 + 8894 + 8. Other capabilities. 8895 + ====================== 8896 + 8897 + This section lists capabilities that give information about other 8898 + features of the KVM implementation. 8899 + 8900 + 8.1 KVM_CAP_PPC_HWRNG 8901 + --------------------- 8902 + 8903 + :Architectures: ppc 8904 + 8905 + This capability, if KVM_CHECK_EXTENSION indicates that it is 8906 + available, means that the kernel has an implementation of the 8907 + H_RANDOM hypercall backed by a hardware random-number generator. 8908 + If present, the kernel H_RANDOM handler can be enabled for guest use 8909 + with the KVM_CAP_PPC_ENABLE_HCALL capability. 8910 + 8911 + 8.3 KVM_CAP_PPC_MMU_RADIX 8912 + ------------------------- 8913 + 8914 + :Architectures: ppc 8915 + 8916 + This capability, if KVM_CHECK_EXTENSION indicates that it is 8917 + available, means that the kernel can support guests using the 8918 + radix MMU defined in Power ISA V3.00 (as implemented in the POWER9 8919 + processor). 8920 + 8921 + 8.4 KVM_CAP_PPC_MMU_HASH_V3 8922 + --------------------------- 8923 + 8924 + :Architectures: ppc 8925 + 8926 + This capability, if KVM_CHECK_EXTENSION indicates that it is 8927 + available, means that the kernel can support guests using the 8928 + hashed page table MMU defined in Power ISA V3.00 (as implemented in 8929 + the POWER9 processor), including in-memory segment tables. 8930 + 8931 + 8.5 KVM_CAP_MIPS_VZ 8932 + ------------------- 8933 + 8934 + :Architectures: mips 8935 + 8936 + This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that 8937 + it is available, means that full hardware assisted virtualization capabilities 8938 + of the hardware are available for use through KVM. An appropriate 8939 + KVM_VM_MIPS_* type must be passed to KVM_CREATE_VM to create a VM which 8940 + utilises it. 8941 + 8942 + If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is 8943 + available, it means that the VM is using full hardware assisted virtualization 8944 + capabilities of the hardware. This is useful to check after creating a VM with 8945 + KVM_VM_MIPS_DEFAULT. 8946 + 8947 + The value returned by KVM_CHECK_EXTENSION should be compared against known 8948 + values (see below). All other values are reserved. This is to allow for the 8949 + possibility of other hardware assisted virtualization implementations which 8950 + may be incompatible with the MIPS VZ ASE. 8951 + 8952 + == ========================================================================== 8953 + 0 The trap & emulate implementation is in use to run guest code in user 8954 + mode. Guest virtual memory segments are rearranged to fit the guest in the 8955 + user mode address space. 8956 + 8957 + 1 The MIPS VZ ASE is in use, providing full hardware assisted 8958 + virtualization, including standard guest virtual memory segments. 8959 + == ========================================================================== 8960 + 8961 + 8.7 KVM_CAP_MIPS_64BIT 8962 + ---------------------- 8963 + 8964 + :Architectures: mips 8965 + 8966 + This capability indicates the supported architecture type of the guest, i.e. the 8967 + supported register and address width. 8968 + 8969 + The values returned when this capability is checked by KVM_CHECK_EXTENSION on a 8970 + kvm VM handle correspond roughly to the CP0_Config.AT register field, and should 8971 + be checked specifically against known values (see below). All other values are 8972 + reserved. 8973 + 8974 + == ======================================================================== 8975 + 0 MIPS32 or microMIPS32. 8976 + Both registers and addresses are 32-bits wide. 8977 + It will only be possible to run 32-bit guest code. 8978 + 8979 + 1 MIPS64 or microMIPS64 with access only to 32-bit compatibility segments. 8980 + Registers are 64-bits wide, but addresses are 32-bits wide. 8981 + 64-bit guest code may run but cannot access MIPS64 memory segments. 8982 + It will also be possible to run 32-bit guest code. 8983 + 8984 + 2 MIPS64 or microMIPS64 with access to all address segments. 8985 + Both registers and addresses are 64-bits wide. 8986 + It will be possible to run 64-bit or 32-bit guest code. 8987 + == ======================================================================== 8988 + 8989 + 8.9 KVM_CAP_ARM_USER_IRQ 8990 + ------------------------ 8991 + 8992 + :Architectures: arm64 8993 + 8994 + This capability, if KVM_CHECK_EXTENSION indicates that it is available, means 8995 + that if userspace creates a VM without an in-kernel interrupt controller, it 8996 + will be notified of changes to the output level of in-kernel emulated devices, 8997 + which can generate virtual interrupts, presented to the VM. 8998 + For such VMs, on every return to userspace, the kernel 8999 + updates the vcpu's run->s.regs.device_irq_level field to represent the actual 9000 + output level of the device. 9001 + 9002 + Whenever kvm detects a change in the device output level, kvm guarantees at 9003 + least one return to userspace before running the VM. This exit could either 9004 + be a KVM_EXIT_INTR or any other exit event, like KVM_EXIT_MMIO. This way, 9005 + userspace can always sample the device output level and re-compute the state of 9006 + the userspace interrupt controller. Userspace should always check the state 9007 + of run->s.regs.device_irq_level on every kvm exit. 9008 + The value in run->s.regs.device_irq_level can represent both level and edge 9009 + triggered interrupt signals, depending on the device. Edge triggered interrupt 9010 + signals will exit to userspace with the bit in run->s.regs.device_irq_level 9011 + set exactly once per edge signal. 9012 + 9013 + The field run->s.regs.device_irq_level is available independent of 9014 + run->kvm_valid_regs or run->kvm_dirty_regs bits. 9015 + 9016 + If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a 9017 + number larger than 0 indicating the version of this capability is implemented 9018 + and thereby which bits in run->s.regs.device_irq_level can signal values. 9019 + 9020 + Currently the following bits are defined for the device_irq_level bitmap:: 9021 + 9022 + KVM_CAP_ARM_USER_IRQ >= 1: 9023 + 9024 + KVM_ARM_DEV_EL1_VTIMER - EL1 virtual timer 9025 + KVM_ARM_DEV_EL1_PTIMER - EL1 physical timer 9026 + KVM_ARM_DEV_PMU - ARM PMU overflow interrupt signal 9027 + 9028 + Future versions of kvm may implement additional events. These will get 9029 + indicated by returning a higher number from KVM_CHECK_EXTENSION and will be 9030 + listed above. 9031 + 9032 + 8.10 KVM_CAP_PPC_SMT_POSSIBLE 9033 + ----------------------------- 9034 + 9035 + :Architectures: ppc 9036 + 9037 + Querying this capability returns a bitmap indicating the possible 9038 + virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N 9039 + (counting from the right) is set, then a virtual SMT mode of 2^N is 9040 + available. 9041 + 9042 + 8.12 KVM_CAP_HYPERV_VP_INDEX 9043 + ---------------------------- 9044 + 9045 + :Architectures: x86 9046 + 9047 + This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr. Its 9048 + value is used to denote the target vcpu for a SynIC interrupt. For 9049 + compatibility, KVM initializes this msr to KVM's internal vcpu index. When this 9050 + capability is absent, userspace can still query this msr's value. 9051 + 9052 + 8.13 KVM_CAP_S390_AIS_MIGRATION 9053 + ------------------------------- 9054 + 9055 + :Architectures: s390 9056 + 9057 + This capability indicates if the flic device will be able to get/set the 9058 + AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows 9059 + to discover this without having to create a flic device. 9060 + 9061 + 8.14 KVM_CAP_S390_PSW 9062 + --------------------- 9063 + 9064 + :Architectures: s390 9065 + 9066 + This capability indicates that the PSW is exposed via the kvm_run structure. 9067 + 9068 + 8.15 KVM_CAP_S390_GMAP 9069 + ---------------------- 9070 + 9071 + :Architectures: s390 9072 + 9073 + This capability indicates that the user space memory used as guest mapping can 9074 + be anywhere in the user memory address space, as long as the memory slots are 9075 + aligned and sized to a segment (1MB) boundary. 9076 + 9077 + 8.16 KVM_CAP_S390_COW 9078 + --------------------- 9079 + 9080 + :Architectures: s390 9081 + 9082 + This capability indicates that the user space memory used as guest mapping can 9083 + use copy-on-write semantics as well as dirty pages tracking via read-only page 9084 + tables. 9085 + 9086 + 8.17 KVM_CAP_S390_BPB 9087 + --------------------- 9088 + 9089 + :Architectures: s390 9090 + 9091 + This capability indicates that kvm will implement the interfaces to handle 9092 + reset, migration and nested KVM for branch prediction blocking. The stfle 9093 + facility 82 should not be provided to the guest without this capability. 9094 + 9095 + 8.18 KVM_CAP_HYPERV_TLBFLUSH 9096 + ---------------------------- 9097 + 9098 + :Architectures: x86 9099 + 9100 + This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush 9101 + hypercalls: 9102 + HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx, 9103 + HvFlushVirtualAddressList, HvFlushVirtualAddressListEx. 9104 + 9105 + 8.19 KVM_CAP_ARM_INJECT_SERROR_ESR 9106 + ---------------------------------- 9107 + 9108 + :Architectures: arm64 9109 + 9110 + This capability indicates that userspace can specify (via the 9111 + KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it 9112 + takes a virtual SError interrupt exception. 9113 + If KVM advertises this capability, userspace can only specify the ISS field for 9114 + the ESR syndrome. Other parts of the ESR, such as the EC are generated by the 9115 + CPU when the exception is taken. If this virtual SError is taken to EL1 using 9116 + AArch64, this value will be reported in the ISS field of ESR_ELx. 9117 + 9118 + See KVM_CAP_VCPU_EVENTS for more details. 9119 + 9120 + 8.20 KVM_CAP_HYPERV_SEND_IPI 9121 + ---------------------------- 9122 + 9123 + :Architectures: x86 9124 + 9125 + This capability indicates that KVM supports paravirtualized Hyper-V IPI send 9126 + hypercalls: 9127 + HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. 9128 + 9129 + 8.22 KVM_CAP_S390_VCPU_RESETS 9130 + ----------------------------- 9131 + 9132 + :Architectures: s390 9133 + 9134 + This capability indicates that the KVM_S390_NORMAL_RESET and 9135 + KVM_S390_CLEAR_RESET ioctls are available. 9136 + 9137 + 8.23 KVM_CAP_S390_PROTECTED 9138 + --------------------------- 9139 + 9140 + :Architectures: s390 9141 + 9142 + This capability indicates that the Ultravisor has been initialized and 9143 + KVM can therefore start protected VMs. 9144 + This capability governs the KVM_S390_PV_COMMAND ioctl and the 9145 + KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected 9146 + guests when the state change is invalid. 9147 + 9148 + 8.24 KVM_CAP_STEAL_TIME 9149 + ----------------------- 9150 + 9151 + :Architectures: arm64, x86 9152 + 9153 + This capability indicates that KVM supports steal time accounting. 9154 + When steal time accounting is supported it may be enabled with 9155 + architecture-specific interfaces. This capability and the architecture- 9156 + specific interfaces must be consistent, i.e. if one says the feature 9157 + is supported, than the other should as well and vice versa. For arm64 9158 + see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL". 9159 + For x86 see Documentation/virt/kvm/x86/msr.rst "MSR_KVM_STEAL_TIME". 9160 + 9161 + 8.25 KVM_CAP_S390_DIAG318 9162 + ------------------------- 9163 + 9164 + :Architectures: s390 9165 + 9166 + This capability enables a guest to set information about its control program 9167 + (i.e. guest kernel type and version). The information is helpful during 9168 + system/firmware service events, providing additional data about the guest 9169 + environments running on the machine. 9170 + 9171 + The information is associated with the DIAGNOSE 0x318 instruction, which sets 9172 + an 8-byte value consisting of a one-byte Control Program Name Code (CPNC) and 9173 + a 7-byte Control Program Version Code (CPVC). The CPNC determines what 9174 + environment the control program is running in (e.g. Linux, z/VM...), and the 9175 + CPVC is used for information specific to OS (e.g. Linux version, Linux 9176 + distribution...) 9177 + 9178 + If this capability is available, then the CPNC and CPVC can be synchronized 9179 + between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318). 9180 + 9181 + 8.26 KVM_CAP_X86_USER_SPACE_MSR 9182 + ------------------------------- 9183 + 9184 + :Architectures: x86 9185 + 9186 + This capability indicates that KVM supports deflection of MSR reads and 9187 + writes to user space. It can be enabled on a VM level. If enabled, MSR 9188 + accesses that would usually trigger a #GP by KVM into the guest will 9189 + instead get bounced to user space through the KVM_EXIT_X86_RDMSR and 9190 + KVM_EXIT_X86_WRMSR exit notifications. 9191 + 9192 + 8.27 KVM_CAP_X86_MSR_FILTER 9193 + --------------------------- 9194 + 9195 + :Architectures: x86 9196 + 9197 + This capability indicates that KVM supports that accesses to user defined MSRs 9198 + may be rejected. With this capability exposed, KVM exports new VM ioctl 9199 + KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR 9200 + ranges that KVM should deny access to. 9201 + 9202 + In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to 9203 + trap and emulate MSRs that are outside of the scope of KVM as well as 9204 + limit the attack surface on KVM's MSR emulation code. 9205 + 8391 9206 8.30 KVM_CAP_XEN_HVM 8392 9207 -------------------- 8393 9208 ··· 8872 8847 done when the KVM_CAP_XEN_HVM ioctl sets the 8873 8848 KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag. 8874 8849 8875 - 8.31 KVM_CAP_PPC_MULTITCE 8876 - ------------------------- 8850 + 8.31 KVM_CAP_SPAPR_MULTITCE 8851 + --------------------------- 8877 8852 8878 - :Capability: KVM_CAP_PPC_MULTITCE 8879 8853 :Architectures: ppc 8880 8854 :Type: vm 8881 8855 ··· 8906 8882 supported in the host. A VMM can check whether the service is 8907 8883 available to the guest on migration. 8908 8884 8909 - 8.33 KVM_CAP_HYPERV_ENFORCE_CPUID 8910 - --------------------------------- 8911 - 8912 - Architectures: x86 8913 - 8914 - When enabled, KVM will disable emulated Hyper-V features provided to the 8915 - guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all 8916 - currently implemented Hyper-V features are provided unconditionally when 8917 - Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001) 8918 - leaf. 8919 - 8920 - 8.34 KVM_CAP_EXIT_HYPERCALL 8921 - --------------------------- 8922 - 8923 - :Capability: KVM_CAP_EXIT_HYPERCALL 8924 - :Architectures: x86 8925 - :Type: vm 8926 - 8927 - This capability, if enabled, will cause KVM to exit to userspace 8928 - with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. 8929 - 8930 - Calling KVM_CHECK_EXTENSION for this capability will return a bitmask 8931 - of hypercalls that can be configured to exit to userspace. 8932 - Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE. 8933 - 8934 - The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset 8935 - of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace 8936 - the hypercalls whose corresponding bit is in the argument, and return 8937 - ENOSYS for the others. 8938 - 8939 - 8.35 KVM_CAP_PMU_CAPABILITY 8940 - --------------------------- 8941 - 8942 - :Capability: KVM_CAP_PMU_CAPABILITY 8943 - :Architectures: x86 8944 - :Type: vm 8945 - :Parameters: arg[0] is bitmask of PMU virtualization capabilities. 8946 - :Returns: 0 on success, -EINVAL when arg[0] contains invalid bits 8947 - 8948 - This capability alters PMU virtualization in KVM. 8949 - 8950 - Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of 8951 - PMU virtualization capabilities that can be adjusted on a VM. 8952 - 8953 - The argument to KVM_ENABLE_CAP is also a bitmask and selects specific 8954 - PMU virtualization capabilities to be applied to the VM. This can 8955 - only be invoked on a VM prior to the creation of VCPUs. 8956 - 8957 - At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting 8958 - this capability will disable PMU virtualization for that VM. Usermode 8959 - should adjust CPUID leaf 0xA to reflect that the PMU is disabled. 8960 - 8961 - 8.36 KVM_CAP_ARM_SYSTEM_SUSPEND 8962 - ------------------------------- 8963 - 8964 - :Capability: KVM_CAP_ARM_SYSTEM_SUSPEND 8965 - :Architectures: arm64 8966 - :Type: vm 8967 - 8968 - When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of 8969 - type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. 8970 - 8971 8885 8.37 KVM_CAP_S390_PROTECTED_DUMP 8972 8886 -------------------------------- 8973 8887 8974 - :Capability: KVM_CAP_S390_PROTECTED_DUMP 8975 8888 :Architectures: s390 8976 8889 :Type: vm 8977 8890 ··· 8918 8957 dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is 8919 8958 available and supports the `KVM_PV_DUMP_CPU` subcommand. 8920 8959 8921 - 8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 8922 - ------------------------------------- 8923 - 8924 - :Capability: KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 8925 - :Architectures: x86 8926 - :Type: vm 8927 - :Parameters: arg[0] must be 0. 8928 - :Returns: 0 on success, -EPERM if the userspace process does not 8929 - have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been 8930 - created. 8931 - 8932 - This capability disables the NX huge pages mitigation for iTLB MULTIHIT. 8933 - 8934 - The capability has no effect if the nx_huge_pages module parameter is not set. 8935 - 8936 - This capability may only be set before any vCPUs are created. 8937 - 8938 8960 8.39 KVM_CAP_S390_CPU_TOPOLOGY 8939 8961 ------------------------------ 8940 8962 8941 - :Capability: KVM_CAP_S390_CPU_TOPOLOGY 8942 8963 :Architectures: s390 8943 8964 :Type: vm 8944 8965 ··· 8942 8999 When getting the Modified Change Topology Report value, the attr->addr 8943 9000 must point to a byte where the value will be stored or retrieved from. 8944 9001 8945 - 8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 8946 - --------------------------------------- 8947 - 8948 - :Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 8949 - :Architectures: arm64 8950 - :Type: vm 8951 - :Parameters: arg[0] is the new split chunk size. 8952 - :Returns: 0 on success, -EINVAL if any memslot was already created. 8953 - 8954 - This capability sets the chunk size used in Eager Page Splitting. 8955 - 8956 - Eager Page Splitting improves the performance of dirty-logging (used 8957 - in live migrations) when guest memory is backed by huge-pages. It 8958 - avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing 8959 - it eagerly when enabling dirty logging (with the 8960 - KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using 8961 - KVM_CLEAR_DIRTY_LOG. 8962 - 8963 - The chunk size specifies how many pages to break at a time, using a 8964 - single allocation for each chunk. Bigger the chunk size, more pages 8965 - need to be allocated ahead of time. 8966 - 8967 - The chunk size needs to be a valid block size. The list of acceptable 8968 - block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 8969 - 64-bit bitmap (each bit describing a block size). The default value is 8970 - 0, to disable the eager page splitting. 8971 - 8972 9002 8.41 KVM_CAP_VM_TYPES 8973 9003 --------------------- 8974 9004 8975 - :Capability: KVM_CAP_MEMORY_ATTRIBUTES 8976 9005 :Architectures: x86 8977 9006 :Type: system ioctl 8978 9007 ··· 8960 9045 Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in 8961 9046 production. The behavior and effective ABI for software-protected VMs is 8962 9047 unstable. 9048 + 9049 + 8.42 KVM_CAP_PPC_RPT_INVALIDATE 9050 + ------------------------------- 9051 + 9052 + :Architectures: ppc 9053 + 9054 + This capability indicates that the kernel is capable of handling 9055 + H_RPT_INVALIDATE hcall. 9056 + 9057 + In order to enable the use of H_RPT_INVALIDATE in the guest, 9058 + user space might have to advertise it for the guest. For example, 9059 + IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is 9060 + present in the "ibm,hypertas-functions" device-tree property. 9061 + 9062 + This capability is enabled for hypervisors on platforms like POWER9 9063 + that support radix MMU. 9064 + 9065 + 8.43 KVM_CAP_PPC_AIL_MODE_3 9066 + --------------------------- 9067 + 9068 + :Architectures: ppc 9069 + 9070 + This capability indicates that the kernel supports the mode 3 setting for the 9071 + "Address Translation Mode on Interrupt" aka "Alternate Interrupt Location" 9072 + resource that is controlled with the H_SET_MODE hypercall. 9073 + 9074 + This capability allows a guest kernel to use a better-performance mode for 9075 + handling interrupts and system calls. 9076 + 9077 + 8.44 KVM_CAP_MEMORY_FAULT_INFO 9078 + ------------------------------ 9079 + 9080 + :Architectures: x86 9081 + 9082 + The presence of this capability indicates that KVM_RUN will fill 9083 + kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if 9084 + there is a valid memslot but no backing VMA for the corresponding host virtual 9085 + address. 9086 + 9087 + The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns 9088 + an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set 9089 + to KVM_EXIT_MEMORY_FAULT. 9090 + 9091 + Note: Userspaces which attempt to resolve memory faults so that they can retry 9092 + KVM_RUN are encouraged to guard against repeatedly receiving the same 9093 + error/annotated fault. 9094 + 9095 + See KVM_EXIT_MEMORY_FAULT for more information. 9096 + 9097 + 8.45 KVM_CAP_X86_GUEST_MODE 9098 + --------------------------- 9099 + 9100 + :Architectures: x86 9101 + 9102 + The presence of this capability indicates that KVM_RUN will update the 9103 + KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the 9104 + vCPU was executing nested guest code when it exited. 9105 + 9106 + KVM exits with the register state of either the L1 or L2 guest 9107 + depending on which executed at the time of an exit. Userspace must 9108 + take care to differentiate between these cases. 8963 9109 8964 9110 9. Known KVM API problems 8965 9111 ========================= ··· 9052 9076 9053 9077 The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature. 9054 9078 9055 - CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``. 9056 - It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel 9057 - has enabled in-kernel emulation of the local APIC. 9079 + On older versions of Linux, CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by 9080 + ``KVM_GET_SUPPORTED_CPUID``, but it can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` 9081 + is present and the kernel has enabled in-kernel emulation of the local APIC. 9082 + On newer versions, ``KVM_GET_SUPPORTED_CPUID`` does report the bit as available. 9058 9083 9059 9084 CPU topology 9060 9085 ~~~~~~~~~~~~

+42 -2

arch/arm64/include/asm/esr.h

··· 121 121 #define ESR_ELx_FSC_SEA_TTW(n) (0x14 + (n)) 122 122 #define ESR_ELx_FSC_SECC (0x18) 123 123 #define ESR_ELx_FSC_SECC_TTW(n) (0x1c + (n)) 124 + #define ESR_ELx_FSC_ADDRSZ (0x00) 125 + 126 + /* 127 + * Annoyingly, the negative levels for Address size faults aren't laid out 128 + * contiguously (or in the desired order) 129 + */ 130 + #define ESR_ELx_FSC_ADDRSZ_nL(n) ((n) == -1 ? 0x25 : 0x2C) 131 + #define ESR_ELx_FSC_ADDRSZ_L(n) ((n) < 0 ? ESR_ELx_FSC_ADDRSZ_nL(n) : \ 132 + (ESR_ELx_FSC_ADDRSZ + (n))) 124 133 125 134 /* Status codes for individual page table levels */ 126 135 #define ESR_ELx_FSC_ACCESS_L(n) (ESR_ELx_FSC_ACCESS + (n)) ··· 170 161 #define ESR_ELx_Xs_MASK (GENMASK_ULL(4, 0)) 171 162 172 163 /* ISS field definitions for exceptions taken in to Hyp */ 173 - #define ESR_ELx_FSC_ADDRSZ (0x00) 174 - #define ESR_ELx_FSC_ADDRSZ_L(n) (ESR_ELx_FSC_ADDRSZ + (n)) 175 164 #define ESR_ELx_CV (UL(1) << 24) 176 165 #define ESR_ELx_COND_SHIFT (20) 177 166 #define ESR_ELx_COND_MASK (UL(0xF) << ESR_ELx_COND_SHIFT) ··· 469 462 (esr == ESR_ELx_FSC_ACCESS_L(2)) || 470 463 (esr == ESR_ELx_FSC_ACCESS_L(1)) || 471 464 (esr == ESR_ELx_FSC_ACCESS_L(0)); 465 + } 466 + 467 + static inline bool esr_fsc_is_addr_sz_fault(unsigned long esr) 468 + { 469 + esr &= ESR_ELx_FSC; 470 + 471 + return (esr == ESR_ELx_FSC_ADDRSZ_L(3)) || 472 + (esr == ESR_ELx_FSC_ADDRSZ_L(2)) || 473 + (esr == ESR_ELx_FSC_ADDRSZ_L(1)) || 474 + (esr == ESR_ELx_FSC_ADDRSZ_L(0)) || 475 + (esr == ESR_ELx_FSC_ADDRSZ_L(-1)); 476 + } 477 + 478 + static inline bool esr_fsc_is_sea_ttw(unsigned long esr) 479 + { 480 + esr = esr & ESR_ELx_FSC; 481 + 482 + return (esr == ESR_ELx_FSC_SEA_TTW(3)) || 483 + (esr == ESR_ELx_FSC_SEA_TTW(2)) || 484 + (esr == ESR_ELx_FSC_SEA_TTW(1)) || 485 + (esr == ESR_ELx_FSC_SEA_TTW(0)) || 486 + (esr == ESR_ELx_FSC_SEA_TTW(-1)); 487 + } 488 + 489 + static inline bool esr_fsc_is_secc_ttw(unsigned long esr) 490 + { 491 + esr = esr & ESR_ELx_FSC; 492 + 493 + return (esr == ESR_ELx_FSC_SECC_TTW(3)) || 494 + (esr == ESR_ELx_FSC_SECC_TTW(2)) || 495 + (esr == ESR_ELx_FSC_SECC_TTW(1)) || 496 + (esr == ESR_ELx_FSC_SECC_TTW(0)) || 497 + (esr == ESR_ELx_FSC_SECC_TTW(-1)); 472 498 } 473 499 474 500 /* Indicate whether ESR.EC==0x1A is for an ERETAx instruction */

+6 -1

arch/arm64/include/asm/kvm_emulate.h

··· 305 305 306 306 static __always_inline phys_addr_t kvm_vcpu_get_fault_ipa(const struct kvm_vcpu *vcpu) 307 307 { 308 - return ((phys_addr_t)vcpu->arch.fault.hpfar_el2 & HPFAR_MASK) << 8; 308 + u64 hpfar = vcpu->arch.fault.hpfar_el2; 309 + 310 + if (unlikely(!(hpfar & HPFAR_EL2_NS))) 311 + return INVALID_GPA; 312 + 313 + return FIELD_GET(HPFAR_EL2_FIPA, hpfar) << 12; 309 314 } 310 315 311 316 static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu *vcpu)

+1 -1

arch/arm64/include/asm/kvm_ras.h

··· 14 14 * Was this synchronous external abort a RAS notification? 15 15 * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. 16 16 */ 17 - static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) 17 + static inline int kvm_handle_guest_sea(void) 18 18 { 19 19 /* apei_claim_sea(NULL) expects to mask interrupts itself */ 20 20 lockdep_assert_irqs_enabled();

+48 -22

arch/arm64/kvm/hyp/include/hyp/fault.h

··· 12 12 #include <asm/kvm_hyp.h> 13 13 #include <asm/kvm_mmu.h> 14 14 15 + static inline bool __fault_safe_to_translate(u64 esr) 16 + { 17 + u64 fsc = esr & ESR_ELx_FSC; 18 + 19 + if (esr_fsc_is_sea_ttw(esr) || esr_fsc_is_secc_ttw(esr)) 20 + return false; 21 + 22 + return !(fsc == ESR_ELx_FSC_EXTABT && (esr & ESR_ELx_FnV)); 23 + } 24 + 15 25 static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar) 16 26 { 17 27 int ret; ··· 54 44 return true; 55 45 } 56 46 47 + /* 48 + * Checks for the conditions when HPFAR_EL2 is written, per ARM ARM R_FKLWR. 49 + */ 50 + static inline bool __hpfar_valid(u64 esr) 51 + { 52 + /* 53 + * CPUs affected by ARM erratum #834220 may incorrectly report a 54 + * stage-2 translation fault when a stage-1 permission fault occurs. 55 + * 56 + * Re-walk the page tables to determine if a stage-1 fault actually 57 + * occurred. 58 + */ 59 + if (cpus_have_final_cap(ARM64_WORKAROUND_834220) && 60 + esr_fsc_is_translation_fault(esr)) 61 + return false; 62 + 63 + if (esr_fsc_is_translation_fault(esr) || esr_fsc_is_access_flag_fault(esr)) 64 + return true; 65 + 66 + if ((esr & ESR_ELx_S1PTW) && esr_fsc_is_permission_fault(esr)) 67 + return true; 68 + 69 + return esr_fsc_is_addr_sz_fault(esr); 70 + } 71 + 57 72 static inline bool __get_fault_info(u64 esr, struct kvm_vcpu_fault_info *fault) 58 73 { 59 - u64 hpfar, far; 74 + u64 hpfar; 60 75 61 - far = read_sysreg_el2(SYS_FAR); 76 + fault->far_el2 = read_sysreg_el2(SYS_FAR); 77 + fault->hpfar_el2 = 0; 78 + 79 + if (__hpfar_valid(esr)) 80 + hpfar = read_sysreg(hpfar_el2); 81 + else if (unlikely(!__fault_safe_to_translate(esr))) 82 + return true; 83 + else if (!__translate_far_to_hpfar(fault->far_el2, &hpfar)) 84 + return false; 62 85 63 86 /* 64 - * The HPFAR can be invalid if the stage 2 fault did not 65 - * happen during a stage 1 page table walk (the ESR_EL2.S1PTW 66 - * bit is clear) and one of the two following cases are true: 67 - * 1. The fault was due to a permission fault 68 - * 2. The processor carries errata 834220 69 - * 70 - * Therefore, for all non S1PTW faults where we either have a 71 - * permission fault or the errata workaround is enabled, we 72 - * resolve the IPA using the AT instruction. 87 + * Hijack HPFAR_EL2.NS (RES0 in Non-secure) to indicate a valid 88 + * HPFAR value. 73 89 */ 74 - if (!(esr & ESR_ELx_S1PTW) && 75 - (cpus_have_final_cap(ARM64_WORKAROUND_834220) || 76 - esr_fsc_is_permission_fault(esr))) { 77 - if (!__translate_far_to_hpfar(far, &hpfar)) 78 - return false; 79 - } else { 80 - hpfar = read_sysreg(hpfar_el2); 81 - } 82 - 83 - fault->far_el2 = far; 84 - fault->hpfar_el2 = hpfar; 90 + fault->hpfar_el2 = hpfar | HPFAR_EL2_NS; 85 91 return true; 86 92 } 87 93

+5 -4

arch/arm64/kvm/hyp/nvhe/ffa.c

··· 730 730 hyp_ffa_version = ffa_req_version; 731 731 } 732 732 733 - if (hyp_ffa_post_init()) 733 + if (hyp_ffa_post_init()) { 734 734 res->a0 = FFA_RET_NOT_SUPPORTED; 735 - else { 736 - has_version_negotiated = true; 735 + } else { 736 + smp_store_release(&has_version_negotiated, true); 737 737 res->a0 = hyp_ffa_version; 738 738 } 739 739 unlock: ··· 809 809 if (!is_ffa_call(func_id)) 810 810 return false; 811 811 812 - if (!has_version_negotiated && func_id != FFA_VERSION) { 812 + if (func_id != FFA_VERSION && 813 + !smp_load_acquire(&has_version_negotiated)) { 813 814 ffa_to_smccc_error(&res, FFA_RET_INVALID_PARAMETERS); 814 815 goto out_handled; 815 816 }

+8 -1

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 578 578 return; 579 579 } 580 580 581 - addr = (fault.hpfar_el2 & HPFAR_MASK) << 8; 581 + 582 + /* 583 + * Yikes, we couldn't resolve the fault IPA. This should reinject an 584 + * abort into the host when we figure out how to do that. 585 + */ 586 + BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS)); 587 + addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12; 588 + 582 589 ret = host_stage2_idmap(addr); 583 590 BUG_ON(ret && ret != -EAGAIN); 584 591 }

+19 -12

arch/arm64/kvm/mmu.c

··· 1794 1794 gfn_t gfn; 1795 1795 int ret, idx; 1796 1796 1797 + /* Synchronous External Abort? */ 1798 + if (kvm_vcpu_abt_issea(vcpu)) { 1799 + /* 1800 + * For RAS the host kernel may handle this abort. 1801 + * There is no need to pass the error into the guest. 1802 + */ 1803 + if (kvm_handle_guest_sea()) 1804 + kvm_inject_vabt(vcpu); 1805 + 1806 + return 1; 1807 + } 1808 + 1797 1809 esr = kvm_vcpu_get_esr(vcpu); 1798 1810 1811 + /* 1812 + * The fault IPA should be reliable at this point as we're not dealing 1813 + * with an SEA. 1814 + */ 1799 1815 ipa = fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); 1816 + if (KVM_BUG_ON(ipa == INVALID_GPA, vcpu->kvm)) 1817 + return -EFAULT; 1818 + 1800 1819 is_iabt = kvm_vcpu_trap_is_iabt(vcpu); 1801 1820 1802 1821 if (esr_fsc_is_translation_fault(esr)) { ··· 1835 1816 kvm_inject_dabt(vcpu, fault_ipa); 1836 1817 return 1; 1837 1818 } 1838 - } 1839 - 1840 - /* Synchronous External Abort? */ 1841 - if (kvm_vcpu_abt_issea(vcpu)) { 1842 - /* 1843 - * For RAS the host kernel may handle this abort. 1844 - * There is no need to pass the error into the guest. 1845 - */ 1846 - if (kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu))) 1847 - kvm_inject_vabt(vcpu); 1848 - 1849 - return 1; 1850 1819 } 1851 1820 1852 1821 trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_esr(vcpu),

+7

arch/arm64/tools/sysreg

··· 3536 3536 Field 4 P 3537 3537 Field 3:0 Align 3538 3538 EndSysreg 3539 + 3540 + Sysreg HPFAR_EL2 3 4 6 0 4 3541 + Field 63 NS 3542 + Res0 62:48 3543 + Field 47:4 FIPA 3544 + Res0 3:0 3545 + EndSysreg

+1 -1

arch/s390/kvm/intercept.c

··· 95 95 96 96 vcpu->stat.exit_validity++; 97 97 trace_kvm_s390_intercept_validity(vcpu, viwhy); 98 - KVM_EVENT(3, "validity intercept 0x%x for pid %u (kvm 0x%pK)", viwhy, 98 + KVM_EVENT(3, "validity intercept 0x%x for pid %u (kvm 0x%p)", viwhy, 99 99 current->pid, vcpu->kvm); 100 100 101 101 /* do not warn on invalid runtime instrumentation mode */

+4 -4

arch/s390/kvm/interrupt.c

··· 3161 3161 if (!gi->origin) 3162 3162 return; 3163 3163 gisa_clear_ipm(gi->origin); 3164 - VM_EVENT(kvm, 3, "gisa 0x%pK cleared", gi->origin); 3164 + VM_EVENT(kvm, 3, "gisa 0x%p cleared", gi->origin); 3165 3165 } 3166 3166 3167 3167 void kvm_s390_gisa_init(struct kvm *kvm) ··· 3177 3177 hrtimer_setup(&gi->timer, gisa_vcpu_kicker, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 3178 3178 memset(gi->origin, 0, sizeof(struct kvm_s390_gisa)); 3179 3179 gi->origin->next_alert = (u32)virt_to_phys(gi->origin); 3180 - VM_EVENT(kvm, 3, "gisa 0x%pK initialized", gi->origin); 3180 + VM_EVENT(kvm, 3, "gisa 0x%p initialized", gi->origin); 3181 3181 } 3182 3182 3183 3183 void kvm_s390_gisa_enable(struct kvm *kvm) ··· 3218 3218 process_gib_alert_list(); 3219 3219 hrtimer_cancel(&gi->timer); 3220 3220 gi->origin = NULL; 3221 - VM_EVENT(kvm, 3, "gisa 0x%pK destroyed", gisa); 3221 + VM_EVENT(kvm, 3, "gisa 0x%p destroyed", gisa); 3222 3222 } 3223 3223 3224 3224 void kvm_s390_gisa_disable(struct kvm *kvm) ··· 3467 3467 } 3468 3468 } 3469 3469 3470 - KVM_EVENT(3, "gib 0x%pK (nisc=%d) initialized", gib, gib->nisc); 3470 + KVM_EVENT(3, "gib 0x%p (nisc=%d) initialized", gib, gib->nisc); 3471 3471 goto out; 3472 3472 3473 3473 out_unreg_gal:

+5 -5

arch/s390/kvm/kvm-s390.c

··· 1022 1022 } 1023 1023 mutex_unlock(&kvm->lock); 1024 1024 VM_EVENT(kvm, 3, "SET: max guest address: %lu", new_limit); 1025 - VM_EVENT(kvm, 3, "New guest asce: 0x%pK", 1025 + VM_EVENT(kvm, 3, "New guest asce: 0x%p", 1026 1026 (void *) kvm->arch.gmap->asce); 1027 1027 break; 1028 1028 } ··· 3466 3466 kvm_s390_gisa_init(kvm); 3467 3467 INIT_LIST_HEAD(&kvm->arch.pv.need_cleanup); 3468 3468 kvm->arch.pv.set_aside = NULL; 3469 - KVM_EVENT(3, "vm 0x%pK created by pid %u", kvm, current->pid); 3469 + KVM_EVENT(3, "vm 0x%p created by pid %u", kvm, current->pid); 3470 3470 3471 3471 return 0; 3472 3472 out_err: ··· 3529 3529 kvm_s390_destroy_adapters(kvm); 3530 3530 kvm_s390_clear_float_irqs(kvm); 3531 3531 kvm_s390_vsie_destroy(kvm); 3532 - KVM_EVENT(3, "vm 0x%pK destroyed", kvm); 3532 + KVM_EVENT(3, "vm 0x%p destroyed", kvm); 3533 3533 } 3534 3534 3535 3535 /* Section: vcpu related */ ··· 3650 3650 3651 3651 free_page((unsigned long)old_sca); 3652 3652 3653 - VM_EVENT(kvm, 2, "Switched to ESCA (0x%pK -> 0x%pK)", 3653 + VM_EVENT(kvm, 2, "Switched to ESCA (0x%p -> 0x%p)", 3654 3654 old_sca, kvm->arch.sca); 3655 3655 return 0; 3656 3656 } ··· 4027 4027 goto out_free_sie_block; 4028 4028 } 4029 4029 4030 - VM_EVENT(vcpu->kvm, 3, "create cpu %d at 0x%pK, sie block at 0x%pK", 4030 + VM_EVENT(vcpu->kvm, 3, "create cpu %d at 0x%p, sie block at 0x%p", 4031 4031 vcpu->vcpu_id, vcpu, vcpu->arch.sie_block); 4032 4032 trace_kvm_s390_create_vcpu(vcpu->vcpu_id, vcpu, vcpu->arch.sie_block); 4033 4033

+2 -2

arch/s390/kvm/trace-s390.h

··· 56 56 __entry->sie_block = sie_block; 57 57 ), 58 58 59 - TP_printk("create cpu %d at 0x%pK, sie block at 0x%pK", 59 + TP_printk("create cpu %d at 0x%p, sie block at 0x%p", 60 60 __entry->id, __entry->vcpu, __entry->sie_block) 61 61 ); 62 62 ··· 255 255 __entry->kvm = kvm; 256 256 ), 257 257 258 - TP_printk("enabling channel I/O support (kvm @ %pK)\n", 258 + TP_printk("enabling channel I/O support (kvm @ %p)\n", 259 259 __entry->kvm) 260 260 ); 261 261

+6 -1

arch/x86/include/asm/kvm_host.h

··· 1472 1472 struct once nx_once; 1473 1473 1474 1474 #ifdef CONFIG_X86_64 1475 - /* The number of TDP MMU pages across all roots. */ 1475 + #ifdef CONFIG_KVM_PROVE_MMU 1476 + /* 1477 + * The number of TDP MMU pages across all roots. Used only to sanity 1478 + * check that KVM isn't leaking TDP MMU pages. 1479 + */ 1476 1480 atomic64_t tdp_mmu_pages; 1481 + #endif 1477 1482 1478 1483 /* 1479 1484 * List of struct kvm_mmu_pages being used as roots.

+3 -5

arch/x86/kvm/cpuid.c

··· 1427 1427 } 1428 1428 break; 1429 1429 case 0xa: { /* Architectural Performance Monitoring */ 1430 - union cpuid10_eax eax; 1431 - union cpuid10_edx edx; 1430 + union cpuid10_eax eax = { }; 1431 + union cpuid10_edx edx = { }; 1432 1432 1433 1433 if (!enable_pmu || !static_cpu_has(X86_FEATURE_ARCH_PERFMON)) { 1434 1434 entry->eax = entry->ebx = entry->ecx = entry->edx = 0; ··· 1444 1444 1445 1445 if (kvm_pmu_cap.version) 1446 1446 edx.split.anythread_deprecated = 1; 1447 - edx.split.reserved1 = 0; 1448 - edx.split.reserved2 = 0; 1449 1447 1450 1448 entry->eax = eax.full; 1451 1449 entry->ebx = kvm_pmu_cap.events_mask; ··· 1761 1763 break; 1762 1764 /* AMD Extended Performance Monitoring and Debug */ 1763 1765 case 0x80000022: { 1764 - union cpuid_0x80000022_ebx ebx; 1766 + union cpuid_0x80000022_ebx ebx = { }; 1765 1767 1766 1768 entry->ecx = entry->edx = 0; 1767 1769 if (!enable_pmu || !kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2)) {

+7 -1

arch/x86/kvm/mmu/tdp_mmu.c

··· 40 40 kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS); 41 41 kvm_tdp_mmu_zap_invalidated_roots(kvm, false); 42 42 43 - WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages)); 43 + #ifdef CONFIG_KVM_PROVE_MMU 44 + KVM_MMU_WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages)); 45 + #endif 44 46 WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots)); 45 47 46 48 /* ··· 327 325 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) 328 326 { 329 327 kvm_account_pgtable_pages((void *)sp->spt, +1); 328 + #ifdef CONFIG_KVM_PROVE_MMU 330 329 atomic64_inc(&kvm->arch.tdp_mmu_pages); 330 + #endif 331 331 } 332 332 333 333 static void tdp_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) 334 334 { 335 335 kvm_account_pgtable_pages((void *)sp->spt, -1); 336 + #ifdef CONFIG_KVM_PROVE_MMU 336 337 atomic64_dec(&kvm->arch.tdp_mmu_pages); 338 + #endif 337 339 } 338 340 339 341 /**

+30 -7

arch/x86/kvm/vmx/posted_intr.c

··· 31 31 */ 32 32 static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); 33 33 34 + #define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING 35 + 34 36 static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) 35 37 { 36 38 return &(to_vmx(vcpu)->pi_desc); ··· 91 89 * current pCPU if the task was migrated. 92 90 */ 93 91 if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { 94 - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); 92 + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); 93 + 94 + /* 95 + * In addition to taking the wakeup lock for the regular/IRQ 96 + * context, tell lockdep it is being taken for the "sched out" 97 + * context as well. vCPU loads happens in task context, and 98 + * this is taking the lock of the *previous* CPU, i.e. can race 99 + * with both the scheduler and the wakeup handler. 100 + */ 101 + raw_spin_lock(spinlock); 102 + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); 95 103 list_del(&vmx->pi_wakeup_list); 96 - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); 104 + spin_release(&spinlock->dep_map, _RET_IP_); 105 + raw_spin_unlock(spinlock); 97 106 } 98 107 99 108 dest = cpu_physical_id(cpu); ··· 161 148 struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 162 149 struct vcpu_vmx *vmx = to_vmx(vcpu); 163 150 struct pi_desc old, new; 164 - unsigned long flags; 165 151 166 - local_irq_save(flags); 152 + lockdep_assert_irqs_disabled(); 167 153 168 - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); 154 + /* 155 + * Acquire the wakeup lock using the "sched out" context to workaround 156 + * a lockdep false positive. When this is called, schedule() holds 157 + * various per-CPU scheduler locks. When the wakeup handler runs, it 158 + * holds this CPU's wakeup lock while calling try_to_wake_up(), which 159 + * can eventually take the aforementioned scheduler locks, which causes 160 + * lockdep to assume there is deadlock. 161 + * 162 + * Deadlock can't actually occur because IRQs are disabled for the 163 + * entirety of the sched_out critical section, i.e. the wakeup handler 164 + * can't run while the scheduler locks are held. 165 + */ 166 + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 167 + PI_LOCK_SCHED_OUT); 169 168 list_add_tail(&vmx->pi_wakeup_list, 170 169 &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); 171 170 raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); ··· 201 176 */ 202 177 if (pi_test_on(&new)) 203 178 __apic_send_IPI_self(POSTED_INTR_WAKEUP_VECTOR); 204 - 205 - local_irq_restore(flags); 206 179 } 207 180 208 181 static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)

+4

arch/x86/kvm/x86.c

··· 11786 11786 if (kvm_mpx_supported()) 11787 11787 kvm_load_guest_fpu(vcpu); 11788 11788 11789 + kvm_vcpu_srcu_read_lock(vcpu); 11790 + 11789 11791 r = kvm_apic_accept_events(vcpu); 11790 11792 if (r < 0) 11791 11793 goto out; ··· 11801 11799 mp_state->mp_state = vcpu->arch.mp_state; 11802 11800 11803 11801 out: 11802 + kvm_vcpu_srcu_read_unlock(vcpu); 11803 + 11804 11804 if (kvm_mpx_supported()) 11805 11805 kvm_put_guest_fpu(vcpu); 11806 11806 vcpu_put(vcpu);

+2 -2

drivers/firmware/smccc/kvm_guest.c

··· 95 95 96 96 for (i = 0; i < max_cpus; i++) { 97 97 arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_CPUS_FUNC_ID, 98 - i, &res); 98 + i, 0, 0, &res); 99 99 if (res.a0 != SMCCC_RET_SUCCESS) { 100 100 pr_warn("Discovering target implementation CPUs failed\n"); 101 101 goto mem_free; ··· 103 103 target[i].midr = res.a1; 104 104 target[i].revidr = res.a2; 105 105 target[i].aidr = res.a3; 106 - }; 106 + } 107 107 108 108 if (!cpu_errata_set_target_impl(max_cpus, target)) { 109 109 pr_warn("Failed to set target implementation CPUs\n");

+1 -1

include/linux/kvm_host.h

··· 2382 2382 struct kvm_vcpu *kvm_get_running_vcpu(void); 2383 2383 struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void); 2384 2384 2385 - #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 2385 + #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS) 2386 2386 bool kvm_arch_has_irq_bypass(void); 2387 2387 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *, 2388 2388 struct irq_bypass_producer *);

+15 -30

tools/testing/selftests/kvm/Makefile.kvm

··· 50 50 # Non-compiled test targets 51 51 TEST_PROGS_x86 += x86/nx_huge_pages_test.sh 52 52 53 + # Compiled test targets valid on all architectures with libkvm support 54 + TEST_GEN_PROGS_COMMON = demand_paging_test 55 + TEST_GEN_PROGS_COMMON += dirty_log_test 56 + TEST_GEN_PROGS_COMMON += guest_print_test 57 + TEST_GEN_PROGS_COMMON += kvm_binary_stats_test 58 + TEST_GEN_PROGS_COMMON += kvm_create_max_vcpus 59 + TEST_GEN_PROGS_COMMON += kvm_page_table_test 60 + TEST_GEN_PROGS_COMMON += set_memory_region_test 61 + 53 62 # Compiled test targets 54 - TEST_GEN_PROGS_x86 = x86/cpuid_test 63 + TEST_GEN_PROGS_x86 = $(TEST_GEN_PROGS_COMMON) 64 + TEST_GEN_PROGS_x86 += x86/cpuid_test 55 65 TEST_GEN_PROGS_x86 += x86/cr4_cpuid_sync_test 56 66 TEST_GEN_PROGS_x86 += x86/dirty_log_page_splitting_test 57 67 TEST_GEN_PROGS_x86 += x86/feature_msrs_test ··· 129 119 TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test 130 120 TEST_GEN_PROGS_x86 += access_tracking_perf_test 131 121 TEST_GEN_PROGS_x86 += coalesced_io_test 132 - TEST_GEN_PROGS_x86 += demand_paging_test 133 - TEST_GEN_PROGS_x86 += dirty_log_test 134 122 TEST_GEN_PROGS_x86 += dirty_log_perf_test 135 123 TEST_GEN_PROGS_x86 += guest_memfd_test 136 - TEST_GEN_PROGS_x86 += guest_print_test 137 124 TEST_GEN_PROGS_x86 += hardware_disable_test 138 - TEST_GEN_PROGS_x86 += kvm_create_max_vcpus 139 - TEST_GEN_PROGS_x86 += kvm_page_table_test 140 125 TEST_GEN_PROGS_x86 += memslot_modification_stress_test 141 126 TEST_GEN_PROGS_x86 += memslot_perf_test 142 127 TEST_GEN_PROGS_x86 += mmu_stress_test 143 128 TEST_GEN_PROGS_x86 += rseq_test 144 - TEST_GEN_PROGS_x86 += set_memory_region_test 145 129 TEST_GEN_PROGS_x86 += steal_time 146 - TEST_GEN_PROGS_x86 += kvm_binary_stats_test 147 130 TEST_GEN_PROGS_x86 += system_counter_offset_test 148 131 TEST_GEN_PROGS_x86 += pre_fault_memory_test 149 132 150 133 # Compiled outputs used by test targets 151 134 TEST_GEN_PROGS_EXTENDED_x86 += x86/nx_huge_pages_test 152 135 136 + TEST_GEN_PROGS_arm64 = $(TEST_GEN_PROGS_COMMON) 153 137 TEST_GEN_PROGS_arm64 += arm64/aarch32_id_regs 154 138 TEST_GEN_PROGS_arm64 += arm64/arch_timer_edge_cases 155 139 TEST_GEN_PROGS_arm64 += arm64/debug-exceptions ··· 162 158 TEST_GEN_PROGS_arm64 += access_tracking_perf_test 163 159 TEST_GEN_PROGS_arm64 += arch_timer 164 160 TEST_GEN_PROGS_arm64 += coalesced_io_test 165 - TEST_GEN_PROGS_arm64 += demand_paging_test 166 - TEST_GEN_PROGS_arm64 += dirty_log_test 167 161 TEST_GEN_PROGS_arm64 += dirty_log_perf_test 168 - TEST_GEN_PROGS_arm64 += guest_print_test 169 162 TEST_GEN_PROGS_arm64 += get-reg-list 170 - TEST_GEN_PROGS_arm64 += kvm_create_max_vcpus 171 - TEST_GEN_PROGS_arm64 += kvm_page_table_test 172 163 TEST_GEN_PROGS_arm64 += memslot_modification_stress_test 173 164 TEST_GEN_PROGS_arm64 += memslot_perf_test 174 165 TEST_GEN_PROGS_arm64 += mmu_stress_test 175 166 TEST_GEN_PROGS_arm64 += rseq_test 176 - TEST_GEN_PROGS_arm64 += set_memory_region_test 177 167 TEST_GEN_PROGS_arm64 += steal_time 178 - TEST_GEN_PROGS_arm64 += kvm_binary_stats_test 179 168 180 - TEST_GEN_PROGS_s390 = s390/memop 169 + TEST_GEN_PROGS_s390 = $(TEST_GEN_PROGS_COMMON) 170 + TEST_GEN_PROGS_s390 += s390/memop 181 171 TEST_GEN_PROGS_s390 += s390/resets 182 172 TEST_GEN_PROGS_s390 += s390/sync_regs_test 183 173 TEST_GEN_PROGS_s390 += s390/tprot ··· 180 182 TEST_GEN_PROGS_s390 += s390/cpumodel_subfuncs_test 181 183 TEST_GEN_PROGS_s390 += s390/shared_zeropage_test 182 184 TEST_GEN_PROGS_s390 += s390/ucontrol_test 183 - TEST_GEN_PROGS_s390 += demand_paging_test 184 - TEST_GEN_PROGS_s390 += dirty_log_test 185 - TEST_GEN_PROGS_s390 += guest_print_test 186 - TEST_GEN_PROGS_s390 += kvm_create_max_vcpus 187 - TEST_GEN_PROGS_s390 += kvm_page_table_test 188 185 TEST_GEN_PROGS_s390 += rseq_test 189 - TEST_GEN_PROGS_s390 += set_memory_region_test 190 - TEST_GEN_PROGS_s390 += kvm_binary_stats_test 191 186 187 + TEST_GEN_PROGS_riscv = $(TEST_GEN_PROGS_COMMON) 192 188 TEST_GEN_PROGS_riscv += riscv/sbi_pmu_test 193 189 TEST_GEN_PROGS_riscv += riscv/ebreak_test 194 190 TEST_GEN_PROGS_riscv += arch_timer 195 191 TEST_GEN_PROGS_riscv += coalesced_io_test 196 - TEST_GEN_PROGS_riscv += demand_paging_test 197 - TEST_GEN_PROGS_riscv += dirty_log_test 198 192 TEST_GEN_PROGS_riscv += get-reg-list 199 - TEST_GEN_PROGS_riscv += guest_print_test 200 - TEST_GEN_PROGS_riscv += kvm_binary_stats_test 201 - TEST_GEN_PROGS_riscv += kvm_create_max_vcpus 202 - TEST_GEN_PROGS_riscv += kvm_page_table_test 203 - TEST_GEN_PROGS_riscv += set_memory_region_test 204 193 TEST_GEN_PROGS_riscv += steal_time 205 194 206 195 SPLIT_TESTS += arch_timer

+1 -1

tools/testing/selftests/kvm/arm64/page_fault_test.c

··· 199 199 if (hadbs == 0) 200 200 return false; 201 201 202 - tcr = read_sysreg(tcr_el1) | TCR_EL1_HA; 202 + tcr = read_sysreg(tcr_el1) | TCR_HA; 203 203 write_sysreg(tcr, tcr_el1); 204 204 isb(); 205 205

+61 -6

tools/testing/selftests/kvm/include/arm64/processor.h

··· 62 62 MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL) | \ 63 63 MAIR_ATTRIDX(MAIR_ATTR_NORMAL_WT, MT_NORMAL_WT)) 64 64 65 + /* TCR_EL1 specific flags */ 66 + #define TCR_T0SZ_OFFSET 0 67 + #define TCR_T0SZ(x) ((UL(64) - (x)) << TCR_T0SZ_OFFSET) 68 + 69 + #define TCR_IRGN0_SHIFT 8 70 + #define TCR_IRGN0_MASK (UL(3) << TCR_IRGN0_SHIFT) 71 + #define TCR_IRGN0_NC (UL(0) << TCR_IRGN0_SHIFT) 72 + #define TCR_IRGN0_WBWA (UL(1) << TCR_IRGN0_SHIFT) 73 + #define TCR_IRGN0_WT (UL(2) << TCR_IRGN0_SHIFT) 74 + #define TCR_IRGN0_WBnWA (UL(3) << TCR_IRGN0_SHIFT) 75 + 76 + #define TCR_ORGN0_SHIFT 10 77 + #define TCR_ORGN0_MASK (UL(3) << TCR_ORGN0_SHIFT) 78 + #define TCR_ORGN0_NC (UL(0) << TCR_ORGN0_SHIFT) 79 + #define TCR_ORGN0_WBWA (UL(1) << TCR_ORGN0_SHIFT) 80 + #define TCR_ORGN0_WT (UL(2) << TCR_ORGN0_SHIFT) 81 + #define TCR_ORGN0_WBnWA (UL(3) << TCR_ORGN0_SHIFT) 82 + 83 + #define TCR_SH0_SHIFT 12 84 + #define TCR_SH0_MASK (UL(3) << TCR_SH0_SHIFT) 85 + #define TCR_SH0_INNER (UL(3) << TCR_SH0_SHIFT) 86 + 87 + #define TCR_TG0_SHIFT 14 88 + #define TCR_TG0_MASK (UL(3) << TCR_TG0_SHIFT) 89 + #define TCR_TG0_4K (UL(0) << TCR_TG0_SHIFT) 90 + #define TCR_TG0_64K (UL(1) << TCR_TG0_SHIFT) 91 + #define TCR_TG0_16K (UL(2) << TCR_TG0_SHIFT) 92 + 93 + #define TCR_IPS_SHIFT 32 94 + #define TCR_IPS_MASK (UL(7) << TCR_IPS_SHIFT) 95 + #define TCR_IPS_52_BITS (UL(6) << TCR_IPS_SHIFT) 96 + #define TCR_IPS_48_BITS (UL(5) << TCR_IPS_SHIFT) 97 + #define TCR_IPS_40_BITS (UL(2) << TCR_IPS_SHIFT) 98 + #define TCR_IPS_36_BITS (UL(1) << TCR_IPS_SHIFT) 99 + 100 + #define TCR_HA (UL(1) << 39) 101 + #define TCR_DS (UL(1) << 59) 102 + 103 + /* 104 + * AttrIndx[2:0] encoding (mapping attributes defined in the MAIR* registers). 105 + */ 106 + #define PTE_ATTRINDX(t) ((t) << 2) 107 + #define PTE_ATTRINDX_MASK GENMASK(4, 2) 108 + #define PTE_ATTRINDX_SHIFT 2 109 + 110 + #define PTE_VALID BIT(0) 111 + #define PGD_TYPE_TABLE BIT(1) 112 + #define PUD_TYPE_TABLE BIT(1) 113 + #define PMD_TYPE_TABLE BIT(1) 114 + #define PTE_TYPE_PAGE BIT(1) 115 + 116 + #define PTE_SHARED (UL(3) << 8) /* SH[1:0], inner shareable */ 117 + #define PTE_AF BIT(10) 118 + 119 + #define PTE_ADDR_MASK(page_shift) GENMASK(47, (page_shift)) 120 + #define PTE_ADDR_51_48 GENMASK(15, 12) 121 + #define PTE_ADDR_51_48_SHIFT 12 122 + #define PTE_ADDR_MASK_LPA2(page_shift) GENMASK(49, (page_shift)) 123 + #define PTE_ADDR_51_50_LPA2 GENMASK(9, 8) 124 + #define PTE_ADDR_51_50_LPA2_SHIFT 8 125 + 65 126 void aarch64_vcpu_setup(struct kvm_vcpu *vcpu, struct kvm_vcpu_init *init); 66 127 struct kvm_vcpu *aarch64_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id, 67 128 struct kvm_vcpu_init *init, void *guest_code); ··· 162 101 (v) == VECTOR_SYNC_CURRENT || \ 163 102 (v) == VECTOR_SYNC_LOWER_64 || \ 164 103 (v) == VECTOR_SYNC_LOWER_32) 165 - 166 - /* Access flag */ 167 - #define PTE_AF (1ULL << 10) 168 - 169 - /* Access flag update enable/disable */ 170 - #define TCR_EL1_HA (1ULL << 39) 171 104 172 105 void aarch64_get_supported_page_sizes(uint32_t ipa, uint32_t *ipa4k, 173 106 uint32_t *ipa16k, uint32_t *ipa64k);

+34 -26

tools/testing/selftests/kvm/lib/arm64/processor.c

··· 72 72 uint64_t pte; 73 73 74 74 if (use_lpa2_pte_format(vm)) { 75 - pte = pa & GENMASK(49, vm->page_shift); 76 - pte |= FIELD_GET(GENMASK(51, 50), pa) << 8; 77 - attrs &= ~GENMASK(9, 8); 75 + pte = pa & PTE_ADDR_MASK_LPA2(vm->page_shift); 76 + pte |= FIELD_GET(GENMASK(51, 50), pa) << PTE_ADDR_51_50_LPA2_SHIFT; 77 + attrs &= ~PTE_ADDR_51_50_LPA2; 78 78 } else { 79 - pte = pa & GENMASK(47, vm->page_shift); 79 + pte = pa & PTE_ADDR_MASK(vm->page_shift); 80 80 if (vm->page_shift == 16) 81 - pte |= FIELD_GET(GENMASK(51, 48), pa) << 12; 81 + pte |= FIELD_GET(GENMASK(51, 48), pa) << PTE_ADDR_51_48_SHIFT; 82 82 } 83 83 pte |= attrs; 84 84 ··· 90 90 uint64_t pa; 91 91 92 92 if (use_lpa2_pte_format(vm)) { 93 - pa = pte & GENMASK(49, vm->page_shift); 94 - pa |= FIELD_GET(GENMASK(9, 8), pte) << 50; 93 + pa = pte & PTE_ADDR_MASK_LPA2(vm->page_shift); 94 + pa |= FIELD_GET(PTE_ADDR_51_50_LPA2, pte) << 50; 95 95 } else { 96 - pa = pte & GENMASK(47, vm->page_shift); 96 + pa = pte & PTE_ADDR_MASK(vm->page_shift); 97 97 if (vm->page_shift == 16) 98 - pa |= FIELD_GET(GENMASK(15, 12), pte) << 48; 98 + pa |= FIELD_GET(PTE_ADDR_51_48, pte) << 48; 99 99 } 100 100 101 101 return pa; ··· 128 128 static void _virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, 129 129 uint64_t flags) 130 130 { 131 - uint8_t attr_idx = flags & 7; 131 + uint8_t attr_idx = flags & (PTE_ATTRINDX_MASK >> PTE_ATTRINDX_SHIFT); 132 + uint64_t pg_attr; 132 133 uint64_t *ptep; 133 134 134 135 TEST_ASSERT((vaddr % vm->page_size) == 0, ··· 148 147 149 148 ptep = addr_gpa2hva(vm, vm->pgd) + pgd_index(vm, vaddr) * 8; 150 149 if (!*ptep) 151 - *ptep = addr_pte(vm, vm_alloc_page_table(vm), 3); 150 + *ptep = addr_pte(vm, vm_alloc_page_table(vm), 151 + PGD_TYPE_TABLE | PTE_VALID); 152 152 153 153 switch (vm->pgtable_levels) { 154 154 case 4: 155 155 ptep = addr_gpa2hva(vm, pte_addr(vm, *ptep)) + pud_index(vm, vaddr) * 8; 156 156 if (!*ptep) 157 - *ptep = addr_pte(vm, vm_alloc_page_table(vm), 3); 157 + *ptep = addr_pte(vm, vm_alloc_page_table(vm), 158 + PUD_TYPE_TABLE | PTE_VALID); 158 159 /* fall through */ 159 160 case 3: 160 161 ptep = addr_gpa2hva(vm, pte_addr(vm, *ptep)) + pmd_index(vm, vaddr) * 8; 161 162 if (!*ptep) 162 - *ptep = addr_pte(vm, vm_alloc_page_table(vm), 3); 163 + *ptep = addr_pte(vm, vm_alloc_page_table(vm), 164 + PMD_TYPE_TABLE | PTE_VALID); 163 165 /* fall through */ 164 166 case 2: 165 167 ptep = addr_gpa2hva(vm, pte_addr(vm, *ptep)) + pte_index(vm, vaddr) * 8; ··· 171 167 TEST_FAIL("Page table levels must be 2, 3, or 4"); 172 168 } 173 169 174 - *ptep = addr_pte(vm, paddr, (attr_idx << 2) | (1 << 10) | 3); /* AF */ 170 + pg_attr = PTE_AF | PTE_ATTRINDX(attr_idx) | PTE_TYPE_PAGE | PTE_VALID; 171 + if (!use_lpa2_pte_format(vm)) 172 + pg_attr |= PTE_SHARED; 173 + 174 + *ptep = addr_pte(vm, paddr, pg_attr); 175 175 } 176 176 177 177 void virt_arch_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) ··· 301 293 case VM_MODE_P48V48_64K: 302 294 case VM_MODE_P40V48_64K: 303 295 case VM_MODE_P36V48_64K: 304 - tcr_el1 |= 1ul << 14; /* TG0 = 64KB */ 296 + tcr_el1 |= TCR_TG0_64K; 305 297 break; 306 298 case VM_MODE_P52V48_16K: 307 299 case VM_MODE_P48V48_16K: 308 300 case VM_MODE_P40V48_16K: 309 301 case VM_MODE_P36V48_16K: 310 302 case VM_MODE_P36V47_16K: 311 - tcr_el1 |= 2ul << 14; /* TG0 = 16KB */ 303 + tcr_el1 |= TCR_TG0_16K; 312 304 break; 313 305 case VM_MODE_P52V48_4K: 314 306 case VM_MODE_P48V48_4K: 315 307 case VM_MODE_P40V48_4K: 316 308 case VM_MODE_P36V48_4K: 317 - tcr_el1 |= 0ul << 14; /* TG0 = 4KB */ 309 + tcr_el1 |= TCR_TG0_4K; 318 310 break; 319 311 default: 320 312 TEST_FAIL("Unknown guest mode, mode: 0x%x", vm->mode); ··· 327 319 case VM_MODE_P52V48_4K: 328 320 case VM_MODE_P52V48_16K: 329 321 case VM_MODE_P52V48_64K: 330 - tcr_el1 |= 6ul << 32; /* IPS = 52 bits */ 322 + tcr_el1 |= TCR_IPS_52_BITS; 331 323 ttbr0_el1 |= FIELD_GET(GENMASK(51, 48), vm->pgd) << 2; 332 324 break; 333 325 case VM_MODE_P48V48_4K: 334 326 case VM_MODE_P48V48_16K: 335 327 case VM_MODE_P48V48_64K: 336 - tcr_el1 |= 5ul << 32; /* IPS = 48 bits */ 328 + tcr_el1 |= TCR_IPS_48_BITS; 337 329 break; 338 330 case VM_MODE_P40V48_4K: 339 331 case VM_MODE_P40V48_16K: 340 332 case VM_MODE_P40V48_64K: 341 - tcr_el1 |= 2ul << 32; /* IPS = 40 bits */ 333 + tcr_el1 |= TCR_IPS_40_BITS; 342 334 break; 343 335 case VM_MODE_P36V48_4K: 344 336 case VM_MODE_P36V48_16K: 345 337 case VM_MODE_P36V48_64K: 346 338 case VM_MODE_P36V47_16K: 347 - tcr_el1 |= 1ul << 32; /* IPS = 36 bits */ 339 + tcr_el1 |= TCR_IPS_36_BITS; 348 340 break; 349 341 default: 350 342 TEST_FAIL("Unknown guest mode, mode: 0x%x", vm->mode); 351 343 } 352 344 353 - sctlr_el1 |= (1 << 0) | (1 << 2) | (1 << 12) /* M | C | I */; 354 - /* TCR_EL1 |= IRGN0:WBWA | ORGN0:WBWA | SH0:Inner-Shareable */; 355 - tcr_el1 |= (1 << 8) | (1 << 10) | (3 << 12); 356 - tcr_el1 |= (64 - vm->va_bits) /* T0SZ */; 345 + sctlr_el1 |= SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_I; 346 + 347 + tcr_el1 |= TCR_IRGN0_WBWA | TCR_ORGN0_WBWA | TCR_SH0_INNER; 348 + tcr_el1 |= TCR_T0SZ(vm->va_bits); 357 349 if (use_lpa2_pte_format(vm)) 358 - tcr_el1 |= (1ul << 59) /* DS */; 350 + tcr_el1 |= TCR_DS; 359 351 360 352 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_SCTLR_EL1), sctlr_el1); 361 353 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_TCR_EL1), tcr_el1);

+2 -3

tools/testing/selftests/kvm/lib/kvm_util.c

··· 2019 2019 KVM_EXIT_STRING(RISCV_SBI), 2020 2020 KVM_EXIT_STRING(RISCV_CSR), 2021 2021 KVM_EXIT_STRING(NOTIFY), 2022 - #ifdef KVM_EXIT_MEMORY_NOT_PRESENT 2023 - KVM_EXIT_STRING(MEMORY_NOT_PRESENT), 2024 - #endif 2022 + KVM_EXIT_STRING(LOONGARCH_IOCSR), 2023 + KVM_EXIT_STRING(MEMORY_FAULT), 2025 2024 }; 2026 2025 2027 2026 /*

+25 -6

tools/testing/selftests/kvm/rseq_test.c

··· 196 196 static void help(const char *name) 197 197 { 198 198 puts(""); 199 - printf("usage: %s [-h] [-u]\n", name); 199 + printf("usage: %s [-h] [-u] [-l latency]\n", name); 200 200 printf(" -u: Don't sanity check the number of successful KVM_RUNs\n"); 201 + printf(" -l: Set /dev/cpu_dma_latency to suppress deep sleep states\n"); 201 202 puts(""); 202 203 exit(0); 203 204 } 204 205 205 206 int main(int argc, char *argv[]) 206 207 { 208 + int r, i, snapshot, opt, fd = -1, latency = -1; 207 209 bool skip_sanity_check = false; 208 - int r, i, snapshot; 209 210 struct kvm_vm *vm; 210 211 struct kvm_vcpu *vcpu; 211 212 u32 cpu, rseq_cpu; 212 - int opt; 213 213 214 - while ((opt = getopt(argc, argv, "hu")) != -1) { 214 + while ((opt = getopt(argc, argv, "hl:u")) != -1) { 215 215 switch (opt) { 216 216 case 'u': 217 217 skip_sanity_check = true; 218 + case 'l': 219 + latency = atoi_paranoid(optarg); 218 220 break; 219 221 case 'h': 220 222 default: ··· 244 242 245 243 pthread_create(&migration_thread, NULL, migration_worker, 246 244 (void *)(unsigned long)syscall(SYS_gettid)); 245 + 246 + if (latency >= 0) { 247 + /* 248 + * Writes to cpu_dma_latency persist only while the file is 249 + * open, i.e. it allows userspace to provide guaranteed latency 250 + * while running a workload. Keep the file open until the test 251 + * completes, otherwise writing cpu_dma_latency is meaningless. 252 + */ 253 + fd = open("/dev/cpu_dma_latency", O_RDWR); 254 + TEST_ASSERT(fd >= 0, __KVM_SYSCALL_ERROR("open() /dev/cpu_dma_latency", fd)); 255 + 256 + r = write(fd, &latency, 4); 257 + TEST_ASSERT(r >= 1, "Error setting /dev/cpu_dma_latency"); 258 + } 247 259 248 260 for (i = 0; !done; i++) { 249 261 vcpu_run(vcpu); ··· 294 278 "rseq CPU = %d, sched CPU = %d", rseq_cpu, cpu); 295 279 } 296 280 281 + if (fd > 0) 282 + close(fd); 283 + 297 284 /* 298 285 * Sanity check that the test was able to enter the guest a reasonable 299 286 * number of times, e.g. didn't get stalled too often/long waiting for ··· 312 293 TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2), 313 294 "Only performed %d KVM_RUNs, task stalled too much?\n\n" 314 295 " Try disabling deep sleep states to reduce CPU wakeup latency,\n" 315 - " e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n" 316 - " or run with -u to disable this sanity check.", i); 296 + " e.g. via cpuidle.off=1 or via -l <latency>, or run with -u to\n" 297 + " disable this sanity check.", i); 317 298 318 299 pthread_join(migration_thread, NULL); 319 300

+57 -51

tools/testing/selftests/kvm/x86/monitor_mwait_test.c

··· 7 7 8 8 #include "kvm_util.h" 9 9 #include "processor.h" 10 + #include "kselftest.h" 10 11 11 12 #define CPUID_MWAIT (1u << 3) 12 13 ··· 15 14 MWAIT_QUIRK_DISABLED = BIT(0), 16 15 MISC_ENABLES_QUIRK_DISABLED = BIT(1), 17 16 MWAIT_DISABLED = BIT(2), 17 + CPUID_DISABLED = BIT(3), 18 + TEST_MAX = CPUID_DISABLED * 2 - 1, 18 19 }; 19 20 20 21 /* ··· 38 35 testcase, vector); \ 39 36 } while (0) 40 37 41 - static void guest_monitor_wait(int testcase) 38 + static void guest_monitor_wait(void *arg) 42 39 { 40 + int testcase = (int) (long) arg; 43 41 u8 vector; 44 42 45 - GUEST_SYNC(testcase); 43 + u64 val = rdmsr(MSR_IA32_MISC_ENABLE) & ~MSR_IA32_MISC_ENABLE_MWAIT; 44 + if (!(testcase & MWAIT_DISABLED)) 45 + val |= MSR_IA32_MISC_ENABLE_MWAIT; 46 + wrmsr(MSR_IA32_MISC_ENABLE, val); 47 + 48 + __GUEST_ASSERT(this_cpu_has(X86_FEATURE_MWAIT) == !(testcase & MWAIT_DISABLED), 49 + "Expected CPUID.MWAIT %s\n", 50 + (testcase & MWAIT_DISABLED) ? "cleared" : "set"); 46 51 47 52 /* 48 53 * Arbitrarily MONITOR this function, SVM performs fault checks before ··· 61 50 62 51 vector = kvm_asm_safe("mwait", "a"(guest_monitor_wait), "c"(0), "d"(0)); 63 52 GUEST_ASSERT_MONITOR_MWAIT("MWAIT", testcase, vector); 64 - } 65 - 66 - static void guest_code(void) 67 - { 68 - guest_monitor_wait(MWAIT_DISABLED); 69 - 70 - guest_monitor_wait(MWAIT_QUIRK_DISABLED | MWAIT_DISABLED); 71 - 72 - guest_monitor_wait(MISC_ENABLES_QUIRK_DISABLED | MWAIT_DISABLED); 73 - guest_monitor_wait(MISC_ENABLES_QUIRK_DISABLED); 74 - 75 - guest_monitor_wait(MISC_ENABLES_QUIRK_DISABLED | MWAIT_QUIRK_DISABLED | MWAIT_DISABLED); 76 - guest_monitor_wait(MISC_ENABLES_QUIRK_DISABLED | MWAIT_QUIRK_DISABLED); 77 53 78 54 GUEST_DONE(); 79 55 } ··· 72 74 struct kvm_vm *vm; 73 75 struct ucall uc; 74 76 int testcase; 77 + char test[80]; 75 78 76 - TEST_REQUIRE(this_cpu_has(X86_FEATURE_MWAIT)); 77 79 TEST_REQUIRE(kvm_has_cap(KVM_CAP_DISABLE_QUIRKS2)); 78 80 79 - vm = vm_create_with_one_vcpu(&vcpu, guest_code); 80 - vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_MWAIT); 81 + ksft_print_header(); 82 + ksft_set_plan(12); 83 + for (testcase = 0; testcase <= TEST_MAX; testcase++) { 84 + vm = vm_create_with_one_vcpu(&vcpu, guest_monitor_wait); 85 + vcpu_args_set(vcpu, 1, (void *)(long)testcase); 81 86 82 - while (1) { 87 + disabled_quirks = 0; 88 + if (testcase & MWAIT_QUIRK_DISABLED) { 89 + disabled_quirks |= KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS; 90 + strcpy(test, "MWAIT can fault"); 91 + } else { 92 + strcpy(test, "MWAIT never faults"); 93 + } 94 + if (testcase & MISC_ENABLES_QUIRK_DISABLED) { 95 + disabled_quirks |= KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT; 96 + strcat(test, ", MISC_ENABLE updates CPUID"); 97 + } else { 98 + strcat(test, ", no CPUID updates"); 99 + } 100 + 101 + vm_enable_cap(vm, KVM_CAP_DISABLE_QUIRKS2, disabled_quirks); 102 + 103 + if (!(testcase & MISC_ENABLES_QUIRK_DISABLED) && 104 + (!!(testcase & CPUID_DISABLED) ^ !!(testcase & MWAIT_DISABLED))) 105 + continue; 106 + 107 + if (testcase & CPUID_DISABLED) { 108 + strcat(test, ", CPUID clear"); 109 + vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_MWAIT); 110 + } else { 111 + strcat(test, ", CPUID set"); 112 + vcpu_set_cpuid_feature(vcpu, X86_FEATURE_MWAIT); 113 + } 114 + 115 + if (testcase & MWAIT_DISABLED) 116 + strcat(test, ", MWAIT disabled"); 117 + 83 118 vcpu_run(vcpu); 84 119 TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); 85 120 86 121 switch (get_ucall(vcpu, &uc)) { 87 - case UCALL_SYNC: 88 - testcase = uc.args[1]; 89 - break; 90 122 case UCALL_ABORT: 91 - REPORT_GUEST_ASSERT(uc); 92 - goto done; 123 + /* Detected in vcpu_run */ 124 + break; 93 125 case UCALL_DONE: 94 - goto done; 126 + ksft_test_result_pass("%s\n", test); 127 + break; 95 128 default: 96 129 TEST_FAIL("Unknown ucall %lu", uc.cmd); 97 - goto done; 130 + break; 98 131 } 99 - 100 - disabled_quirks = 0; 101 - if (testcase & MWAIT_QUIRK_DISABLED) 102 - disabled_quirks |= KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS; 103 - if (testcase & MISC_ENABLES_QUIRK_DISABLED) 104 - disabled_quirks |= KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT; 105 - vm_enable_cap(vm, KVM_CAP_DISABLE_QUIRKS2, disabled_quirks); 106 - 107 - /* 108 - * If the MISC_ENABLES quirk (KVM neglects to update CPUID to 109 - * enable/disable MWAIT) is disabled, toggle the ENABLE_MWAIT 110 - * bit in MISC_ENABLES accordingly. If the quirk is enabled, 111 - * the only valid configuration is MWAIT disabled, as CPUID 112 - * can't be manually changed after running the vCPU. 113 - */ 114 - if (!(testcase & MISC_ENABLES_QUIRK_DISABLED)) { 115 - TEST_ASSERT(testcase & MWAIT_DISABLED, 116 - "Can't toggle CPUID features after running vCPU"); 117 - continue; 118 - } 119 - 120 - vcpu_set_msr(vcpu, MSR_IA32_MISC_ENABLE, 121 - (testcase & MWAIT_DISABLED) ? 0 : MSR_IA32_MISC_ENABLE_MWAIT); 132 + kvm_vm_free(vm); 122 133 } 134 + ksft_finished(); 123 135 124 - done: 125 - kvm_vm_free(vm); 126 136 return 0; 127 137 }

+1 -1

virt/kvm/Kconfig

··· 75 75 depends on KVM && COMPAT && !(S390 || ARM64 || RISCV) 76 76 77 77 config HAVE_KVM_IRQ_BYPASS 78 - bool 78 + tristate 79 79 select IRQ_BYPASS_MANAGER 80 80 81 81 config HAVE_KVM_VCPU_ASYNC_IOCTL

+5 -5

virt/kvm/eventfd.c

··· 149 149 /* 150 150 * It is now safe to release the object's resources 151 151 */ 152 - #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 152 + #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS) 153 153 irq_bypass_unregister_consumer(&irqfd->consumer); 154 154 #endif 155 155 eventfd_ctx_put(irqfd->eventfd); ··· 274 274 write_seqcount_end(&irqfd->irq_entry_sc); 275 275 } 276 276 277 - #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 277 + #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS) 278 278 void __attribute__((weak)) kvm_arch_irq_bypass_stop( 279 279 struct irq_bypass_consumer *cons) 280 280 { ··· 424 424 if (events & EPOLLIN) 425 425 schedule_work(&irqfd->inject); 426 426 427 - #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 427 + #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS) 428 428 if (kvm_arch_has_irq_bypass()) { 429 429 irqfd->consumer.token = (void *)irqfd->eventfd; 430 430 irqfd->consumer.add_producer = kvm_arch_irq_bypass_add_producer; ··· 609 609 spin_lock_irq(&kvm->irqfds.lock); 610 610 611 611 list_for_each_entry(irqfd, &kvm->irqfds.items, list) { 612 - #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 612 + #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS) 613 613 /* Under irqfds.lock, so can read irq_entry safely */ 614 614 struct kvm_kernel_irq_routing_entry old = irqfd->irq_entry; 615 615 #endif 616 616 617 617 irqfd_update(kvm, irqfd); 618 618 619 - #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 619 + #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS) 620 620 if (irqfd->producer && 621 621 kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) { 622 622 int ret = kvm_arch_update_irqfd_routing(