Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+29 -6

Documentation/virtual/kvm/api.txt

··· 612 612 Parameters: none 613 613 Returns: 0 on success, -1 on error 614 614 615 - Creates an interrupt controller model in the kernel. On x86, creates a virtual 616 - ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a 617 - local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23 618 - only go to the IOAPIC. On ARM/arm64, a GIC is 619 - created. On s390, a dummy irq routing table is created. 615 + Creates an interrupt controller model in the kernel. 616 + On x86, creates a virtual ioapic, a virtual PIC (two PICs, nested), and sets up 617 + future vcpus to have a local APIC. IRQ routing for GSIs 0-15 is set to both 618 + PIC and IOAPIC; GSI 16-23 only go to the IOAPIC. 619 + On ARM/arm64, a GICv2 is created. Any other GIC versions require the usage of 620 + KVM_CREATE_DEVICE, which also supports creating a GICv2. Using 621 + KVM_CREATE_DEVICE is preferred over KVM_CREATE_IRQCHIP for GICv2. 622 + On s390, a dummy irq routing table is created. 620 623 621 624 Note that on s390 the KVM_CAP_S390_IRQCHIP vm capability needs to be enabled 622 625 before KVM_CREATE_IRQCHIP can be used. ··· 2315 2312 2316 2313 type can be one of the following: 2317 2314 2318 - KVM_S390_SIGP_STOP (vcpu) - sigp restart 2315 + KVM_S390_SIGP_STOP (vcpu) - sigp stop; optional flags in parm 2319 2316 KVM_S390_PROGRAM_INT (vcpu) - program check; code in parm 2320 2317 KVM_S390_SIGP_SET_PREFIX (vcpu) - sigp set prefix; prefix address in parm 2321 2318 KVM_S390_RESTART (vcpu) - restart ··· 3228 3225 If the hcall number specified is not one that has an in-kernel 3229 3226 implementation, the KVM_ENABLE_CAP ioctl will fail with an EINVAL 3230 3227 error. 3228 + 3229 + 7.2 KVM_CAP_S390_USER_SIGP 3230 + 3231 + Architectures: s390 3232 + Parameters: none 3233 + 3234 + This capability controls which SIGP orders will be handled completely in user 3235 + space. With this capability enabled, all fast orders will be handled completely 3236 + in the kernel: 3237 + - SENSE 3238 + - SENSE RUNNING 3239 + - EXTERNAL CALL 3240 + - EMERGENCY SIGNAL 3241 + - CONDITIONAL EMERGENCY SIGNAL 3242 + 3243 + All other orders will be handled completely in user space. 3244 + 3245 + Only privileged operation exceptions will be checked for in the kernel (or even 3246 + in the hardware prior to interception). If this capability is not enabled, the 3247 + old way of handling SIGP orders is used (partially in kernel and user space).

+35 -2

Documentation/virtual/kvm/devices/arm-vgic.txt

··· 3 3 4 4 Device types supported: 5 5 KVM_DEV_TYPE_ARM_VGIC_V2 ARM Generic Interrupt Controller v2.0 6 + KVM_DEV_TYPE_ARM_VGIC_V3 ARM Generic Interrupt Controller v3.0 6 7 7 8 Only one VGIC instance may be instantiated through either this API or the 8 9 legacy KVM_CREATE_IRQCHIP api. The created VGIC will act as the VM interrupt 9 10 controller, requiring emulated user-space devices to inject interrupts to the 10 11 VGIC instead of directly to CPUs. 11 12 13 + Creating a guest GICv3 device requires a host GICv3 as well. 14 + GICv3 implementations with hardware compatibility support allow a guest GICv2 15 + as well. 16 + 12 17 Groups: 13 18 KVM_DEV_ARM_VGIC_GRP_ADDR 14 19 Attributes: 15 20 KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit) 16 21 Base address in the guest physical address space of the GIC distributor 17 - register mappings. 22 + register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2. 23 + This address needs to be 4K aligned and the region covers 4 KByte. 18 24 19 25 KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit) 20 26 Base address in the guest physical address space of the GIC virtual cpu 21 - interface register mappings. 27 + interface register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2. 28 + This address needs to be 4K aligned and the region covers 4 KByte. 29 + 30 + KVM_VGIC_V3_ADDR_TYPE_DIST (rw, 64-bit) 31 + Base address in the guest physical address space of the GICv3 distributor 32 + register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3. 33 + This address needs to be 64K aligned and the region covers 64 KByte. 34 + 35 + KVM_VGIC_V3_ADDR_TYPE_REDIST (rw, 64-bit) 36 + Base address in the guest physical address space of the GICv3 37 + redistributor register mappings. There are two 64K pages for each 38 + VCPU and all of the redistributor pages are contiguous. 39 + Only valid for KVM_DEV_TYPE_ARM_VGIC_V3. 40 + This address needs to be 64K aligned. 41 + 22 42 23 43 KVM_DEV_ARM_VGIC_GRP_DIST_REGS 24 44 Attributes: ··· 56 36 the register. 57 37 Limitations: 58 38 - Priorities are not implemented, and registers are RAZ/WI 39 + - Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2. 59 40 Errors: 60 41 -ENODEV: Getting or setting this register is not yet supported 61 42 -EBUSY: One or more VCPUs are running ··· 89 68 90 69 Limitations: 91 70 - Priorities are not implemented, and registers are RAZ/WI 71 + - Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2. 92 72 Errors: 93 73 -ENODEV: Getting or setting this register is not yet supported 94 74 -EBUSY: One or more VCPUs are running ··· 103 81 -EINVAL: Value set is out of the expected range 104 82 -EBUSY: Value has already be set, or GIC has already been initialized 105 83 with default values. 84 + 85 + KVM_DEV_ARM_VGIC_GRP_CTRL 86 + Attributes: 87 + KVM_DEV_ARM_VGIC_CTRL_INIT 88 + request the initialization of the VGIC, no additional parameter in 89 + kvm_device_attr.addr. 90 + Errors: 91 + -ENXIO: VGIC not properly configured as required prior to calling 92 + this attribute 93 + -ENODEV: no online VCPU 94 + -ENOMEM: memory shortage when allocating vgic internal data

+59

Documentation/virtual/kvm/devices/vm.txt

··· 24 24 25 25 Clear the CMMA status for all guest pages, so any pages the guest marked 26 26 as unused are again used any may not be reclaimed by the host. 27 + 28 + 1.3. ATTRIBUTE KVM_S390_VM_MEM_LIMIT_SIZE 29 + Parameters: in attr->addr the address for the new limit of guest memory 30 + Returns: -EFAULT if the given address is not accessible 31 + -EINVAL if the virtual machine is of type UCONTROL 32 + -E2BIG if the given guest memory is to big for that machine 33 + -EBUSY if a vcpu is already defined 34 + -ENOMEM if not enough memory is available for a new shadow guest mapping 35 + 0 otherwise 36 + 37 + Allows userspace to query the actual limit and set a new limit for 38 + the maximum guest memory size. The limit will be rounded up to 39 + 2048 MB, 4096 GB, 8192 TB respectively, as this limit is governed by 40 + the number of page table levels. 41 + 42 + 2. GROUP: KVM_S390_VM_CPU_MODEL 43 + Architectures: s390 44 + 45 + 2.1. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE (r/o) 46 + 47 + Allows user space to retrieve machine and kvm specific cpu related information: 48 + 49 + struct kvm_s390_vm_cpu_machine { 50 + __u64 cpuid; # CPUID of host 51 + __u32 ibc; # IBC level range offered by host 52 + __u8 pad[4]; 53 + __u64 fac_mask[256]; # set of cpu facilities enabled by KVM 54 + __u64 fac_list[256]; # set of cpu facilities offered by host 55 + } 56 + 57 + Parameters: address of buffer to store the machine related cpu data 58 + of type struct kvm_s390_vm_cpu_machine* 59 + Returns: -EFAULT if the given address is not accessible from kernel space 60 + -ENOMEM if not enough memory is available to process the ioctl 61 + 0 in case of success 62 + 63 + 2.2. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR (r/w) 64 + 65 + Allows user space to retrieve or request to change cpu related information for a vcpu: 66 + 67 + struct kvm_s390_vm_cpu_processor { 68 + __u64 cpuid; # CPUID currently (to be) used by this vcpu 69 + __u16 ibc; # IBC level currently (to be) used by this vcpu 70 + __u8 pad[6]; 71 + __u64 fac_list[256]; # set of cpu facilities currently (to be) used 72 + # by this vcpu 73 + } 74 + 75 + KVM does not enforce or limit the cpu model data in any form. Take the information 76 + retrieved by means of KVM_S390_VM_CPU_MACHINE as hint for reasonable configuration 77 + setups. Instruction interceptions triggered by additionally set facilitiy bits that 78 + are not handled by KVM need to by imlemented in the VM driver code. 79 + 80 + Parameters: address of buffer to store/set the processor related cpu 81 + data of type struct kvm_s390_vm_cpu_processor*. 82 + Returns: -EBUSY in case 1 or more vcpus are already activated (only in write case) 83 + -EFAULT if the given address is not accessible from kernel space 84 + -ENOMEM if not enough memory is available to process the ioctl 85 + 0 in case of success

+1

arch/arm/include/asm/kvm_asm.h

··· 96 96 97 97 extern void __kvm_flush_vm_context(void); 98 98 extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa); 99 + extern void __kvm_tlb_flush_vmid(struct kvm *kvm); 99 100 100 101 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); 101 102 #endif

+3 -2

arch/arm/include/asm/kvm_emulate.h

··· 23 23 #include <asm/kvm_asm.h> 24 24 #include <asm/kvm_mmio.h> 25 25 #include <asm/kvm_arm.h> 26 + #include <asm/cputype.h> 26 27 27 28 unsigned long *vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num); 28 29 unsigned long *vcpu_spsr(struct kvm_vcpu *vcpu); ··· 178 177 return kvm_vcpu_get_hsr(vcpu) & HSR_HVC_IMM_MASK; 179 178 } 180 179 181 - static inline unsigned long kvm_vcpu_get_mpidr(struct kvm_vcpu *vcpu) 180 + static inline unsigned long kvm_vcpu_get_mpidr_aff(struct kvm_vcpu *vcpu) 182 181 { 183 - return vcpu->arch.cp15[c0_MPIDR]; 182 + return vcpu->arch.cp15[c0_MPIDR] & MPIDR_HWID_BITMASK; 184 183 } 185 184 186 185 static inline void kvm_vcpu_set_be(struct kvm_vcpu *vcpu)

+6

arch/arm/include/asm/kvm_host.h

··· 68 68 69 69 /* Interrupt controller */ 70 70 struct vgic_dist vgic; 71 + int max_vcpus; 71 72 }; 72 73 73 74 #define KVM_NR_MEM_OBJS 40 ··· 145 144 }; 146 145 147 146 struct kvm_vcpu_stat { 147 + u32 halt_successful_poll; 148 148 u32 halt_wakeup; 149 149 }; 150 150 ··· 232 230 233 231 int kvm_perf_init(void); 234 232 int kvm_perf_teardown(void); 233 + 234 + void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot); 235 + 236 + struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr); 235 237 236 238 static inline void kvm_arch_hardware_disable(void) {} 237 239 static inline void kvm_arch_hardware_unsetup(void) {}

+1

arch/arm/include/asm/kvm_mmio.h

··· 37 37 u8 data[8]; 38 38 u32 len; 39 39 bool is_write; 40 + void *private; 40 41 }; 41 42 42 43 static inline void kvm_prepare_mmio(struct kvm_run *run,

+21

arch/arm/include/asm/kvm_mmu.h

··· 115 115 pmd_val(*pmd) |= L_PMD_S2_RDWR; 116 116 } 117 117 118 + static inline void kvm_set_s2pte_readonly(pte_t *pte) 119 + { 120 + pte_val(*pte) = (pte_val(*pte) & ~L_PTE_S2_RDWR) | L_PTE_S2_RDONLY; 121 + } 122 + 123 + static inline bool kvm_s2pte_readonly(pte_t *pte) 124 + { 125 + return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY; 126 + } 127 + 128 + static inline void kvm_set_s2pmd_readonly(pmd_t *pmd) 129 + { 130 + pmd_val(*pmd) = (pmd_val(*pmd) & ~L_PMD_S2_RDWR) | L_PMD_S2_RDONLY; 131 + } 132 + 133 + static inline bool kvm_s2pmd_readonly(pmd_t *pmd) 134 + { 135 + return (pmd_val(*pmd) & L_PMD_S2_RDWR) == L_PMD_S2_RDONLY; 136 + } 137 + 138 + 118 139 /* Open coded p*d_addr_end that can deal with 64bit addresses */ 119 140 #define kvm_pgd_addr_end(addr, end) \ 120 141 ({ u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \

+1

arch/arm/include/asm/pgtable-3level.h

··· 129 129 #define L_PTE_S2_RDONLY (_AT(pteval_t, 1) << 6) /* HAP[1] */ 130 130 #define L_PTE_S2_RDWR (_AT(pteval_t, 3) << 6) /* HAP[2:1] */ 131 131 132 + #define L_PMD_S2_RDONLY (_AT(pmdval_t, 1) << 6) /* HAP[1] */ 132 133 #define L_PMD_S2_RDWR (_AT(pmdval_t, 3) << 6) /* HAP[2:1] */ 133 134 134 135 /*

+2

arch/arm/include/uapi/asm/kvm.h

··· 175 175 #define KVM_DEV_ARM_VGIC_OFFSET_SHIFT 0 176 176 #define KVM_DEV_ARM_VGIC_OFFSET_MASK (0xffffffffULL << KVM_DEV_ARM_VGIC_OFFSET_SHIFT) 177 177 #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS 3 178 + #define KVM_DEV_ARM_VGIC_GRP_CTRL 4 179 + #define KVM_DEV_ARM_VGIC_CTRL_INIT 0 178 180 179 181 /* KVM_IRQ_LINE irq field index values */ 180 182 #define KVM_ARM_IRQ_TYPE_SHIFT 24

+2

arch/arm/kvm/Kconfig

··· 21 21 select PREEMPT_NOTIFIERS 22 22 select ANON_INODES 23 23 select HAVE_KVM_CPU_RELAX_INTERCEPT 24 + select HAVE_KVM_ARCH_TLB_FLUSH_ALL 24 25 select KVM_MMIO 25 26 select KVM_ARM_HOST 27 + select KVM_GENERIC_DIRTYLOG_READ_PROTECT 26 28 select SRCU 27 29 depends on ARM_VIRT_EXT && ARM_LPAE 28 30 ---help---

+1

arch/arm/kvm/Makefile

··· 22 22 obj-y += coproc.o coproc_a15.o coproc_a7.o mmio.o psci.o perf.o 23 23 obj-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic.o 24 24 obj-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic-v2.o 25 + obj-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic-v2-emul.o 25 26 obj-$(CONFIG_KVM_ARM_TIMER) += $(KVM)/arm/arch_timer.o

+54 -4

arch/arm/kvm/arm.c

··· 132 132 /* Mark the initial VMID generation invalid */ 133 133 kvm->arch.vmid_gen = 0; 134 134 135 + /* The maximum number of VCPUs is limited by the host's GIC model */ 136 + kvm->arch.max_vcpus = kvm_vgic_get_max_vcpus(); 137 + 135 138 return ret; 136 139 out_free_stage2_pgd: 137 140 kvm_free_stage2_pgd(kvm); ··· 221 218 goto out; 222 219 } 223 220 221 + if (id >= kvm->arch.max_vcpus) { 222 + err = -EINVAL; 223 + goto out; 224 + } 225 + 224 226 vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); 225 227 if (!vcpu) { 226 228 err = -ENOMEM; ··· 249 241 return ERR_PTR(err); 250 242 } 251 243 252 - int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 244 + void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 253 245 { 254 - return 0; 255 246 } 256 247 257 248 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) ··· 784 777 } 785 778 } 786 779 780 + /** 781 + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot 782 + * @kvm: kvm instance 783 + * @log: slot id and address to which we copy the log 784 + * 785 + * Steps 1-4 below provide general overview of dirty page logging. See 786 + * kvm_get_dirty_log_protect() function description for additional details. 787 + * 788 + * We call kvm_get_dirty_log_protect() to handle steps 1-3, upon return we 789 + * always flush the TLB (step 4) even if previous step failed and the dirty 790 + * bitmap may be corrupt. Regardless of previous outcome the KVM logging API 791 + * does not preclude user space subsequent dirty log read. Flushing TLB ensures 792 + * writes will be marked dirty for next log read. 793 + * 794 + * 1. Take a snapshot of the bit and clear it if needed. 795 + * 2. Write protect the corresponding page. 796 + * 3. Copy the snapshot to the userspace. 797 + * 4. Flush TLB's if needed. 798 + */ 787 799 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) 788 800 { 789 - return -EINVAL; 801 + bool is_dirty = false; 802 + int r; 803 + 804 + mutex_lock(&kvm->slots_lock); 805 + 806 + r = kvm_get_dirty_log_protect(kvm, log, &is_dirty); 807 + 808 + if (is_dirty) 809 + kvm_flush_remote_tlbs(kvm); 810 + 811 + mutex_unlock(&kvm->slots_lock); 812 + return r; 790 813 } 791 814 792 815 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, ··· 848 811 switch (ioctl) { 849 812 case KVM_CREATE_IRQCHIP: { 850 813 if (vgic_present) 851 - return kvm_vgic_create(kvm); 814 + return kvm_vgic_create(kvm, KVM_DEV_TYPE_ARM_VGIC_V2); 852 815 else 853 816 return -ENXIO; 854 817 } ··· 1070 1033 static void check_kvm_target_cpu(void *ret) 1071 1034 { 1072 1035 *(int *)ret = kvm_target_cpu(); 1036 + } 1037 + 1038 + struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr) 1039 + { 1040 + struct kvm_vcpu *vcpu; 1041 + int i; 1042 + 1043 + mpidr &= MPIDR_HWID_BITMASK; 1044 + kvm_for_each_vcpu(i, vcpu, kvm) { 1045 + if (mpidr == kvm_vcpu_get_mpidr_aff(vcpu)) 1046 + return vcpu; 1047 + } 1048 + return NULL; 1073 1049 } 1074 1050 1075 1051 /**

+5 -3

arch/arm/kvm/handle_exit.c

··· 87 87 */ 88 88 static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run) 89 89 { 90 - trace_kvm_wfi(*vcpu_pc(vcpu)); 91 - if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) 90 + if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE) { 91 + trace_kvm_wfx(*vcpu_pc(vcpu), true); 92 92 kvm_vcpu_on_spin(vcpu); 93 - else 93 + } else { 94 + trace_kvm_wfx(*vcpu_pc(vcpu), false); 94 95 kvm_vcpu_block(vcpu); 96 + } 95 97 96 98 kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu)); 97 99

+11

arch/arm/kvm/interrupts.S

··· 66 66 bx lr 67 67 ENDPROC(__kvm_tlb_flush_vmid_ipa) 68 68 69 + /** 70 + * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs 71 + * 72 + * Reuses __kvm_tlb_flush_vmid_ipa() for ARMv7, without passing address 73 + * parameter 74 + */ 75 + 76 + ENTRY(__kvm_tlb_flush_vmid) 77 + b __kvm_tlb_flush_vmid_ipa 78 + ENDPROC(__kvm_tlb_flush_vmid) 79 + 69 80 /******************************************************************** 70 81 * Flush TLBs and instruction caches of all CPUs inside the inner-shareable 71 82 * domain, for all VMIDs

+262 -9

arch/arm/kvm/mmu.c

··· 45 45 #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) 46 46 47 47 #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x)) 48 + #define kvm_pud_huge(_x) pud_huge(_x) 49 + 50 + #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0) 51 + #define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1) 52 + 53 + static bool memslot_is_logging(struct kvm_memory_slot *memslot) 54 + { 55 + return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY); 56 + } 57 + 58 + /** 59 + * kvm_flush_remote_tlbs() - flush all VM TLB entries for v7/8 60 + * @kvm: pointer to kvm structure. 61 + * 62 + * Interface to HYP function to flush all VM TLB entries 63 + */ 64 + void kvm_flush_remote_tlbs(struct kvm *kvm) 65 + { 66 + kvm_call_hyp(__kvm_tlb_flush_vmid, kvm); 67 + } 48 68 49 69 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) 50 70 { ··· 96 76 static void kvm_flush_dcache_pud(pud_t pud) 97 77 { 98 78 __kvm_flush_dcache_pud(pud); 79 + } 80 + 81 + /** 82 + * stage2_dissolve_pmd() - clear and flush huge PMD entry 83 + * @kvm: pointer to kvm structure. 84 + * @addr: IPA 85 + * @pmd: pmd pointer for IPA 86 + * 87 + * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. Marks all 88 + * pages in the range dirty. 89 + */ 90 + static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd) 91 + { 92 + if (!kvm_pmd_huge(*pmd)) 93 + return; 94 + 95 + pmd_clear(pmd); 96 + kvm_tlb_flush_vmid_ipa(kvm, addr); 97 + put_page(virt_to_page(pmd)); 99 98 } 100 99 101 100 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, ··· 858 819 } 859 820 860 821 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, 861 - phys_addr_t addr, const pte_t *new_pte, bool iomap) 822 + phys_addr_t addr, const pte_t *new_pte, 823 + unsigned long flags) 862 824 { 863 825 pmd_t *pmd; 864 826 pte_t *pte, old_pte; 827 + bool iomap = flags & KVM_S2PTE_FLAG_IS_IOMAP; 828 + bool logging_active = flags & KVM_S2_FLAG_LOGGING_ACTIVE; 829 + 830 + VM_BUG_ON(logging_active && !cache); 865 831 866 832 /* Create stage-2 page table mapping - Levels 0 and 1 */ 867 833 pmd = stage2_get_pmd(kvm, cache, addr); ··· 877 833 */ 878 834 return 0; 879 835 } 836 + 837 + /* 838 + * While dirty page logging - dissolve huge PMD, then continue on to 839 + * allocate page. 840 + */ 841 + if (logging_active) 842 + stage2_dissolve_pmd(kvm, addr, pmd); 880 843 881 844 /* Create stage-2 page mappings - Level 2 */ 882 845 if (pmd_none(*pmd)) { ··· 941 890 if (ret) 942 891 goto out; 943 892 spin_lock(&kvm->mmu_lock); 944 - ret = stage2_set_pte(kvm, &cache, addr, &pte, true); 893 + ret = stage2_set_pte(kvm, &cache, addr, &pte, 894 + KVM_S2PTE_FLAG_IS_IOMAP); 945 895 spin_unlock(&kvm->mmu_lock); 946 896 if (ret) 947 897 goto out; ··· 1009 957 return !pfn_valid(pfn); 1010 958 } 1011 959 960 + /** 961 + * stage2_wp_ptes - write protect PMD range 962 + * @pmd: pointer to pmd entry 963 + * @addr: range start address 964 + * @end: range end address 965 + */ 966 + static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end) 967 + { 968 + pte_t *pte; 969 + 970 + pte = pte_offset_kernel(pmd, addr); 971 + do { 972 + if (!pte_none(*pte)) { 973 + if (!kvm_s2pte_readonly(pte)) 974 + kvm_set_s2pte_readonly(pte); 975 + } 976 + } while (pte++, addr += PAGE_SIZE, addr != end); 977 + } 978 + 979 + /** 980 + * stage2_wp_pmds - write protect PUD range 981 + * @pud: pointer to pud entry 982 + * @addr: range start address 983 + * @end: range end address 984 + */ 985 + static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end) 986 + { 987 + pmd_t *pmd; 988 + phys_addr_t next; 989 + 990 + pmd = pmd_offset(pud, addr); 991 + 992 + do { 993 + next = kvm_pmd_addr_end(addr, end); 994 + if (!pmd_none(*pmd)) { 995 + if (kvm_pmd_huge(*pmd)) { 996 + if (!kvm_s2pmd_readonly(pmd)) 997 + kvm_set_s2pmd_readonly(pmd); 998 + } else { 999 + stage2_wp_ptes(pmd, addr, next); 1000 + } 1001 + } 1002 + } while (pmd++, addr = next, addr != end); 1003 + } 1004 + 1005 + /** 1006 + * stage2_wp_puds - write protect PGD range 1007 + * @pgd: pointer to pgd entry 1008 + * @addr: range start address 1009 + * @end: range end address 1010 + * 1011 + * Process PUD entries, for a huge PUD we cause a panic. 1012 + */ 1013 + static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end) 1014 + { 1015 + pud_t *pud; 1016 + phys_addr_t next; 1017 + 1018 + pud = pud_offset(pgd, addr); 1019 + do { 1020 + next = kvm_pud_addr_end(addr, end); 1021 + if (!pud_none(*pud)) { 1022 + /* TODO:PUD not supported, revisit later if supported */ 1023 + BUG_ON(kvm_pud_huge(*pud)); 1024 + stage2_wp_pmds(pud, addr, next); 1025 + } 1026 + } while (pud++, addr = next, addr != end); 1027 + } 1028 + 1029 + /** 1030 + * stage2_wp_range() - write protect stage2 memory region range 1031 + * @kvm: The KVM pointer 1032 + * @addr: Start address of range 1033 + * @end: End address of range 1034 + */ 1035 + static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end) 1036 + { 1037 + pgd_t *pgd; 1038 + phys_addr_t next; 1039 + 1040 + pgd = kvm->arch.pgd + pgd_index(addr); 1041 + do { 1042 + /* 1043 + * Release kvm_mmu_lock periodically if the memory region is 1044 + * large. Otherwise, we may see kernel panics with 1045 + * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR, 1046 + * CONFIG_LOCKDEP. Additionally, holding the lock too long 1047 + * will also starve other vCPUs. 1048 + */ 1049 + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) 1050 + cond_resched_lock(&kvm->mmu_lock); 1051 + 1052 + next = kvm_pgd_addr_end(addr, end); 1053 + if (pgd_present(*pgd)) 1054 + stage2_wp_puds(pgd, addr, next); 1055 + } while (pgd++, addr = next, addr != end); 1056 + } 1057 + 1058 + /** 1059 + * kvm_mmu_wp_memory_region() - write protect stage 2 entries for memory slot 1060 + * @kvm: The KVM pointer 1061 + * @slot: The memory slot to write protect 1062 + * 1063 + * Called to start logging dirty pages after memory region 1064 + * KVM_MEM_LOG_DIRTY_PAGES operation is called. After this function returns 1065 + * all present PMD and PTEs are write protected in the memory region. 1066 + * Afterwards read of dirty page log can be called. 1067 + * 1068 + * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired, 1069 + * serializing operations for VM memory regions. 1070 + */ 1071 + void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot) 1072 + { 1073 + struct kvm_memory_slot *memslot = id_to_memslot(kvm->memslots, slot); 1074 + phys_addr_t start = memslot->base_gfn << PAGE_SHIFT; 1075 + phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT; 1076 + 1077 + spin_lock(&kvm->mmu_lock); 1078 + stage2_wp_range(kvm, start, end); 1079 + spin_unlock(&kvm->mmu_lock); 1080 + kvm_flush_remote_tlbs(kvm); 1081 + } 1082 + 1083 + /** 1084 + * kvm_mmu_write_protect_pt_masked() - write protect dirty pages 1085 + * @kvm: The KVM pointer 1086 + * @slot: The memory slot associated with mask 1087 + * @gfn_offset: The gfn offset in memory slot 1088 + * @mask: The mask of dirty pages at offset 'gfn_offset' in this memory 1089 + * slot to be write protected 1090 + * 1091 + * Walks bits set in mask write protects the associated pte's. Caller must 1092 + * acquire kvm_mmu_lock. 1093 + */ 1094 + static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, 1095 + struct kvm_memory_slot *slot, 1096 + gfn_t gfn_offset, unsigned long mask) 1097 + { 1098 + phys_addr_t base_gfn = slot->base_gfn + gfn_offset; 1099 + phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT; 1100 + phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT; 1101 + 1102 + stage2_wp_range(kvm, start, end); 1103 + } 1104 + 1105 + /* 1106 + * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected 1107 + * dirty pages. 1108 + * 1109 + * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to 1110 + * enable dirty logging for them. 1111 + */ 1112 + void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, 1113 + struct kvm_memory_slot *slot, 1114 + gfn_t gfn_offset, unsigned long mask) 1115 + { 1116 + kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); 1117 + } 1118 + 1012 1119 static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn, 1013 1120 unsigned long size, bool uncached) 1014 1121 { ··· 1188 977 pfn_t pfn; 1189 978 pgprot_t mem_type = PAGE_S2; 1190 979 bool fault_ipa_uncached; 980 + bool logging_active = memslot_is_logging(memslot); 981 + unsigned long flags = 0; 1191 982 1192 983 write_fault = kvm_is_write_fault(vcpu); 1193 984 if (fault_status == FSC_PERM && !write_fault) { ··· 1206 993 return -EFAULT; 1207 994 } 1208 995 1209 - if (is_vm_hugetlb_page(vma)) { 996 + if (is_vm_hugetlb_page(vma) && !logging_active) { 1210 997 hugetlb = true; 1211 998 gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT; 1212 999 } else { ··· 1247 1034 if (is_error_pfn(pfn)) 1248 1035 return -EFAULT; 1249 1036 1250 - if (kvm_is_device_pfn(pfn)) 1037 + if (kvm_is_device_pfn(pfn)) { 1251 1038 mem_type = PAGE_S2_DEVICE; 1039 + flags |= KVM_S2PTE_FLAG_IS_IOMAP; 1040 + } else if (logging_active) { 1041 + /* 1042 + * Faults on pages in a memslot with logging enabled 1043 + * should not be mapped with huge pages (it introduces churn 1044 + * and performance degradation), so force a pte mapping. 1045 + */ 1046 + force_pte = true; 1047 + flags |= KVM_S2_FLAG_LOGGING_ACTIVE; 1048 + 1049 + /* 1050 + * Only actually map the page as writable if this was a write 1051 + * fault. 1052 + */ 1053 + if (!write_fault) 1054 + writable = false; 1055 + } 1252 1056 1253 1057 spin_lock(&kvm->mmu_lock); 1254 1058 if (mmu_notifier_retry(kvm, mmu_seq)) 1255 1059 goto out_unlock; 1060 + 1256 1061 if (!hugetlb && !force_pte) 1257 1062 hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa); 1258 1063 ··· 1287 1056 ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd); 1288 1057 } else { 1289 1058 pte_t new_pte = pfn_pte(pfn, mem_type); 1059 + 1290 1060 if (writable) { 1291 1061 kvm_set_s2pte_writable(&new_pte); 1292 1062 kvm_set_pfn_dirty(pfn); 1063 + mark_page_dirty(kvm, gfn); 1293 1064 } 1294 1065 coherent_cache_guest_page(vcpu, pfn, PAGE_SIZE, fault_ipa_uncached); 1295 - ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, 1296 - pgprot_val(mem_type) == pgprot_val(PAGE_S2_DEVICE)); 1066 + ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, flags); 1297 1067 } 1298 - 1299 1068 1300 1069 out_unlock: 1301 1070 spin_unlock(&kvm->mmu_lock); ··· 1446 1215 { 1447 1216 pte_t *pte = (pte_t *)data; 1448 1217 1449 - stage2_set_pte(kvm, NULL, gpa, pte, false); 1218 + /* 1219 + * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE 1220 + * flag clear because MMU notifiers will have unmapped a huge PMD before 1221 + * calling ->change_pte() (which in turn calls kvm_set_spte_hva()) and 1222 + * therefore stage2_set_pte() never needs to clear out a huge PMD 1223 + * through this calling path. 1224 + */ 1225 + stage2_set_pte(kvm, NULL, gpa, pte, 0); 1450 1226 } 1451 1227 1452 1228 ··· 1586 1348 const struct kvm_memory_slot *old, 1587 1349 enum kvm_mr_change change) 1588 1350 { 1351 + /* 1352 + * At this point memslot has been committed and there is an 1353 + * allocated dirty_bitmap[], dirty pages will be be tracked while the 1354 + * memory slot is write protected. 1355 + */ 1356 + if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) 1357 + kvm_mmu_wp_memory_region(kvm, mem->slot); 1589 1358 } 1590 1359 1591 1360 int kvm_arch_prepare_memory_region(struct kvm *kvm, ··· 1605 1360 bool writable = !(mem->flags & KVM_MEM_READONLY); 1606 1361 int ret = 0; 1607 1362 1608 - if (change != KVM_MR_CREATE && change != KVM_MR_MOVE) 1363 + if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && 1364 + change != KVM_MR_FLAGS_ONLY) 1609 1365 return 0; 1610 1366 1611 1367 /* ··· 1657 1411 phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + 1658 1412 vm_start - vma->vm_start; 1659 1413 1414 + /* IO region dirty page logging not allowed */ 1415 + if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) 1416 + return -EINVAL; 1417 + 1660 1418 ret = kvm_phys_addr_ioremap(kvm, gpa, pa, 1661 1419 vm_end - vm_start, 1662 1420 writable); ··· 1669 1419 } 1670 1420 hva = vm_end; 1671 1421 } while (hva < reg_end); 1422 + 1423 + if (change == KVM_MR_FLAGS_ONLY) 1424 + return ret; 1672 1425 1673 1426 spin_lock(&kvm->mmu_lock); 1674 1427 if (ret)

+5 -12

arch/arm/kvm/psci.c

··· 22 22 #include <asm/cputype.h> 23 23 #include <asm/kvm_emulate.h> 24 24 #include <asm/kvm_psci.h> 25 + #include <asm/kvm_host.h> 25 26 26 27 /* 27 28 * This is an implementation of the Power State Coordination Interface ··· 67 66 static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) 68 67 { 69 68 struct kvm *kvm = source_vcpu->kvm; 70 - struct kvm_vcpu *vcpu = NULL, *tmp; 69 + struct kvm_vcpu *vcpu = NULL; 71 70 wait_queue_head_t *wq; 72 71 unsigned long cpu_id; 73 72 unsigned long context_id; 74 - unsigned long mpidr; 75 73 phys_addr_t target_pc; 76 - int i; 77 74 78 - cpu_id = *vcpu_reg(source_vcpu, 1); 75 + cpu_id = *vcpu_reg(source_vcpu, 1) & MPIDR_HWID_BITMASK; 79 76 if (vcpu_mode_is_32bit(source_vcpu)) 80 77 cpu_id &= ~((u32) 0); 81 78 82 - kvm_for_each_vcpu(i, tmp, kvm) { 83 - mpidr = kvm_vcpu_get_mpidr(tmp); 84 - if ((mpidr & MPIDR_HWID_BITMASK) == (cpu_id & MPIDR_HWID_BITMASK)) { 85 - vcpu = tmp; 86 - break; 87 - } 88 - } 79 + vcpu = kvm_mpidr_to_vcpu(kvm, cpu_id); 89 80 90 81 /* 91 82 * Make sure the caller requested a valid CPU and that the CPU is ··· 148 155 * then ON else OFF 149 156 */ 150 157 kvm_for_each_vcpu(i, tmp, kvm) { 151 - mpidr = kvm_vcpu_get_mpidr(tmp); 158 + mpidr = kvm_vcpu_get_mpidr_aff(tmp); 152 159 if (((mpidr & target_affinity_mask) == target_affinity) && 153 160 !tmp->arch.pause) { 154 161 return PSCI_0_2_AFFINITY_LEVEL_ON;

+7 -4

arch/arm/kvm/trace.h

··· 140 140 __entry->CRm, __entry->Op2) 141 141 ); 142 142 143 - TRACE_EVENT(kvm_wfi, 144 - TP_PROTO(unsigned long vcpu_pc), 145 - TP_ARGS(vcpu_pc), 143 + TRACE_EVENT(kvm_wfx, 144 + TP_PROTO(unsigned long vcpu_pc, bool is_wfe), 145 + TP_ARGS(vcpu_pc, is_wfe), 146 146 147 147 TP_STRUCT__entry( 148 148 __field( unsigned long, vcpu_pc ) 149 + __field( bool, is_wfe ) 149 150 ), 150 151 151 152 TP_fast_assign( 152 153 __entry->vcpu_pc = vcpu_pc; 154 + __entry->is_wfe = is_wfe; 153 155 ), 154 156 155 - TP_printk("guest executed wfi at: 0x%08lx", __entry->vcpu_pc) 157 + TP_printk("guest executed wf%c at: 0x%08lx", 158 + __entry->is_wfe ? 'e' : 'i', __entry->vcpu_pc) 156 159 ); 157 160 158 161 TRACE_EVENT(kvm_unmap_hva,

+1

arch/arm64/include/asm/esr.h

··· 96 96 #define ESR_ELx_COND_SHIFT (20) 97 97 #define ESR_ELx_COND_MASK (UL(0xF) << ESR_ELx_COND_SHIFT) 98 98 #define ESR_ELx_WFx_ISS_WFE (UL(1) << 0) 99 + #define ESR_ELx_xVC_IMM_MASK ((1UL << 16) - 1) 99 100 100 101 #ifndef __ASSEMBLY__ 101 102 #include <asm/types.h>

+1

arch/arm64/include/asm/kvm_asm.h

··· 126 126 127 127 extern void __kvm_flush_vm_context(void); 128 128 extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa); 129 + extern void __kvm_tlb_flush_vmid(struct kvm *kvm); 129 130 130 131 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); 131 132

+8 -2

arch/arm64/include/asm/kvm_emulate.h

··· 29 29 #include <asm/kvm_asm.h> 30 30 #include <asm/kvm_mmio.h> 31 31 #include <asm/ptrace.h> 32 + #include <asm/cputype.h> 32 33 33 34 unsigned long *vcpu_reg32(const struct kvm_vcpu *vcpu, u8 reg_num); 34 35 unsigned long *vcpu_spsr32(const struct kvm_vcpu *vcpu); ··· 141 140 return ((phys_addr_t)vcpu->arch.fault.hpfar_el2 & HPFAR_MASK) << 8; 142 141 } 143 142 143 + static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu) 144 + { 145 + return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK; 146 + } 147 + 144 148 static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu) 145 149 { 146 150 return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV); ··· 207 201 return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC_TYPE; 208 202 } 209 203 210 - static inline unsigned long kvm_vcpu_get_mpidr(struct kvm_vcpu *vcpu) 204 + static inline unsigned long kvm_vcpu_get_mpidr_aff(struct kvm_vcpu *vcpu) 211 205 { 212 - return vcpu_sys_reg(vcpu, MPIDR_EL1); 206 + return vcpu_sys_reg(vcpu, MPIDR_EL1) & MPIDR_HWID_BITMASK; 213 207 } 214 208 215 209 static inline void kvm_vcpu_set_be(struct kvm_vcpu *vcpu)

+7

arch/arm64/include/asm/kvm_host.h

··· 59 59 /* VTTBR value associated with above pgd and vmid */ 60 60 u64 vttbr; 61 61 62 + /* The maximum number of vCPUs depends on the used GIC model */ 63 + int max_vcpus; 64 + 62 65 /* Interrupt controller */ 63 66 struct vgic_dist vgic; 64 67 ··· 162 159 }; 163 160 164 161 struct kvm_vcpu_stat { 162 + u32 halt_successful_poll; 165 163 u32 halt_wakeup; 166 164 }; 167 165 ··· 200 196 201 197 u64 kvm_call_hyp(void *hypfn, ...); 202 198 void force_vm_exit(const cpumask_t *mask); 199 + void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot); 203 200 204 201 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, 205 202 int exception_index); 206 203 207 204 int kvm_perf_init(void); 208 205 int kvm_perf_teardown(void); 206 + 207 + struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr); 209 208 210 209 static inline void __cpu_init_hyp_mode(phys_addr_t boot_pgd_ptr, 211 210 phys_addr_t pgd_ptr,

+1

arch/arm64/include/asm/kvm_mmio.h

··· 40 40 u8 data[8]; 41 41 u32 len; 42 42 bool is_write; 43 + void *private; 43 44 }; 44 45 45 46 static inline void kvm_prepare_mmio(struct kvm_run *run,

+21

arch/arm64/include/asm/kvm_mmu.h

··· 118 118 pmd_val(*pmd) |= PMD_S2_RDWR; 119 119 } 120 120 121 + static inline void kvm_set_s2pte_readonly(pte_t *pte) 122 + { 123 + pte_val(*pte) = (pte_val(*pte) & ~PTE_S2_RDWR) | PTE_S2_RDONLY; 124 + } 125 + 126 + static inline bool kvm_s2pte_readonly(pte_t *pte) 127 + { 128 + return (pte_val(*pte) & PTE_S2_RDWR) == PTE_S2_RDONLY; 129 + } 130 + 131 + static inline void kvm_set_s2pmd_readonly(pmd_t *pmd) 132 + { 133 + pmd_val(*pmd) = (pmd_val(*pmd) & ~PMD_S2_RDWR) | PMD_S2_RDONLY; 134 + } 135 + 136 + static inline bool kvm_s2pmd_readonly(pmd_t *pmd) 137 + { 138 + return (pmd_val(*pmd) & PMD_S2_RDWR) == PMD_S2_RDONLY; 139 + } 140 + 141 + 121 142 #define kvm_pgd_addr_end(addr, end) pgd_addr_end(addr, end) 122 143 #define kvm_pud_addr_end(addr, end) pud_addr_end(addr, end) 123 144 #define kvm_pmd_addr_end(addr, end) pmd_addr_end(addr, end)

+1

arch/arm64/include/asm/pgtable-hwdef.h

··· 119 119 #define PTE_S2_RDONLY (_AT(pteval_t, 1) << 6) /* HAP[2:1] */ 120 120 #define PTE_S2_RDWR (_AT(pteval_t, 3) << 6) /* HAP[2:1] */ 121 121 122 + #define PMD_S2_RDONLY (_AT(pmdval_t, 1) << 6) /* HAP[2:1] */ 122 123 #define PMD_S2_RDWR (_AT(pmdval_t, 3) << 6) /* HAP[2:1] */ 123 124 124 125 /*

+9

arch/arm64/include/uapi/asm/kvm.h

··· 78 78 #define KVM_VGIC_V2_DIST_SIZE 0x1000 79 79 #define KVM_VGIC_V2_CPU_SIZE 0x2000 80 80 81 + /* Supported VGICv3 address types */ 82 + #define KVM_VGIC_V3_ADDR_TYPE_DIST 2 83 + #define KVM_VGIC_V3_ADDR_TYPE_REDIST 3 84 + 85 + #define KVM_VGIC_V3_DIST_SIZE SZ_64K 86 + #define KVM_VGIC_V3_REDIST_SIZE (2 * SZ_64K) 87 + 81 88 #define KVM_ARM_VCPU_POWER_OFF 0 /* CPU is started in OFF state */ 82 89 #define KVM_ARM_VCPU_EL1_32BIT 1 /* CPU running a 32bit VM */ 83 90 #define KVM_ARM_VCPU_PSCI_0_2 2 /* CPU uses PSCI v0.2 */ ··· 168 161 #define KVM_DEV_ARM_VGIC_OFFSET_SHIFT 0 169 162 #define KVM_DEV_ARM_VGIC_OFFSET_MASK (0xffffffffULL << KVM_DEV_ARM_VGIC_OFFSET_SHIFT) 170 163 #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS 3 164 + #define KVM_DEV_ARM_VGIC_GRP_CTRL 4 165 + #define KVM_DEV_ARM_VGIC_CTRL_INIT 0 171 166 172 167 /* KVM_IRQ_LINE irq field index values */ 173 168 #define KVM_ARM_IRQ_TYPE_SHIFT 24

+1

arch/arm64/kernel/asm-offsets.c

··· 140 140 DEFINE(VGIC_V2_CPU_ELRSR, offsetof(struct vgic_cpu, vgic_v2.vgic_elrsr)); 141 141 DEFINE(VGIC_V2_CPU_APR, offsetof(struct vgic_cpu, vgic_v2.vgic_apr)); 142 142 DEFINE(VGIC_V2_CPU_LR, offsetof(struct vgic_cpu, vgic_v2.vgic_lr)); 143 + DEFINE(VGIC_V3_CPU_SRE, offsetof(struct vgic_cpu, vgic_v3.vgic_sre)); 143 144 DEFINE(VGIC_V3_CPU_HCR, offsetof(struct vgic_cpu, vgic_v3.vgic_hcr)); 144 145 DEFINE(VGIC_V3_CPU_VMCR, offsetof(struct vgic_cpu, vgic_v3.vgic_vmcr)); 145 146 DEFINE(VGIC_V3_CPU_MISR, offsetof(struct vgic_cpu, vgic_v3.vgic_misr));

+2

arch/arm64/kvm/Kconfig

··· 22 22 select PREEMPT_NOTIFIERS 23 23 select ANON_INODES 24 24 select HAVE_KVM_CPU_RELAX_INTERCEPT 25 + select HAVE_KVM_ARCH_TLB_FLUSH_ALL 25 26 select KVM_MMIO 26 27 select KVM_ARM_HOST 27 28 select KVM_ARM_VGIC 28 29 select KVM_ARM_TIMER 30 + select KVM_GENERIC_DIRTYLOG_READ_PROTECT 29 31 select SRCU 30 32 ---help--- 31 33 Support hosting virtualized guest machines.

+2

arch/arm64/kvm/Makefile

··· 21 21 22 22 kvm-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic.o 23 23 kvm-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic-v2.o 24 + kvm-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic-v2-emul.o 24 25 kvm-$(CONFIG_KVM_ARM_VGIC) += vgic-v2-switch.o 25 26 kvm-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic-v3.o 27 + kvm-$(CONFIG_KVM_ARM_VGIC) += $(KVM)/arm/vgic-v3-emul.o 26 28 kvm-$(CONFIG_KVM_ARM_VGIC) += vgic-v3-switch.o 27 29 kvm-$(CONFIG_KVM_ARM_TIMER) += $(KVM)/arm/arch_timer.o

+11 -2

arch/arm64/kvm/handle_exit.c

··· 28 28 #include <asm/kvm_mmu.h> 29 29 #include <asm/kvm_psci.h> 30 30 31 + #define CREATE_TRACE_POINTS 32 + #include "trace.h" 33 + 31 34 typedef int (*exit_handle_fn)(struct kvm_vcpu *, struct kvm_run *); 32 35 33 36 static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run *run) 34 37 { 35 38 int ret; 39 + 40 + trace_kvm_hvc_arm64(*vcpu_pc(vcpu), *vcpu_reg(vcpu, 0), 41 + kvm_vcpu_hvc_get_imm(vcpu)); 36 42 37 43 ret = kvm_psci_call(vcpu); 38 44 if (ret < 0) { ··· 69 63 */ 70 64 static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run) 71 65 { 72 - if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) 66 + if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) { 67 + trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true); 73 68 kvm_vcpu_on_spin(vcpu); 74 - else 69 + } else { 70 + trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false); 75 71 kvm_vcpu_block(vcpu); 72 + } 76 73 77 74 kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu)); 78 75

+22

arch/arm64/kvm/hyp.S

··· 1032 1032 ret 1033 1033 ENDPROC(__kvm_tlb_flush_vmid_ipa) 1034 1034 1035 + /** 1036 + * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs 1037 + * @struct kvm *kvm - pointer to kvm structure 1038 + * 1039 + * Invalidates all Stage 1 and 2 TLB entries for current VMID. 1040 + */ 1041 + ENTRY(__kvm_tlb_flush_vmid) 1042 + dsb ishst 1043 + 1044 + kern_hyp_va x0 1045 + ldr x2, [x0, #KVM_VTTBR] 1046 + msr vttbr_el2, x2 1047 + isb 1048 + 1049 + tlbi vmalls12e1is 1050 + dsb ish 1051 + isb 1052 + 1053 + msr vttbr_el2, xzr 1054 + ret 1055 + ENDPROC(__kvm_tlb_flush_vmid) 1056 + 1035 1057 ENTRY(__kvm_flush_vm_context) 1036 1058 dsb ishst 1037 1059 tlbi alle1is

+38 -2

arch/arm64/kvm/sys_regs.c

··· 113 113 return true; 114 114 } 115 115 116 + /* 117 + * Trap handler for the GICv3 SGI generation system register. 118 + * Forward the request to the VGIC emulation. 119 + * The cp15_64 code makes sure this automatically works 120 + * for both AArch64 and AArch32 accesses. 121 + */ 122 + static bool access_gic_sgi(struct kvm_vcpu *vcpu, 123 + const struct sys_reg_params *p, 124 + const struct sys_reg_desc *r) 125 + { 126 + u64 val; 127 + 128 + if (!p->is_write) 129 + return read_from_write_only(vcpu, p); 130 + 131 + val = *vcpu_reg(vcpu, p->Rt); 132 + vgic_v3_dispatch_sgi(vcpu, val); 133 + 134 + return true; 135 + } 136 + 116 137 static bool trap_raz_wi(struct kvm_vcpu *vcpu, 117 138 const struct sys_reg_params *p, 118 139 const struct sys_reg_desc *r) ··· 221 200 222 201 static void reset_mpidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 223 202 { 203 + u64 mpidr; 204 + 224 205 /* 225 - * Simply map the vcpu_id into the Aff0 field of the MPIDR. 206 + * Map the vcpu_id into the first three affinity level fields of 207 + * the MPIDR. We limit the number of VCPUs in level 0 due to a 208 + * limitation to 16 CPUs in that level in the ICC_SGIxR registers 209 + * of the GICv3 to be able to address each CPU directly when 210 + * sending IPIs. 226 211 */ 227 - vcpu_sys_reg(vcpu, MPIDR_EL1) = (1UL << 31) | (vcpu->vcpu_id & 0xff); 212 + mpidr = (vcpu->vcpu_id & 0x0f) << MPIDR_LEVEL_SHIFT(0); 213 + mpidr |= ((vcpu->vcpu_id >> 4) & 0xff) << MPIDR_LEVEL_SHIFT(1); 214 + mpidr |= ((vcpu->vcpu_id >> 12) & 0xff) << MPIDR_LEVEL_SHIFT(2); 215 + vcpu_sys_reg(vcpu, MPIDR_EL1) = (1ULL << 31) | mpidr; 228 216 } 229 217 230 218 /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */ ··· 403 373 { Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000), 404 374 NULL, reset_val, VBAR_EL1, 0 }, 405 375 376 + /* ICC_SGI1R_EL1 */ 377 + { Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b1011), Op2(0b101), 378 + access_gic_sgi }, 406 379 /* ICC_SRE_EL1 */ 407 380 { Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b1100), Op2(0b101), 408 381 trap_raz_wi }, ··· 638 605 * register). 639 606 */ 640 607 static const struct sys_reg_desc cp15_regs[] = { 608 + { Op1( 0), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, 609 + 641 610 { Op1( 0), CRn( 1), CRm( 0), Op2( 0), access_vm_reg, NULL, c1_SCTLR }, 642 611 { Op1( 0), CRn( 2), CRm( 0), Op2( 0), access_vm_reg, NULL, c2_TTBR0 }, 643 612 { Op1( 0), CRn( 2), CRm( 0), Op2( 1), access_vm_reg, NULL, c2_TTBR1 }, ··· 687 652 688 653 static const struct sys_reg_desc cp15_64_regs[] = { 689 654 { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 }, 655 + { Op1( 0), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, 690 656 { Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR1 }, 691 657 }; 692 658

+55

arch/arm64/kvm/trace.h

··· 1 + #if !defined(_TRACE_ARM64_KVM_H) || defined(TRACE_HEADER_MULTI_READ) 2 + #define _TRACE_ARM64_KVM_H 3 + 4 + #include <linux/tracepoint.h> 5 + 6 + #undef TRACE_SYSTEM 7 + #define TRACE_SYSTEM kvm 8 + 9 + TRACE_EVENT(kvm_wfx_arm64, 10 + TP_PROTO(unsigned long vcpu_pc, bool is_wfe), 11 + TP_ARGS(vcpu_pc, is_wfe), 12 + 13 + TP_STRUCT__entry( 14 + __field(unsigned long, vcpu_pc) 15 + __field(bool, is_wfe) 16 + ), 17 + 18 + TP_fast_assign( 19 + __entry->vcpu_pc = vcpu_pc; 20 + __entry->is_wfe = is_wfe; 21 + ), 22 + 23 + TP_printk("guest executed wf%c at: 0x%08lx", 24 + __entry->is_wfe ? 'e' : 'i', __entry->vcpu_pc) 25 + ); 26 + 27 + TRACE_EVENT(kvm_hvc_arm64, 28 + TP_PROTO(unsigned long vcpu_pc, unsigned long r0, unsigned long imm), 29 + TP_ARGS(vcpu_pc, r0, imm), 30 + 31 + TP_STRUCT__entry( 32 + __field(unsigned long, vcpu_pc) 33 + __field(unsigned long, r0) 34 + __field(unsigned long, imm) 35 + ), 36 + 37 + TP_fast_assign( 38 + __entry->vcpu_pc = vcpu_pc; 39 + __entry->r0 = r0; 40 + __entry->imm = imm; 41 + ), 42 + 43 + TP_printk("HVC at 0x%08lx (r0: 0x%08lx, imm: 0x%lx)", 44 + __entry->vcpu_pc, __entry->r0, __entry->imm) 45 + ); 46 + 47 + #endif /* _TRACE_ARM64_KVM_H */ 48 + 49 + #undef TRACE_INCLUDE_PATH 50 + #define TRACE_INCLUDE_PATH . 51 + #undef TRACE_INCLUDE_FILE 52 + #define TRACE_INCLUDE_FILE trace 53 + 54 + /* This part must be outside protection */ 55 + #include <trace/define_trace.h>

+9 -5

arch/arm64/kvm/vgic-v3-switch.S

··· 148 148 * x0: Register pointing to VCPU struct 149 149 */ 150 150 .macro restore_vgic_v3_state 151 - // Disable SRE_EL1 access. Necessary, otherwise 152 - // ICH_VMCR_EL2.VFIQEn becomes one, and FIQ happens... 153 - msr_s ICC_SRE_EL1, xzr 154 - isb 155 - 156 151 // Compute the address of struct vgic_cpu 157 152 add x3, x0, #VCPU_VGIC_CPU 158 153 159 154 // Restore all interesting registers 160 155 ldr w4, [x3, #VGIC_V3_CPU_HCR] 161 156 ldr w5, [x3, #VGIC_V3_CPU_VMCR] 157 + ldr w25, [x3, #VGIC_V3_CPU_SRE] 158 + 159 + msr_s ICC_SRE_EL1, x25 160 + 161 + // make sure SRE is valid before writing the other registers 162 + isb 162 163 163 164 msr_s ICH_HCR_EL2, x4 164 165 msr_s ICH_VMCR_EL2, x5 ··· 245 244 dsb sy 246 245 247 246 // Prevent the guest from touching the GIC system registers 247 + // if SRE isn't enabled for GICv3 emulation 248 + cbnz x25, 1f 248 249 mrs_s x5, ICC_SRE_EL2 249 250 and x5, x5, #~ICC_SRE_EL2_ENABLE 250 251 msr_s ICC_SRE_EL2, x5 252 + 1: 251 253 .endm 252 254 253 255 ENTRY(__save_vgic_v3_state)

-1

arch/ia64/include/uapi/asm/Kbuild

··· 18 18 header-y += ioctl.h 19 19 header-y += ioctls.h 20 20 header-y += ipcbuf.h 21 - header-y += kvm.h 22 21 header-y += kvm_para.h 23 22 header-y += mman.h 24 23 header-y += msgbuf.h

+1

arch/mips/include/asm/kvm_host.h

··· 120 120 u32 resvd_inst_exits; 121 121 u32 break_inst_exits; 122 122 u32 flush_dcache_exits; 123 + u32 halt_successful_poll; 123 124 u32 halt_wakeup; 124 125 }; 125 126

+1 -1

arch/mips/kvm/locore.S

··· 434 434 /* Setup status register for running guest in UM */ 435 435 .set at 436 436 or v1, v1, (ST0_EXL | KSU_USER | ST0_IE) 437 - and v1, v1, ~ST0_CU0 437 + and v1, v1, ~(ST0_CU0 | ST0_MX) 438 438 .set noat 439 439 mtc0 v1, CP0_STATUS 440 440 ehb

+18 -5

arch/mips/kvm/mips.c

··· 15 15 #include <linux/vmalloc.h> 16 16 #include <linux/fs.h> 17 17 #include <linux/bootmem.h> 18 + #include <asm/fpu.h> 18 19 #include <asm/page.h> 19 20 #include <asm/cacheflush.h> 20 21 #include <asm/mmu_context.h> 22 + #include <asm/pgtable.h> 21 23 22 24 #include <linux/kvm_host.h> 23 25 ··· 49 47 { "resvd_inst", VCPU_STAT(resvd_inst_exits), KVM_STAT_VCPU }, 50 48 { "break_inst", VCPU_STAT(break_inst_exits), KVM_STAT_VCPU }, 51 49 { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU }, 50 + { "halt_successful_poll", VCPU_STAT(halt_successful_poll), KVM_STAT_VCPU }, 52 51 { "halt_wakeup", VCPU_STAT(halt_wakeup), KVM_STAT_VCPU }, 53 52 {NULL} 54 53 }; ··· 381 378 vcpu->mmio_needed = 0; 382 379 } 383 380 381 + lose_fpu(1); 382 + 384 383 local_irq_disable(); 385 384 /* Check if we have any exceptions/interrupts pending */ 386 385 kvm_mips_deliver_interrupts(vcpu, ··· 390 385 391 386 kvm_guest_enter(); 392 387 388 + /* Disable hardware page table walking while in guest */ 389 + htw_stop(); 390 + 393 391 r = __kvm_mips_vcpu_run(run, vcpu); 392 + 393 + /* Re-enable HTW before enabling interrupts */ 394 + htw_start(); 394 395 395 396 kvm_guest_exit(); 396 397 local_irq_enable(); ··· 843 832 return -ENOIOCTLCMD; 844 833 } 845 834 846 - int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 835 + void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 847 836 { 848 - return 0; 849 837 } 850 838 851 839 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) ··· 990 980 { 991 981 uint32_t status = read_c0_status(); 992 982 993 - if (cpu_has_fpu) 994 - status |= (ST0_CU1); 995 - 996 983 if (cpu_has_dsp) 997 984 status |= (ST0_MX); 998 985 ··· 1008 1001 unsigned long badvaddr = vcpu->arch.host_cp0_badvaddr; 1009 1002 enum emulation_result er = EMULATE_DONE; 1010 1003 int ret = RESUME_GUEST; 1004 + 1005 + /* re-enable HTW before enabling interrupts */ 1006 + htw_start(); 1011 1007 1012 1008 /* Set a default exit reason */ 1013 1009 run->exit_reason = KVM_EXIT_UNKNOWN; ··· 1145 1135 trace_kvm_exit(vcpu, SIGNAL_EXITS); 1146 1136 } 1147 1137 } 1138 + 1139 + /* Disable HTW before returning to guest or host */ 1140 + htw_stop(); 1148 1141 1149 1142 return ret; 1150 1143 }

+1

arch/powerpc/include/asm/kvm_host.h

··· 107 107 u32 emulated_inst_exits; 108 108 u32 dec_exits; 109 109 u32 ext_intr_exits; 110 + u32 halt_successful_poll; 110 111 u32 halt_wakeup; 111 112 u32 dbell_exits; 112 113 u32 gdbell_exits;

+1

arch/powerpc/kvm/book3s.c

··· 52 52 { "dec", VCPU_STAT(dec_exits) }, 53 53 { "ext_intr", VCPU_STAT(ext_intr_exits) }, 54 54 { "queue_intr", VCPU_STAT(queue_intr) }, 55 + { "halt_successful_poll", VCPU_STAT(halt_successful_poll), }, 55 56 { "halt_wakeup", VCPU_STAT(halt_wakeup) }, 56 57 { "pf_storage", VCPU_STAT(pf_storage) }, 57 58 { "sp_storage", VCPU_STAT(sp_storage) },

+1

arch/powerpc/kvm/booke.c

··· 62 62 { "inst_emu", VCPU_STAT(emulated_inst_exits) }, 63 63 { "dec", VCPU_STAT(dec_exits) }, 64 64 { "ext_intr", VCPU_STAT(ext_intr_exits) }, 65 + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) }, 65 66 { "halt_wakeup", VCPU_STAT(halt_wakeup) }, 66 67 { "doorbell", VCPU_STAT(dbell_exits) }, 67 68 { "guest doorbell", VCPU_STAT(gdbell_exits) },

+1 -2

arch/powerpc/kvm/powerpc.c

··· 623 623 return vcpu; 624 624 } 625 625 626 - int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 626 + void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 627 627 { 628 - return 0; 629 628 } 630 629 631 630 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)

+44 -12

arch/s390/include/asm/kvm_host.h

··· 35 35 #define KVM_NR_IRQCHIPS 1 36 36 #define KVM_IRQCHIP_NUM_PINS 4096 37 37 38 - #define SIGP_CTRL_C 0x00800000 38 + #define SIGP_CTRL_C 0x80 39 + #define SIGP_CTRL_SCN_MASK 0x3f 39 40 40 41 struct sca_entry { 41 - atomic_t ctrl; 42 - __u32 reserved; 42 + __u8 reserved0; 43 + __u8 sigp_ctrl; 44 + __u16 reserved[3]; 43 45 __u64 sda; 44 46 __u64 reserved2[2]; 45 47 } __attribute__((packed)); ··· 89 87 atomic_t cpuflags; /* 0x0000 */ 90 88 __u32 : 1; /* 0x0004 */ 91 89 __u32 prefix : 18; 92 - __u32 : 13; 90 + __u32 : 1; 91 + __u32 ibc : 12; 93 92 __u8 reserved08[4]; /* 0x0008 */ 94 93 #define PROG_IN_SIE (1<<0) 95 94 __u32 prog0c; /* 0x000c */ ··· 135 132 __u8 reserved60; /* 0x0060 */ 136 133 __u8 ecb; /* 0x0061 */ 137 134 __u8 ecb2; /* 0x0062 */ 138 - __u8 reserved63[1]; /* 0x0063 */ 135 + #define ECB3_AES 0x04 136 + #define ECB3_DEA 0x08 137 + __u8 ecb3; /* 0x0063 */ 139 138 __u32 scaol; /* 0x0064 */ 140 139 __u8 reserved68[4]; /* 0x0068 */ 141 140 __u32 todpr; /* 0x006c */ ··· 164 159 __u64 tecmc; /* 0x00e8 */ 165 160 __u8 reservedf0[12]; /* 0x00f0 */ 166 161 #define CRYCB_FORMAT1 0x00000001 162 + #define CRYCB_FORMAT2 0x00000003 167 163 __u32 crycbd; /* 0x00fc */ 168 164 __u64 gcr[16]; /* 0x0100 */ 169 165 __u64 gbea; /* 0x0180 */ ··· 198 192 u32 exit_stop_request; 199 193 u32 exit_validity; 200 194 u32 exit_instruction; 195 + u32 halt_successful_poll; 201 196 u32 halt_wakeup; 202 197 u32 instruction_lctl; 203 198 u32 instruction_lctlg; ··· 385 378 struct kvm_s390_emerg_info emerg; 386 379 struct kvm_s390_extcall_info extcall; 387 380 struct kvm_s390_prefix_info prefix; 381 + struct kvm_s390_stop_info stop; 388 382 struct kvm_s390_mchk_info mchk; 389 383 }; 390 384 }; 391 - 392 - /* for local_interrupt.action_flags */ 393 - #define ACTION_STORE_ON_STOP (1<<0) 394 - #define ACTION_STOP_ON_STOP (1<<1) 395 385 396 386 struct kvm_s390_irq_payload { 397 387 struct kvm_s390_io_info io; ··· 397 393 struct kvm_s390_emerg_info emerg; 398 394 struct kvm_s390_extcall_info extcall; 399 395 struct kvm_s390_prefix_info prefix; 396 + struct kvm_s390_stop_info stop; 400 397 struct kvm_s390_mchk_info mchk; 401 398 }; 402 399 ··· 406 401 struct kvm_s390_float_interrupt *float_int; 407 402 wait_queue_head_t *wq; 408 403 atomic_t *cpuflags; 409 - unsigned int action_bits; 410 404 DECLARE_BITMAP(sigp_emerg_pending, KVM_MAX_VCPUS); 411 405 struct kvm_s390_irq_payload irq; 412 406 unsigned long pending_irqs; ··· 474 470 }; 475 471 struct gmap *gmap; 476 472 struct kvm_guestdbg_info_arch guestdbg; 477 - #define KVM_S390_PFAULT_TOKEN_INVALID (-1UL) 478 473 unsigned long pfault_token; 479 474 unsigned long pfault_select; 480 475 unsigned long pfault_compare; ··· 507 504 #define MAX_S390_IO_ADAPTERS ((MAX_ISC + 1) * 8) 508 505 #define MAX_S390_ADAPTER_MAPS 256 509 506 507 + /* maximum size of facilities and facility mask is 2k bytes */ 508 + #define S390_ARCH_FAC_LIST_SIZE_BYTE (1<<11) 509 + #define S390_ARCH_FAC_LIST_SIZE_U64 \ 510 + (S390_ARCH_FAC_LIST_SIZE_BYTE / sizeof(u64)) 511 + #define S390_ARCH_FAC_MASK_SIZE_BYTE S390_ARCH_FAC_LIST_SIZE_BYTE 512 + #define S390_ARCH_FAC_MASK_SIZE_U64 \ 513 + (S390_ARCH_FAC_MASK_SIZE_BYTE / sizeof(u64)) 514 + 515 + struct s390_model_fac { 516 + /* facilities used in SIE context */ 517 + __u64 sie[S390_ARCH_FAC_LIST_SIZE_U64]; 518 + /* subset enabled by kvm */ 519 + __u64 kvm[S390_ARCH_FAC_LIST_SIZE_U64]; 520 + }; 521 + 522 + struct kvm_s390_cpu_model { 523 + struct s390_model_fac *fac; 524 + struct cpuid cpu_id; 525 + unsigned short ibc; 526 + }; 527 + 510 528 struct kvm_s390_crypto { 511 529 struct kvm_s390_crypto_cb *crycb; 512 530 __u32 crycbd; 531 + __u8 aes_kw; 532 + __u8 dea_kw; 513 533 }; 514 534 515 535 struct kvm_s390_crypto_cb { 516 - __u8 reserved00[128]; /* 0x0000 */ 536 + __u8 reserved00[72]; /* 0x0000 */ 537 + __u8 dea_wrapping_key_mask[24]; /* 0x0048 */ 538 + __u8 aes_wrapping_key_mask[32]; /* 0x0060 */ 539 + __u8 reserved80[128]; /* 0x0080 */ 517 540 }; 518 541 519 542 struct kvm_arch{ ··· 552 523 int use_irqchip; 553 524 int use_cmma; 554 525 int user_cpu_state_ctrl; 526 + int user_sigp; 555 527 struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS]; 556 528 wait_queue_head_t ipte_wq; 557 529 int ipte_lock_count; 558 530 struct mutex ipte_mutex; 559 531 spinlock_t start_stop_lock; 532 + struct kvm_s390_cpu_model model; 560 533 struct kvm_s390_crypto crypto; 534 + u64 epoch; 561 535 }; 562 536 563 537 #define KVM_HVA_ERR_BAD (-1UL)

+3 -1

arch/s390/include/asm/sclp.h

··· 31 31 u8 reserved0[2]; 32 32 u8 : 3; 33 33 u8 siif : 1; 34 - u8 : 4; 34 + u8 sigpif : 1; 35 + u8 : 3; 35 36 u8 reserved2[10]; 36 37 u8 type; 37 38 u8 reserved1; ··· 70 69 unsigned long sclp_get_hsa_size(void); 71 70 void sclp_early_detect(void); 72 71 int sclp_has_siif(void); 72 + int sclp_has_sigpif(void); 73 73 unsigned int sclp_get_ibc(void); 74 74 75 75 long _sclp_print_early(const char *);

+7 -3

arch/s390/include/asm/sysinfo.h

··· 15 15 #define __ASM_S390_SYSINFO_H 16 16 17 17 #include <asm/bitsperlong.h> 18 + #include <linux/uuid.h> 18 19 19 20 struct sysinfo_1_1_1 { 20 21 unsigned char p:1; ··· 117 116 char name[8]; 118 117 unsigned int caf; 119 118 char cpi[16]; 120 - char reserved_1[24]; 121 - 119 + char reserved_1[3]; 120 + char ext_name_encoding; 121 + unsigned int reserved_2; 122 + uuid_be uuid; 122 123 } vm[8]; 123 - char reserved_544[3552]; 124 + char reserved_3[1504]; 125 + char ext_names[8][256]; 124 126 }; 125 127 126 128 extern int topology_max_mnest;

+37

arch/s390/include/uapi/asm/kvm.h

··· 57 57 58 58 /* kvm attr_group on vm fd */ 59 59 #define KVM_S390_VM_MEM_CTRL 0 60 + #define KVM_S390_VM_TOD 1 61 + #define KVM_S390_VM_CRYPTO 2 62 + #define KVM_S390_VM_CPU_MODEL 3 60 63 61 64 /* kvm attributes for mem_ctrl */ 62 65 #define KVM_S390_VM_MEM_ENABLE_CMMA 0 63 66 #define KVM_S390_VM_MEM_CLR_CMMA 1 67 + #define KVM_S390_VM_MEM_LIMIT_SIZE 2 68 + 69 + /* kvm attributes for KVM_S390_VM_TOD */ 70 + #define KVM_S390_VM_TOD_LOW 0 71 + #define KVM_S390_VM_TOD_HIGH 1 72 + 73 + /* kvm attributes for KVM_S390_VM_CPU_MODEL */ 74 + /* processor related attributes are r/w */ 75 + #define KVM_S390_VM_CPU_PROCESSOR 0 76 + struct kvm_s390_vm_cpu_processor { 77 + __u64 cpuid; 78 + __u16 ibc; 79 + __u8 pad[6]; 80 + __u64 fac_list[256]; 81 + }; 82 + 83 + /* machine related attributes are r/o */ 84 + #define KVM_S390_VM_CPU_MACHINE 1 85 + struct kvm_s390_vm_cpu_machine { 86 + __u64 cpuid; 87 + __u32 ibc; 88 + __u8 pad[4]; 89 + __u64 fac_mask[256]; 90 + __u64 fac_list[256]; 91 + }; 92 + 93 + /* kvm attributes for crypto */ 94 + #define KVM_S390_VM_CRYPTO_ENABLE_AES_KW 0 95 + #define KVM_S390_VM_CRYPTO_ENABLE_DEA_KW 1 96 + #define KVM_S390_VM_CRYPTO_DISABLE_AES_KW 2 97 + #define KVM_S390_VM_CRYPTO_DISABLE_DEA_KW 3 64 98 65 99 /* for KVM_GET_REGS and KVM_SET_REGS */ 66 100 struct kvm_regs { ··· 140 106 __u32 pad; /* Should be set to 0 */ 141 107 struct kvm_hw_breakpoint __user *hw_bp; 142 108 }; 109 + 110 + /* for KVM_SYNC_PFAULT and KVM_REG_S390_PFTOKEN */ 111 + #define KVM_S390_PFAULT_TOKEN_INVALID 0xffffffffffffffffULL 143 112 144 113 #define KVM_SYNC_PREFIX (1UL << 0) 145 114 #define KVM_SYNC_GPRS (1UL << 1)

+29

arch/s390/kernel/sysinfo.c

··· 204 204 } 205 205 } 206 206 207 + static void print_ext_name(struct seq_file *m, int lvl, 208 + struct sysinfo_3_2_2 *info) 209 + { 210 + if (info->vm[lvl].ext_name_encoding == 0) 211 + return; 212 + if (info->ext_names[lvl][0] == 0) 213 + return; 214 + switch (info->vm[lvl].ext_name_encoding) { 215 + case 1: /* EBCDIC */ 216 + EBCASC(info->ext_names[lvl], sizeof(info->ext_names[lvl])); 217 + break; 218 + case 2: /* UTF-8 */ 219 + break; 220 + default: 221 + return; 222 + } 223 + seq_printf(m, "VM%02d Extended Name: %-.256s\n", lvl, 224 + info->ext_names[lvl]); 225 + } 226 + 227 + static void print_uuid(struct seq_file *m, int i, struct sysinfo_3_2_2 *info) 228 + { 229 + if (!memcmp(&info->vm[i].uuid, &NULL_UUID_BE, sizeof(uuid_be))) 230 + return; 231 + seq_printf(m, "VM%02d UUID: %pUb\n", i, &info->vm[i].uuid); 232 + } 233 + 207 234 static void stsi_3_2_2(struct seq_file *m, struct sysinfo_3_2_2 *info) 208 235 { 209 236 int i; ··· 248 221 seq_printf(m, "VM%02d CPUs Configured: %d\n", i, info->vm[i].cpus_configured); 249 222 seq_printf(m, "VM%02d CPUs Standby: %d\n", i, info->vm[i].cpus_standby); 250 223 seq_printf(m, "VM%02d CPUs Reserved: %d\n", i, info->vm[i].cpus_reserved); 224 + print_ext_name(m, i, info); 225 + print_uuid(m, i, info); 251 226 } 252 227 } 253 228

+2 -2

arch/s390/kvm/gaccess.c

··· 357 357 union asce asce; 358 358 359 359 ctlreg0.val = vcpu->arch.sie_block->gcr[0]; 360 - edat1 = ctlreg0.edat && test_vfacility(8); 361 - edat2 = edat1 && test_vfacility(78); 360 + edat1 = ctlreg0.edat && test_kvm_facility(vcpu->kvm, 8); 361 + edat2 = edat1 && test_kvm_facility(vcpu->kvm, 78); 362 362 asce.val = get_vcpu_asce(vcpu); 363 363 if (asce.r) 364 364 goto real_address;

+28 -15

arch/s390/kvm/intercept.c

··· 68 68 69 69 static int handle_stop(struct kvm_vcpu *vcpu) 70 70 { 71 + struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 71 72 int rc = 0; 72 - unsigned int action_bits; 73 + uint8_t flags, stop_pending; 73 74 74 75 vcpu->stat.exit_stop_request++; 75 - trace_kvm_s390_stop_request(vcpu->arch.local_int.action_bits); 76 76 77 - action_bits = vcpu->arch.local_int.action_bits; 78 - 79 - if (!(action_bits & ACTION_STOP_ON_STOP)) 77 + /* delay the stop if any non-stop irq is pending */ 78 + if (kvm_s390_vcpu_has_irq(vcpu, 1)) 80 79 return 0; 81 80 82 - if (action_bits & ACTION_STORE_ON_STOP) { 81 + /* avoid races with the injection/SIGP STOP code */ 82 + spin_lock(&li->lock); 83 + flags = li->irq.stop.flags; 84 + stop_pending = kvm_s390_is_stop_irq_pending(vcpu); 85 + spin_unlock(&li->lock); 86 + 87 + trace_kvm_s390_stop_request(stop_pending, flags); 88 + if (!stop_pending) 89 + return 0; 90 + 91 + if (flags & KVM_S390_STOP_FLAG_STORE_STATUS) { 83 92 rc = kvm_s390_vcpu_store_status(vcpu, 84 93 KVM_S390_STORE_STATUS_NOADDR); 85 94 if (rc) ··· 288 279 irq.type = KVM_S390_INT_CPU_TIMER; 289 280 break; 290 281 case EXT_IRQ_EXTERNAL_CALL: 291 - if (kvm_s390_si_ext_call_pending(vcpu)) 292 - return 0; 293 282 irq.type = KVM_S390_INT_EXTERNAL_CALL; 294 283 irq.u.extcall.code = vcpu->arch.sie_block->extcpuaddr; 295 - break; 284 + rc = kvm_s390_inject_vcpu(vcpu, &irq); 285 + /* ignore if another external call is already pending */ 286 + if (rc == -EBUSY) 287 + return 0; 288 + return rc; 296 289 default: 297 290 return -EOPNOTSUPP; 298 291 } ··· 318 307 kvm_s390_get_regs_rre(vcpu, &reg1, &reg2); 319 308 320 309 /* Make sure that the source is paged-in */ 321 - srcaddr = kvm_s390_real_to_abs(vcpu, vcpu->run->s.regs.gprs[reg2]); 322 - if (kvm_is_error_gpa(vcpu->kvm, srcaddr)) 323 - return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING); 310 + rc = guest_translate_address(vcpu, vcpu->run->s.regs.gprs[reg2], 311 + &srcaddr, 0); 312 + if (rc) 313 + return kvm_s390_inject_prog_cond(vcpu, rc); 324 314 rc = kvm_arch_fault_in_page(vcpu, srcaddr, 0); 325 315 if (rc != 0) 326 316 return rc; 327 317 328 318 /* Make sure that the destination is paged-in */ 329 - dstaddr = kvm_s390_real_to_abs(vcpu, vcpu->run->s.regs.gprs[reg1]); 330 - if (kvm_is_error_gpa(vcpu->kvm, dstaddr)) 331 - return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING); 319 + rc = guest_translate_address(vcpu, vcpu->run->s.regs.gprs[reg1], 320 + &dstaddr, 1); 321 + if (rc) 322 + return kvm_s390_inject_prog_cond(vcpu, rc); 332 323 rc = kvm_arch_fault_in_page(vcpu, dstaddr, 1); 333 324 if (rc != 0) 334 325 return rc;

+134 -57

arch/s390/kvm/interrupt.c

··· 19 19 #include <linux/bitmap.h> 20 20 #include <asm/asm-offsets.h> 21 21 #include <asm/uaccess.h> 22 + #include <asm/sclp.h> 22 23 #include "kvm-s390.h" 23 24 #include "gaccess.h" 24 25 #include "trace-s390.h" ··· 160 159 if (psw_mchk_disabled(vcpu)) 161 160 active_mask &= ~IRQ_PEND_MCHK_MASK; 162 161 162 + /* 163 + * STOP irqs will never be actively delivered. They are triggered via 164 + * intercept requests and cleared when the stop intercept is performed. 165 + */ 166 + __clear_bit(IRQ_PEND_SIGP_STOP, &active_mask); 167 + 163 168 return active_mask; 164 169 } 165 170 ··· 193 186 LCTL_CR10 | LCTL_CR11); 194 187 vcpu->arch.sie_block->ictl |= (ICTL_STCTL | ICTL_PINT); 195 188 } 196 - 197 - if (vcpu->arch.local_int.action_bits & ACTION_STOP_ON_STOP) 198 - atomic_set_mask(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags); 199 189 } 200 190 201 191 static void __set_cpuflag(struct kvm_vcpu *vcpu, u32 flag) ··· 220 216 vcpu->arch.sie_block->lctl |= LCTL_CR14; 221 217 } 222 218 219 + static void set_intercept_indicators_stop(struct kvm_vcpu *vcpu) 220 + { 221 + if (kvm_s390_is_stop_irq_pending(vcpu)) 222 + __set_cpuflag(vcpu, CPUSTAT_STOP_INT); 223 + } 224 + 223 225 /* Set interception request for non-deliverable local interrupts */ 224 226 static void set_intercept_indicators_local(struct kvm_vcpu *vcpu) 225 227 { 226 228 set_intercept_indicators_ext(vcpu); 227 229 set_intercept_indicators_mchk(vcpu); 230 + set_intercept_indicators_stop(vcpu); 228 231 } 229 232 230 233 static void __set_intercept_indicator(struct kvm_vcpu *vcpu, ··· 401 390 &vcpu->arch.sie_block->gpsw, sizeof(psw_t)); 402 391 clear_bit(IRQ_PEND_RESTART, &li->pending_irqs); 403 392 return rc ? -EFAULT : 0; 404 - } 405 - 406 - static int __must_check __deliver_stop(struct kvm_vcpu *vcpu) 407 - { 408 - VCPU_EVENT(vcpu, 4, "%s", "interrupt: cpu stop"); 409 - vcpu->stat.deliver_stop_signal++; 410 - trace_kvm_s390_deliver_interrupt(vcpu->vcpu_id, KVM_S390_SIGP_STOP, 411 - 0, 0); 412 - 413 - __set_cpuflag(vcpu, CPUSTAT_STOP_INT); 414 - clear_bit(IRQ_PEND_SIGP_STOP, &vcpu->arch.local_int.pending_irqs); 415 - return 0; 416 393 } 417 394 418 395 static int __must_check __deliver_set_prefix(struct kvm_vcpu *vcpu) ··· 704 705 [IRQ_PEND_EXT_CLOCK_COMP] = __deliver_ckc, 705 706 [IRQ_PEND_EXT_CPU_TIMER] = __deliver_cpu_timer, 706 707 [IRQ_PEND_RESTART] = __deliver_restart, 707 - [IRQ_PEND_SIGP_STOP] = __deliver_stop, 708 708 [IRQ_PEND_SET_PREFIX] = __deliver_set_prefix, 709 709 [IRQ_PEND_PFAULT_INIT] = __deliver_pfault_init, 710 710 }; ··· 736 738 return rc; 737 739 } 738 740 739 - /* Check whether SIGP interpretation facility has an external call pending */ 740 - int kvm_s390_si_ext_call_pending(struct kvm_vcpu *vcpu) 741 + /* Check whether an external call is pending (deliverable or not) */ 742 + int kvm_s390_ext_call_pending(struct kvm_vcpu *vcpu) 741 743 { 742 - atomic_t *sigp_ctrl = &vcpu->kvm->arch.sca->cpu[vcpu->vcpu_id].ctrl; 744 + struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 745 + uint8_t sigp_ctrl = vcpu->kvm->arch.sca->cpu[vcpu->vcpu_id].sigp_ctrl; 743 746 744 - if (!psw_extint_disabled(vcpu) && 745 - (vcpu->arch.sie_block->gcr[0] & 0x2000ul) && 746 - (atomic_read(sigp_ctrl) & SIGP_CTRL_C) && 747 - (atomic_read(&vcpu->arch.sie_block->cpuflags) & CPUSTAT_ECALL_PEND)) 748 - return 1; 747 + if (!sclp_has_sigpif()) 748 + return test_bit(IRQ_PEND_EXT_EXTERNAL, &li->pending_irqs); 749 749 750 - return 0; 750 + return (sigp_ctrl & SIGP_CTRL_C) && 751 + (atomic_read(&vcpu->arch.sie_block->cpuflags) & CPUSTAT_ECALL_PEND); 751 752 } 752 753 753 - int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu) 754 + int kvm_s390_vcpu_has_irq(struct kvm_vcpu *vcpu, int exclude_stop) 754 755 { 755 756 struct kvm_s390_float_interrupt *fi = vcpu->arch.local_int.float_int; 756 757 struct kvm_s390_interrupt_info *inti; ··· 770 773 if (!rc && kvm_cpu_has_pending_timer(vcpu)) 771 774 rc = 1; 772 775 773 - if (!rc && kvm_s390_si_ext_call_pending(vcpu)) 776 + /* external call pending and deliverable */ 777 + if (!rc && kvm_s390_ext_call_pending(vcpu) && 778 + !psw_extint_disabled(vcpu) && 779 + (vcpu->arch.sie_block->gcr[0] & 0x2000ul)) 780 + rc = 1; 781 + 782 + if (!rc && !exclude_stop && kvm_s390_is_stop_irq_pending(vcpu)) 774 783 rc = 1; 775 784 776 785 return rc; ··· 807 804 return -EOPNOTSUPP; /* disabled wait */ 808 805 } 809 806 810 - __set_cpu_idle(vcpu); 811 807 if (!ckc_interrupts_enabled(vcpu)) { 812 808 VCPU_EVENT(vcpu, 3, "%s", "enabled wait w/o timer"); 809 + __set_cpu_idle(vcpu); 813 810 goto no_timer; 814 811 } 815 812 816 813 now = get_tod_clock_fast() + vcpu->arch.sie_block->epoch; 817 814 sltime = tod_to_ns(vcpu->arch.sie_block->ckc - now); 815 + 816 + /* underflow */ 817 + if (vcpu->arch.sie_block->ckc < now) 818 + return 0; 819 + 820 + __set_cpu_idle(vcpu); 818 821 hrtimer_start(&vcpu->arch.ckc_timer, ktime_set (0, sltime) , HRTIMER_MODE_REL); 819 822 VCPU_EVENT(vcpu, 5, "enabled wait via clock comparator: %llx ns", sltime); 820 823 no_timer: ··· 829 820 __unset_cpu_idle(vcpu); 830 821 vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); 831 822 832 - hrtimer_try_to_cancel(&vcpu->arch.ckc_timer); 823 + hrtimer_cancel(&vcpu->arch.ckc_timer); 833 824 return 0; 834 825 } 835 826 ··· 849 840 enum hrtimer_restart kvm_s390_idle_wakeup(struct hrtimer *timer) 850 841 { 851 842 struct kvm_vcpu *vcpu; 843 + u64 now, sltime; 852 844 853 845 vcpu = container_of(timer, struct kvm_vcpu, arch.ckc_timer); 854 - kvm_s390_vcpu_wakeup(vcpu); 846 + now = get_tod_clock_fast() + vcpu->arch.sie_block->epoch; 847 + sltime = tod_to_ns(vcpu->arch.sie_block->ckc - now); 855 848 849 + /* 850 + * If the monotonic clock runs faster than the tod clock we might be 851 + * woken up too early and have to go back to sleep to avoid deadlocks. 852 + */ 853 + if (vcpu->arch.sie_block->ckc > now && 854 + hrtimer_forward_now(timer, ns_to_ktime(sltime))) 855 + return HRTIMER_RESTART; 856 + kvm_s390_vcpu_wakeup(vcpu); 856 857 return HRTIMER_NORESTART; 857 858 } 858 859 ··· 878 859 879 860 /* clear pending external calls set by sigp interpretation facility */ 880 861 atomic_clear_mask(CPUSTAT_ECALL_PEND, li->cpuflags); 881 - atomic_clear_mask(SIGP_CTRL_C, 882 - &vcpu->kvm->arch.sca->cpu[vcpu->vcpu_id].ctrl); 862 + vcpu->kvm->arch.sca->cpu[vcpu->vcpu_id].sigp_ctrl = 0; 883 863 } 884 864 885 865 int __must_check kvm_s390_deliver_pending_interrupts(struct kvm_vcpu *vcpu) ··· 1002 984 return 0; 1003 985 } 1004 986 1005 - int __inject_extcall(struct kvm_vcpu *vcpu, struct kvm_s390_irq *irq) 987 + static int __inject_extcall_sigpif(struct kvm_vcpu *vcpu, uint16_t src_id) 988 + { 989 + unsigned char new_val, old_val; 990 + uint8_t *sigp_ctrl = &vcpu->kvm->arch.sca->cpu[vcpu->vcpu_id].sigp_ctrl; 991 + 992 + new_val = SIGP_CTRL_C | (src_id & SIGP_CTRL_SCN_MASK); 993 + old_val = *sigp_ctrl & ~SIGP_CTRL_C; 994 + if (cmpxchg(sigp_ctrl, old_val, new_val) != old_val) { 995 + /* another external call is pending */ 996 + return -EBUSY; 997 + } 998 + atomic_set_mask(CPUSTAT_ECALL_PEND, &vcpu->arch.sie_block->cpuflags); 999 + return 0; 1000 + } 1001 + 1002 + static int __inject_extcall(struct kvm_vcpu *vcpu, struct kvm_s390_irq *irq) 1006 1003 { 1007 1004 struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 1008 1005 struct kvm_s390_extcall_info *extcall = &li->irq.extcall; 1006 + uint16_t src_id = irq->u.extcall.code; 1009 1007 1010 1008 VCPU_EVENT(vcpu, 3, "inject: external call source-cpu:%u", 1011 - irq->u.extcall.code); 1009 + src_id); 1012 1010 trace_kvm_s390_inject_vcpu(vcpu->vcpu_id, KVM_S390_INT_EXTERNAL_CALL, 1013 - irq->u.extcall.code, 0, 2); 1011 + src_id, 0, 2); 1014 1012 1013 + /* sending vcpu invalid */ 1014 + if (src_id >= KVM_MAX_VCPUS || 1015 + kvm_get_vcpu(vcpu->kvm, src_id) == NULL) 1016 + return -EINVAL; 1017 + 1018 + if (sclp_has_sigpif()) 1019 + return __inject_extcall_sigpif(vcpu, src_id); 1020 + 1021 + if (!test_and_set_bit(IRQ_PEND_EXT_EXTERNAL, &li->pending_irqs)) 1022 + return -EBUSY; 1015 1023 *extcall = irq->u.extcall; 1016 - set_bit(IRQ_PEND_EXT_EXTERNAL, &li->pending_irqs); 1017 1024 atomic_set_mask(CPUSTAT_EXT_INT, li->cpuflags); 1018 1025 return 0; 1019 1026 } ··· 1049 1006 struct kvm_s390_prefix_info *prefix = &li->irq.prefix; 1050 1007 1051 1008 VCPU_EVENT(vcpu, 3, "inject: set prefix to %x (from user)", 1052 - prefix->address); 1009 + irq->u.prefix.address); 1053 1010 trace_kvm_s390_inject_vcpu(vcpu->vcpu_id, KVM_S390_SIGP_SET_PREFIX, 1054 - prefix->address, 0, 2); 1011 + irq->u.prefix.address, 0, 2); 1012 + 1013 + if (!is_vcpu_stopped(vcpu)) 1014 + return -EBUSY; 1055 1015 1056 1016 *prefix = irq->u.prefix; 1057 1017 set_bit(IRQ_PEND_SET_PREFIX, &li->pending_irqs); 1058 1018 return 0; 1059 1019 } 1060 1020 1021 + #define KVM_S390_STOP_SUPP_FLAGS (KVM_S390_STOP_FLAG_STORE_STATUS) 1061 1022 static int __inject_sigp_stop(struct kvm_vcpu *vcpu, struct kvm_s390_irq *irq) 1062 1023 { 1063 1024 struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 1025 + struct kvm_s390_stop_info *stop = &li->irq.stop; 1026 + int rc = 0; 1064 1027 1065 1028 trace_kvm_s390_inject_vcpu(vcpu->vcpu_id, KVM_S390_SIGP_STOP, 0, 0, 2); 1066 1029 1067 - li->action_bits |= ACTION_STOP_ON_STOP; 1068 - set_bit(IRQ_PEND_SIGP_STOP, &li->pending_irqs); 1030 + if (irq->u.stop.flags & ~KVM_S390_STOP_SUPP_FLAGS) 1031 + return -EINVAL; 1032 + 1033 + if (is_vcpu_stopped(vcpu)) { 1034 + if (irq->u.stop.flags & KVM_S390_STOP_FLAG_STORE_STATUS) 1035 + rc = kvm_s390_store_status_unloaded(vcpu, 1036 + KVM_S390_STORE_STATUS_NOADDR); 1037 + return rc; 1038 + } 1039 + 1040 + if (test_and_set_bit(IRQ_PEND_SIGP_STOP, &li->pending_irqs)) 1041 + return -EBUSY; 1042 + stop->flags = irq->u.stop.flags; 1043 + __set_cpuflag(vcpu, CPUSTAT_STOP_INT); 1069 1044 return 0; 1070 1045 } 1071 1046 ··· 1103 1042 struct kvm_s390_irq *irq) 1104 1043 { 1105 1044 struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 1106 - struct kvm_s390_emerg_info *emerg = &li->irq.emerg; 1107 1045 1108 1046 VCPU_EVENT(vcpu, 3, "inject: emergency %u\n", 1109 1047 irq->u.emerg.code); 1110 1048 trace_kvm_s390_inject_vcpu(vcpu->vcpu_id, KVM_S390_INT_EMERGENCY, 1111 - emerg->code, 0, 2); 1049 + irq->u.emerg.code, 0, 2); 1112 1050 1113 - set_bit(emerg->code, li->sigp_emerg_pending); 1051 + set_bit(irq->u.emerg.code, li->sigp_emerg_pending); 1114 1052 set_bit(IRQ_PEND_EXT_EMERGENCY, &li->pending_irqs); 1115 1053 atomic_set_mask(CPUSTAT_EXT_INT, li->cpuflags); 1116 1054 return 0; ··· 1121 1061 struct kvm_s390_mchk_info *mchk = &li->irq.mchk; 1122 1062 1123 1063 VCPU_EVENT(vcpu, 5, "inject: machine check parm64:%llx", 1124 - mchk->mcic); 1064 + irq->u.mchk.mcic); 1125 1065 trace_kvm_s390_inject_vcpu(vcpu->vcpu_id, KVM_S390_MCHK, 0, 1126 - mchk->mcic, 2); 1066 + irq->u.mchk.mcic, 2); 1127 1067 1128 1068 /* 1129 1069 * Because repressible machine checks can be indicated along with ··· 1181 1121 1182 1122 if ((!schid && !cr6) || (schid && cr6)) 1183 1123 return NULL; 1184 - mutex_lock(&kvm->lock); 1185 1124 fi = &kvm->arch.float_int; 1186 1125 spin_lock(&fi->lock); 1187 1126 inti = NULL; ··· 1208 1149 if (list_empty(&fi->list)) 1209 1150 atomic_set(&fi->active, 0); 1210 1151 spin_unlock(&fi->lock); 1211 - mutex_unlock(&kvm->lock); 1212 1152 return inti; 1213 1153 } 1214 1154 ··· 1220 1162 int sigcpu; 1221 1163 int rc = 0; 1222 1164 1223 - mutex_lock(&kvm->lock); 1224 1165 fi = &kvm->arch.float_int; 1225 1166 spin_lock(&fi->lock); 1226 1167 if (fi->irq_count >= KVM_S390_MAX_FLOAT_IRQS) { ··· 1244 1187 list_add_tail(&inti->list, &iter->list); 1245 1188 } 1246 1189 atomic_set(&fi->active, 1); 1190 + if (atomic_read(&kvm->online_vcpus) == 0) 1191 + goto unlock_fi; 1247 1192 sigcpu = find_first_bit(fi->idle_mask, KVM_MAX_VCPUS); 1248 1193 if (sigcpu == KVM_MAX_VCPUS) { 1249 1194 do { ··· 1272 1213 kvm_s390_vcpu_wakeup(kvm_get_vcpu(kvm, sigcpu)); 1273 1214 unlock_fi: 1274 1215 spin_unlock(&fi->lock); 1275 - mutex_unlock(&kvm->lock); 1276 1216 return rc; 1277 1217 } 1278 1218 ··· 1279 1221 struct kvm_s390_interrupt *s390int) 1280 1222 { 1281 1223 struct kvm_s390_interrupt_info *inti; 1224 + int rc; 1282 1225 1283 1226 inti = kzalloc(sizeof(*inti), GFP_KERNEL); 1284 1227 if (!inti) ··· 1298 1239 inti->ext.ext_params = s390int->parm; 1299 1240 break; 1300 1241 case KVM_S390_INT_PFAULT_DONE: 1301 - inti->type = s390int->type; 1302 1242 inti->ext.ext_params2 = s390int->parm64; 1303 1243 break; 1304 1244 case KVM_S390_MCHK: ··· 1326 1268 trace_kvm_s390_inject_vm(s390int->type, s390int->parm, s390int->parm64, 1327 1269 2); 1328 1270 1329 - return __inject_vm(kvm, inti); 1271 + rc = __inject_vm(kvm, inti); 1272 + if (rc) 1273 + kfree(inti); 1274 + return rc; 1330 1275 } 1331 1276 1332 1277 void kvm_s390_reinject_io_int(struct kvm *kvm, ··· 1351 1290 case KVM_S390_SIGP_SET_PREFIX: 1352 1291 irq->u.prefix.address = s390int->parm; 1353 1292 break; 1293 + case KVM_S390_SIGP_STOP: 1294 + irq->u.stop.flags = s390int->parm; 1295 + break; 1354 1296 case KVM_S390_INT_EXTERNAL_CALL: 1355 - if (irq->u.extcall.code & 0xffff0000) 1297 + if (s390int->parm & 0xffff0000) 1356 1298 return -EINVAL; 1357 1299 irq->u.extcall.code = s390int->parm; 1358 1300 break; 1359 1301 case KVM_S390_INT_EMERGENCY: 1360 - if (irq->u.emerg.code & 0xffff0000) 1302 + if (s390int->parm & 0xffff0000) 1361 1303 return -EINVAL; 1362 1304 irq->u.emerg.code = s390int->parm; 1363 1305 break; ··· 1369 1305 break; 1370 1306 } 1371 1307 return 0; 1308 + } 1309 + 1310 + int kvm_s390_is_stop_irq_pending(struct kvm_vcpu *vcpu) 1311 + { 1312 + struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 1313 + 1314 + return test_bit(IRQ_PEND_SIGP_STOP, &li->pending_irqs); 1315 + } 1316 + 1317 + void kvm_s390_clear_stop_irq(struct kvm_vcpu *vcpu) 1318 + { 1319 + struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 1320 + 1321 + spin_lock(&li->lock); 1322 + li->irq.stop.flags = 0; 1323 + clear_bit(IRQ_PEND_SIGP_STOP, &li->pending_irqs); 1324 + spin_unlock(&li->lock); 1372 1325 } 1373 1326 1374 1327 int kvm_s390_inject_vcpu(struct kvm_vcpu *vcpu, struct kvm_s390_irq *irq) ··· 1444 1363 struct kvm_s390_float_interrupt *fi; 1445 1364 struct kvm_s390_interrupt_info *n, *inti = NULL; 1446 1365 1447 - mutex_lock(&kvm->lock); 1448 1366 fi = &kvm->arch.float_int; 1449 1367 spin_lock(&fi->lock); 1450 1368 list_for_each_entry_safe(inti, n, &fi->list, list) { ··· 1453 1373 fi->irq_count = 0; 1454 1374 atomic_set(&fi->active, 0); 1455 1375 spin_unlock(&fi->lock); 1456 - mutex_unlock(&kvm->lock); 1457 1376 } 1458 1377 1459 1378 static inline int copy_irq_to_user(struct kvm_s390_interrupt_info *inti, ··· 1492 1413 int ret = 0; 1493 1414 int n = 0; 1494 1415 1495 - mutex_lock(&kvm->lock); 1496 1416 fi = &kvm->arch.float_int; 1497 1417 spin_lock(&fi->lock); 1498 1418 ··· 1510 1432 } 1511 1433 1512 1434 spin_unlock(&fi->lock); 1513 - mutex_unlock(&kvm->lock); 1514 1435 1515 1436 return ret < 0 ? ret : n; 1516 1437 }

+536 -60

arch/s390/kvm/kvm-s390.c

··· 22 22 #include <linux/kvm.h> 23 23 #include <linux/kvm_host.h> 24 24 #include <linux/module.h> 25 + #include <linux/random.h> 25 26 #include <linux/slab.h> 26 27 #include <linux/timer.h> 27 28 #include <asm/asm-offsets.h> ··· 30 29 #include <asm/pgtable.h> 31 30 #include <asm/nmi.h> 32 31 #include <asm/switch_to.h> 33 - #include <asm/facility.h> 34 32 #include <asm/sclp.h> 35 33 #include "kvm-s390.h" 36 34 #include "gaccess.h" ··· 50 50 { "exit_instruction", VCPU_STAT(exit_instruction) }, 51 51 { "exit_program_interruption", VCPU_STAT(exit_program_interruption) }, 52 52 { "exit_instr_and_program_int", VCPU_STAT(exit_instr_and_program) }, 53 + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) }, 53 54 { "halt_wakeup", VCPU_STAT(halt_wakeup) }, 54 55 { "instruction_lctlg", VCPU_STAT(instruction_lctlg) }, 55 56 { "instruction_lctl", VCPU_STAT(instruction_lctl) }, ··· 99 98 { NULL } 100 99 }; 101 100 102 - unsigned long *vfacilities; 103 - static struct gmap_notifier gmap_notifier; 101 + /* upper facilities limit for kvm */ 102 + unsigned long kvm_s390_fac_list_mask[] = { 103 + 0xff82fffbf4fc2000UL, 104 + 0x005c000000000000UL, 105 + }; 104 106 105 - /* test availability of vfacility */ 106 - int test_vfacility(unsigned long nr) 107 + unsigned long kvm_s390_fac_list_mask_size(void) 107 108 { 108 - return __test_facility(nr, (void *) vfacilities); 109 + BUILD_BUG_ON(ARRAY_SIZE(kvm_s390_fac_list_mask) > S390_ARCH_FAC_MASK_SIZE_U64); 110 + return ARRAY_SIZE(kvm_s390_fac_list_mask); 109 111 } 112 + 113 + static struct gmap_notifier gmap_notifier; 110 114 111 115 /* Section: not file related */ 112 116 int kvm_arch_hardware_enable(void) ··· 172 166 case KVM_CAP_S390_IRQCHIP: 173 167 case KVM_CAP_VM_ATTRIBUTES: 174 168 case KVM_CAP_MP_STATE: 169 + case KVM_CAP_S390_USER_SIGP: 175 170 r = 1; 176 171 break; 177 172 case KVM_CAP_NR_VCPUS: ··· 261 254 kvm->arch.use_irqchip = 1; 262 255 r = 0; 263 256 break; 257 + case KVM_CAP_S390_USER_SIGP: 258 + kvm->arch.user_sigp = 1; 259 + r = 0; 260 + break; 264 261 default: 265 262 r = -EINVAL; 266 263 break; ··· 272 261 return r; 273 262 } 274 263 275 - static int kvm_s390_mem_control(struct kvm *kvm, struct kvm_device_attr *attr) 264 + static int kvm_s390_get_mem_control(struct kvm *kvm, struct kvm_device_attr *attr) 265 + { 266 + int ret; 267 + 268 + switch (attr->attr) { 269 + case KVM_S390_VM_MEM_LIMIT_SIZE: 270 + ret = 0; 271 + if (put_user(kvm->arch.gmap->asce_end, (u64 __user *)attr->addr)) 272 + ret = -EFAULT; 273 + break; 274 + default: 275 + ret = -ENXIO; 276 + break; 277 + } 278 + return ret; 279 + } 280 + 281 + static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *attr) 276 282 { 277 283 int ret; 278 284 unsigned int idx; ··· 311 283 mutex_unlock(&kvm->lock); 312 284 ret = 0; 313 285 break; 286 + case KVM_S390_VM_MEM_LIMIT_SIZE: { 287 + unsigned long new_limit; 288 + 289 + if (kvm_is_ucontrol(kvm)) 290 + return -EINVAL; 291 + 292 + if (get_user(new_limit, (u64 __user *)attr->addr)) 293 + return -EFAULT; 294 + 295 + if (new_limit > kvm->arch.gmap->asce_end) 296 + return -E2BIG; 297 + 298 + ret = -EBUSY; 299 + mutex_lock(&kvm->lock); 300 + if (atomic_read(&kvm->online_vcpus) == 0) { 301 + /* gmap_alloc will round the limit up */ 302 + struct gmap *new = gmap_alloc(current->mm, new_limit); 303 + 304 + if (!new) { 305 + ret = -ENOMEM; 306 + } else { 307 + gmap_free(kvm->arch.gmap); 308 + new->private = kvm; 309 + kvm->arch.gmap = new; 310 + ret = 0; 311 + } 312 + } 313 + mutex_unlock(&kvm->lock); 314 + break; 315 + } 314 316 default: 315 317 ret = -ENXIO; 318 + break; 319 + } 320 + return ret; 321 + } 322 + 323 + static void kvm_s390_vcpu_crypto_setup(struct kvm_vcpu *vcpu); 324 + 325 + static int kvm_s390_vm_set_crypto(struct kvm *kvm, struct kvm_device_attr *attr) 326 + { 327 + struct kvm_vcpu *vcpu; 328 + int i; 329 + 330 + if (!test_kvm_facility(kvm, 76)) 331 + return -EINVAL; 332 + 333 + mutex_lock(&kvm->lock); 334 + switch (attr->attr) { 335 + case KVM_S390_VM_CRYPTO_ENABLE_AES_KW: 336 + get_random_bytes( 337 + kvm->arch.crypto.crycb->aes_wrapping_key_mask, 338 + sizeof(kvm->arch.crypto.crycb->aes_wrapping_key_mask)); 339 + kvm->arch.crypto.aes_kw = 1; 340 + break; 341 + case KVM_S390_VM_CRYPTO_ENABLE_DEA_KW: 342 + get_random_bytes( 343 + kvm->arch.crypto.crycb->dea_wrapping_key_mask, 344 + sizeof(kvm->arch.crypto.crycb->dea_wrapping_key_mask)); 345 + kvm->arch.crypto.dea_kw = 1; 346 + break; 347 + case KVM_S390_VM_CRYPTO_DISABLE_AES_KW: 348 + kvm->arch.crypto.aes_kw = 0; 349 + memset(kvm->arch.crypto.crycb->aes_wrapping_key_mask, 0, 350 + sizeof(kvm->arch.crypto.crycb->aes_wrapping_key_mask)); 351 + break; 352 + case KVM_S390_VM_CRYPTO_DISABLE_DEA_KW: 353 + kvm->arch.crypto.dea_kw = 0; 354 + memset(kvm->arch.crypto.crycb->dea_wrapping_key_mask, 0, 355 + sizeof(kvm->arch.crypto.crycb->dea_wrapping_key_mask)); 356 + break; 357 + default: 358 + mutex_unlock(&kvm->lock); 359 + return -ENXIO; 360 + } 361 + 362 + kvm_for_each_vcpu(i, vcpu, kvm) { 363 + kvm_s390_vcpu_crypto_setup(vcpu); 364 + exit_sie(vcpu); 365 + } 366 + mutex_unlock(&kvm->lock); 367 + return 0; 368 + } 369 + 370 + static int kvm_s390_set_tod_high(struct kvm *kvm, struct kvm_device_attr *attr) 371 + { 372 + u8 gtod_high; 373 + 374 + if (copy_from_user(&gtod_high, (void __user *)attr->addr, 375 + sizeof(gtod_high))) 376 + return -EFAULT; 377 + 378 + if (gtod_high != 0) 379 + return -EINVAL; 380 + 381 + return 0; 382 + } 383 + 384 + static int kvm_s390_set_tod_low(struct kvm *kvm, struct kvm_device_attr *attr) 385 + { 386 + struct kvm_vcpu *cur_vcpu; 387 + unsigned int vcpu_idx; 388 + u64 host_tod, gtod; 389 + int r; 390 + 391 + if (copy_from_user(&gtod, (void __user *)attr->addr, sizeof(gtod))) 392 + return -EFAULT; 393 + 394 + r = store_tod_clock(&host_tod); 395 + if (r) 396 + return r; 397 + 398 + mutex_lock(&kvm->lock); 399 + kvm->arch.epoch = gtod - host_tod; 400 + kvm_for_each_vcpu(vcpu_idx, cur_vcpu, kvm) { 401 + cur_vcpu->arch.sie_block->epoch = kvm->arch.epoch; 402 + exit_sie(cur_vcpu); 403 + } 404 + mutex_unlock(&kvm->lock); 405 + return 0; 406 + } 407 + 408 + static int kvm_s390_set_tod(struct kvm *kvm, struct kvm_device_attr *attr) 409 + { 410 + int ret; 411 + 412 + if (attr->flags) 413 + return -EINVAL; 414 + 415 + switch (attr->attr) { 416 + case KVM_S390_VM_TOD_HIGH: 417 + ret = kvm_s390_set_tod_high(kvm, attr); 418 + break; 419 + case KVM_S390_VM_TOD_LOW: 420 + ret = kvm_s390_set_tod_low(kvm, attr); 421 + break; 422 + default: 423 + ret = -ENXIO; 424 + break; 425 + } 426 + return ret; 427 + } 428 + 429 + static int kvm_s390_get_tod_high(struct kvm *kvm, struct kvm_device_attr *attr) 430 + { 431 + u8 gtod_high = 0; 432 + 433 + if (copy_to_user((void __user *)attr->addr, &gtod_high, 434 + sizeof(gtod_high))) 435 + return -EFAULT; 436 + 437 + return 0; 438 + } 439 + 440 + static int kvm_s390_get_tod_low(struct kvm *kvm, struct kvm_device_attr *attr) 441 + { 442 + u64 host_tod, gtod; 443 + int r; 444 + 445 + r = store_tod_clock(&host_tod); 446 + if (r) 447 + return r; 448 + 449 + gtod = host_tod + kvm->arch.epoch; 450 + if (copy_to_user((void __user *)attr->addr, &gtod, sizeof(gtod))) 451 + return -EFAULT; 452 + 453 + return 0; 454 + } 455 + 456 + static int kvm_s390_get_tod(struct kvm *kvm, struct kvm_device_attr *attr) 457 + { 458 + int ret; 459 + 460 + if (attr->flags) 461 + return -EINVAL; 462 + 463 + switch (attr->attr) { 464 + case KVM_S390_VM_TOD_HIGH: 465 + ret = kvm_s390_get_tod_high(kvm, attr); 466 + break; 467 + case KVM_S390_VM_TOD_LOW: 468 + ret = kvm_s390_get_tod_low(kvm, attr); 469 + break; 470 + default: 471 + ret = -ENXIO; 472 + break; 473 + } 474 + return ret; 475 + } 476 + 477 + static int kvm_s390_set_processor(struct kvm *kvm, struct kvm_device_attr *attr) 478 + { 479 + struct kvm_s390_vm_cpu_processor *proc; 480 + int ret = 0; 481 + 482 + mutex_lock(&kvm->lock); 483 + if (atomic_read(&kvm->online_vcpus)) { 484 + ret = -EBUSY; 485 + goto out; 486 + } 487 + proc = kzalloc(sizeof(*proc), GFP_KERNEL); 488 + if (!proc) { 489 + ret = -ENOMEM; 490 + goto out; 491 + } 492 + if (!copy_from_user(proc, (void __user *)attr->addr, 493 + sizeof(*proc))) { 494 + memcpy(&kvm->arch.model.cpu_id, &proc->cpuid, 495 + sizeof(struct cpuid)); 496 + kvm->arch.model.ibc = proc->ibc; 497 + memcpy(kvm->arch.model.fac->kvm, proc->fac_list, 498 + S390_ARCH_FAC_LIST_SIZE_BYTE); 499 + } else 500 + ret = -EFAULT; 501 + kfree(proc); 502 + out: 503 + mutex_unlock(&kvm->lock); 504 + return ret; 505 + } 506 + 507 + static int kvm_s390_set_cpu_model(struct kvm *kvm, struct kvm_device_attr *attr) 508 + { 509 + int ret = -ENXIO; 510 + 511 + switch (attr->attr) { 512 + case KVM_S390_VM_CPU_PROCESSOR: 513 + ret = kvm_s390_set_processor(kvm, attr); 514 + break; 515 + } 516 + return ret; 517 + } 518 + 519 + static int kvm_s390_get_processor(struct kvm *kvm, struct kvm_device_attr *attr) 520 + { 521 + struct kvm_s390_vm_cpu_processor *proc; 522 + int ret = 0; 523 + 524 + proc = kzalloc(sizeof(*proc), GFP_KERNEL); 525 + if (!proc) { 526 + ret = -ENOMEM; 527 + goto out; 528 + } 529 + memcpy(&proc->cpuid, &kvm->arch.model.cpu_id, sizeof(struct cpuid)); 530 + proc->ibc = kvm->arch.model.ibc; 531 + memcpy(&proc->fac_list, kvm->arch.model.fac->kvm, S390_ARCH_FAC_LIST_SIZE_BYTE); 532 + if (copy_to_user((void __user *)attr->addr, proc, sizeof(*proc))) 533 + ret = -EFAULT; 534 + kfree(proc); 535 + out: 536 + return ret; 537 + } 538 + 539 + static int kvm_s390_get_machine(struct kvm *kvm, struct kvm_device_attr *attr) 540 + { 541 + struct kvm_s390_vm_cpu_machine *mach; 542 + int ret = 0; 543 + 544 + mach = kzalloc(sizeof(*mach), GFP_KERNEL); 545 + if (!mach) { 546 + ret = -ENOMEM; 547 + goto out; 548 + } 549 + get_cpu_id((struct cpuid *) &mach->cpuid); 550 + mach->ibc = sclp_get_ibc(); 551 + memcpy(&mach->fac_mask, kvm_s390_fac_list_mask, 552 + kvm_s390_fac_list_mask_size() * sizeof(u64)); 553 + memcpy((unsigned long *)&mach->fac_list, S390_lowcore.stfle_fac_list, 554 + S390_ARCH_FAC_LIST_SIZE_U64); 555 + if (copy_to_user((void __user *)attr->addr, mach, sizeof(*mach))) 556 + ret = -EFAULT; 557 + kfree(mach); 558 + out: 559 + return ret; 560 + } 561 + 562 + static int kvm_s390_get_cpu_model(struct kvm *kvm, struct kvm_device_attr *attr) 563 + { 564 + int ret = -ENXIO; 565 + 566 + switch (attr->attr) { 567 + case KVM_S390_VM_CPU_PROCESSOR: 568 + ret = kvm_s390_get_processor(kvm, attr); 569 + break; 570 + case KVM_S390_VM_CPU_MACHINE: 571 + ret = kvm_s390_get_machine(kvm, attr); 316 572 break; 317 573 } 318 574 return ret; ··· 608 296 609 297 switch (attr->group) { 610 298 case KVM_S390_VM_MEM_CTRL: 611 - ret = kvm_s390_mem_control(kvm, attr); 299 + ret = kvm_s390_set_mem_control(kvm, attr); 300 + break; 301 + case KVM_S390_VM_TOD: 302 + ret = kvm_s390_set_tod(kvm, attr); 303 + break; 304 + case KVM_S390_VM_CPU_MODEL: 305 + ret = kvm_s390_set_cpu_model(kvm, attr); 306 + break; 307 + case KVM_S390_VM_CRYPTO: 308 + ret = kvm_s390_vm_set_crypto(kvm, attr); 612 309 break; 613 310 default: 614 311 ret = -ENXIO; ··· 629 308 630 309 static int kvm_s390_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr) 631 310 { 632 - return -ENXIO; 311 + int ret; 312 + 313 + switch (attr->group) { 314 + case KVM_S390_VM_MEM_CTRL: 315 + ret = kvm_s390_get_mem_control(kvm, attr); 316 + break; 317 + case KVM_S390_VM_TOD: 318 + ret = kvm_s390_get_tod(kvm, attr); 319 + break; 320 + case KVM_S390_VM_CPU_MODEL: 321 + ret = kvm_s390_get_cpu_model(kvm, attr); 322 + break; 323 + default: 324 + ret = -ENXIO; 325 + break; 326 + } 327 + 328 + return ret; 633 329 } 634 330 635 331 static int kvm_s390_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr) ··· 658 320 switch (attr->attr) { 659 321 case KVM_S390_VM_MEM_ENABLE_CMMA: 660 322 case KVM_S390_VM_MEM_CLR_CMMA: 323 + case KVM_S390_VM_MEM_LIMIT_SIZE: 324 + ret = 0; 325 + break; 326 + default: 327 + ret = -ENXIO; 328 + break; 329 + } 330 + break; 331 + case KVM_S390_VM_TOD: 332 + switch (attr->attr) { 333 + case KVM_S390_VM_TOD_LOW: 334 + case KVM_S390_VM_TOD_HIGH: 335 + ret = 0; 336 + break; 337 + default: 338 + ret = -ENXIO; 339 + break; 340 + } 341 + break; 342 + case KVM_S390_VM_CPU_MODEL: 343 + switch (attr->attr) { 344 + case KVM_S390_VM_CPU_PROCESSOR: 345 + case KVM_S390_VM_CPU_MACHINE: 346 + ret = 0; 347 + break; 348 + default: 349 + ret = -ENXIO; 350 + break; 351 + } 352 + break; 353 + case KVM_S390_VM_CRYPTO: 354 + switch (attr->attr) { 355 + case KVM_S390_VM_CRYPTO_ENABLE_AES_KW: 356 + case KVM_S390_VM_CRYPTO_ENABLE_DEA_KW: 357 + case KVM_S390_VM_CRYPTO_DISABLE_AES_KW: 358 + case KVM_S390_VM_CRYPTO_DISABLE_DEA_KW: 661 359 ret = 0; 662 360 break; 663 361 default: ··· 775 401 return r; 776 402 } 777 403 404 + static int kvm_s390_query_ap_config(u8 *config) 405 + { 406 + u32 fcn_code = 0x04000000UL; 407 + u32 cc; 408 + 409 + asm volatile( 410 + "lgr 0,%1\n" 411 + "lgr 2,%2\n" 412 + ".long 0xb2af0000\n" /* PQAP(QCI) */ 413 + "ipm %0\n" 414 + "srl %0,28\n" 415 + : "=r" (cc) 416 + : "r" (fcn_code), "r" (config) 417 + : "cc", "0", "2", "memory" 418 + ); 419 + 420 + return cc; 421 + } 422 + 423 + static int kvm_s390_apxa_installed(void) 424 + { 425 + u8 config[128]; 426 + int cc; 427 + 428 + if (test_facility(2) && test_facility(12)) { 429 + cc = kvm_s390_query_ap_config(config); 430 + 431 + if (cc) 432 + pr_err("PQAP(QCI) failed with cc=%d", cc); 433 + else 434 + return config[0] & 0x40; 435 + } 436 + 437 + return 0; 438 + } 439 + 440 + static void kvm_s390_set_crycb_format(struct kvm *kvm) 441 + { 442 + kvm->arch.crypto.crycbd = (__u32)(unsigned long) kvm->arch.crypto.crycb; 443 + 444 + if (kvm_s390_apxa_installed()) 445 + kvm->arch.crypto.crycbd |= CRYCB_FORMAT2; 446 + else 447 + kvm->arch.crypto.crycbd |= CRYCB_FORMAT1; 448 + } 449 + 450 + static void kvm_s390_get_cpu_id(struct cpuid *cpu_id) 451 + { 452 + get_cpu_id(cpu_id); 453 + cpu_id->version = 0xff; 454 + } 455 + 778 456 static int kvm_s390_crypto_init(struct kvm *kvm) 779 457 { 780 - if (!test_vfacility(76)) 458 + if (!test_kvm_facility(kvm, 76)) 781 459 return 0; 782 460 783 461 kvm->arch.crypto.crycb = kzalloc(sizeof(*kvm->arch.crypto.crycb), ··· 837 411 if (!kvm->arch.crypto.crycb) 838 412 return -ENOMEM; 839 413 840 - kvm->arch.crypto.crycbd = (__u32) (unsigned long) kvm->arch.crypto.crycb | 841 - CRYCB_FORMAT1; 414 + kvm_s390_set_crycb_format(kvm); 415 + 416 + /* Disable AES/DEA protected key functions by default */ 417 + kvm->arch.crypto.aes_kw = 0; 418 + kvm->arch.crypto.dea_kw = 0; 842 419 843 420 return 0; 844 421 } 845 422 846 423 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) 847 424 { 848 - int rc; 425 + int i, rc; 849 426 char debug_name[16]; 850 427 static unsigned long sca_offset; 851 428 ··· 883 454 if (!kvm->arch.dbf) 884 455 goto out_nodbf; 885 456 457 + /* 458 + * The architectural maximum amount of facilities is 16 kbit. To store 459 + * this amount, 2 kbyte of memory is required. Thus we need a full 460 + * page to hold the active copy (arch.model.fac->sie) and the current 461 + * facilities set (arch.model.fac->kvm). Its address size has to be 462 + * 31 bits and word aligned. 463 + */ 464 + kvm->arch.model.fac = 465 + (struct s390_model_fac *) get_zeroed_page(GFP_KERNEL | GFP_DMA); 466 + if (!kvm->arch.model.fac) 467 + goto out_nofac; 468 + 469 + memcpy(kvm->arch.model.fac->kvm, S390_lowcore.stfle_fac_list, 470 + S390_ARCH_FAC_LIST_SIZE_U64); 471 + 472 + /* 473 + * If this KVM host runs *not* in a LPAR, relax the facility bits 474 + * of the kvm facility mask by all missing facilities. This will allow 475 + * to determine the right CPU model by means of the remaining facilities. 476 + * Live guest migration must prohibit the migration of KVMs running in 477 + * a LPAR to non LPAR hosts. 478 + */ 479 + if (!MACHINE_IS_LPAR) 480 + for (i = 0; i < kvm_s390_fac_list_mask_size(); i++) 481 + kvm_s390_fac_list_mask[i] &= kvm->arch.model.fac->kvm[i]; 482 + 483 + /* 484 + * Apply the kvm facility mask to limit the kvm supported/tolerated 485 + * facility list. 486 + */ 487 + for (i = 0; i < S390_ARCH_FAC_LIST_SIZE_U64; i++) { 488 + if (i < kvm_s390_fac_list_mask_size()) 489 + kvm->arch.model.fac->kvm[i] &= kvm_s390_fac_list_mask[i]; 490 + else 491 + kvm->arch.model.fac->kvm[i] = 0UL; 492 + } 493 + 494 + kvm_s390_get_cpu_id(&kvm->arch.model.cpu_id); 495 + kvm->arch.model.ibc = sclp_get_ibc() & 0x0fff; 496 + 886 497 if (kvm_s390_crypto_init(kvm) < 0) 887 498 goto out_crypto; 888 499 ··· 946 477 947 478 kvm->arch.css_support = 0; 948 479 kvm->arch.use_irqchip = 0; 480 + kvm->arch.epoch = 0; 949 481 950 482 spin_lock_init(&kvm->arch.start_stop_lock); 951 483 ··· 954 484 out_nogmap: 955 485 kfree(kvm->arch.crypto.crycb); 956 486 out_crypto: 487 + free_page((unsigned long)kvm->arch.model.fac); 488 + out_nofac: 957 489 debug_unregister(kvm->arch.dbf); 958 490 out_nodbf: 959 491 free_page((unsigned long)(kvm->arch.sca)); ··· 1008 536 void kvm_arch_destroy_vm(struct kvm *kvm) 1009 537 { 1010 538 kvm_free_vcpus(kvm); 539 + free_page((unsigned long)kvm->arch.model.fac); 1011 540 free_page((unsigned long)(kvm->arch.sca)); 1012 541 debug_unregister(kvm->arch.dbf); 1013 542 kfree(kvm->arch.crypto.crycb); ··· 1019 546 } 1020 547 1021 548 /* Section: vcpu related */ 549 + static int __kvm_ucontrol_vcpu_init(struct kvm_vcpu *vcpu) 550 + { 551 + vcpu->arch.gmap = gmap_alloc(current->mm, -1UL); 552 + if (!vcpu->arch.gmap) 553 + return -ENOMEM; 554 + vcpu->arch.gmap->private = vcpu->kvm; 555 + 556 + return 0; 557 + } 558 + 1022 559 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) 1023 560 { 1024 561 vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID; 1025 562 kvm_clear_async_pf_completion_queue(vcpu); 1026 - if (kvm_is_ucontrol(vcpu->kvm)) { 1027 - vcpu->arch.gmap = gmap_alloc(current->mm, -1UL); 1028 - if (!vcpu->arch.gmap) 1029 - return -ENOMEM; 1030 - vcpu->arch.gmap->private = vcpu->kvm; 1031 - return 0; 1032 - } 1033 - 1034 - vcpu->arch.gmap = vcpu->kvm->arch.gmap; 1035 563 vcpu->run->kvm_valid_regs = KVM_SYNC_PREFIX | 1036 564 KVM_SYNC_GPRS | 1037 565 KVM_SYNC_ACRS | 1038 566 KVM_SYNC_CRS | 1039 567 KVM_SYNC_ARCH0 | 1040 568 KVM_SYNC_PFAULT; 569 + 570 + if (kvm_is_ucontrol(vcpu->kvm)) 571 + return __kvm_ucontrol_vcpu_init(vcpu); 572 + 1041 573 return 0; 1042 574 } 1043 575 ··· 1093 615 kvm_s390_clear_local_irqs(vcpu); 1094 616 } 1095 617 1096 - int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 618 + void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 1097 619 { 1098 - return 0; 620 + mutex_lock(&vcpu->kvm->lock); 621 + vcpu->arch.sie_block->epoch = vcpu->kvm->arch.epoch; 622 + mutex_unlock(&vcpu->kvm->lock); 623 + if (!kvm_is_ucontrol(vcpu->kvm)) 624 + vcpu->arch.gmap = vcpu->kvm->arch.gmap; 1099 625 } 1100 626 1101 627 static void kvm_s390_vcpu_crypto_setup(struct kvm_vcpu *vcpu) 1102 628 { 1103 - if (!test_vfacility(76)) 629 + if (!test_kvm_facility(vcpu->kvm, 76)) 1104 630 return; 631 + 632 + vcpu->arch.sie_block->ecb3 &= ~(ECB3_AES | ECB3_DEA); 633 + 634 + if (vcpu->kvm->arch.crypto.aes_kw) 635 + vcpu->arch.sie_block->ecb3 |= ECB3_AES; 636 + if (vcpu->kvm->arch.crypto.dea_kw) 637 + vcpu->arch.sie_block->ecb3 |= ECB3_DEA; 1105 638 1106 639 vcpu->arch.sie_block->crycbd = vcpu->kvm->arch.crypto.crycbd; 1107 640 } ··· 1143 654 CPUSTAT_STOPPED | 1144 655 CPUSTAT_GED); 1145 656 vcpu->arch.sie_block->ecb = 6; 1146 - if (test_vfacility(50) && test_vfacility(73)) 657 + if (test_kvm_facility(vcpu->kvm, 50) && test_kvm_facility(vcpu->kvm, 73)) 1147 658 vcpu->arch.sie_block->ecb |= 0x10; 1148 659 1149 660 vcpu->arch.sie_block->ecb2 = 8; 1150 - vcpu->arch.sie_block->eca = 0xD1002000U; 661 + vcpu->arch.sie_block->eca = 0xC1002000U; 1151 662 if (sclp_has_siif()) 1152 663 vcpu->arch.sie_block->eca |= 1; 1153 - vcpu->arch.sie_block->fac = (int) (long) vfacilities; 664 + if (sclp_has_sigpif()) 665 + vcpu->arch.sie_block->eca |= 0x10000000U; 1154 666 vcpu->arch.sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE | 1155 667 ICTL_TPROT; 1156 668 ··· 1160 670 if (rc) 1161 671 return rc; 1162 672 } 1163 - hrtimer_init(&vcpu->arch.ckc_timer, CLOCK_REALTIME, HRTIMER_MODE_ABS); 673 + hrtimer_init(&vcpu->arch.ckc_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 1164 674 vcpu->arch.ckc_timer.function = kvm_s390_idle_wakeup; 1165 - get_cpu_id(&vcpu->arch.cpu_id); 1166 - vcpu->arch.cpu_id.version = 0xff; 675 + 676 + mutex_lock(&vcpu->kvm->lock); 677 + vcpu->arch.cpu_id = vcpu->kvm->arch.model.cpu_id; 678 + memcpy(vcpu->kvm->arch.model.fac->sie, vcpu->kvm->arch.model.fac->kvm, 679 + S390_ARCH_FAC_LIST_SIZE_BYTE); 680 + vcpu->arch.sie_block->ibc = vcpu->kvm->arch.model.ibc; 681 + mutex_unlock(&vcpu->kvm->lock); 1167 682 1168 683 kvm_s390_vcpu_crypto_setup(vcpu); 1169 684 ··· 1212 717 vcpu->arch.sie_block->scaol = (__u32)(__u64)kvm->arch.sca; 1213 718 set_bit(63 - id, (unsigned long *) &kvm->arch.sca->mcn); 1214 719 } 720 + vcpu->arch.sie_block->fac = (int) (long) kvm->arch.model.fac->sie; 1215 721 1216 722 spin_lock_init(&vcpu->arch.local_int.lock); 1217 723 vcpu->arch.local_int.float_int = &kvm->arch.float_int; ··· 1237 741 1238 742 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) 1239 743 { 1240 - return kvm_cpu_has_interrupt(vcpu); 744 + return kvm_s390_vcpu_has_irq(vcpu, 0); 1241 745 } 1242 746 1243 747 void s390_vcpu_block(struct kvm_vcpu *vcpu) ··· 1365 869 case KVM_REG_S390_PFTOKEN: 1366 870 r = get_user(vcpu->arch.pfault_token, 1367 871 (u64 __user *)reg->addr); 872 + if (vcpu->arch.pfault_token == KVM_S390_PFAULT_TOKEN_INVALID) 873 + kvm_clear_async_pf_completion_queue(vcpu); 1368 874 break; 1369 875 case KVM_REG_S390_PFCOMPARE: 1370 876 r = get_user(vcpu->arch.pfault_compare, ··· 1674 1176 return 0; 1675 1177 if (psw_extint_disabled(vcpu)) 1676 1178 return 0; 1677 - if (kvm_cpu_has_interrupt(vcpu)) 1179 + if (kvm_s390_vcpu_has_irq(vcpu, 0)) 1678 1180 return 0; 1679 1181 if (!(vcpu->arch.sie_block->gcr[0] & 0x200ul)) 1680 1182 return 0; ··· 1839 1341 vcpu->arch.pfault_token = kvm_run->s.regs.pft; 1840 1342 vcpu->arch.pfault_select = kvm_run->s.regs.pfs; 1841 1343 vcpu->arch.pfault_compare = kvm_run->s.regs.pfc; 1344 + if (vcpu->arch.pfault_token == KVM_S390_PFAULT_TOKEN_INVALID) 1345 + kvm_clear_async_pf_completion_queue(vcpu); 1842 1346 } 1843 1347 kvm_run->kvm_dirty_regs = 0; 1844 1348 } ··· 2059 1559 spin_lock(&vcpu->kvm->arch.start_stop_lock); 2060 1560 online_vcpus = atomic_read(&vcpu->kvm->online_vcpus); 2061 1561 2062 - /* Need to lock access to action_bits to avoid a SIGP race condition */ 2063 - spin_lock(&vcpu->arch.local_int.lock); 2064 - atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags); 2065 - 2066 1562 /* SIGP STOP and SIGP STOP AND STORE STATUS has been fully processed */ 2067 - vcpu->arch.local_int.action_bits &= 2068 - ~(ACTION_STOP_ON_STOP | ACTION_STORE_ON_STOP); 2069 - spin_unlock(&vcpu->arch.local_int.lock); 1563 + kvm_s390_clear_stop_irq(vcpu); 2070 1564 1565 + atomic_set_mask(CPUSTAT_STOPPED, &vcpu->arch.sie_block->cpuflags); 2071 1566 __disable_ibs_on_vcpu(vcpu); 2072 1567 2073 1568 for (i = 0; i < online_vcpus; i++) { ··· 2278 1783 2279 1784 static int __init kvm_s390_init(void) 2280 1785 { 2281 - int ret; 2282 - ret = kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE); 2283 - if (ret) 2284 - return ret; 2285 - 2286 - /* 2287 - * guests can ask for up to 255+1 double words, we need a full page 2288 - * to hold the maximum amount of facilities. On the other hand, we 2289 - * only set facilities that are known to work in KVM. 2290 - */ 2291 - vfacilities = (unsigned long *) get_zeroed_page(GFP_KERNEL|GFP_DMA); 2292 - if (!vfacilities) { 2293 - kvm_exit(); 2294 - return -ENOMEM; 2295 - } 2296 - memcpy(vfacilities, S390_lowcore.stfle_fac_list, 16); 2297 - vfacilities[0] &= 0xff82fffbf47c2000UL; 2298 - vfacilities[1] &= 0x005c000000000000UL; 2299 - return 0; 1786 + return kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE); 2300 1787 } 2301 1788 2302 1789 static void __exit kvm_s390_exit(void) 2303 1790 { 2304 - free_page((unsigned long) vfacilities); 2305 1791 kvm_exit(); 2306 1792 } 2307 1793

+13 -6

arch/s390/kvm/kvm-s390.h

··· 18 18 #include <linux/hrtimer.h> 19 19 #include <linux/kvm.h> 20 20 #include <linux/kvm_host.h> 21 + #include <asm/facility.h> 21 22 22 23 typedef int (*intercept_handler_t)(struct kvm_vcpu *vcpu); 23 - 24 - /* declare vfacilities extern */ 25 - extern unsigned long *vfacilities; 26 24 27 25 /* Transactional Memory Execution related macros */ 28 26 #define IS_TE_ENABLED(vcpu) ((vcpu->arch.sie_block->ecb & 0x10)) ··· 125 127 vcpu->arch.sie_block->gpsw.mask |= cc << 44; 126 128 } 127 129 130 + /* test availability of facility in a kvm intance */ 131 + static inline int test_kvm_facility(struct kvm *kvm, unsigned long nr) 132 + { 133 + return __test_facility(nr, kvm->arch.model.fac->kvm); 134 + } 135 + 128 136 /* are cpu states controlled by user space */ 129 137 static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm) 130 138 { ··· 187 183 void kvm_s390_vcpu_unsetup_cmma(struct kvm_vcpu *vcpu); 188 184 /* is cmma enabled */ 189 185 bool kvm_s390_cmma_enabled(struct kvm *kvm); 190 - int test_vfacility(unsigned long nr); 186 + unsigned long kvm_s390_fac_list_mask_size(void); 187 + extern unsigned long kvm_s390_fac_list_mask[]; 191 188 192 189 /* implemented in diag.c */ 193 190 int kvm_s390_handle_diag(struct kvm_vcpu *vcpu); ··· 233 228 struct kvm_s390_irq *s390irq); 234 229 235 230 /* implemented in interrupt.c */ 236 - int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); 231 + int kvm_s390_vcpu_has_irq(struct kvm_vcpu *vcpu, int exclude_stop); 237 232 int psw_extint_disabled(struct kvm_vcpu *vcpu); 238 233 void kvm_s390_destroy_adapters(struct kvm *kvm); 239 - int kvm_s390_si_ext_call_pending(struct kvm_vcpu *vcpu); 234 + int kvm_s390_ext_call_pending(struct kvm_vcpu *vcpu); 240 235 extern struct kvm_device_ops kvm_flic_ops; 236 + int kvm_s390_is_stop_irq_pending(struct kvm_vcpu *vcpu); 237 + void kvm_s390_clear_stop_irq(struct kvm_vcpu *vcpu); 241 238 242 239 /* implemented in guestdbg.c */ 243 240 void kvm_s390_backup_guest_per_regs(struct kvm_vcpu *vcpu);

+9 -4

arch/s390/kvm/priv.c

··· 337 337 static int handle_stfl(struct kvm_vcpu *vcpu) 338 338 { 339 339 int rc; 340 + unsigned int fac; 340 341 341 342 vcpu->stat.instruction_stfl++; 342 343 343 344 if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) 344 345 return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); 345 346 347 + /* 348 + * We need to shift the lower 32 facility bits (bit 0-31) from a u64 349 + * into a u32 memory representation. They will remain bits 0-31. 350 + */ 351 + fac = *vcpu->kvm->arch.model.fac->sie >> 32; 346 352 rc = write_guest_lc(vcpu, offsetof(struct _lowcore, stfl_fac_list), 347 - vfacilities, 4); 353 + &fac, sizeof(fac)); 348 354 if (rc) 349 355 return rc; 350 - VCPU_EVENT(vcpu, 5, "store facility list value %x", 351 - *(unsigned int *) vfacilities); 352 - trace_kvm_s390_handle_stfl(vcpu, *(unsigned int *) vfacilities); 356 + VCPU_EVENT(vcpu, 5, "store facility list value %x", fac); 357 + trace_kvm_s390_handle_stfl(vcpu, fac); 353 358 return 0; 354 359 } 355 360

+93 -67

arch/s390/kvm/sigp.c

··· 26 26 struct kvm_s390_local_interrupt *li; 27 27 int cpuflags; 28 28 int rc; 29 + int ext_call_pending; 29 30 30 31 li = &dst_vcpu->arch.local_int; 31 32 32 33 cpuflags = atomic_read(li->cpuflags); 33 - if (!(cpuflags & (CPUSTAT_ECALL_PEND | CPUSTAT_STOPPED))) 34 + ext_call_pending = kvm_s390_ext_call_pending(dst_vcpu); 35 + if (!(cpuflags & CPUSTAT_STOPPED) && !ext_call_pending) 34 36 rc = SIGP_CC_ORDER_CODE_ACCEPTED; 35 37 else { 36 38 *reg &= 0xffffffff00000000UL; 37 - if (cpuflags & CPUSTAT_ECALL_PEND) 39 + if (ext_call_pending) 38 40 *reg |= SIGP_STATUS_EXT_CALL_PENDING; 39 41 if (cpuflags & CPUSTAT_STOPPED) 40 42 *reg |= SIGP_STATUS_STOPPED; ··· 98 96 } 99 97 100 98 static int __sigp_external_call(struct kvm_vcpu *vcpu, 101 - struct kvm_vcpu *dst_vcpu) 99 + struct kvm_vcpu *dst_vcpu, u64 *reg) 102 100 { 103 101 struct kvm_s390_irq irq = { 104 102 .type = KVM_S390_INT_EXTERNAL_CALL, ··· 107 105 int rc; 108 106 109 107 rc = kvm_s390_inject_vcpu(dst_vcpu, &irq); 110 - if (!rc) 108 + if (rc == -EBUSY) { 109 + *reg &= 0xffffffff00000000UL; 110 + *reg |= SIGP_STATUS_EXT_CALL_PENDING; 111 + return SIGP_CC_STATUS_STORED; 112 + } else if (rc == 0) { 111 113 VCPU_EVENT(vcpu, 4, "sent sigp ext call to cpu %x", 112 114 dst_vcpu->vcpu_id); 115 + } 113 116 114 117 return rc ? rc : SIGP_CC_ORDER_CODE_ACCEPTED; 115 118 } 116 119 117 - static int __inject_sigp_stop(struct kvm_vcpu *dst_vcpu, int action) 118 - { 119 - struct kvm_s390_local_interrupt *li = &dst_vcpu->arch.local_int; 120 - int rc = SIGP_CC_ORDER_CODE_ACCEPTED; 121 - 122 - spin_lock(&li->lock); 123 - if (li->action_bits & ACTION_STOP_ON_STOP) { 124 - /* another SIGP STOP is pending */ 125 - rc = SIGP_CC_BUSY; 126 - goto out; 127 - } 128 - if ((atomic_read(li->cpuflags) & CPUSTAT_STOPPED)) { 129 - if ((action & ACTION_STORE_ON_STOP) != 0) 130 - rc = -ESHUTDOWN; 131 - goto out; 132 - } 133 - set_bit(IRQ_PEND_SIGP_STOP, &li->pending_irqs); 134 - li->action_bits |= action; 135 - atomic_set_mask(CPUSTAT_STOP_INT, li->cpuflags); 136 - kvm_s390_vcpu_wakeup(dst_vcpu); 137 - out: 138 - spin_unlock(&li->lock); 139 - 140 - return rc; 141 - } 142 - 143 120 static int __sigp_stop(struct kvm_vcpu *vcpu, struct kvm_vcpu *dst_vcpu) 144 121 { 122 + struct kvm_s390_irq irq = { 123 + .type = KVM_S390_SIGP_STOP, 124 + }; 145 125 int rc; 146 126 147 - rc = __inject_sigp_stop(dst_vcpu, ACTION_STOP_ON_STOP); 148 - VCPU_EVENT(vcpu, 4, "sent sigp stop to cpu %x", dst_vcpu->vcpu_id); 127 + rc = kvm_s390_inject_vcpu(dst_vcpu, &irq); 128 + if (rc == -EBUSY) 129 + rc = SIGP_CC_BUSY; 130 + else if (rc == 0) 131 + VCPU_EVENT(vcpu, 4, "sent sigp stop to cpu %x", 132 + dst_vcpu->vcpu_id); 149 133 150 134 return rc; 151 135 } ··· 139 151 static int __sigp_stop_and_store_status(struct kvm_vcpu *vcpu, 140 152 struct kvm_vcpu *dst_vcpu, u64 *reg) 141 153 { 154 + struct kvm_s390_irq irq = { 155 + .type = KVM_S390_SIGP_STOP, 156 + .u.stop.flags = KVM_S390_STOP_FLAG_STORE_STATUS, 157 + }; 142 158 int rc; 143 159 144 - rc = __inject_sigp_stop(dst_vcpu, ACTION_STOP_ON_STOP | 145 - ACTION_STORE_ON_STOP); 146 - VCPU_EVENT(vcpu, 4, "sent sigp stop and store status to cpu %x", 147 - dst_vcpu->vcpu_id); 148 - 149 - if (rc == -ESHUTDOWN) { 150 - /* If the CPU has already been stopped, we still have 151 - * to save the status when doing stop-and-store. This 152 - * has to be done after unlocking all spinlocks. */ 153 - rc = kvm_s390_store_status_unloaded(dst_vcpu, 154 - KVM_S390_STORE_STATUS_NOADDR); 155 - } 160 + rc = kvm_s390_inject_vcpu(dst_vcpu, &irq); 161 + if (rc == -EBUSY) 162 + rc = SIGP_CC_BUSY; 163 + else if (rc == 0) 164 + VCPU_EVENT(vcpu, 4, "sent sigp stop and store status to cpu %x", 165 + dst_vcpu->vcpu_id); 156 166 157 167 return rc; 158 168 } ··· 183 197 static int __sigp_set_prefix(struct kvm_vcpu *vcpu, struct kvm_vcpu *dst_vcpu, 184 198 u32 address, u64 *reg) 185 199 { 186 - struct kvm_s390_local_interrupt *li; 200 + struct kvm_s390_irq irq = { 201 + .type = KVM_S390_SIGP_SET_PREFIX, 202 + .u.prefix.address = address & 0x7fffe000u, 203 + }; 187 204 int rc; 188 - 189 - li = &dst_vcpu->arch.local_int; 190 205 191 206 /* 192 207 * Make sure the new value is valid memory. We only need to check the 193 208 * first page, since address is 8k aligned and memory pieces are always 194 209 * at least 1MB aligned and have at least a size of 1MB. 195 210 */ 196 - address &= 0x7fffe000u; 197 - if (kvm_is_error_gpa(vcpu->kvm, address)) { 211 + if (kvm_is_error_gpa(vcpu->kvm, irq.u.prefix.address)) { 198 212 *reg &= 0xffffffff00000000UL; 199 213 *reg |= SIGP_STATUS_INVALID_PARAMETER; 200 214 return SIGP_CC_STATUS_STORED; 201 215 } 202 216 203 - spin_lock(&li->lock); 204 - /* cpu must be in stopped state */ 205 - if (!(atomic_read(li->cpuflags) & CPUSTAT_STOPPED)) { 217 + rc = kvm_s390_inject_vcpu(dst_vcpu, &irq); 218 + if (rc == -EBUSY) { 206 219 *reg &= 0xffffffff00000000UL; 207 220 *reg |= SIGP_STATUS_INCORRECT_STATE; 208 - rc = SIGP_CC_STATUS_STORED; 209 - goto out_li; 221 + return SIGP_CC_STATUS_STORED; 222 + } else if (rc == 0) { 223 + VCPU_EVENT(vcpu, 4, "set prefix of cpu %02x to %x", 224 + dst_vcpu->vcpu_id, irq.u.prefix.address); 210 225 } 211 226 212 - li->irq.prefix.address = address; 213 - set_bit(IRQ_PEND_SET_PREFIX, &li->pending_irqs); 214 - kvm_s390_vcpu_wakeup(dst_vcpu); 215 - rc = SIGP_CC_ORDER_CODE_ACCEPTED; 216 - 217 - VCPU_EVENT(vcpu, 4, "set prefix of cpu %02x to %x", dst_vcpu->vcpu_id, 218 - address); 219 - out_li: 220 - spin_unlock(&li->lock); 221 227 return rc; 222 228 } 223 229 ··· 220 242 int flags; 221 243 int rc; 222 244 223 - spin_lock(&dst_vcpu->arch.local_int.lock); 224 245 flags = atomic_read(dst_vcpu->arch.local_int.cpuflags); 225 - spin_unlock(&dst_vcpu->arch.local_int.lock); 226 246 if (!(flags & CPUSTAT_STOPPED)) { 227 247 *reg &= 0xffffffff00000000UL; 228 248 *reg |= SIGP_STATUS_INCORRECT_STATE; ··· 267 291 /* handle (RE)START in user space */ 268 292 int rc = -EOPNOTSUPP; 269 293 294 + /* make sure we don't race with STOP irq injection */ 270 295 spin_lock(&li->lock); 271 - if (li->action_bits & ACTION_STOP_ON_STOP) 296 + if (kvm_s390_is_stop_irq_pending(dst_vcpu)) 272 297 rc = SIGP_CC_BUSY; 273 298 spin_unlock(&li->lock); 274 299 ··· 310 333 break; 311 334 case SIGP_EXTERNAL_CALL: 312 335 vcpu->stat.instruction_sigp_external_call++; 313 - rc = __sigp_external_call(vcpu, dst_vcpu); 336 + rc = __sigp_external_call(vcpu, dst_vcpu, status_reg); 314 337 break; 315 338 case SIGP_EMERGENCY_SIGNAL: 316 339 vcpu->stat.instruction_sigp_emergency++; ··· 371 394 return rc; 372 395 } 373 396 397 + static int handle_sigp_order_in_user_space(struct kvm_vcpu *vcpu, u8 order_code) 398 + { 399 + if (!vcpu->kvm->arch.user_sigp) 400 + return 0; 401 + 402 + switch (order_code) { 403 + case SIGP_SENSE: 404 + case SIGP_EXTERNAL_CALL: 405 + case SIGP_EMERGENCY_SIGNAL: 406 + case SIGP_COND_EMERGENCY_SIGNAL: 407 + case SIGP_SENSE_RUNNING: 408 + return 0; 409 + /* update counters as we're directly dropping to user space */ 410 + case SIGP_STOP: 411 + vcpu->stat.instruction_sigp_stop++; 412 + break; 413 + case SIGP_STOP_AND_STORE_STATUS: 414 + vcpu->stat.instruction_sigp_stop_store_status++; 415 + break; 416 + case SIGP_STORE_STATUS_AT_ADDRESS: 417 + vcpu->stat.instruction_sigp_store_status++; 418 + break; 419 + case SIGP_SET_PREFIX: 420 + vcpu->stat.instruction_sigp_prefix++; 421 + break; 422 + case SIGP_START: 423 + vcpu->stat.instruction_sigp_start++; 424 + break; 425 + case SIGP_RESTART: 426 + vcpu->stat.instruction_sigp_restart++; 427 + break; 428 + case SIGP_INITIAL_CPU_RESET: 429 + vcpu->stat.instruction_sigp_init_cpu_reset++; 430 + break; 431 + case SIGP_CPU_RESET: 432 + vcpu->stat.instruction_sigp_cpu_reset++; 433 + break; 434 + default: 435 + vcpu->stat.instruction_sigp_unknown++; 436 + } 437 + 438 + VCPU_EVENT(vcpu, 4, "sigp order %u: completely handled in user space", 439 + order_code); 440 + 441 + return 1; 442 + } 443 + 374 444 int kvm_s390_handle_sigp(struct kvm_vcpu *vcpu) 375 445 { 376 446 int r1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4; ··· 432 408 return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); 433 409 434 410 order_code = kvm_s390_get_base_disp_rs(vcpu); 411 + if (handle_sigp_order_in_user_space(vcpu, order_code)) 412 + return -EOPNOTSUPP; 435 413 436 414 if (r1 % 2) 437 415 parameter = vcpu->run->s.regs.gprs[r1];

+8 -6

arch/s390/kvm/trace-s390.h

··· 209 209 * Trace point for a vcpu's stop requests. 210 210 */ 211 211 TRACE_EVENT(kvm_s390_stop_request, 212 - TP_PROTO(unsigned int action_bits), 213 - TP_ARGS(action_bits), 212 + TP_PROTO(unsigned char stop_irq, unsigned char flags), 213 + TP_ARGS(stop_irq, flags), 214 214 215 215 TP_STRUCT__entry( 216 - __field(unsigned int, action_bits) 216 + __field(unsigned char, stop_irq) 217 + __field(unsigned char, flags) 217 218 ), 218 219 219 220 TP_fast_assign( 220 - __entry->action_bits = action_bits; 221 + __entry->stop_irq = stop_irq; 222 + __entry->flags = flags; 221 223 ), 222 224 223 - TP_printk("stop request, action_bits = %08x", 224 - __entry->action_bits) 225 + TP_printk("stop request, stop irq = %u, flags = %08x", 226 + __entry->stop_irq, __entry->flags) 225 227 ); 226 228 227 229

+1

arch/x86/include/asm/kvm_emulate.h

··· 208 208 209 209 void (*get_cpuid)(struct x86_emulate_ctxt *ctxt, 210 210 u32 *eax, u32 *ebx, u32 *ecx, u32 *edx); 211 + void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); 211 212 }; 212 213 213 214 typedef u32 __attribute__((vector_size(16))) sse128_t;

+52 -7

arch/x86/include/asm/kvm_host.h

··· 38 38 #define KVM_PRIVATE_MEM_SLOTS 3 39 39 #define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS) 40 40 41 - #define KVM_MMIO_SIZE 16 42 - 43 41 #define KVM_PIO_PAGE_OFFSET 1 44 42 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 45 43 ··· 49 51 | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG)) 50 52 51 53 #define CR3_L_MODE_RESERVED_BITS 0xFFFFFF0000000000ULL 52 - #define CR3_PCID_INVD (1UL << 63) 54 + #define CR3_PCID_INVD BIT_64(63) 53 55 #define CR4_RESERVED_BITS \ 54 56 (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\ 55 57 | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ ··· 157 159 #define DR7_GD (1 << 13) 158 160 #define DR7_FIXED_1 0x00000400 159 161 #define DR7_VOLATILE 0xffff2bff 162 + 163 + #define PFERR_PRESENT_BIT 0 164 + #define PFERR_WRITE_BIT 1 165 + #define PFERR_USER_BIT 2 166 + #define PFERR_RSVD_BIT 3 167 + #define PFERR_FETCH_BIT 4 168 + 169 + #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT) 170 + #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT) 171 + #define PFERR_USER_MASK (1U << PFERR_USER_BIT) 172 + #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT) 173 + #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT) 160 174 161 175 /* apic attention bits */ 162 176 #define KVM_APIC_CHECK_VAPIC 0 ··· 625 615 #ifdef CONFIG_KVM_MMU_AUDIT 626 616 int audit_point; 627 617 #endif 618 + 619 + bool boot_vcpu_runs_old_kvmclock; 628 620 }; 629 621 630 622 struct kvm_vm_stat { ··· 655 643 u32 irq_window_exits; 656 644 u32 nmi_window_exits; 657 645 u32 halt_exits; 646 + u32 halt_successful_poll; 658 647 u32 halt_wakeup; 659 648 u32 request_irq_exits; 660 649 u32 irq_exits; ··· 800 787 int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); 801 788 802 789 void (*sched_in)(struct kvm_vcpu *kvm, int cpu); 790 + 791 + /* 792 + * Arch-specific dirty logging hooks. These hooks are only supposed to 793 + * be valid if the specific arch has hardware-accelerated dirty logging 794 + * mechanism. Currently only for PML on VMX. 795 + * 796 + * - slot_enable_log_dirty: 797 + * called when enabling log dirty mode for the slot. 798 + * - slot_disable_log_dirty: 799 + * called when disabling log dirty mode for the slot. 800 + * also called when slot is created with log dirty disabled. 801 + * - flush_log_dirty: 802 + * called before reporting dirty_bitmap to userspace. 803 + * - enable_log_dirty_pt_masked: 804 + * called when reenabling log dirty for the GFNs in the mask after 805 + * corresponding bits are cleared in slot->dirty_bitmap. 806 + */ 807 + void (*slot_enable_log_dirty)(struct kvm *kvm, 808 + struct kvm_memory_slot *slot); 809 + void (*slot_disable_log_dirty)(struct kvm *kvm, 810 + struct kvm_memory_slot *slot); 811 + void (*flush_log_dirty)(struct kvm *kvm); 812 + void (*enable_log_dirty_pt_masked)(struct kvm *kvm, 813 + struct kvm_memory_slot *slot, 814 + gfn_t offset, unsigned long mask); 803 815 }; 804 816 805 817 struct kvm_arch_async_pf { ··· 857 819 u64 dirty_mask, u64 nx_mask, u64 x_mask); 858 820 859 821 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); 860 - void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); 861 - void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, 862 - struct kvm_memory_slot *slot, 863 - gfn_t gfn_offset, unsigned long mask); 822 + void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 823 + struct kvm_memory_slot *memslot); 824 + void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, 825 + struct kvm_memory_slot *memslot); 826 + void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, 827 + struct kvm_memory_slot *memslot); 828 + void kvm_mmu_slot_set_dirty(struct kvm *kvm, 829 + struct kvm_memory_slot *memslot); 830 + void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, 831 + struct kvm_memory_slot *slot, 832 + gfn_t gfn_offset, unsigned long mask); 864 833 void kvm_mmu_zap_all(struct kvm *kvm); 865 834 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm); 866 835 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);

+4

arch/x86/include/asm/vmx.h

··· 69 69 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING 0x00000400 70 70 #define SECONDARY_EXEC_ENABLE_INVPCID 0x00001000 71 71 #define SECONDARY_EXEC_SHADOW_VMCS 0x00004000 72 + #define SECONDARY_EXEC_ENABLE_PML 0x00020000 72 73 #define SECONDARY_EXEC_XSAVES 0x00100000 73 74 74 75 ··· 122 121 GUEST_LDTR_SELECTOR = 0x0000080c, 123 122 GUEST_TR_SELECTOR = 0x0000080e, 124 123 GUEST_INTR_STATUS = 0x00000810, 124 + GUEST_PML_INDEX = 0x00000812, 125 125 HOST_ES_SELECTOR = 0x00000c00, 126 126 HOST_CS_SELECTOR = 0x00000c02, 127 127 HOST_SS_SELECTOR = 0x00000c04, ··· 142 140 VM_EXIT_MSR_LOAD_ADDR_HIGH = 0x00002009, 143 141 VM_ENTRY_MSR_LOAD_ADDR = 0x0000200a, 144 142 VM_ENTRY_MSR_LOAD_ADDR_HIGH = 0x0000200b, 143 + PML_ADDRESS = 0x0000200e, 144 + PML_ADDRESS_HIGH = 0x0000200f, 145 145 TSC_OFFSET = 0x00002010, 146 146 TSC_OFFSET_HIGH = 0x00002011, 147 147 VIRTUAL_APIC_PAGE_ADDR = 0x00002012,

+3

arch/x86/include/uapi/asm/msr-index.h

··· 364 364 #define MSR_IA32_UCODE_WRITE 0x00000079 365 365 #define MSR_IA32_UCODE_REV 0x0000008b 366 366 367 + #define MSR_IA32_SMM_MONITOR_CTL 0x0000009b 368 + #define MSR_IA32_SMBASE 0x0000009e 369 + 367 370 #define MSR_IA32_PERF_STATUS 0x00000198 368 371 #define MSR_IA32_PERF_CTL 0x00000199 369 372 #define INTEL_PERF_CTL_MASK 0xffff

+6

arch/x86/include/uapi/asm/vmx.h

··· 56 56 #define EXIT_REASON_MSR_READ 31 57 57 #define EXIT_REASON_MSR_WRITE 32 58 58 #define EXIT_REASON_INVALID_STATE 33 59 + #define EXIT_REASON_MSR_LOAD_FAIL 34 59 60 #define EXIT_REASON_MWAIT_INSTRUCTION 36 60 61 #define EXIT_REASON_MONITOR_INSTRUCTION 39 61 62 #define EXIT_REASON_PAUSE_INSTRUCTION 40 ··· 73 72 #define EXIT_REASON_XSETBV 55 74 73 #define EXIT_REASON_APIC_WRITE 56 75 74 #define EXIT_REASON_INVPCID 58 75 + #define EXIT_REASON_PML_FULL 62 76 76 #define EXIT_REASON_XSAVES 63 77 77 #define EXIT_REASON_XRSTORS 64 78 78 ··· 118 116 { EXIT_REASON_APIC_WRITE, "APIC_WRITE" }, \ 119 117 { EXIT_REASON_EOI_INDUCED, "EOI_INDUCED" }, \ 120 118 { EXIT_REASON_INVALID_STATE, "INVALID_STATE" }, \ 119 + { EXIT_REASON_MSR_LOAD_FAIL, "MSR_LOAD_FAIL" }, \ 121 120 { EXIT_REASON_INVD, "INVD" }, \ 122 121 { EXIT_REASON_INVVPID, "INVVPID" }, \ 123 122 { EXIT_REASON_INVPCID, "INVPCID" }, \ 124 123 { EXIT_REASON_XSAVES, "XSAVES" }, \ 125 124 { EXIT_REASON_XRSTORS, "XRSTORS" } 125 + 126 + #define VMX_ABORT_SAVE_GUEST_MSR_FAIL 1 127 + #define VMX_ABORT_LOAD_HOST_MSR_FAIL 4 126 128 127 129 #endif /* _UAPIVMX_H */

+1

arch/x86/kvm/Kconfig

··· 39 39 select PERF_EVENTS 40 40 select HAVE_KVM_MSI 41 41 select HAVE_KVM_CPU_RELAX_INTERCEPT 42 + select KVM_GENERIC_DIRTYLOG_READ_PROTECT 42 43 select KVM_VFIO 43 44 select SRCU 44 45 ---help---

+165 -83

arch/x86/kvm/emulate.c

··· 86 86 #define DstAcc (OpAcc << DstShift) 87 87 #define DstDI (OpDI << DstShift) 88 88 #define DstMem64 (OpMem64 << DstShift) 89 + #define DstMem16 (OpMem16 << DstShift) 89 90 #define DstImmUByte (OpImmUByte << DstShift) 90 91 #define DstDX (OpDX << DstShift) 91 92 #define DstAccLo (OpAccLo << DstShift) ··· 125 124 #define RMExt (4<<15) /* Opcode extension in ModRM r/m if mod == 3 */ 126 125 #define Escape (5<<15) /* Escape to coprocessor instruction */ 127 126 #define InstrDual (6<<15) /* Alternate instruction decoding of mod == 3 */ 127 + #define ModeDual (7<<15) /* Different instruction for 32/64 bit */ 128 128 #define Sse (1<<18) /* SSE Vector instruction */ 129 129 /* Generic ModRM decode. */ 130 130 #define ModRM (1<<19) ··· 167 165 #define NoMod ((u64)1 << 47) /* Mod field is ignored */ 168 166 #define Intercept ((u64)1 << 48) /* Has valid intercept field */ 169 167 #define CheckPerm ((u64)1 << 49) /* Has valid check_perm field */ 170 - #define NoBigReal ((u64)1 << 50) /* No big real mode */ 171 168 #define PrivUD ((u64)1 << 51) /* #UD instead of #GP on CPL > 0 */ 172 169 #define NearBranch ((u64)1 << 52) /* Near branches */ 173 170 #define No16 ((u64)1 << 53) /* No 16 bit operand */ 171 + #define IncSP ((u64)1 << 54) /* SP is incremented before ModRM calc */ 174 172 175 173 #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) 176 174 ··· 215 213 const struct gprefix *gprefix; 216 214 const struct escape *esc; 217 215 const struct instr_dual *idual; 216 + const struct mode_dual *mdual; 218 217 void (*fastop)(struct fastop *fake); 219 218 } u; 220 219 int (*check_perm)(struct x86_emulate_ctxt *ctxt); ··· 243 240 struct opcode mod3; 244 241 }; 245 242 243 + struct mode_dual { 244 + struct opcode mode32; 245 + struct opcode mode64; 246 + }; 247 + 246 248 /* EFLAGS bit definitions. */ 247 249 #define EFLG_ID (1<<21) 248 250 #define EFLG_VIP (1<<20) ··· 269 261 270 262 #define EFLG_RESERVED_ZEROS_MASK 0xffc0802a 271 263 #define EFLG_RESERVED_ONE_MASK 2 264 + 265 + enum x86_transfer_type { 266 + X86_TRANSFER_NONE, 267 + X86_TRANSFER_CALL_JMP, 268 + X86_TRANSFER_RET, 269 + X86_TRANSFER_TASK_SWITCH, 270 + }; 272 271 273 272 static ulong reg_read(struct x86_emulate_ctxt *ctxt, unsigned nr) 274 273 { ··· 684 669 } 685 670 if (addr.ea > lim) 686 671 goto bad; 687 - *max_size = min_t(u64, ~0u, (u64)lim + 1 - addr.ea); 688 - if (size > *max_size) 689 - goto bad; 672 + if (lim == 0xffffffff) 673 + *max_size = ~0u; 674 + else { 675 + *max_size = (u64)lim + 1 - addr.ea; 676 + if (size > *max_size) 677 + goto bad; 678 + } 690 679 la &= (u32)-1; 691 680 break; 692 681 } ··· 741 722 const struct desc_struct *cs_desc) 742 723 { 743 724 enum x86emul_mode mode = ctxt->mode; 725 + int rc; 744 726 745 727 #ifdef CONFIG_X86_64 746 - if (ctxt->mode >= X86EMUL_MODE_PROT32 && cs_desc->l) { 747 - u64 efer = 0; 728 + if (ctxt->mode >= X86EMUL_MODE_PROT16) { 729 + if (cs_desc->l) { 730 + u64 efer = 0; 748 731 749 - ctxt->ops->get_msr(ctxt, MSR_EFER, &efer); 750 - if (efer & EFER_LMA) 751 - mode = X86EMUL_MODE_PROT64; 732 + ctxt->ops->get_msr(ctxt, MSR_EFER, &efer); 733 + if (efer & EFER_LMA) 734 + mode = X86EMUL_MODE_PROT64; 735 + } else 736 + mode = X86EMUL_MODE_PROT32; /* temporary value */ 752 737 } 753 738 #endif 754 739 if (mode == X86EMUL_MODE_PROT16 || mode == X86EMUL_MODE_PROT32) 755 740 mode = cs_desc->d ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16; 756 - return assign_eip(ctxt, dst, mode); 741 + rc = assign_eip(ctxt, dst, mode); 742 + if (rc == X86EMUL_CONTINUE) 743 + ctxt->mode = mode; 744 + return rc; 757 745 } 758 746 759 747 static inline int jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) ··· 1083 1057 asm volatile("fnstcw %0": "+m"(fcw)); 1084 1058 ctxt->ops->put_fpu(ctxt); 1085 1059 1086 - /* force 2 byte destination */ 1087 - ctxt->dst.bytes = 2; 1088 1060 ctxt->dst.val = fcw; 1089 1061 1090 1062 return X86EMUL_CONTINUE; ··· 1099 1075 asm volatile("fnstsw %0": "+m"(fsw)); 1100 1076 ctxt->ops->put_fpu(ctxt); 1101 1077 1102 - /* force 2 byte destination */ 1103 - ctxt->dst.bytes = 2; 1104 1078 ctxt->dst.val = fsw; 1105 1079 1106 1080 return X86EMUL_CONTINUE; ··· 1245 1223 else { 1246 1224 modrm_ea += reg_read(ctxt, base_reg); 1247 1225 adjust_modrm_seg(ctxt, base_reg); 1226 + /* Increment ESP on POP [ESP] */ 1227 + if ((ctxt->d & IncSP) && 1228 + base_reg == VCPU_REGS_RSP) 1229 + modrm_ea += ctxt->op_bytes; 1248 1230 } 1249 1231 if (index_reg != 4) 1250 1232 modrm_ea += reg_read(ctxt, index_reg) << scale; ··· 1461 1435 ops->get_gdt(ctxt, dt); 1462 1436 } 1463 1437 1464 - /* allowed just for 8 bytes segments */ 1465 - static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, 1466 - u16 selector, struct desc_struct *desc, 1467 - ulong *desc_addr_p) 1468 - { 1469 - struct desc_ptr dt; 1470 - u16 index = selector >> 3; 1471 - ulong addr; 1472 - 1473 - get_descriptor_table_ptr(ctxt, selector, &dt); 1474 - 1475 - if (dt.size < index * 8 + 7) 1476 - return emulate_gp(ctxt, selector & 0xfffc); 1477 - 1478 - *desc_addr_p = addr = dt.address + index * 8; 1479 - return ctxt->ops->read_std(ctxt, addr, desc, sizeof *desc, 1480 - &ctxt->exception); 1481 - } 1482 - 1483 - /* allowed just for 8 bytes segments */ 1484 - static int write_segment_descriptor(struct x86_emulate_ctxt *ctxt, 1485 - u16 selector, struct desc_struct *desc) 1438 + static int get_descriptor_ptr(struct x86_emulate_ctxt *ctxt, 1439 + u16 selector, ulong *desc_addr_p) 1486 1440 { 1487 1441 struct desc_ptr dt; 1488 1442 u16 index = selector >> 3; ··· 1474 1468 return emulate_gp(ctxt, selector & 0xfffc); 1475 1469 1476 1470 addr = dt.address + index * 8; 1471 + 1472 + #ifdef CONFIG_X86_64 1473 + if (addr >> 32 != 0) { 1474 + u64 efer = 0; 1475 + 1476 + ctxt->ops->get_msr(ctxt, MSR_EFER, &efer); 1477 + if (!(efer & EFER_LMA)) 1478 + addr &= (u32)-1; 1479 + } 1480 + #endif 1481 + 1482 + *desc_addr_p = addr; 1483 + return X86EMUL_CONTINUE; 1484 + } 1485 + 1486 + /* allowed just for 8 bytes segments */ 1487 + static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, 1488 + u16 selector, struct desc_struct *desc, 1489 + ulong *desc_addr_p) 1490 + { 1491 + int rc; 1492 + 1493 + rc = get_descriptor_ptr(ctxt, selector, desc_addr_p); 1494 + if (rc != X86EMUL_CONTINUE) 1495 + return rc; 1496 + 1497 + return ctxt->ops->read_std(ctxt, *desc_addr_p, desc, sizeof(*desc), 1498 + &ctxt->exception); 1499 + } 1500 + 1501 + /* allowed just for 8 bytes segments */ 1502 + static int write_segment_descriptor(struct x86_emulate_ctxt *ctxt, 1503 + u16 selector, struct desc_struct *desc) 1504 + { 1505 + int rc; 1506 + ulong addr; 1507 + 1508 + rc = get_descriptor_ptr(ctxt, selector, &addr); 1509 + if (rc != X86EMUL_CONTINUE) 1510 + return rc; 1511 + 1477 1512 return ctxt->ops->write_std(ctxt, addr, desc, sizeof *desc, 1478 1513 &ctxt->exception); 1479 1514 } ··· 1522 1475 /* Does not support long mode */ 1523 1476 static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt, 1524 1477 u16 selector, int seg, u8 cpl, 1525 - bool in_task_switch, 1478 + enum x86_transfer_type transfer, 1526 1479 struct desc_struct *desc) 1527 1480 { 1528 1481 struct desc_struct seg_desc, old_desc; ··· 1576 1529 return ret; 1577 1530 1578 1531 err_code = selector & 0xfffc; 1579 - err_vec = in_task_switch ? TS_VECTOR : GP_VECTOR; 1532 + err_vec = (transfer == X86_TRANSFER_TASK_SWITCH) ? TS_VECTOR : 1533 + GP_VECTOR; 1580 1534 1581 1535 /* can't load system descriptor into segment selector */ 1582 - if (seg <= VCPU_SREG_GS && !seg_desc.s) 1536 + if (seg <= VCPU_SREG_GS && !seg_desc.s) { 1537 + if (transfer == X86_TRANSFER_CALL_JMP) 1538 + return X86EMUL_UNHANDLEABLE; 1583 1539 goto exception; 1540 + } 1584 1541 1585 1542 if (!seg_desc.p) { 1586 1543 err_vec = (seg == VCPU_SREG_SS) ? SS_VECTOR : NP_VECTOR; ··· 1656 1605 1657 1606 if (seg_desc.s) { 1658 1607 /* mark segment as accessed */ 1659 - seg_desc.type |= 1; 1660 - ret = write_segment_descriptor(ctxt, selector, &seg_desc); 1661 - if (ret != X86EMUL_CONTINUE) 1662 - return ret; 1608 + if (!(seg_desc.type & 1)) { 1609 + seg_desc.type |= 1; 1610 + ret = write_segment_descriptor(ctxt, selector, 1611 + &seg_desc); 1612 + if (ret != X86EMUL_CONTINUE) 1613 + return ret; 1614 + } 1663 1615 } else if (ctxt->mode == X86EMUL_MODE_PROT64) { 1664 1616 ret = ctxt->ops->read_std(ctxt, desc_addr+8, &base3, 1665 1617 sizeof(base3), &ctxt->exception); ··· 1685 1631 u16 selector, int seg) 1686 1632 { 1687 1633 u8 cpl = ctxt->ops->cpl(ctxt); 1688 - return __load_segment_descriptor(ctxt, selector, seg, cpl, false, NULL); 1634 + return __load_segment_descriptor(ctxt, selector, seg, cpl, 1635 + X86_TRANSFER_NONE, NULL); 1689 1636 } 1690 1637 1691 1638 static void write_register_operand(struct operand *op) ··· 1883 1828 unsigned long selector; 1884 1829 int rc; 1885 1830 1886 - rc = emulate_pop(ctxt, &selector, ctxt->op_bytes); 1831 + rc = emulate_pop(ctxt, &selector, 2); 1887 1832 if (rc != X86EMUL_CONTINUE) 1888 1833 return rc; 1889 1834 1890 1835 if (ctxt->modrm_reg == VCPU_SREG_SS) 1891 1836 ctxt->interruptibility = KVM_X86_SHADOW_INT_MOV_SS; 1837 + if (ctxt->op_bytes > 2) 1838 + rsp_increment(ctxt, ctxt->op_bytes - 2); 1892 1839 1893 1840 rc = load_segment_descriptor(ctxt, (u16)selector, seg); 1894 1841 return rc; ··· 2064 2007 2065 2008 ctxt->eflags &= ~EFLG_RESERVED_ZEROS_MASK; /* Clear reserved zeros */ 2066 2009 ctxt->eflags |= EFLG_RESERVED_ONE_MASK; 2010 + ctxt->ops->set_nmi_mask(ctxt, false); 2067 2011 2068 2012 return rc; 2069 2013 } ··· 2099 2041 2100 2042 memcpy(&sel, ctxt->src.valptr + ctxt->op_bytes, 2); 2101 2043 2102 - rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, false, 2044 + rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, 2045 + X86_TRANSFER_CALL_JMP, 2103 2046 &new_desc); 2104 2047 if (rc != X86EMUL_CONTINUE) 2105 2048 return rc; ··· 2189 2130 /* Outer-privilege level return is not implemented */ 2190 2131 if (ctxt->mode >= X86EMUL_MODE_PROT16 && (cs & 3) > cpl) 2191 2132 return X86EMUL_UNHANDLEABLE; 2192 - rc = __load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS, cpl, false, 2133 + rc = __load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS, cpl, 2134 + X86_TRANSFER_RET, 2193 2135 &new_desc); 2194 2136 if (rc != X86EMUL_CONTINUE) 2195 2137 return rc; ··· 2223 2163 fastop(ctxt, em_cmp); 2224 2164 2225 2165 if (ctxt->eflags & EFLG_ZF) { 2226 - /* Success: write back to memory. */ 2166 + /* Success: write back to memory; no update of EAX */ 2167 + ctxt->src.type = OP_NONE; 2227 2168 ctxt->dst.val = ctxt->src.orig_val; 2228 2169 } else { 2229 2170 /* Failure: write the value we saw to EAX. */ 2230 - ctxt->dst.type = OP_REG; 2231 - ctxt->dst.addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); 2171 + ctxt->src.type = OP_REG; 2172 + ctxt->src.addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); 2173 + ctxt->src.val = ctxt->dst.orig_val; 2174 + /* Create write-cycle to dest by writing the same value */ 2232 2175 ctxt->dst.val = ctxt->dst.orig_val; 2233 2176 } 2234 2177 return X86EMUL_CONTINUE; ··· 2619 2556 * it is handled in a context of new task 2620 2557 */ 2621 2558 ret = __load_segment_descriptor(ctxt, tss->ldt, VCPU_SREG_LDTR, cpl, 2622 - true, NULL); 2559 + X86_TRANSFER_TASK_SWITCH, NULL); 2623 2560 if (ret != X86EMUL_CONTINUE) 2624 2561 return ret; 2625 2562 ret = __load_segment_descriptor(ctxt, tss->es, VCPU_SREG_ES, cpl, 2626 - true, NULL); 2563 + X86_TRANSFER_TASK_SWITCH, NULL); 2627 2564 if (ret != X86EMUL_CONTINUE) 2628 2565 return ret; 2629 2566 ret = __load_segment_descriptor(ctxt, tss->cs, VCPU_SREG_CS, cpl, 2630 - true, NULL); 2567 + X86_TRANSFER_TASK_SWITCH, NULL); 2631 2568 if (ret != X86EMUL_CONTINUE) 2632 2569 return ret; 2633 2570 ret = __load_segment_descriptor(ctxt, tss->ss, VCPU_SREG_SS, cpl, 2634 - true, NULL); 2571 + X86_TRANSFER_TASK_SWITCH, NULL); 2635 2572 if (ret != X86EMUL_CONTINUE) 2636 2573 return ret; 2637 2574 ret = __load_segment_descriptor(ctxt, tss->ds, VCPU_SREG_DS, cpl, 2638 - true, NULL); 2575 + X86_TRANSFER_TASK_SWITCH, NULL); 2639 2576 if (ret != X86EMUL_CONTINUE) 2640 2577 return ret; 2641 2578 ··· 2757 2694 * it is handled in a context of new task 2758 2695 */ 2759 2696 ret = __load_segment_descriptor(ctxt, tss->ldt_selector, VCPU_SREG_LDTR, 2760 - cpl, true, NULL); 2697 + cpl, X86_TRANSFER_TASK_SWITCH, NULL); 2761 2698 if (ret != X86EMUL_CONTINUE) 2762 2699 return ret; 2763 2700 ret = __load_segment_descriptor(ctxt, tss->es, VCPU_SREG_ES, cpl, 2764 - true, NULL); 2701 + X86_TRANSFER_TASK_SWITCH, NULL); 2765 2702 if (ret != X86EMUL_CONTINUE) 2766 2703 return ret; 2767 2704 ret = __load_segment_descriptor(ctxt, tss->cs, VCPU_SREG_CS, cpl, 2768 - true, NULL); 2705 + X86_TRANSFER_TASK_SWITCH, NULL); 2769 2706 if (ret != X86EMUL_CONTINUE) 2770 2707 return ret; 2771 2708 ret = __load_segment_descriptor(ctxt, tss->ss, VCPU_SREG_SS, cpl, 2772 - true, NULL); 2709 + X86_TRANSFER_TASK_SWITCH, NULL); 2773 2710 if (ret != X86EMUL_CONTINUE) 2774 2711 return ret; 2775 2712 ret = __load_segment_descriptor(ctxt, tss->ds, VCPU_SREG_DS, cpl, 2776 - true, NULL); 2713 + X86_TRANSFER_TASK_SWITCH, NULL); 2777 2714 if (ret != X86EMUL_CONTINUE) 2778 2715 return ret; 2779 2716 ret = __load_segment_descriptor(ctxt, tss->fs, VCPU_SREG_FS, cpl, 2780 - true, NULL); 2717 + X86_TRANSFER_TASK_SWITCH, NULL); 2781 2718 if (ret != X86EMUL_CONTINUE) 2782 2719 return ret; 2783 2720 ret = __load_segment_descriptor(ctxt, tss->gs, VCPU_SREG_GS, cpl, 2784 - true, NULL); 2721 + X86_TRANSFER_TASK_SWITCH, NULL); 2785 2722 if (ret != X86EMUL_CONTINUE) 2786 2723 return ret; 2787 2724 ··· 2802 2739 ret = ops->read_std(ctxt, old_tss_base, &tss_seg, sizeof tss_seg, 2803 2740 &ctxt->exception); 2804 2741 if (ret != X86EMUL_CONTINUE) 2805 - /* FIXME: need to provide precise fault address */ 2806 2742 return ret; 2807 2743 2808 2744 save_state_to_tss32(ctxt, &tss_seg); ··· 2810 2748 ret = ops->write_std(ctxt, old_tss_base + eip_offset, &tss_seg.eip, 2811 2749 ldt_sel_offset - eip_offset, &ctxt->exception); 2812 2750 if (ret != X86EMUL_CONTINUE) 2813 - /* FIXME: need to provide precise fault address */ 2814 2751 return ret; 2815 2752 2816 2753 ret = ops->read_std(ctxt, new_tss_base, &tss_seg, sizeof tss_seg, 2817 2754 &ctxt->exception); 2818 2755 if (ret != X86EMUL_CONTINUE) 2819 - /* FIXME: need to provide precise fault address */ 2820 2756 return ret; 2821 2757 2822 2758 if (old_tss_sel != 0xffff) { ··· 2825 2765 sizeof tss_seg.prev_task_link, 2826 2766 &ctxt->exception); 2827 2767 if (ret != X86EMUL_CONTINUE) 2828 - /* FIXME: need to provide precise fault address */ 2829 2768 return ret; 2830 2769 } 2831 2770 ··· 3058 2999 struct desc_struct old_desc, new_desc; 3059 3000 const struct x86_emulate_ops *ops = ctxt->ops; 3060 3001 int cpl = ctxt->ops->cpl(ctxt); 3002 + enum x86emul_mode prev_mode = ctxt->mode; 3061 3003 3062 3004 old_eip = ctxt->_eip; 3063 3005 ops->get_segment(ctxt, &old_cs, &old_desc, NULL, VCPU_SREG_CS); 3064 3006 3065 3007 memcpy(&sel, ctxt->src.valptr + ctxt->op_bytes, 2); 3066 - rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, false, 3067 - &new_desc); 3008 + rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, 3009 + X86_TRANSFER_CALL_JMP, &new_desc); 3068 3010 if (rc != X86EMUL_CONTINUE) 3069 - return X86EMUL_CONTINUE; 3011 + return rc; 3070 3012 3071 3013 rc = assign_eip_far(ctxt, ctxt->src.val, &new_desc); 3072 3014 if (rc != X86EMUL_CONTINUE) ··· 3082 3022 rc = em_push(ctxt); 3083 3023 /* If we failed, we tainted the memory, but the very least we should 3084 3024 restore cs */ 3085 - if (rc != X86EMUL_CONTINUE) 3025 + if (rc != X86EMUL_CONTINUE) { 3026 + pr_warn_once("faulting far call emulation tainted memory\n"); 3086 3027 goto fail; 3028 + } 3087 3029 return rc; 3088 3030 fail: 3089 3031 ops->set_segment(ctxt, old_cs, &old_desc, 0, VCPU_SREG_CS); 3032 + ctxt->mode = prev_mode; 3090 3033 return rc; 3091 3034 3092 3035 } ··· 3540 3477 return X86EMUL_CONTINUE; 3541 3478 } 3542 3479 3480 + static int em_movsxd(struct x86_emulate_ctxt *ctxt) 3481 + { 3482 + ctxt->dst.val = (s32) ctxt->src.val; 3483 + return X86EMUL_CONTINUE; 3484 + } 3485 + 3543 3486 static bool valid_cr(int nr) 3544 3487 { 3545 3488 switch (nr) { ··· 3745 3676 #define G(_f, _g) { .flags = ((_f) | Group | ModRM), .u.group = (_g) } 3746 3677 #define GD(_f, _g) { .flags = ((_f) | GroupDual | ModRM), .u.gdual = (_g) } 3747 3678 #define ID(_f, _i) { .flags = ((_f) | InstrDual | ModRM), .u.idual = (_i) } 3679 + #define MD(_f, _m) { .flags = ((_f) | ModeDual), .u.mdual = (_m) } 3748 3680 #define E(_f, _e) { .flags = ((_f) | Escape | ModRM), .u.esc = (_e) } 3749 3681 #define I(_f, _e) { .flags = (_f), .u.execute = (_e) } 3750 3682 #define F(_f, _e) { .flags = (_f) | Fastop, .u.fastop = (_e) } ··· 3808 3738 }; 3809 3739 3810 3740 static const struct opcode group1A[] = { 3811 - I(DstMem | SrcNone | Mov | Stack, em_pop), N, N, N, N, N, N, N, 3741 + I(DstMem | SrcNone | Mov | Stack | IncSP, em_pop), N, N, N, N, N, N, N, 3812 3742 }; 3813 3743 3814 3744 static const struct opcode group2[] = { ··· 3924 3854 }; 3925 3855 3926 3856 static const struct escape escape_d9 = { { 3927 - N, N, N, N, N, N, N, I(DstMem, em_fnstcw), 3857 + N, N, N, N, N, N, N, I(DstMem16 | Mov, em_fnstcw), 3928 3858 }, { 3929 3859 /* 0xC0 - 0xC7 */ 3930 3860 N, N, N, N, N, N, N, N, ··· 3966 3896 } }; 3967 3897 3968 3898 static const struct escape escape_dd = { { 3969 - N, N, N, N, N, N, N, I(DstMem, em_fnstsw), 3899 + N, N, N, N, N, N, N, I(DstMem16 | Mov, em_fnstsw), 3970 3900 }, { 3971 3901 /* 0xC0 - 0xC7 */ 3972 3902 N, N, N, N, N, N, N, N, ··· 3988 3918 3989 3919 static const struct instr_dual instr_dual_0f_c3 = { 3990 3920 I(DstMem | SrcReg | ModRM | No16 | Mov, em_mov), N 3921 + }; 3922 + 3923 + static const struct mode_dual mode_dual_63 = { 3924 + N, I(DstReg | SrcMem32 | ModRM | Mov, em_movsxd) 3991 3925 }; 3992 3926 3993 3927 static const struct opcode opcode_table[256] = { ··· 4028 3954 /* 0x60 - 0x67 */ 4029 3955 I(ImplicitOps | Stack | No64, em_pusha), 4030 3956 I(ImplicitOps | Stack | No64, em_popa), 4031 - N, D(DstReg | SrcMem32 | ModRM | Mov) /* movsxd (x86/64) */ , 3957 + N, MD(ModRM, &mode_dual_63), 4032 3958 N, N, N, N, 4033 3959 /* 0x68 - 0x6F */ 4034 3960 I(SrcImm | Mov | Stack, em_push), ··· 4084 4010 G(ByteOp, group11), G(0, group11), 4085 4011 /* 0xC8 - 0xCF */ 4086 4012 I(Stack | SrcImmU16 | Src2ImmByte, em_enter), I(Stack, em_leave), 4087 - I(ImplicitOps | Stack | SrcImmU16, em_ret_far_imm), 4088 - I(ImplicitOps | Stack, em_ret_far), 4013 + I(ImplicitOps | SrcImmU16, em_ret_far_imm), 4014 + I(ImplicitOps, em_ret_far), 4089 4015 D(ImplicitOps), DI(SrcImmByte, intn), 4090 4016 D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret), 4091 4017 /* 0xD0 - 0xD7 */ ··· 4182 4108 F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), 4183 4109 GD(0, &group15), F(DstReg | SrcMem | ModRM, em_imul), 4184 4110 /* 0xB0 - 0xB7 */ 4185 - I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), 4111 + I2bv(DstMem | SrcReg | ModRM | Lock | PageTable | SrcWrite, em_cmpxchg), 4186 4112 I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), 4187 4113 F(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), 4188 4114 I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg), ··· 4248 4174 #undef I 4249 4175 #undef GP 4250 4176 #undef EXT 4177 + #undef MD 4178 + #undef ID 4251 4179 4252 4180 #undef D2bv 4253 4181 #undef D2bvIP ··· 4639 4563 else 4640 4564 opcode = opcode.u.idual->mod012; 4641 4565 break; 4566 + case ModeDual: 4567 + if (ctxt->mode == X86EMUL_MODE_PROT64) 4568 + opcode = opcode.u.mdual->mode64; 4569 + else 4570 + opcode = opcode.u.mdual->mode32; 4571 + break; 4642 4572 default: 4643 4573 return EMULATION_FAILED; 4644 4574 } ··· 4942 4860 /* optimisation - avoid slow emulated read if Mov */ 4943 4861 rc = segmented_read(ctxt, ctxt->dst.addr.mem, 4944 4862 &ctxt->dst.val, ctxt->dst.bytes); 4945 - if (rc != X86EMUL_CONTINUE) 4863 + if (rc != X86EMUL_CONTINUE) { 4864 + if (!(ctxt->d & NoWrite) && 4865 + rc == X86EMUL_PROPAGATE_FAULT && 4866 + ctxt->exception.vector == PF_VECTOR) 4867 + ctxt->exception.error_code |= PFERR_WRITE_MASK; 4946 4868 goto done; 4869 + } 4947 4870 } 4948 4871 ctxt->dst.orig_val = ctxt->dst.val; 4949 4872 ··· 4986 4899 goto threebyte_insn; 4987 4900 4988 4901 switch (ctxt->b) { 4989 - case 0x63: /* movsxd */ 4990 - if (ctxt->mode != X86EMUL_MODE_PROT64) 4991 - goto cannot_emulate; 4992 - ctxt->dst.val = (s32) ctxt->src.val; 4993 - break; 4994 4902 case 0x70 ... 0x7f: /* jcc (short) */ 4995 4903 if (test_cc(ctxt->b, ctxt->eflags)) 4996 4904 rc = jmp_rel(ctxt, ctxt->src.val);

+1 -1

arch/x86/kvm/ioapic.h

··· 98 98 } 99 99 100 100 void kvm_rtc_eoi_tracking_restore_one(struct kvm_vcpu *vcpu); 101 - int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, 101 + bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, 102 102 int short_hand, unsigned int dest, int dest_mode); 103 103 int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2); 104 104 void kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, int vector,

+3 -1

arch/x86/kvm/iommu.c

··· 138 138 139 139 gfn += page_size >> PAGE_SHIFT; 140 140 141 - 141 + cond_resched(); 142 142 } 143 143 144 144 return 0; ··· 306 306 kvm_unpin_pages(kvm, pfn, unmap_pages); 307 307 308 308 gfn += unmap_pages; 309 + 310 + cond_resched(); 309 311 } 310 312 } 311 313

+96 -51

arch/x86/kvm/lapic.c

··· 33 33 #include <asm/page.h> 34 34 #include <asm/current.h> 35 35 #include <asm/apicdef.h> 36 + #include <asm/delay.h> 36 37 #include <linux/atomic.h> 37 38 #include <linux/jump_label.h> 38 39 #include "kvm_cache_regs.h" ··· 328 327 return count; 329 328 } 330 329 331 - void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir) 330 + void __kvm_apic_update_irr(u32 *pir, void *regs) 332 331 { 333 332 u32 i, pir_val; 334 - struct kvm_lapic *apic = vcpu->arch.apic; 335 333 336 334 for (i = 0; i <= 7; i++) { 337 335 pir_val = xchg(&pir[i], 0); 338 336 if (pir_val) 339 - *((u32 *)(apic->regs + APIC_IRR + i * 0x10)) |= pir_val; 337 + *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val; 340 338 } 339 + } 340 + EXPORT_SYMBOL_GPL(__kvm_apic_update_irr); 341 + 342 + void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir) 343 + { 344 + struct kvm_lapic *apic = vcpu->arch.apic; 345 + 346 + __kvm_apic_update_irr(pir, apic->regs); 341 347 } 342 348 EXPORT_SYMBOL_GPL(kvm_apic_update_irr); 343 349 ··· 413 405 * because the processor can modify ISR under the hood. Instead 414 406 * just set SVI. 415 407 */ 416 - if (unlikely(kvm_apic_vid_enabled(vcpu->kvm))) 408 + if (unlikely(kvm_x86_ops->hwapic_isr_update)) 417 409 kvm_x86_ops->hwapic_isr_update(vcpu->kvm, vec); 418 410 else { 419 411 ++apic->isr_count; ··· 461 453 * on the other hand isr_count and highest_isr_cache are unused 462 454 * and must be left alone. 463 455 */ 464 - if (unlikely(kvm_apic_vid_enabled(vcpu->kvm))) 456 + if (unlikely(kvm_x86_ops->hwapic_isr_update)) 465 457 kvm_x86_ops->hwapic_isr_update(vcpu->kvm, 466 458 apic_find_highest_isr(apic)); 467 459 else { ··· 588 580 apic_update_ppr(apic); 589 581 } 590 582 591 - static int kvm_apic_broadcast(struct kvm_lapic *apic, u32 dest) 583 + static bool kvm_apic_broadcast(struct kvm_lapic *apic, u32 dest) 592 584 { 593 585 return dest == (apic_x2apic_mode(apic) ? 594 586 X2APIC_BROADCAST : APIC_BROADCAST); 595 587 } 596 588 597 - int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest) 589 + static bool kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest) 598 590 { 599 591 return kvm_apic_id(apic) == dest || kvm_apic_broadcast(apic, dest); 600 592 } 601 593 602 - int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) 594 + static bool kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) 603 595 { 604 - int result = 0; 605 596 u32 logical_id; 606 597 607 598 if (kvm_apic_broadcast(apic, mda)) 608 - return 1; 599 + return true; 609 600 610 - if (apic_x2apic_mode(apic)) { 611 - logical_id = kvm_apic_get_reg(apic, APIC_LDR); 612 - return logical_id & mda; 613 - } 601 + logical_id = kvm_apic_get_reg(apic, APIC_LDR); 614 602 615 - logical_id = GET_APIC_LOGICAL_ID(kvm_apic_get_reg(apic, APIC_LDR)); 603 + if (apic_x2apic_mode(apic)) 604 + return ((logical_id >> 16) == (mda >> 16)) 605 + && (logical_id & mda & 0xffff) != 0; 606 + 607 + logical_id = GET_APIC_LOGICAL_ID(logical_id); 616 608 617 609 switch (kvm_apic_get_reg(apic, APIC_DFR)) { 618 610 case APIC_DFR_FLAT: 619 - if (logical_id & mda) 620 - result = 1; 621 - break; 611 + return (logical_id & mda) != 0; 622 612 case APIC_DFR_CLUSTER: 623 - if (((logical_id >> 4) == (mda >> 0x4)) 624 - && (logical_id & mda & 0xf)) 625 - result = 1; 626 - break; 613 + return ((logical_id >> 4) == (mda >> 4)) 614 + && (logical_id & mda & 0xf) != 0; 627 615 default: 628 616 apic_debug("Bad DFR vcpu %d: %08x\n", 629 617 apic->vcpu->vcpu_id, kvm_apic_get_reg(apic, APIC_DFR)); 630 - break; 618 + return false; 631 619 } 632 - 633 - return result; 634 620 } 635 621 636 - int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, 622 + bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, 637 623 int short_hand, unsigned int dest, int dest_mode) 638 624 { 639 - int result = 0; 640 625 struct kvm_lapic *target = vcpu->arch.apic; 641 626 642 627 apic_debug("target %p, source %p, dest 0x%x, " ··· 639 638 ASSERT(target); 640 639 switch (short_hand) { 641 640 case APIC_DEST_NOSHORT: 642 - if (dest_mode == 0) 643 - /* Physical mode. */ 644 - result = kvm_apic_match_physical_addr(target, dest); 641 + if (dest_mode == APIC_DEST_PHYSICAL) 642 + return kvm_apic_match_physical_addr(target, dest); 645 643 else 646 - /* Logical mode. */ 647 - result = kvm_apic_match_logical_addr(target, dest); 648 - break; 644 + return kvm_apic_match_logical_addr(target, dest); 649 645 case APIC_DEST_SELF: 650 - result = (target == source); 651 - break; 646 + return target == source; 652 647 case APIC_DEST_ALLINC: 653 - result = 1; 654 - break; 648 + return true; 655 649 case APIC_DEST_ALLBUT: 656 - result = (target != source); 657 - break; 650 + return target != source; 658 651 default: 659 652 apic_debug("kvm: apic: Bad dest shorthand value %x\n", 660 653 short_hand); 661 - break; 654 + return false; 662 655 } 663 - 664 - return result; 665 656 } 666 657 667 658 bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, ··· 686 693 687 694 ret = true; 688 695 689 - if (irq->dest_mode == 0) { /* physical mode */ 696 + if (irq->dest_mode == APIC_DEST_PHYSICAL) { 690 697 if (irq->dest_id >= ARRAY_SIZE(map->phys_map)) 691 698 goto out; 692 699 ··· 1069 1076 { 1070 1077 struct kvm_vcpu *vcpu = apic->vcpu; 1071 1078 wait_queue_head_t *q = &vcpu->wq; 1079 + struct kvm_timer *ktimer = &apic->lapic_timer; 1072 1080 1073 - /* 1074 - * Note: KVM_REQ_PENDING_TIMER is implicitly checked in 1075 - * vcpu_enter_guest. 1076 - */ 1077 1081 if (atomic_read(&apic->lapic_timer.pending)) 1078 1082 return; 1079 1083 1080 1084 atomic_inc(&apic->lapic_timer.pending); 1081 - /* FIXME: this code should not know anything about vcpus */ 1082 - kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); 1085 + kvm_set_pending_timer(vcpu); 1083 1086 1084 1087 if (waitqueue_active(q)) 1085 1088 wake_up_interruptible(q); 1089 + 1090 + if (apic_lvtt_tscdeadline(apic)) 1091 + ktimer->expired_tscdeadline = ktimer->tscdeadline; 1092 + } 1093 + 1094 + /* 1095 + * On APICv, this test will cause a busy wait 1096 + * during a higher-priority task. 1097 + */ 1098 + 1099 + static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) 1100 + { 1101 + struct kvm_lapic *apic = vcpu->arch.apic; 1102 + u32 reg = kvm_apic_get_reg(apic, APIC_LVTT); 1103 + 1104 + if (kvm_apic_hw_enabled(apic)) { 1105 + int vec = reg & APIC_VECTOR_MASK; 1106 + void *bitmap = apic->regs + APIC_ISR; 1107 + 1108 + if (kvm_x86_ops->deliver_posted_interrupt) 1109 + bitmap = apic->regs + APIC_IRR; 1110 + 1111 + if (apic_test_vector(vec, bitmap)) 1112 + return true; 1113 + } 1114 + return false; 1115 + } 1116 + 1117 + void wait_lapic_expire(struct kvm_vcpu *vcpu) 1118 + { 1119 + struct kvm_lapic *apic = vcpu->arch.apic; 1120 + u64 guest_tsc, tsc_deadline; 1121 + 1122 + if (!kvm_vcpu_has_lapic(vcpu)) 1123 + return; 1124 + 1125 + if (apic->lapic_timer.expired_tscdeadline == 0) 1126 + return; 1127 + 1128 + if (!lapic_timer_int_injected(vcpu)) 1129 + return; 1130 + 1131 + tsc_deadline = apic->lapic_timer.expired_tscdeadline; 1132 + apic->lapic_timer.expired_tscdeadline = 0; 1133 + guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc()); 1134 + trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); 1135 + 1136 + /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ 1137 + if (guest_tsc < tsc_deadline) 1138 + __delay(tsc_deadline - guest_tsc); 1086 1139 } 1087 1140 1088 1141 static void start_apic_timer(struct kvm_lapic *apic) 1089 1142 { 1090 1143 ktime_t now; 1144 + 1091 1145 atomic_set(&apic->lapic_timer.pending, 0); 1092 1146 1093 1147 if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) { ··· 1180 1140 /* lapic timer in tsc deadline mode */ 1181 1141 u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline; 1182 1142 u64 ns = 0; 1143 + ktime_t expire; 1183 1144 struct kvm_vcpu *vcpu = apic->vcpu; 1184 1145 unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz; 1185 1146 unsigned long flags; ··· 1195 1154 if (likely(tscdeadline > guest_tsc)) { 1196 1155 ns = (tscdeadline - guest_tsc) * 1000000ULL; 1197 1156 do_div(ns, this_tsc_khz); 1157 + expire = ktime_add_ns(now, ns); 1158 + expire = ktime_sub_ns(expire, lapic_timer_advance_ns); 1198 1159 hrtimer_start(&apic->lapic_timer.timer, 1199 - ktime_add_ns(now, ns), HRTIMER_MODE_ABS); 1160 + expire, HRTIMER_MODE_ABS); 1200 1161 } else 1201 1162 apic_timer_expired(apic); 1202 1163 ··· 1788 1745 if (kvm_x86_ops->hwapic_irr_update) 1789 1746 kvm_x86_ops->hwapic_irr_update(vcpu, 1790 1747 apic_find_highest_irr(apic)); 1791 - kvm_x86_ops->hwapic_isr_update(vcpu->kvm, apic_find_highest_isr(apic)); 1748 + if (unlikely(kvm_x86_ops->hwapic_isr_update)) 1749 + kvm_x86_ops->hwapic_isr_update(vcpu->kvm, 1750 + apic_find_highest_isr(apic)); 1792 1751 kvm_make_request(KVM_REQ_EVENT, vcpu); 1793 1752 kvm_rtc_eoi_tracking_restore_one(vcpu); 1794 1753 }

+4 -2

arch/x86/kvm/lapic.h

··· 14 14 u32 timer_mode; 15 15 u32 timer_mode_mask; 16 16 u64 tscdeadline; 17 + u64 expired_tscdeadline; 17 18 atomic_t pending; /* accumulated triggered timers */ 18 19 }; 19 20 ··· 57 56 void kvm_apic_set_version(struct kvm_vcpu *vcpu); 58 57 59 58 void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); 59 + void __kvm_apic_update_irr(u32 *pir, void *regs); 60 60 void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir); 61 - int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest); 62 - int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda); 63 61 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq, 64 62 unsigned long *dest_map); 65 63 int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type); ··· 169 169 } 170 170 171 171 bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); 172 + 173 + void wait_lapic_expire(struct kvm_vcpu *vcpu); 172 174 173 175 #endif

+287 -64

arch/x86/kvm/mmu.c

··· 63 63 #undef MMU_DEBUG 64 64 65 65 #ifdef MMU_DEBUG 66 + static bool dbg = 0; 67 + module_param(dbg, bool, 0644); 66 68 67 69 #define pgprintk(x...) do { if (dbg) printk(x); } while (0) 68 70 #define rmap_printk(x...) do { if (dbg) printk(x); } while (0) 69 - 71 + #define MMU_WARN_ON(x) WARN_ON(x) 70 72 #else 71 - 72 73 #define pgprintk(x...) do { } while (0) 73 74 #define rmap_printk(x...) do { } while (0) 74 - 75 - #endif 76 - 77 - #ifdef MMU_DEBUG 78 - static bool dbg = 0; 79 - module_param(dbg, bool, 0644); 80 - #endif 81 - 82 - #ifndef MMU_DEBUG 83 - #define ASSERT(x) do { } while (0) 84 - #else 85 - #define ASSERT(x) \ 86 - if (!(x)) { \ 87 - printk(KERN_WARNING "assertion failed %s:%d: %s\n", \ 88 - __FILE__, __LINE__, #x); \ 89 - } 75 + #define MMU_WARN_ON(x) do { } while (0) 90 76 #endif 91 77 92 78 #define PTE_PREFETCH_NUM 8 ··· 532 546 return (old_spte & bit_mask) && !(new_spte & bit_mask); 533 547 } 534 548 549 + static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask) 550 + { 551 + return (old_spte & bit_mask) != (new_spte & bit_mask); 552 + } 553 + 535 554 /* Rules for using mmu_spte_set: 536 555 * Set the sptep from nonpresent to present. 537 556 * Note: the sptep being assigned *must* be either not present ··· 586 595 587 596 if (!shadow_accessed_mask) 588 597 return ret; 598 + 599 + /* 600 + * Flush TLB when accessed/dirty bits are changed in the page tables, 601 + * to guarantee consistency between TLB and page tables. 602 + */ 603 + if (spte_is_bit_changed(old_spte, new_spte, 604 + shadow_accessed_mask | shadow_dirty_mask)) 605 + ret = true; 589 606 590 607 if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask)) 591 608 kvm_set_pfn_accessed(spte_to_pfn(old_spte)); ··· 1215 1216 return flush; 1216 1217 } 1217 1218 1219 + static bool spte_clear_dirty(struct kvm *kvm, u64 *sptep) 1220 + { 1221 + u64 spte = *sptep; 1222 + 1223 + rmap_printk("rmap_clear_dirty: spte %p %llx\n", sptep, *sptep); 1224 + 1225 + spte &= ~shadow_dirty_mask; 1226 + 1227 + return mmu_spte_update(sptep, spte); 1228 + } 1229 + 1230 + static bool __rmap_clear_dirty(struct kvm *kvm, unsigned long *rmapp) 1231 + { 1232 + u64 *sptep; 1233 + struct rmap_iterator iter; 1234 + bool flush = false; 1235 + 1236 + for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { 1237 + BUG_ON(!(*sptep & PT_PRESENT_MASK)); 1238 + 1239 + flush |= spte_clear_dirty(kvm, sptep); 1240 + sptep = rmap_get_next(&iter); 1241 + } 1242 + 1243 + return flush; 1244 + } 1245 + 1246 + static bool spte_set_dirty(struct kvm *kvm, u64 *sptep) 1247 + { 1248 + u64 spte = *sptep; 1249 + 1250 + rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep); 1251 + 1252 + spte |= shadow_dirty_mask; 1253 + 1254 + return mmu_spte_update(sptep, spte); 1255 + } 1256 + 1257 + static bool __rmap_set_dirty(struct kvm *kvm, unsigned long *rmapp) 1258 + { 1259 + u64 *sptep; 1260 + struct rmap_iterator iter; 1261 + bool flush = false; 1262 + 1263 + for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { 1264 + BUG_ON(!(*sptep & PT_PRESENT_MASK)); 1265 + 1266 + flush |= spte_set_dirty(kvm, sptep); 1267 + sptep = rmap_get_next(&iter); 1268 + } 1269 + 1270 + return flush; 1271 + } 1272 + 1218 1273 /** 1219 1274 * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages 1220 1275 * @kvm: kvm instance ··· 1279 1226 * Used when we do not need to care about huge page mappings: e.g. during dirty 1280 1227 * logging we do not have any such mappings. 1281 1228 */ 1282 - void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, 1229 + static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, 1283 1230 struct kvm_memory_slot *slot, 1284 1231 gfn_t gfn_offset, unsigned long mask) 1285 1232 { ··· 1293 1240 /* clear the first set bit */ 1294 1241 mask &= mask - 1; 1295 1242 } 1243 + } 1244 + 1245 + /** 1246 + * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages 1247 + * @kvm: kvm instance 1248 + * @slot: slot to clear D-bit 1249 + * @gfn_offset: start of the BITS_PER_LONG pages we care about 1250 + * @mask: indicates which pages we should clear D-bit 1251 + * 1252 + * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap. 1253 + */ 1254 + void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, 1255 + struct kvm_memory_slot *slot, 1256 + gfn_t gfn_offset, unsigned long mask) 1257 + { 1258 + unsigned long *rmapp; 1259 + 1260 + while (mask) { 1261 + rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), 1262 + PT_PAGE_TABLE_LEVEL, slot); 1263 + __rmap_clear_dirty(kvm, rmapp); 1264 + 1265 + /* clear the first set bit */ 1266 + mask &= mask - 1; 1267 + } 1268 + } 1269 + EXPORT_SYMBOL_GPL(kvm_mmu_clear_dirty_pt_masked); 1270 + 1271 + /** 1272 + * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected 1273 + * PT level pages. 1274 + * 1275 + * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to 1276 + * enable dirty logging for them. 1277 + * 1278 + * Used when we do not need to care about huge page mappings: e.g. during dirty 1279 + * logging we do not have any such mappings. 1280 + */ 1281 + void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, 1282 + struct kvm_memory_slot *slot, 1283 + gfn_t gfn_offset, unsigned long mask) 1284 + { 1285 + if (kvm_x86_ops->enable_log_dirty_pt_masked) 1286 + kvm_x86_ops->enable_log_dirty_pt_masked(kvm, slot, gfn_offset, 1287 + mask); 1288 + else 1289 + kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); 1296 1290 } 1297 1291 1298 1292 static bool rmap_write_protect(struct kvm *kvm, u64 gfn) ··· 1636 1536 1637 1537 static void kvm_mmu_free_page(struct kvm_mmu_page *sp) 1638 1538 { 1639 - ASSERT(is_empty_shadow_page(sp->spt)); 1539 + MMU_WARN_ON(!is_empty_shadow_page(sp->spt)); 1640 1540 hlist_del(&sp->hash_link); 1641 1541 list_del(&sp->link); 1642 1542 free_page((unsigned long)sp->spt); ··· 2601 2501 } 2602 2502 } 2603 2503 2604 - if (pte_access & ACC_WRITE_MASK) 2504 + if (pte_access & ACC_WRITE_MASK) { 2605 2505 mark_page_dirty(vcpu->kvm, gfn); 2506 + spte |= shadow_dirty_mask; 2507 + } 2606 2508 2607 2509 set_pte: 2608 2510 if (mmu_spte_update(sptep, spte)) ··· 2920 2818 */ 2921 2819 gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt); 2922 2820 2821 + /* 2822 + * Theoretically we could also set dirty bit (and flush TLB) here in 2823 + * order to eliminate unnecessary PML logging. See comments in 2824 + * set_spte. But fast_page_fault is very unlikely to happen with PML 2825 + * enabled, so we do not do this. This might result in the same GPA 2826 + * to be logged in PML buffer again when the write really happens, and 2827 + * eventually to be called by mark_page_dirty twice. But it's also no 2828 + * harm. This also avoids the TLB flush needed after setting dirty bit 2829 + * so non-PML cases won't be impacted. 2830 + * 2831 + * Compare with set_spte where instead shadow_dirty_mask is set. 2832 + */ 2923 2833 if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) 2924 2834 mark_page_dirty(vcpu->kvm, gfn); 2925 2835 ··· 3155 3041 for (i = 0; i < 4; ++i) { 3156 3042 hpa_t root = vcpu->arch.mmu.pae_root[i]; 3157 3043 3158 - ASSERT(!VALID_PAGE(root)); 3044 + MMU_WARN_ON(VALID_PAGE(root)); 3159 3045 spin_lock(&vcpu->kvm->mmu_lock); 3160 3046 make_mmu_pages_available(vcpu); 3161 3047 sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT), ··· 3193 3079 if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) { 3194 3080 hpa_t root = vcpu->arch.mmu.root_hpa; 3195 3081 3196 - ASSERT(!VALID_PAGE(root)); 3082 + MMU_WARN_ON(VALID_PAGE(root)); 3197 3083 3198 3084 spin_lock(&vcpu->kvm->mmu_lock); 3199 3085 make_mmu_pages_available(vcpu); ··· 3218 3104 for (i = 0; i < 4; ++i) { 3219 3105 hpa_t root = vcpu->arch.mmu.pae_root[i]; 3220 3106 3221 - ASSERT(!VALID_PAGE(root)); 3107 + MMU_WARN_ON(VALID_PAGE(root)); 3222 3108 if (vcpu->arch.mmu.root_level == PT32E_ROOT_LEVEL) { 3223 3109 pdptr = vcpu->arch.mmu.get_pdptr(vcpu, i); 3224 3110 if (!is_present_gpte(pdptr)) { ··· 3443 3329 if (r) 3444 3330 return r; 3445 3331 3446 - ASSERT(vcpu); 3447 - ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); 3332 + MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); 3448 3333 3449 3334 gfn = gva >> PAGE_SHIFT; 3450 3335 ··· 3509 3396 int write = error_code & PFERR_WRITE_MASK; 3510 3397 bool map_writable; 3511 3398 3512 - ASSERT(vcpu); 3513 - ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); 3399 + MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); 3514 3400 3515 3401 if (unlikely(error_code & PFERR_RSVD_MASK)) { 3516 3402 r = handle_mmio_page_fault(vcpu, gpa, error_code, true); ··· 3830 3718 update_permission_bitmask(vcpu, context, false); 3831 3719 update_last_pte_bitmap(vcpu, context); 3832 3720 3833 - ASSERT(is_pae(vcpu)); 3721 + MMU_WARN_ON(!is_pae(vcpu)); 3834 3722 context->page_fault = paging64_page_fault; 3835 3723 context->gva_to_gpa = paging64_gva_to_gpa; 3836 3724 context->sync_page = paging64_sync_page; ··· 3875 3763 3876 3764 static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) 3877 3765 { 3878 - struct kvm_mmu *context = vcpu->arch.walk_mmu; 3766 + struct kvm_mmu *context = &vcpu->arch.mmu; 3879 3767 3880 3768 context->base_role.word = 0; 3881 3769 context->page_fault = tdp_page_fault; ··· 3915 3803 update_last_pte_bitmap(vcpu, context); 3916 3804 } 3917 3805 3918 - void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) 3806 + void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu) 3919 3807 { 3920 3808 bool smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); 3921 - ASSERT(vcpu); 3922 - ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); 3809 + struct kvm_mmu *context = &vcpu->arch.mmu; 3810 + 3811 + MMU_WARN_ON(VALID_PAGE(context->root_hpa)); 3923 3812 3924 3813 if (!is_paging(vcpu)) 3925 3814 nonpaging_init_context(vcpu, context); ··· 3931 3818 else 3932 3819 paging32_init_context(vcpu, context); 3933 3820 3934 - vcpu->arch.mmu.base_role.nxe = is_nx(vcpu); 3935 - vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu); 3936 - vcpu->arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); 3937 - vcpu->arch.mmu.base_role.smep_andnot_wp 3821 + context->base_role.nxe = is_nx(vcpu); 3822 + context->base_role.cr4_pae = !!is_pae(vcpu); 3823 + context->base_role.cr0_wp = is_write_protection(vcpu); 3824 + context->base_role.smep_andnot_wp 3938 3825 = smep && !is_write_protection(vcpu); 3939 3826 } 3940 3827 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); 3941 3828 3942 - void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context, 3943 - bool execonly) 3829 + void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly) 3944 3830 { 3945 - ASSERT(vcpu); 3946 - ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); 3831 + struct kvm_mmu *context = &vcpu->arch.mmu; 3832 + 3833 + MMU_WARN_ON(VALID_PAGE(context->root_hpa)); 3947 3834 3948 3835 context->shadow_root_level = kvm_x86_ops->get_tdp_level(); 3949 3836 ··· 3964 3851 3965 3852 static void init_kvm_softmmu(struct kvm_vcpu *vcpu) 3966 3853 { 3967 - kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu); 3968 - vcpu->arch.walk_mmu->set_cr3 = kvm_x86_ops->set_cr3; 3969 - vcpu->arch.walk_mmu->get_cr3 = get_cr3; 3970 - vcpu->arch.walk_mmu->get_pdptr = kvm_pdptr_read; 3971 - vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault; 3854 + struct kvm_mmu *context = &vcpu->arch.mmu; 3855 + 3856 + kvm_init_shadow_mmu(vcpu); 3857 + context->set_cr3 = kvm_x86_ops->set_cr3; 3858 + context->get_cr3 = get_cr3; 3859 + context->get_pdptr = kvm_pdptr_read; 3860 + context->inject_page_fault = kvm_inject_page_fault; 3972 3861 } 3973 3862 3974 3863 static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu) ··· 4015 3900 static void init_kvm_mmu(struct kvm_vcpu *vcpu) 4016 3901 { 4017 3902 if (mmu_is_nested(vcpu)) 4018 - return init_kvm_nested_mmu(vcpu); 3903 + init_kvm_nested_mmu(vcpu); 4019 3904 else if (tdp_enabled) 4020 - return init_kvm_tdp_mmu(vcpu); 3905 + init_kvm_tdp_mmu(vcpu); 4021 3906 else 4022 - return init_kvm_softmmu(vcpu); 3907 + init_kvm_softmmu(vcpu); 4023 3908 } 4024 3909 4025 3910 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu) 4026 3911 { 4027 - ASSERT(vcpu); 4028 - 4029 3912 kvm_mmu_unload(vcpu); 4030 3913 init_kvm_mmu(vcpu); 4031 3914 } ··· 4379 4266 struct page *page; 4380 4267 int i; 4381 4268 4382 - ASSERT(vcpu); 4383 - 4384 4269 /* 4385 4270 * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. 4386 4271 * Therefore we need to allocate shadow page tables in the first ··· 4397 4286 4398 4287 int kvm_mmu_create(struct kvm_vcpu *vcpu) 4399 4288 { 4400 - ASSERT(vcpu); 4401 - 4402 4289 vcpu->arch.walk_mmu = &vcpu->arch.mmu; 4403 4290 vcpu->arch.mmu.root_hpa = INVALID_PAGE; 4404 4291 vcpu->arch.mmu.translate_gpa = translate_gpa; ··· 4407 4298 4408 4299 void kvm_mmu_setup(struct kvm_vcpu *vcpu) 4409 4300 { 4410 - ASSERT(vcpu); 4411 - ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); 4301 + MMU_WARN_ON(VALID_PAGE(vcpu->arch.mmu.root_hpa)); 4412 4302 4413 4303 init_kvm_mmu(vcpu); 4414 4304 } 4415 4305 4416 - void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) 4306 + void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 4307 + struct kvm_memory_slot *memslot) 4417 4308 { 4418 - struct kvm_memory_slot *memslot; 4419 4309 gfn_t last_gfn; 4420 4310 int i; 4311 + bool flush = false; 4421 4312 4422 - memslot = id_to_memslot(kvm->memslots, slot); 4423 4313 last_gfn = memslot->base_gfn + memslot->npages - 1; 4424 4314 4425 4315 spin_lock(&kvm->mmu_lock); ··· 4433 4325 4434 4326 for (index = 0; index <= last_index; ++index, ++rmapp) { 4435 4327 if (*rmapp) 4436 - __rmap_write_protect(kvm, rmapp, false); 4328 + flush |= __rmap_write_protect(kvm, rmapp, 4329 + false); 4437 4330 4438 4331 if (need_resched() || spin_needbreak(&kvm->mmu_lock)) 4439 4332 cond_resched_lock(&kvm->mmu_lock); ··· 4461 4352 * instead of PT_WRITABLE_MASK, that means it does not depend 4462 4353 * on PT_WRITABLE_MASK anymore. 4463 4354 */ 4464 - kvm_flush_remote_tlbs(kvm); 4355 + if (flush) 4356 + kvm_flush_remote_tlbs(kvm); 4465 4357 } 4358 + 4359 + void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, 4360 + struct kvm_memory_slot *memslot) 4361 + { 4362 + gfn_t last_gfn; 4363 + unsigned long *rmapp; 4364 + unsigned long last_index, index; 4365 + bool flush = false; 4366 + 4367 + last_gfn = memslot->base_gfn + memslot->npages - 1; 4368 + 4369 + spin_lock(&kvm->mmu_lock); 4370 + 4371 + rmapp = memslot->arch.rmap[PT_PAGE_TABLE_LEVEL - 1]; 4372 + last_index = gfn_to_index(last_gfn, memslot->base_gfn, 4373 + PT_PAGE_TABLE_LEVEL); 4374 + 4375 + for (index = 0; index <= last_index; ++index, ++rmapp) { 4376 + if (*rmapp) 4377 + flush |= __rmap_clear_dirty(kvm, rmapp); 4378 + 4379 + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) 4380 + cond_resched_lock(&kvm->mmu_lock); 4381 + } 4382 + 4383 + spin_unlock(&kvm->mmu_lock); 4384 + 4385 + lockdep_assert_held(&kvm->slots_lock); 4386 + 4387 + /* 4388 + * It's also safe to flush TLBs out of mmu lock here as currently this 4389 + * function is only used for dirty logging, in which case flushing TLB 4390 + * out of mmu lock also guarantees no dirty pages will be lost in 4391 + * dirty_bitmap. 4392 + */ 4393 + if (flush) 4394 + kvm_flush_remote_tlbs(kvm); 4395 + } 4396 + EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty); 4397 + 4398 + void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, 4399 + struct kvm_memory_slot *memslot) 4400 + { 4401 + gfn_t last_gfn; 4402 + int i; 4403 + bool flush = false; 4404 + 4405 + last_gfn = memslot->base_gfn + memslot->npages - 1; 4406 + 4407 + spin_lock(&kvm->mmu_lock); 4408 + 4409 + for (i = PT_PAGE_TABLE_LEVEL + 1; /* skip rmap for 4K page */ 4410 + i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { 4411 + unsigned long *rmapp; 4412 + unsigned long last_index, index; 4413 + 4414 + rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL]; 4415 + last_index = gfn_to_index(last_gfn, memslot->base_gfn, i); 4416 + 4417 + for (index = 0; index <= last_index; ++index, ++rmapp) { 4418 + if (*rmapp) 4419 + flush |= __rmap_write_protect(kvm, rmapp, 4420 + false); 4421 + 4422 + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) 4423 + cond_resched_lock(&kvm->mmu_lock); 4424 + } 4425 + } 4426 + spin_unlock(&kvm->mmu_lock); 4427 + 4428 + /* see kvm_mmu_slot_remove_write_access */ 4429 + lockdep_assert_held(&kvm->slots_lock); 4430 + 4431 + if (flush) 4432 + kvm_flush_remote_tlbs(kvm); 4433 + } 4434 + EXPORT_SYMBOL_GPL(kvm_mmu_slot_largepage_remove_write_access); 4435 + 4436 + void kvm_mmu_slot_set_dirty(struct kvm *kvm, 4437 + struct kvm_memory_slot *memslot) 4438 + { 4439 + gfn_t last_gfn; 4440 + int i; 4441 + bool flush = false; 4442 + 4443 + last_gfn = memslot->base_gfn + memslot->npages - 1; 4444 + 4445 + spin_lock(&kvm->mmu_lock); 4446 + 4447 + for (i = PT_PAGE_TABLE_LEVEL; 4448 + i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { 4449 + unsigned long *rmapp; 4450 + unsigned long last_index, index; 4451 + 4452 + rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL]; 4453 + last_index = gfn_to_index(last_gfn, memslot->base_gfn, i); 4454 + 4455 + for (index = 0; index <= last_index; ++index, ++rmapp) { 4456 + if (*rmapp) 4457 + flush |= __rmap_set_dirty(kvm, rmapp); 4458 + 4459 + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) 4460 + cond_resched_lock(&kvm->mmu_lock); 4461 + } 4462 + } 4463 + 4464 + spin_unlock(&kvm->mmu_lock); 4465 + 4466 + lockdep_assert_held(&kvm->slots_lock); 4467 + 4468 + /* see kvm_mmu_slot_leaf_clear_dirty */ 4469 + if (flush) 4470 + kvm_flush_remote_tlbs(kvm); 4471 + } 4472 + EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty); 4466 4473 4467 4474 #define BATCH_ZAP_PAGES 10 4468 4475 static void kvm_zap_obsolete_pages(struct kvm *kvm) ··· 4831 4606 4832 4607 void kvm_mmu_destroy(struct kvm_vcpu *vcpu) 4833 4608 { 4834 - ASSERT(vcpu); 4835 - 4836 4609 kvm_mmu_unload(vcpu); 4837 4610 free_mmu_pages(vcpu); 4838 4611 mmu_free_memory_caches(vcpu);

+2 -15

arch/x86/kvm/mmu.h

··· 44 44 #define PT_DIRECTORY_LEVEL 2 45 45 #define PT_PAGE_TABLE_LEVEL 1 46 46 47 - #define PFERR_PRESENT_BIT 0 48 - #define PFERR_WRITE_BIT 1 49 - #define PFERR_USER_BIT 2 50 - #define PFERR_RSVD_BIT 3 51 - #define PFERR_FETCH_BIT 4 52 - 53 - #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT) 54 - #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT) 55 - #define PFERR_USER_MASK (1U << PFERR_USER_BIT) 56 - #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT) 57 - #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT) 58 - 59 47 static inline u64 rsvd_bits(int s, int e) 60 48 { 61 49 return ((1ULL << (e - s + 1)) - 1) << s; ··· 69 81 }; 70 82 71 83 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); 72 - void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); 73 - void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context, 74 - bool execonly); 84 + void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu); 85 + void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly); 75 86 void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 76 87 bool ept); 77 88

+2 -2

arch/x86/kvm/svm.c

··· 2003 2003 2004 2004 static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) 2005 2005 { 2006 - kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu); 2007 - 2006 + WARN_ON(mmu_is_nested(vcpu)); 2007 + kvm_init_shadow_mmu(vcpu); 2008 2008 vcpu->arch.mmu.set_cr3 = nested_svm_set_tdp_cr3; 2009 2009 vcpu->arch.mmu.get_cr3 = nested_svm_get_tdp_cr3; 2010 2010 vcpu->arch.mmu.get_pdptr = nested_svm_get_tdp_pdptr;

+38

arch/x86/kvm/trace.h

··· 848 848 849 849 #endif /* CONFIG_X86_64 */ 850 850 851 + /* 852 + * Tracepoint for PML full VMEXIT. 853 + */ 854 + TRACE_EVENT(kvm_pml_full, 855 + TP_PROTO(unsigned int vcpu_id), 856 + TP_ARGS(vcpu_id), 857 + 858 + TP_STRUCT__entry( 859 + __field( unsigned int, vcpu_id ) 860 + ), 861 + 862 + TP_fast_assign( 863 + __entry->vcpu_id = vcpu_id; 864 + ), 865 + 866 + TP_printk("vcpu %d: PML full", __entry->vcpu_id) 867 + ); 868 + 851 869 TRACE_EVENT(kvm_ple_window, 852 870 TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old), 853 871 TP_ARGS(grow, vcpu_id, new, old), ··· 930 912 __entry->tsc_to_system_mul, 931 913 __entry->tsc_shift, 932 914 __entry->flags) 915 + ); 916 + 917 + TRACE_EVENT(kvm_wait_lapic_expire, 918 + TP_PROTO(unsigned int vcpu_id, s64 delta), 919 + TP_ARGS(vcpu_id, delta), 920 + 921 + TP_STRUCT__entry( 922 + __field( unsigned int, vcpu_id ) 923 + __field( s64, delta ) 924 + ), 925 + 926 + TP_fast_assign( 927 + __entry->vcpu_id = vcpu_id; 928 + __entry->delta = delta; 929 + ), 930 + 931 + TP_printk("vcpu %u: delta %lld (%s)", 932 + __entry->vcpu_id, 933 + __entry->delta, 934 + __entry->delta < 0 ? "early" : "late") 933 935 ); 934 936 935 937 #endif /* _TRACE_KVM_H */

+956 -136

arch/x86/kvm/vmx.c

··· 45 45 #include <asm/perf_event.h> 46 46 #include <asm/debugreg.h> 47 47 #include <asm/kexec.h> 48 + #include <asm/apic.h> 48 49 49 50 #include "trace.h" 50 51 ··· 101 100 module_param(nested, bool, S_IRUGO); 102 101 103 102 static u64 __read_mostly host_xss; 103 + 104 + static bool __read_mostly enable_pml = 1; 105 + module_param_named(pml, enable_pml, bool, S_IRUGO); 104 106 105 107 #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD) 106 108 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE) ··· 219 215 u64 tsc_offset; 220 216 u64 virtual_apic_page_addr; 221 217 u64 apic_access_addr; 218 + u64 posted_intr_desc_addr; 222 219 u64 ept_pointer; 220 + u64 eoi_exit_bitmap0; 221 + u64 eoi_exit_bitmap1; 222 + u64 eoi_exit_bitmap2; 223 + u64 eoi_exit_bitmap3; 223 224 u64 xss_exit_bitmap; 224 225 u64 guest_physical_address; 225 226 u64 vmcs_link_pointer; ··· 339 330 u32 vmx_preemption_timer_value; 340 331 u32 padding32[7]; /* room for future expansion */ 341 332 u16 virtual_processor_id; 333 + u16 posted_intr_nv; 342 334 u16 guest_es_selector; 343 335 u16 guest_cs_selector; 344 336 u16 guest_ss_selector; ··· 348 338 u16 guest_gs_selector; 349 339 u16 guest_ldtr_selector; 350 340 u16 guest_tr_selector; 341 + u16 guest_intr_status; 351 342 u16 host_es_selector; 352 343 u16 host_cs_selector; 353 344 u16 host_ss_selector; ··· 412 401 */ 413 402 struct page *apic_access_page; 414 403 struct page *virtual_apic_page; 404 + struct page *pi_desc_page; 405 + struct pi_desc *pi_desc; 406 + bool pi_pending; 407 + u16 posted_intr_nv; 415 408 u64 msr_ia32_feature_control; 416 409 417 410 struct hrtimer preemption_timer; ··· 423 408 424 409 /* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */ 425 410 u64 vmcs01_debugctl; 411 + 412 + u32 nested_vmx_procbased_ctls_low; 413 + u32 nested_vmx_procbased_ctls_high; 414 + u32 nested_vmx_true_procbased_ctls_low; 415 + u32 nested_vmx_secondary_ctls_low; 416 + u32 nested_vmx_secondary_ctls_high; 417 + u32 nested_vmx_pinbased_ctls_low; 418 + u32 nested_vmx_pinbased_ctls_high; 419 + u32 nested_vmx_exit_ctls_low; 420 + u32 nested_vmx_exit_ctls_high; 421 + u32 nested_vmx_true_exit_ctls_low; 422 + u32 nested_vmx_entry_ctls_low; 423 + u32 nested_vmx_entry_ctls_high; 424 + u32 nested_vmx_true_entry_ctls_low; 425 + u32 nested_vmx_misc_low; 426 + u32 nested_vmx_misc_high; 427 + u32 nested_vmx_ept_caps; 426 428 }; 427 429 428 430 #define POSTED_INTR_ON 0 ··· 543 511 /* Dynamic PLE window. */ 544 512 int ple_window; 545 513 bool ple_window_dirty; 514 + 515 + /* Support for PML */ 516 + #define PML_ENTITY_NUM 512 517 + struct page *pml_pg; 546 518 }; 547 519 548 520 enum segment_cache_field { ··· 630 594 631 595 static const unsigned short vmcs_field_to_offset_table[] = { 632 596 FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), 597 + FIELD(POSTED_INTR_NV, posted_intr_nv), 633 598 FIELD(GUEST_ES_SELECTOR, guest_es_selector), 634 599 FIELD(GUEST_CS_SELECTOR, guest_cs_selector), 635 600 FIELD(GUEST_SS_SELECTOR, guest_ss_selector), ··· 639 602 FIELD(GUEST_GS_SELECTOR, guest_gs_selector), 640 603 FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector), 641 604 FIELD(GUEST_TR_SELECTOR, guest_tr_selector), 605 + FIELD(GUEST_INTR_STATUS, guest_intr_status), 642 606 FIELD(HOST_ES_SELECTOR, host_es_selector), 643 607 FIELD(HOST_CS_SELECTOR, host_cs_selector), 644 608 FIELD(HOST_SS_SELECTOR, host_ss_selector), ··· 656 618 FIELD64(TSC_OFFSET, tsc_offset), 657 619 FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr), 658 620 FIELD64(APIC_ACCESS_ADDR, apic_access_addr), 621 + FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr), 659 622 FIELD64(EPT_POINTER, ept_pointer), 623 + FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0), 624 + FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), 625 + FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2), 626 + FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3), 660 627 FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap), 661 628 FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address), 662 629 FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer), ··· 809 766 static void kvm_cpu_vmxoff(void); 810 767 static bool vmx_mpx_supported(void); 811 768 static bool vmx_xsaves_supported(void); 769 + static int vmx_vm_has_apicv(struct kvm *kvm); 812 770 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); 813 771 static void vmx_set_segment(struct kvm_vcpu *vcpu, 814 772 struct kvm_segment *var, int seg); ··· 837 793 static unsigned long *vmx_msr_bitmap_longmode; 838 794 static unsigned long *vmx_msr_bitmap_legacy_x2apic; 839 795 static unsigned long *vmx_msr_bitmap_longmode_x2apic; 796 + static unsigned long *vmx_msr_bitmap_nested; 840 797 static unsigned long *vmx_vmread_bitmap; 841 798 static unsigned long *vmx_vmwrite_bitmap; 842 799 ··· 1004 959 return vmx_capability.ept & VMX_EPT_EXECUTE_ONLY_BIT; 1005 960 } 1006 961 1007 - static inline bool cpu_has_vmx_eptp_uncacheable(void) 1008 - { 1009 - return vmx_capability.ept & VMX_EPTP_UC_BIT; 1010 - } 1011 - 1012 - static inline bool cpu_has_vmx_eptp_writeback(void) 1013 - { 1014 - return vmx_capability.ept & VMX_EPTP_WB_BIT; 1015 - } 1016 - 1017 962 static inline bool cpu_has_vmx_ept_2m_page(void) 1018 963 { 1019 964 return vmx_capability.ept & VMX_EPT_2MB_PAGE_BIT; ··· 1108 1073 SECONDARY_EXEC_SHADOW_VMCS; 1109 1074 } 1110 1075 1076 + static inline bool cpu_has_vmx_pml(void) 1077 + { 1078 + return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_PML; 1079 + } 1080 + 1111 1081 static inline bool report_flexpriority(void) 1112 1082 { 1113 1083 return flexpriority_enabled; ··· 1150 1110 { 1151 1111 return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES) && 1152 1112 vmx_xsaves_supported(); 1113 + } 1114 + 1115 + static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12) 1116 + { 1117 + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); 1118 + } 1119 + 1120 + static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12) 1121 + { 1122 + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT); 1123 + } 1124 + 1125 + static inline bool nested_cpu_has_vid(struct vmcs12 *vmcs12) 1126 + { 1127 + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); 1128 + } 1129 + 1130 + static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) 1131 + { 1132 + return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR; 1153 1133 } 1154 1134 1155 1135 static inline bool is_exception(u32 intr_info) ··· 2344 2284 * if the corresponding bit in the (32-bit) control field *must* be on, and a 2345 2285 * bit in the high half is on if the corresponding bit in the control field 2346 2286 * may be on. See also vmx_control_verify(). 2347 - * TODO: allow these variables to be modified (downgraded) by module options 2348 - * or other means. 2349 2287 */ 2350 - static u32 nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high; 2351 - static u32 nested_vmx_true_procbased_ctls_low; 2352 - static u32 nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high; 2353 - static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high; 2354 - static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high; 2355 - static u32 nested_vmx_true_exit_ctls_low; 2356 - static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high; 2357 - static u32 nested_vmx_true_entry_ctls_low; 2358 - static u32 nested_vmx_misc_low, nested_vmx_misc_high; 2359 - static u32 nested_vmx_ept_caps; 2360 - static __init void nested_vmx_setup_ctls_msrs(void) 2288 + static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) 2361 2289 { 2362 2290 /* 2363 2291 * Note that as a general rule, the high half of the MSRs (bits in ··· 2364 2316 2365 2317 /* pin-based controls */ 2366 2318 rdmsr(MSR_IA32_VMX_PINBASED_CTLS, 2367 - nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high); 2368 - nested_vmx_pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; 2369 - nested_vmx_pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK | 2370 - PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS; 2371 - nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | 2319 + vmx->nested.nested_vmx_pinbased_ctls_low, 2320 + vmx->nested.nested_vmx_pinbased_ctls_high); 2321 + vmx->nested.nested_vmx_pinbased_ctls_low |= 2322 + PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; 2323 + vmx->nested.nested_vmx_pinbased_ctls_high &= 2324 + PIN_BASED_EXT_INTR_MASK | 2325 + PIN_BASED_NMI_EXITING | 2326 + PIN_BASED_VIRTUAL_NMIS; 2327 + vmx->nested.nested_vmx_pinbased_ctls_high |= 2328 + PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | 2372 2329 PIN_BASED_VMX_PREEMPTION_TIMER; 2330 + if (vmx_vm_has_apicv(vmx->vcpu.kvm)) 2331 + vmx->nested.nested_vmx_pinbased_ctls_high |= 2332 + PIN_BASED_POSTED_INTR; 2373 2333 2374 2334 /* exit controls */ 2375 2335 rdmsr(MSR_IA32_VMX_EXIT_CTLS, 2376 - nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high); 2377 - nested_vmx_exit_ctls_low = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; 2336 + vmx->nested.nested_vmx_exit_ctls_low, 2337 + vmx->nested.nested_vmx_exit_ctls_high); 2338 + vmx->nested.nested_vmx_exit_ctls_low = 2339 + VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; 2378 2340 2379 - nested_vmx_exit_ctls_high &= 2341 + vmx->nested.nested_vmx_exit_ctls_high &= 2380 2342 #ifdef CONFIG_X86_64 2381 2343 VM_EXIT_HOST_ADDR_SPACE_SIZE | 2382 2344 #endif 2383 2345 VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; 2384 - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | 2346 + vmx->nested.nested_vmx_exit_ctls_high |= 2347 + VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | 2385 2348 VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER | 2386 2349 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT; 2387 2350 2388 2351 if (vmx_mpx_supported()) 2389 - nested_vmx_exit_ctls_high |= VM_EXIT_CLEAR_BNDCFGS; 2352 + vmx->nested.nested_vmx_exit_ctls_high |= VM_EXIT_CLEAR_BNDCFGS; 2390 2353 2391 2354 /* We support free control of debug control saving. */ 2392 - nested_vmx_true_exit_ctls_low = nested_vmx_exit_ctls_low & 2355 + vmx->nested.nested_vmx_true_exit_ctls_low = 2356 + vmx->nested.nested_vmx_exit_ctls_low & 2393 2357 ~VM_EXIT_SAVE_DEBUG_CONTROLS; 2394 2358 2395 2359 /* entry controls */ 2396 2360 rdmsr(MSR_IA32_VMX_ENTRY_CTLS, 2397 - nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high); 2398 - nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; 2399 - nested_vmx_entry_ctls_high &= 2361 + vmx->nested.nested_vmx_entry_ctls_low, 2362 + vmx->nested.nested_vmx_entry_ctls_high); 2363 + vmx->nested.nested_vmx_entry_ctls_low = 2364 + VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; 2365 + vmx->nested.nested_vmx_entry_ctls_high &= 2400 2366 #ifdef CONFIG_X86_64 2401 2367 VM_ENTRY_IA32E_MODE | 2402 2368 #endif 2403 2369 VM_ENTRY_LOAD_IA32_PAT; 2404 - nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | 2405 - VM_ENTRY_LOAD_IA32_EFER); 2370 + vmx->nested.nested_vmx_entry_ctls_high |= 2371 + (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER); 2406 2372 if (vmx_mpx_supported()) 2407 - nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_BNDCFGS; 2373 + vmx->nested.nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_BNDCFGS; 2408 2374 2409 2375 /* We support free control of debug control loading. */ 2410 - nested_vmx_true_entry_ctls_low = nested_vmx_entry_ctls_low & 2376 + vmx->nested.nested_vmx_true_entry_ctls_low = 2377 + vmx->nested.nested_vmx_entry_ctls_low & 2411 2378 ~VM_ENTRY_LOAD_DEBUG_CONTROLS; 2412 2379 2413 2380 /* cpu-based controls */ 2414 2381 rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, 2415 - nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); 2416 - nested_vmx_procbased_ctls_low = CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR; 2417 - nested_vmx_procbased_ctls_high &= 2382 + vmx->nested.nested_vmx_procbased_ctls_low, 2383 + vmx->nested.nested_vmx_procbased_ctls_high); 2384 + vmx->nested.nested_vmx_procbased_ctls_low = 2385 + CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR; 2386 + vmx->nested.nested_vmx_procbased_ctls_high &= 2418 2387 CPU_BASED_VIRTUAL_INTR_PENDING | 2419 2388 CPU_BASED_VIRTUAL_NMI_PENDING | CPU_BASED_USE_TSC_OFFSETING | 2420 2389 CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING | ··· 2451 2386 * can use it to avoid exits to L1 - even when L0 runs L2 2452 2387 * without MSR bitmaps. 2453 2388 */ 2454 - nested_vmx_procbased_ctls_high |= CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | 2389 + vmx->nested.nested_vmx_procbased_ctls_high |= 2390 + CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | 2455 2391 CPU_BASED_USE_MSR_BITMAPS; 2456 2392 2457 2393 /* We support free control of CR3 access interception. */ 2458 - nested_vmx_true_procbased_ctls_low = nested_vmx_procbased_ctls_low & 2394 + vmx->nested.nested_vmx_true_procbased_ctls_low = 2395 + vmx->nested.nested_vmx_procbased_ctls_low & 2459 2396 ~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING); 2460 2397 2461 2398 /* secondary cpu-based controls */ 2462 2399 rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2, 2463 - nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high); 2464 - nested_vmx_secondary_ctls_low = 0; 2465 - nested_vmx_secondary_ctls_high &= 2400 + vmx->nested.nested_vmx_secondary_ctls_low, 2401 + vmx->nested.nested_vmx_secondary_ctls_high); 2402 + vmx->nested.nested_vmx_secondary_ctls_low = 0; 2403 + vmx->nested.nested_vmx_secondary_ctls_high &= 2466 2404 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 2405 + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | 2406 + SECONDARY_EXEC_APIC_REGISTER_VIRT | 2407 + SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 2467 2408 SECONDARY_EXEC_WBINVD_EXITING | 2468 2409 SECONDARY_EXEC_XSAVES; 2469 2410 2470 2411 if (enable_ept) { 2471 2412 /* nested EPT: emulate EPT also to L1 */ 2472 - nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT | 2413 + vmx->nested.nested_vmx_secondary_ctls_high |= 2414 + SECONDARY_EXEC_ENABLE_EPT | 2473 2415 SECONDARY_EXEC_UNRESTRICTED_GUEST; 2474 - nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT | 2416 + vmx->nested.nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT | 2475 2417 VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT | 2476 2418 VMX_EPT_INVEPT_BIT; 2477 - nested_vmx_ept_caps &= vmx_capability.ept; 2419 + vmx->nested.nested_vmx_ept_caps &= vmx_capability.ept; 2478 2420 /* 2479 2421 * For nested guests, we don't do anything specific 2480 2422 * for single context invalidation. Hence, only advertise 2481 2423 * support for global context invalidation. 2482 2424 */ 2483 - nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT; 2425 + vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT; 2484 2426 } else 2485 - nested_vmx_ept_caps = 0; 2427 + vmx->nested.nested_vmx_ept_caps = 0; 2486 2428 2487 2429 /* miscellaneous data */ 2488 - rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high); 2489 - nested_vmx_misc_low &= VMX_MISC_SAVE_EFER_LMA; 2490 - nested_vmx_misc_low |= VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE | 2430 + rdmsr(MSR_IA32_VMX_MISC, 2431 + vmx->nested.nested_vmx_misc_low, 2432 + vmx->nested.nested_vmx_misc_high); 2433 + vmx->nested.nested_vmx_misc_low &= VMX_MISC_SAVE_EFER_LMA; 2434 + vmx->nested.nested_vmx_misc_low |= 2435 + VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE | 2491 2436 VMX_MISC_ACTIVITY_HLT; 2492 - nested_vmx_misc_high = 0; 2437 + vmx->nested.nested_vmx_misc_high = 0; 2493 2438 } 2494 2439 2495 2440 static inline bool vmx_control_verify(u32 control, u32 low, u32 high) ··· 2518 2443 /* Returns 0 on success, non-0 otherwise. */ 2519 2444 static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) 2520 2445 { 2446 + struct vcpu_vmx *vmx = to_vmx(vcpu); 2447 + 2521 2448 switch (msr_index) { 2522 2449 case MSR_IA32_VMX_BASIC: 2523 2450 /* ··· 2534 2457 break; 2535 2458 case MSR_IA32_VMX_TRUE_PINBASED_CTLS: 2536 2459 case MSR_IA32_VMX_PINBASED_CTLS: 2537 - *pdata = vmx_control_msr(nested_vmx_pinbased_ctls_low, 2538 - nested_vmx_pinbased_ctls_high); 2460 + *pdata = vmx_control_msr( 2461 + vmx->nested.nested_vmx_pinbased_ctls_low, 2462 + vmx->nested.nested_vmx_pinbased_ctls_high); 2539 2463 break; 2540 2464 case MSR_IA32_VMX_TRUE_PROCBASED_CTLS: 2541 - *pdata = vmx_control_msr(nested_vmx_true_procbased_ctls_low, 2542 - nested_vmx_procbased_ctls_high); 2465 + *pdata = vmx_control_msr( 2466 + vmx->nested.nested_vmx_true_procbased_ctls_low, 2467 + vmx->nested.nested_vmx_procbased_ctls_high); 2543 2468 break; 2544 2469 case MSR_IA32_VMX_PROCBASED_CTLS: 2545 - *pdata = vmx_control_msr(nested_vmx_procbased_ctls_low, 2546 - nested_vmx_procbased_ctls_high); 2470 + *pdata = vmx_control_msr( 2471 + vmx->nested.nested_vmx_procbased_ctls_low, 2472 + vmx->nested.nested_vmx_procbased_ctls_high); 2547 2473 break; 2548 2474 case MSR_IA32_VMX_TRUE_EXIT_CTLS: 2549 - *pdata = vmx_control_msr(nested_vmx_true_exit_ctls_low, 2550 - nested_vmx_exit_ctls_high); 2475 + *pdata = vmx_control_msr( 2476 + vmx->nested.nested_vmx_true_exit_ctls_low, 2477 + vmx->nested.nested_vmx_exit_ctls_high); 2551 2478 break; 2552 2479 case MSR_IA32_VMX_EXIT_CTLS: 2553 - *pdata = vmx_control_msr(nested_vmx_exit_ctls_low, 2554 - nested_vmx_exit_ctls_high); 2480 + *pdata = vmx_control_msr( 2481 + vmx->nested.nested_vmx_exit_ctls_low, 2482 + vmx->nested.nested_vmx_exit_ctls_high); 2555 2483 break; 2556 2484 case MSR_IA32_VMX_TRUE_ENTRY_CTLS: 2557 - *pdata = vmx_control_msr(nested_vmx_true_entry_ctls_low, 2558 - nested_vmx_entry_ctls_high); 2485 + *pdata = vmx_control_msr( 2486 + vmx->nested.nested_vmx_true_entry_ctls_low, 2487 + vmx->nested.nested_vmx_entry_ctls_high); 2559 2488 break; 2560 2489 case MSR_IA32_VMX_ENTRY_CTLS: 2561 - *pdata = vmx_control_msr(nested_vmx_entry_ctls_low, 2562 - nested_vmx_entry_ctls_high); 2490 + *pdata = vmx_control_msr( 2491 + vmx->nested.nested_vmx_entry_ctls_low, 2492 + vmx->nested.nested_vmx_entry_ctls_high); 2563 2493 break; 2564 2494 case MSR_IA32_VMX_MISC: 2565 - *pdata = vmx_control_msr(nested_vmx_misc_low, 2566 - nested_vmx_misc_high); 2495 + *pdata = vmx_control_msr( 2496 + vmx->nested.nested_vmx_misc_low, 2497 + vmx->nested.nested_vmx_misc_high); 2567 2498 break; 2568 2499 /* 2569 2500 * These MSRs specify bits which the guest must keep fixed (on or off) ··· 2596 2511 *pdata = 0x2e; /* highest index: VMX_PREEMPTION_TIMER_VALUE */ 2597 2512 break; 2598 2513 case MSR_IA32_VMX_PROCBASED_CTLS2: 2599 - *pdata = vmx_control_msr(nested_vmx_secondary_ctls_low, 2600 - nested_vmx_secondary_ctls_high); 2514 + *pdata = vmx_control_msr( 2515 + vmx->nested.nested_vmx_secondary_ctls_low, 2516 + vmx->nested.nested_vmx_secondary_ctls_high); 2601 2517 break; 2602 2518 case MSR_IA32_VMX_EPT_VPID_CAP: 2603 2519 /* Currently, no nested vpid support */ 2604 - *pdata = nested_vmx_ept_caps; 2520 + *pdata = vmx->nested.nested_vmx_ept_caps; 2605 2521 break; 2606 2522 default: 2607 2523 return 1; ··· 3015 2929 SECONDARY_EXEC_APIC_REGISTER_VIRT | 3016 2930 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 3017 2931 SECONDARY_EXEC_SHADOW_VMCS | 3018 - SECONDARY_EXEC_XSAVES; 2932 + SECONDARY_EXEC_XSAVES | 2933 + SECONDARY_EXEC_ENABLE_PML; 3019 2934 if (adjust_vmx_controls(min2, opt2, 3020 2935 MSR_IA32_VMX_PROCBASED_CTLS2, 3021 2936 &_cpu_based_2nd_exec_control) < 0) ··· 4246 4159 } 4247 4160 } 4248 4161 4162 + /* 4163 + * If a msr is allowed by L0, we should check whether it is allowed by L1. 4164 + * The corresponding bit will be cleared unless both of L0 and L1 allow it. 4165 + */ 4166 + static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1, 4167 + unsigned long *msr_bitmap_nested, 4168 + u32 msr, int type) 4169 + { 4170 + int f = sizeof(unsigned long); 4171 + 4172 + if (!cpu_has_vmx_msr_bitmap()) { 4173 + WARN_ON(1); 4174 + return; 4175 + } 4176 + 4177 + /* 4178 + * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals 4179 + * have the write-low and read-high bitmap offsets the wrong way round. 4180 + * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff. 4181 + */ 4182 + if (msr <= 0x1fff) { 4183 + if (type & MSR_TYPE_R && 4184 + !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) 4185 + /* read-low */ 4186 + __clear_bit(msr, msr_bitmap_nested + 0x000 / f); 4187 + 4188 + if (type & MSR_TYPE_W && 4189 + !test_bit(msr, msr_bitmap_l1 + 0x800 / f)) 4190 + /* write-low */ 4191 + __clear_bit(msr, msr_bitmap_nested + 0x800 / f); 4192 + 4193 + } else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) { 4194 + msr &= 0x1fff; 4195 + if (type & MSR_TYPE_R && 4196 + !test_bit(msr, msr_bitmap_l1 + 0x400 / f)) 4197 + /* read-high */ 4198 + __clear_bit(msr, msr_bitmap_nested + 0x400 / f); 4199 + 4200 + if (type & MSR_TYPE_W && 4201 + !test_bit(msr, msr_bitmap_l1 + 0xc00 / f)) 4202 + /* write-high */ 4203 + __clear_bit(msr, msr_bitmap_nested + 0xc00 / f); 4204 + 4205 + } 4206 + } 4207 + 4249 4208 static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) 4250 4209 { 4251 4210 if (!longmode_only) ··· 4330 4197 return enable_apicv && irqchip_in_kernel(kvm); 4331 4198 } 4332 4199 4200 + static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) 4201 + { 4202 + struct vcpu_vmx *vmx = to_vmx(vcpu); 4203 + int max_irr; 4204 + void *vapic_page; 4205 + u16 status; 4206 + 4207 + if (vmx->nested.pi_desc && 4208 + vmx->nested.pi_pending) { 4209 + vmx->nested.pi_pending = false; 4210 + if (!pi_test_and_clear_on(vmx->nested.pi_desc)) 4211 + return 0; 4212 + 4213 + max_irr = find_last_bit( 4214 + (unsigned long *)vmx->nested.pi_desc->pir, 256); 4215 + 4216 + if (max_irr == 256) 4217 + return 0; 4218 + 4219 + vapic_page = kmap(vmx->nested.virtual_apic_page); 4220 + if (!vapic_page) { 4221 + WARN_ON(1); 4222 + return -ENOMEM; 4223 + } 4224 + __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page); 4225 + kunmap(vmx->nested.virtual_apic_page); 4226 + 4227 + status = vmcs_read16(GUEST_INTR_STATUS); 4228 + if ((u8)max_irr > ((u8)status & 0xff)) { 4229 + status &= ~0xff; 4230 + status |= (u8)max_irr; 4231 + vmcs_write16(GUEST_INTR_STATUS, status); 4232 + } 4233 + } 4234 + return 0; 4235 + } 4236 + 4237 + static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, 4238 + int vector) 4239 + { 4240 + struct vcpu_vmx *vmx = to_vmx(vcpu); 4241 + 4242 + if (is_guest_mode(vcpu) && 4243 + vector == vmx->nested.posted_intr_nv) { 4244 + /* the PIR and ON have been set by L1. */ 4245 + if (vcpu->mode == IN_GUEST_MODE) 4246 + apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), 4247 + POSTED_INTR_VECTOR); 4248 + /* 4249 + * If a posted intr is not recognized by hardware, 4250 + * we will accomplish it in the next vmentry. 4251 + */ 4252 + vmx->nested.pi_pending = true; 4253 + kvm_make_request(KVM_REQ_EVENT, vcpu); 4254 + return 0; 4255 + } 4256 + return -1; 4257 + } 4333 4258 /* 4334 4259 * Send interrupt to vcpu via posted interrupt way. 4335 4260 * 1. If target vcpu is running(non-root mode), send posted interrupt ··· 4399 4208 { 4400 4209 struct vcpu_vmx *vmx = to_vmx(vcpu); 4401 4210 int r; 4211 + 4212 + r = vmx_deliver_nested_posted_interrupt(vcpu, vector); 4213 + if (!r) 4214 + return; 4402 4215 4403 4216 if (pi_test_and_set_pir(vector, &vmx->pi_desc)) 4404 4217 return; ··· 4555 4360 a current VMCS12 4556 4361 */ 4557 4362 exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS; 4363 + /* PML is enabled/disabled in creating/destorying vcpu */ 4364 + exec_control &= ~SECONDARY_EXEC_ENABLE_PML; 4365 + 4558 4366 return exec_control; 4559 4367 } 4560 4368 ··· 5184 4986 hypercall[2] = 0xc1; 5185 4987 } 5186 4988 5187 - static bool nested_cr0_valid(struct vmcs12 *vmcs12, unsigned long val) 4989 + static bool nested_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val) 5188 4990 { 5189 4991 unsigned long always_on = VMXON_CR0_ALWAYSON; 4992 + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); 5190 4993 5191 - if (nested_vmx_secondary_ctls_high & 4994 + if (to_vmx(vcpu)->nested.nested_vmx_secondary_ctls_high & 5192 4995 SECONDARY_EXEC_UNRESTRICTED_GUEST && 5193 4996 nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST)) 5194 4997 always_on &= ~(X86_CR0_PE | X86_CR0_PG); ··· 5214 5015 val = (val & ~vmcs12->cr0_guest_host_mask) | 5215 5016 (vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask); 5216 5017 5217 - if (!nested_cr0_valid(vmcs12, val)) 5018 + if (!nested_cr0_valid(vcpu, val)) 5218 5019 return 1; 5219 5020 5220 5021 if (kvm_set_cr0(vcpu, val)) ··· 6016 5817 (unsigned long *)__get_free_page(GFP_KERNEL); 6017 5818 if (!vmx_msr_bitmap_longmode_x2apic) 6018 5819 goto out4; 5820 + 5821 + if (nested) { 5822 + vmx_msr_bitmap_nested = 5823 + (unsigned long *)__get_free_page(GFP_KERNEL); 5824 + if (!vmx_msr_bitmap_nested) 5825 + goto out5; 5826 + } 5827 + 6019 5828 vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); 6020 5829 if (!vmx_vmread_bitmap) 6021 - goto out5; 5830 + goto out6; 6022 5831 6023 5832 vmx_vmwrite_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); 6024 5833 if (!vmx_vmwrite_bitmap) 6025 - goto out6; 5834 + goto out7; 6026 5835 6027 5836 memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); 6028 5837 memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); ··· 6046 5839 6047 5840 memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); 6048 5841 memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); 5842 + if (nested) 5843 + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); 6049 5844 6050 5845 if (setup_vmcs_config(&vmcs_config) < 0) { 6051 5846 r = -EIO; 6052 - goto out7; 5847 + goto out8; 6053 5848 } 6054 5849 6055 5850 if (boot_cpu_has(X86_FEATURE_NX)) ··· 6077 5868 if (!cpu_has_vmx_unrestricted_guest()) 6078 5869 enable_unrestricted_guest = 0; 6079 5870 6080 - if (!cpu_has_vmx_flexpriority()) { 5871 + if (!cpu_has_vmx_flexpriority()) 6081 5872 flexpriority_enabled = 0; 6082 5873 6083 - /* 6084 - * set_apic_access_page_addr() is used to reload apic access 6085 - * page upon invalidation. No need to do anything if the 6086 - * processor does not have the APIC_ACCESS_ADDR VMCS field. 6087 - */ 5874 + /* 5875 + * set_apic_access_page_addr() is used to reload apic access 5876 + * page upon invalidation. No need to do anything if not 5877 + * using the APIC_ACCESS_ADDR VMCS field. 5878 + */ 5879 + if (!flexpriority_enabled) 6088 5880 kvm_x86_ops->set_apic_access_page_addr = NULL; 6089 - } 6090 5881 6091 5882 if (!cpu_has_vmx_tpr_shadow()) 6092 5883 kvm_x86_ops->update_cr8_intercept = NULL; ··· 6104 5895 kvm_x86_ops->update_cr8_intercept = NULL; 6105 5896 else { 6106 5897 kvm_x86_ops->hwapic_irr_update = NULL; 5898 + kvm_x86_ops->hwapic_isr_update = NULL; 6107 5899 kvm_x86_ops->deliver_posted_interrupt = NULL; 6108 5900 kvm_x86_ops->sync_pir_to_irr = vmx_sync_pir_to_irr_dummy; 6109 5901 } 6110 - 6111 - if (nested) 6112 - nested_vmx_setup_ctls_msrs(); 6113 5902 6114 5903 vmx_disable_intercept_for_msr(MSR_FS_BASE, false); 6115 5904 vmx_disable_intercept_for_msr(MSR_GS_BASE, false); ··· 6152 5945 6153 5946 update_ple_window_actual_max(); 6154 5947 5948 + /* 5949 + * Only enable PML when hardware supports PML feature, and both EPT 5950 + * and EPT A/D bit features are enabled -- PML depends on them to work. 5951 + */ 5952 + if (!enable_ept || !enable_ept_ad_bits || !cpu_has_vmx_pml()) 5953 + enable_pml = 0; 5954 + 5955 + if (!enable_pml) { 5956 + kvm_x86_ops->slot_enable_log_dirty = NULL; 5957 + kvm_x86_ops->slot_disable_log_dirty = NULL; 5958 + kvm_x86_ops->flush_log_dirty = NULL; 5959 + kvm_x86_ops->enable_log_dirty_pt_masked = NULL; 5960 + } 5961 + 6155 5962 return alloc_kvm_area(); 6156 5963 6157 - out7: 5964 + out8: 6158 5965 free_page((unsigned long)vmx_vmwrite_bitmap); 6159 - out6: 5966 + out7: 6160 5967 free_page((unsigned long)vmx_vmread_bitmap); 5968 + out6: 5969 + if (nested) 5970 + free_page((unsigned long)vmx_msr_bitmap_nested); 6161 5971 out5: 6162 5972 free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic); 6163 5973 out4: ··· 6201 5977 free_page((unsigned long)vmx_io_bitmap_a); 6202 5978 free_page((unsigned long)vmx_vmwrite_bitmap); 6203 5979 free_page((unsigned long)vmx_vmread_bitmap); 5980 + if (nested) 5981 + free_page((unsigned long)vmx_msr_bitmap_nested); 6204 5982 6205 5983 free_kvm_area(); 6206 5984 } ··· 6367 6141 * We don't need to force a shadow sync because 6368 6142 * VM_INSTRUCTION_ERROR is not shadowed 6369 6143 */ 6144 + } 6145 + 6146 + static void nested_vmx_abort(struct kvm_vcpu *vcpu, u32 indicator) 6147 + { 6148 + /* TODO: not to reset guest simply here. */ 6149 + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); 6150 + pr_warn("kvm: nested vmx abort, indicator %d\n", indicator); 6370 6151 } 6371 6152 6372 6153 static enum hrtimer_restart vmx_preemption_timer_fn(struct hrtimer *timer) ··· 6665 6432 vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 6666 6433 vmcs_write64(VMCS_LINK_POINTER, -1ull); 6667 6434 } 6435 + vmx->nested.posted_intr_nv = -1; 6668 6436 kunmap(vmx->nested.current_vmcs12_page); 6669 6437 nested_release_page(vmx->nested.current_vmcs12_page); 6670 6438 vmx->nested.current_vmptr = -1ull; ··· 6693 6459 if (vmx->nested.virtual_apic_page) { 6694 6460 nested_release_page(vmx->nested.virtual_apic_page); 6695 6461 vmx->nested.virtual_apic_page = NULL; 6462 + } 6463 + if (vmx->nested.pi_desc_page) { 6464 + kunmap(vmx->nested.pi_desc_page); 6465 + nested_release_page(vmx->nested.pi_desc_page); 6466 + vmx->nested.pi_desc_page = NULL; 6467 + vmx->nested.pi_desc = NULL; 6696 6468 } 6697 6469 6698 6470 nested_free_all_saved_vmcss(vmx); ··· 7133 6893 /* Emulate the INVEPT instruction */ 7134 6894 static int handle_invept(struct kvm_vcpu *vcpu) 7135 6895 { 6896 + struct vcpu_vmx *vmx = to_vmx(vcpu); 7136 6897 u32 vmx_instruction_info, types; 7137 6898 unsigned long type; 7138 6899 gva_t gva; ··· 7142 6901 u64 eptp, gpa; 7143 6902 } operand; 7144 6903 7145 - if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) || 7146 - !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) { 6904 + if (!(vmx->nested.nested_vmx_secondary_ctls_high & 6905 + SECONDARY_EXEC_ENABLE_EPT) || 6906 + !(vmx->nested.nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) { 7147 6907 kvm_queue_exception(vcpu, UD_VECTOR); 7148 6908 return 1; 7149 6909 } ··· 7160 6918 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); 7161 6919 type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf); 7162 6920 7163 - types = (nested_vmx_ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6; 6921 + types = (vmx->nested.nested_vmx_ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6; 7164 6922 7165 6923 if (!(types & (1UL << type))) { 7166 6924 nested_vmx_failValid(vcpu, ··· 7199 6957 static int handle_invvpid(struct kvm_vcpu *vcpu) 7200 6958 { 7201 6959 kvm_queue_exception(vcpu, UD_VECTOR); 6960 + return 1; 6961 + } 6962 + 6963 + static int handle_pml_full(struct kvm_vcpu *vcpu) 6964 + { 6965 + unsigned long exit_qualification; 6966 + 6967 + trace_kvm_pml_full(vcpu->vcpu_id); 6968 + 6969 + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); 6970 + 6971 + /* 6972 + * PML buffer FULL happened while executing iret from NMI, 6973 + * "blocked by NMI" bit has to be set before next VM entry. 6974 + */ 6975 + if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) && 6976 + cpu_has_virtual_nmis() && 6977 + (exit_qualification & INTR_INFO_UNBLOCK_NMI)) 6978 + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, 6979 + GUEST_INTR_STATE_NMI); 6980 + 6981 + /* 6982 + * PML buffer already flushed at beginning of VMEXIT. Nothing to do 6983 + * here.., and there's no userspace involvement needed for PML. 6984 + */ 7202 6985 return 1; 7203 6986 } 7204 6987 ··· 7275 7008 [EXIT_REASON_INVVPID] = handle_invvpid, 7276 7009 [EXIT_REASON_XSAVES] = handle_xsaves, 7277 7010 [EXIT_REASON_XRSTORS] = handle_xrstors, 7011 + [EXIT_REASON_PML_FULL] = handle_pml_full, 7278 7012 }; 7279 7013 7280 7014 static const int kvm_vmx_max_exit_handlers = ··· 7543 7275 case EXIT_REASON_APIC_ACCESS: 7544 7276 return nested_cpu_has2(vmcs12, 7545 7277 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); 7278 + case EXIT_REASON_APIC_WRITE: 7279 + case EXIT_REASON_EOI_INDUCED: 7280 + /* apic_write and eoi_induced should exit unconditionally. */ 7281 + return 1; 7546 7282 case EXIT_REASON_EPT_VIOLATION: 7547 7283 /* 7548 7284 * L0 always deals with the EPT violation. If nested EPT is ··· 7586 7314 *info2 = vmcs_read32(VM_EXIT_INTR_INFO); 7587 7315 } 7588 7316 7317 + static int vmx_enable_pml(struct vcpu_vmx *vmx) 7318 + { 7319 + struct page *pml_pg; 7320 + u32 exec_control; 7321 + 7322 + pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO); 7323 + if (!pml_pg) 7324 + return -ENOMEM; 7325 + 7326 + vmx->pml_pg = pml_pg; 7327 + 7328 + vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg)); 7329 + vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1); 7330 + 7331 + exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 7332 + exec_control |= SECONDARY_EXEC_ENABLE_PML; 7333 + vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 7334 + 7335 + return 0; 7336 + } 7337 + 7338 + static void vmx_disable_pml(struct vcpu_vmx *vmx) 7339 + { 7340 + u32 exec_control; 7341 + 7342 + ASSERT(vmx->pml_pg); 7343 + __free_page(vmx->pml_pg); 7344 + vmx->pml_pg = NULL; 7345 + 7346 + exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 7347 + exec_control &= ~SECONDARY_EXEC_ENABLE_PML; 7348 + vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 7349 + } 7350 + 7351 + static void vmx_flush_pml_buffer(struct vcpu_vmx *vmx) 7352 + { 7353 + struct kvm *kvm = vmx->vcpu.kvm; 7354 + u64 *pml_buf; 7355 + u16 pml_idx; 7356 + 7357 + pml_idx = vmcs_read16(GUEST_PML_INDEX); 7358 + 7359 + /* Do nothing if PML buffer is empty */ 7360 + if (pml_idx == (PML_ENTITY_NUM - 1)) 7361 + return; 7362 + 7363 + /* PML index always points to next available PML buffer entity */ 7364 + if (pml_idx >= PML_ENTITY_NUM) 7365 + pml_idx = 0; 7366 + else 7367 + pml_idx++; 7368 + 7369 + pml_buf = page_address(vmx->pml_pg); 7370 + for (; pml_idx < PML_ENTITY_NUM; pml_idx++) { 7371 + u64 gpa; 7372 + 7373 + gpa = pml_buf[pml_idx]; 7374 + WARN_ON(gpa & (PAGE_SIZE - 1)); 7375 + mark_page_dirty(kvm, gpa >> PAGE_SHIFT); 7376 + } 7377 + 7378 + /* reset PML index */ 7379 + vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1); 7380 + } 7381 + 7382 + /* 7383 + * Flush all vcpus' PML buffer and update logged GPAs to dirty_bitmap. 7384 + * Called before reporting dirty_bitmap to userspace. 7385 + */ 7386 + static void kvm_flush_pml_buffers(struct kvm *kvm) 7387 + { 7388 + int i; 7389 + struct kvm_vcpu *vcpu; 7390 + /* 7391 + * We only need to kick vcpu out of guest mode here, as PML buffer 7392 + * is flushed at beginning of all VMEXITs, and it's obvious that only 7393 + * vcpus running in guest are possible to have unflushed GPAs in PML 7394 + * buffer. 7395 + */ 7396 + kvm_for_each_vcpu(i, vcpu, kvm) 7397 + kvm_vcpu_kick(vcpu); 7398 + } 7399 + 7589 7400 /* 7590 7401 * The guest has exited. See if we can fix it or if we need userspace 7591 7402 * assistance. ··· 7678 7323 struct vcpu_vmx *vmx = to_vmx(vcpu); 7679 7324 u32 exit_reason = vmx->exit_reason; 7680 7325 u32 vectoring_info = vmx->idt_vectoring_info; 7326 + 7327 + /* 7328 + * Flush logged GPAs PML buffer, this will make dirty_bitmap more 7329 + * updated. Another good is, in kvm_vm_ioctl_get_dirty_log, before 7330 + * querying dirty_bitmap, we only need to kick all vcpus out of guest 7331 + * mode as if vcpus is in root mode, the PML buffer must has been 7332 + * flushed already. 7333 + */ 7334 + if (enable_pml) 7335 + vmx_flush_pml_buffer(vmx); 7681 7336 7682 7337 /* If guest state is invalid, start emulating */ 7683 7338 if (vmx->emulation_required) ··· 7835 7470 { 7836 7471 u16 status; 7837 7472 u8 old; 7838 - 7839 - if (!vmx_vm_has_apicv(kvm)) 7840 - return; 7841 7473 7842 7474 if (isr == -1) 7843 7475 isr = 0; ··· 8335 7973 { 8336 7974 struct vcpu_vmx *vmx = to_vmx(vcpu); 8337 7975 7976 + if (enable_pml) 7977 + vmx_disable_pml(vmx); 8338 7978 free_vpid(vmx); 8339 7979 leave_guest_mode(vcpu); 8340 7980 vmx_load_vmcs01(vcpu); ··· 8404 8040 goto free_vmcs; 8405 8041 } 8406 8042 8043 + if (nested) 8044 + nested_vmx_setup_ctls_msrs(vmx); 8045 + 8046 + vmx->nested.posted_intr_nv = -1; 8407 8047 vmx->nested.current_vmptr = -1ull; 8408 8048 vmx->nested.current_vmcs12 = NULL; 8049 + 8050 + /* 8051 + * If PML is turned on, failure on enabling PML just results in failure 8052 + * of creating the vcpu, therefore we can simplify PML logic (by 8053 + * avoiding dealing with cases, such as enabling PML partially on vcpus 8054 + * for the guest, etc. 8055 + */ 8056 + if (enable_pml) { 8057 + err = vmx_enable_pml(vmx); 8058 + if (err) 8059 + goto free_vmcs; 8060 + } 8409 8061 8410 8062 return &vmx->vcpu; 8411 8063 ··· 8564 8184 8565 8185 static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) 8566 8186 { 8567 - kvm_init_shadow_ept_mmu(vcpu, &vcpu->arch.mmu, 8568 - nested_vmx_ept_caps & VMX_EPT_EXECUTE_ONLY_BIT); 8569 - 8187 + WARN_ON(mmu_is_nested(vcpu)); 8188 + kvm_init_shadow_ept_mmu(vcpu, 8189 + to_vmx(vcpu)->nested.nested_vmx_ept_caps & 8190 + VMX_EPT_EXECUTE_ONLY_BIT); 8570 8191 vcpu->arch.mmu.set_cr3 = vmx_set_cr3; 8571 8192 vcpu->arch.mmu.get_cr3 = nested_ept_get_cr3; 8572 8193 vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault; ··· 8580 8199 vcpu->arch.walk_mmu = &vcpu->arch.mmu; 8581 8200 } 8582 8201 8202 + static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, 8203 + u16 error_code) 8204 + { 8205 + bool inequality, bit; 8206 + 8207 + bit = (vmcs12->exception_bitmap & (1u << PF_VECTOR)) != 0; 8208 + inequality = 8209 + (error_code & vmcs12->page_fault_error_code_mask) != 8210 + vmcs12->page_fault_error_code_match; 8211 + return inequality ^ bit; 8212 + } 8213 + 8583 8214 static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu, 8584 8215 struct x86_exception *fault) 8585 8216 { ··· 8599 8206 8600 8207 WARN_ON(!is_guest_mode(vcpu)); 8601 8208 8602 - /* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */ 8603 - if (vmcs12->exception_bitmap & (1u << PF_VECTOR)) 8209 + if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code)) 8604 8210 nested_vmx_vmexit(vcpu, to_vmx(vcpu)->exit_reason, 8605 8211 vmcs_read32(VM_EXIT_INTR_INFO), 8606 8212 vmcs_readl(EXIT_QUALIFICATION)); ··· 8653 8261 return false; 8654 8262 } 8655 8263 8264 + if (nested_cpu_has_posted_intr(vmcs12)) { 8265 + if (!IS_ALIGNED(vmcs12->posted_intr_desc_addr, 64)) 8266 + return false; 8267 + 8268 + if (vmx->nested.pi_desc_page) { /* shouldn't happen */ 8269 + kunmap(vmx->nested.pi_desc_page); 8270 + nested_release_page(vmx->nested.pi_desc_page); 8271 + } 8272 + vmx->nested.pi_desc_page = 8273 + nested_get_page(vcpu, vmcs12->posted_intr_desc_addr); 8274 + if (!vmx->nested.pi_desc_page) 8275 + return false; 8276 + 8277 + vmx->nested.pi_desc = 8278 + (struct pi_desc *)kmap(vmx->nested.pi_desc_page); 8279 + if (!vmx->nested.pi_desc) { 8280 + nested_release_page_clean(vmx->nested.pi_desc_page); 8281 + return false; 8282 + } 8283 + vmx->nested.pi_desc = 8284 + (struct pi_desc *)((void *)vmx->nested.pi_desc + 8285 + (unsigned long)(vmcs12->posted_intr_desc_addr & 8286 + (PAGE_SIZE - 1))); 8287 + } 8288 + 8656 8289 return true; 8657 8290 } 8658 8291 ··· 8701 8284 do_div(preemption_timeout, vcpu->arch.virtual_tsc_khz); 8702 8285 hrtimer_start(&vmx->nested.preemption_timer, 8703 8286 ns_to_ktime(preemption_timeout), HRTIMER_MODE_REL); 8287 + } 8288 + 8289 + static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, 8290 + struct vmcs12 *vmcs12) 8291 + { 8292 + int maxphyaddr; 8293 + u64 addr; 8294 + 8295 + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS)) 8296 + return 0; 8297 + 8298 + if (vmcs12_read_any(vcpu, MSR_BITMAP, &addr)) { 8299 + WARN_ON(1); 8300 + return -EINVAL; 8301 + } 8302 + maxphyaddr = cpuid_maxphyaddr(vcpu); 8303 + 8304 + if (!PAGE_ALIGNED(vmcs12->msr_bitmap) || 8305 + ((addr + PAGE_SIZE) >> maxphyaddr)) 8306 + return -EINVAL; 8307 + 8308 + return 0; 8309 + } 8310 + 8311 + /* 8312 + * Merge L0's and L1's MSR bitmap, return false to indicate that 8313 + * we do not use the hardware. 8314 + */ 8315 + static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, 8316 + struct vmcs12 *vmcs12) 8317 + { 8318 + int msr; 8319 + struct page *page; 8320 + unsigned long *msr_bitmap; 8321 + 8322 + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) 8323 + return false; 8324 + 8325 + page = nested_get_page(vcpu, vmcs12->msr_bitmap); 8326 + if (!page) { 8327 + WARN_ON(1); 8328 + return false; 8329 + } 8330 + msr_bitmap = (unsigned long *)kmap(page); 8331 + if (!msr_bitmap) { 8332 + nested_release_page_clean(page); 8333 + WARN_ON(1); 8334 + return false; 8335 + } 8336 + 8337 + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) { 8338 + if (nested_cpu_has_apic_reg_virt(vmcs12)) 8339 + for (msr = 0x800; msr <= 0x8ff; msr++) 8340 + nested_vmx_disable_intercept_for_msr( 8341 + msr_bitmap, 8342 + vmx_msr_bitmap_nested, 8343 + msr, MSR_TYPE_R); 8344 + /* TPR is allowed */ 8345 + nested_vmx_disable_intercept_for_msr(msr_bitmap, 8346 + vmx_msr_bitmap_nested, 8347 + APIC_BASE_MSR + (APIC_TASKPRI >> 4), 8348 + MSR_TYPE_R | MSR_TYPE_W); 8349 + if (nested_cpu_has_vid(vmcs12)) { 8350 + /* EOI and self-IPI are allowed */ 8351 + nested_vmx_disable_intercept_for_msr( 8352 + msr_bitmap, 8353 + vmx_msr_bitmap_nested, 8354 + APIC_BASE_MSR + (APIC_EOI >> 4), 8355 + MSR_TYPE_W); 8356 + nested_vmx_disable_intercept_for_msr( 8357 + msr_bitmap, 8358 + vmx_msr_bitmap_nested, 8359 + APIC_BASE_MSR + (APIC_SELF_IPI >> 4), 8360 + MSR_TYPE_W); 8361 + } 8362 + } else { 8363 + /* 8364 + * Enable reading intercept of all the x2apic 8365 + * MSRs. We should not rely on vmcs12 to do any 8366 + * optimizations here, it may have been modified 8367 + * by L1. 8368 + */ 8369 + for (msr = 0x800; msr <= 0x8ff; msr++) 8370 + __vmx_enable_intercept_for_msr( 8371 + vmx_msr_bitmap_nested, 8372 + msr, 8373 + MSR_TYPE_R); 8374 + 8375 + __vmx_enable_intercept_for_msr( 8376 + vmx_msr_bitmap_nested, 8377 + APIC_BASE_MSR + (APIC_TASKPRI >> 4), 8378 + MSR_TYPE_W); 8379 + __vmx_enable_intercept_for_msr( 8380 + vmx_msr_bitmap_nested, 8381 + APIC_BASE_MSR + (APIC_EOI >> 4), 8382 + MSR_TYPE_W); 8383 + __vmx_enable_intercept_for_msr( 8384 + vmx_msr_bitmap_nested, 8385 + APIC_BASE_MSR + (APIC_SELF_IPI >> 4), 8386 + MSR_TYPE_W); 8387 + } 8388 + kunmap(page); 8389 + nested_release_page_clean(page); 8390 + 8391 + return true; 8392 + } 8393 + 8394 + static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, 8395 + struct vmcs12 *vmcs12) 8396 + { 8397 + if (!nested_cpu_has_virt_x2apic_mode(vmcs12) && 8398 + !nested_cpu_has_apic_reg_virt(vmcs12) && 8399 + !nested_cpu_has_vid(vmcs12) && 8400 + !nested_cpu_has_posted_intr(vmcs12)) 8401 + return 0; 8402 + 8403 + /* 8404 + * If virtualize x2apic mode is enabled, 8405 + * virtualize apic access must be disabled. 8406 + */ 8407 + if (nested_cpu_has_virt_x2apic_mode(vmcs12) && 8408 + nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) 8409 + return -EINVAL; 8410 + 8411 + /* 8412 + * If virtual interrupt delivery is enabled, 8413 + * we must exit on external interrupts. 8414 + */ 8415 + if (nested_cpu_has_vid(vmcs12) && 8416 + !nested_exit_on_intr(vcpu)) 8417 + return -EINVAL; 8418 + 8419 + /* 8420 + * bits 15:8 should be zero in posted_intr_nv, 8421 + * the descriptor address has been already checked 8422 + * in nested_get_vmcs12_pages. 8423 + */ 8424 + if (nested_cpu_has_posted_intr(vmcs12) && 8425 + (!nested_cpu_has_vid(vmcs12) || 8426 + !nested_exit_intr_ack_set(vcpu) || 8427 + vmcs12->posted_intr_nv & 0xff00)) 8428 + return -EINVAL; 8429 + 8430 + /* tpr shadow is needed by all apicv features. */ 8431 + if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) 8432 + return -EINVAL; 8433 + 8434 + return 0; 8435 + } 8436 + 8437 + static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, 8438 + unsigned long count_field, 8439 + unsigned long addr_field, 8440 + int maxphyaddr) 8441 + { 8442 + u64 count, addr; 8443 + 8444 + if (vmcs12_read_any(vcpu, count_field, &count) || 8445 + vmcs12_read_any(vcpu, addr_field, &addr)) { 8446 + WARN_ON(1); 8447 + return -EINVAL; 8448 + } 8449 + if (count == 0) 8450 + return 0; 8451 + if (!IS_ALIGNED(addr, 16) || addr >> maxphyaddr || 8452 + (addr + count * sizeof(struct vmx_msr_entry) - 1) >> maxphyaddr) { 8453 + pr_warn_ratelimited( 8454 + "nVMX: invalid MSR switch (0x%lx, %d, %llu, 0x%08llx)", 8455 + addr_field, maxphyaddr, count, addr); 8456 + return -EINVAL; 8457 + } 8458 + return 0; 8459 + } 8460 + 8461 + static int nested_vmx_check_msr_switch_controls(struct kvm_vcpu *vcpu, 8462 + struct vmcs12 *vmcs12) 8463 + { 8464 + int maxphyaddr; 8465 + 8466 + if (vmcs12->vm_exit_msr_load_count == 0 && 8467 + vmcs12->vm_exit_msr_store_count == 0 && 8468 + vmcs12->vm_entry_msr_load_count == 0) 8469 + return 0; /* Fast path */ 8470 + maxphyaddr = cpuid_maxphyaddr(vcpu); 8471 + if (nested_vmx_check_msr_switch(vcpu, VM_EXIT_MSR_LOAD_COUNT, 8472 + VM_EXIT_MSR_LOAD_ADDR, maxphyaddr) || 8473 + nested_vmx_check_msr_switch(vcpu, VM_EXIT_MSR_STORE_COUNT, 8474 + VM_EXIT_MSR_STORE_ADDR, maxphyaddr) || 8475 + nested_vmx_check_msr_switch(vcpu, VM_ENTRY_MSR_LOAD_COUNT, 8476 + VM_ENTRY_MSR_LOAD_ADDR, maxphyaddr)) 8477 + return -EINVAL; 8478 + return 0; 8479 + } 8480 + 8481 + static int nested_vmx_msr_check_common(struct kvm_vcpu *vcpu, 8482 + struct vmx_msr_entry *e) 8483 + { 8484 + /* x2APIC MSR accesses are not allowed */ 8485 + if (apic_x2apic_mode(vcpu->arch.apic) && e->index >> 8 == 0x8) 8486 + return -EINVAL; 8487 + if (e->index == MSR_IA32_UCODE_WRITE || /* SDM Table 35-2 */ 8488 + e->index == MSR_IA32_UCODE_REV) 8489 + return -EINVAL; 8490 + if (e->reserved != 0) 8491 + return -EINVAL; 8492 + return 0; 8493 + } 8494 + 8495 + static int nested_vmx_load_msr_check(struct kvm_vcpu *vcpu, 8496 + struct vmx_msr_entry *e) 8497 + { 8498 + if (e->index == MSR_FS_BASE || 8499 + e->index == MSR_GS_BASE || 8500 + e->index == MSR_IA32_SMM_MONITOR_CTL || /* SMM is not supported */ 8501 + nested_vmx_msr_check_common(vcpu, e)) 8502 + return -EINVAL; 8503 + return 0; 8504 + } 8505 + 8506 + static int nested_vmx_store_msr_check(struct kvm_vcpu *vcpu, 8507 + struct vmx_msr_entry *e) 8508 + { 8509 + if (e->index == MSR_IA32_SMBASE || /* SMM is not supported */ 8510 + nested_vmx_msr_check_common(vcpu, e)) 8511 + return -EINVAL; 8512 + return 0; 8513 + } 8514 + 8515 + /* 8516 + * Load guest's/host's msr at nested entry/exit. 8517 + * return 0 for success, entry index for failure. 8518 + */ 8519 + static u32 nested_vmx_load_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count) 8520 + { 8521 + u32 i; 8522 + struct vmx_msr_entry e; 8523 + struct msr_data msr; 8524 + 8525 + msr.host_initiated = false; 8526 + for (i = 0; i < count; i++) { 8527 + if (kvm_read_guest(vcpu->kvm, gpa + i * sizeof(e), 8528 + &e, sizeof(e))) { 8529 + pr_warn_ratelimited( 8530 + "%s cannot read MSR entry (%u, 0x%08llx)\n", 8531 + __func__, i, gpa + i * sizeof(e)); 8532 + goto fail; 8533 + } 8534 + if (nested_vmx_load_msr_check(vcpu, &e)) { 8535 + pr_warn_ratelimited( 8536 + "%s check failed (%u, 0x%x, 0x%x)\n", 8537 + __func__, i, e.index, e.reserved); 8538 + goto fail; 8539 + } 8540 + msr.index = e.index; 8541 + msr.data = e.value; 8542 + if (kvm_set_msr(vcpu, &msr)) { 8543 + pr_warn_ratelimited( 8544 + "%s cannot write MSR (%u, 0x%x, 0x%llx)\n", 8545 + __func__, i, e.index, e.value); 8546 + goto fail; 8547 + } 8548 + } 8549 + return 0; 8550 + fail: 8551 + return i + 1; 8552 + } 8553 + 8554 + static int nested_vmx_store_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count) 8555 + { 8556 + u32 i; 8557 + struct vmx_msr_entry e; 8558 + 8559 + for (i = 0; i < count; i++) { 8560 + if (kvm_read_guest(vcpu->kvm, 8561 + gpa + i * sizeof(e), 8562 + &e, 2 * sizeof(u32))) { 8563 + pr_warn_ratelimited( 8564 + "%s cannot read MSR entry (%u, 0x%08llx)\n", 8565 + __func__, i, gpa + i * sizeof(e)); 8566 + return -EINVAL; 8567 + } 8568 + if (nested_vmx_store_msr_check(vcpu, &e)) { 8569 + pr_warn_ratelimited( 8570 + "%s check failed (%u, 0x%x, 0x%x)\n", 8571 + __func__, i, e.index, e.reserved); 8572 + return -EINVAL; 8573 + } 8574 + if (kvm_get_msr(vcpu, e.index, &e.value)) { 8575 + pr_warn_ratelimited( 8576 + "%s cannot read MSR (%u, 0x%x)\n", 8577 + __func__, i, e.index); 8578 + return -EINVAL; 8579 + } 8580 + if (kvm_write_guest(vcpu->kvm, 8581 + gpa + i * sizeof(e) + 8582 + offsetof(struct vmx_msr_entry, value), 8583 + &e.value, sizeof(e.value))) { 8584 + pr_warn_ratelimited( 8585 + "%s cannot write MSR (%u, 0x%x, 0x%llx)\n", 8586 + __func__, i, e.index, e.value); 8587 + return -EINVAL; 8588 + } 8589 + } 8590 + return 0; 8704 8591 } 8705 8592 8706 8593 /* ··· 9086 8365 9087 8366 exec_control = vmcs12->pin_based_vm_exec_control; 9088 8367 exec_control |= vmcs_config.pin_based_exec_ctrl; 9089 - exec_control &= ~(PIN_BASED_VMX_PREEMPTION_TIMER | 9090 - PIN_BASED_POSTED_INTR); 8368 + exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER; 8369 + 8370 + if (nested_cpu_has_posted_intr(vmcs12)) { 8371 + /* 8372 + * Note that we use L0's vector here and in 8373 + * vmx_deliver_nested_posted_interrupt. 8374 + */ 8375 + vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv; 8376 + vmx->nested.pi_pending = false; 8377 + vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR); 8378 + vmcs_write64(POSTED_INTR_DESC_ADDR, 8379 + page_to_phys(vmx->nested.pi_desc_page) + 8380 + (unsigned long)(vmcs12->posted_intr_desc_addr & 8381 + (PAGE_SIZE - 1))); 8382 + } else 8383 + exec_control &= ~PIN_BASED_POSTED_INTR; 8384 + 9091 8385 vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control); 9092 8386 9093 8387 vmx->nested.preemption_timer_expired = false; ··· 9159 8423 else 9160 8424 vmcs_write64(APIC_ACCESS_ADDR, 9161 8425 page_to_phys(vmx->nested.apic_access_page)); 9162 - } else if (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { 8426 + } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) && 8427 + (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))) { 9163 8428 exec_control |= 9164 8429 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 9165 8430 kvm_vcpu_reload_apic_access_page(vcpu); 8431 + } 8432 + 8433 + if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) { 8434 + vmcs_write64(EOI_EXIT_BITMAP0, 8435 + vmcs12->eoi_exit_bitmap0); 8436 + vmcs_write64(EOI_EXIT_BITMAP1, 8437 + vmcs12->eoi_exit_bitmap1); 8438 + vmcs_write64(EOI_EXIT_BITMAP2, 8439 + vmcs12->eoi_exit_bitmap2); 8440 + vmcs_write64(EOI_EXIT_BITMAP3, 8441 + vmcs12->eoi_exit_bitmap3); 8442 + vmcs_write16(GUEST_INTR_STATUS, 8443 + vmcs12->guest_intr_status); 9166 8444 } 9167 8445 9168 8446 vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); ··· 9212 8462 vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold); 9213 8463 } 9214 8464 8465 + if (cpu_has_vmx_msr_bitmap() && 8466 + exec_control & CPU_BASED_USE_MSR_BITMAPS && 8467 + nested_vmx_merge_msr_bitmap(vcpu, vmcs12)) { 8468 + vmcs_write64(MSR_BITMAP, __pa(vmx_msr_bitmap_nested)); 8469 + } else 8470 + exec_control &= ~CPU_BASED_USE_MSR_BITMAPS; 8471 + 9215 8472 /* 9216 - * Merging of IO and MSR bitmaps not currently supported. 8473 + * Merging of IO bitmap not currently supported. 9217 8474 * Rather, exit every time. 9218 8475 */ 9219 - exec_control &= ~CPU_BASED_USE_MSR_BITMAPS; 9220 8476 exec_control &= ~CPU_BASED_USE_IO_BITMAPS; 9221 8477 exec_control |= CPU_BASED_UNCOND_IO_EXITING; 9222 8478 ··· 9338 8582 int cpu; 9339 8583 struct loaded_vmcs *vmcs02; 9340 8584 bool ia32e; 8585 + u32 msr_entry_idx; 9341 8586 9342 8587 if (!nested_vmx_check_permission(vcpu) || 9343 8588 !nested_vmx_check_vmcs12(vcpu)) ··· 9373 8616 return 1; 9374 8617 } 9375 8618 9376 - if ((vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_MSR_BITMAPS) && 9377 - !PAGE_ALIGNED(vmcs12->msr_bitmap)) { 9378 - /*TODO: Also verify bits beyond physical address width are 0*/ 9379 - nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); 9380 - return 1; 9381 - } 9382 - 9383 8619 if (!nested_get_vmcs12_pages(vcpu, vmcs12)) { 9384 8620 /*TODO: Also verify bits beyond physical address width are 0*/ 9385 8621 nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); 9386 8622 return 1; 9387 8623 } 9388 8624 9389 - if (vmcs12->vm_entry_msr_load_count > 0 || 9390 - vmcs12->vm_exit_msr_load_count > 0 || 9391 - vmcs12->vm_exit_msr_store_count > 0) { 9392 - pr_warn_ratelimited("%s: VMCS MSR_{LOAD,STORE} unsupported\n", 9393 - __func__); 8625 + if (nested_vmx_check_msr_bitmap_controls(vcpu, vmcs12)) { 8626 + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); 8627 + return 1; 8628 + } 8629 + 8630 + if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) { 8631 + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); 8632 + return 1; 8633 + } 8634 + 8635 + if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) { 9394 8636 nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); 9395 8637 return 1; 9396 8638 } 9397 8639 9398 8640 if (!vmx_control_verify(vmcs12->cpu_based_vm_exec_control, 9399 - nested_vmx_true_procbased_ctls_low, 9400 - nested_vmx_procbased_ctls_high) || 8641 + vmx->nested.nested_vmx_true_procbased_ctls_low, 8642 + vmx->nested.nested_vmx_procbased_ctls_high) || 9401 8643 !vmx_control_verify(vmcs12->secondary_vm_exec_control, 9402 - nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high) || 8644 + vmx->nested.nested_vmx_secondary_ctls_low, 8645 + vmx->nested.nested_vmx_secondary_ctls_high) || 9403 8646 !vmx_control_verify(vmcs12->pin_based_vm_exec_control, 9404 - nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high) || 8647 + vmx->nested.nested_vmx_pinbased_ctls_low, 8648 + vmx->nested.nested_vmx_pinbased_ctls_high) || 9405 8649 !vmx_control_verify(vmcs12->vm_exit_controls, 9406 - nested_vmx_true_exit_ctls_low, 9407 - nested_vmx_exit_ctls_high) || 8650 + vmx->nested.nested_vmx_true_exit_ctls_low, 8651 + vmx->nested.nested_vmx_exit_ctls_high) || 9408 8652 !vmx_control_verify(vmcs12->vm_entry_controls, 9409 - nested_vmx_true_entry_ctls_low, 9410 - nested_vmx_entry_ctls_high)) 8653 + vmx->nested.nested_vmx_true_entry_ctls_low, 8654 + vmx->nested.nested_vmx_entry_ctls_high)) 9411 8655 { 9412 8656 nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); 9413 8657 return 1; ··· 9421 8663 return 1; 9422 8664 } 9423 8665 9424 - if (!nested_cr0_valid(vmcs12, vmcs12->guest_cr0) || 8666 + if (!nested_cr0_valid(vcpu, vmcs12->guest_cr0) || 9425 8667 ((vmcs12->guest_cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) { 9426 8668 nested_vmx_entry_failure(vcpu, vmcs12, 9427 8669 EXIT_REASON_INVALID_STATE, ENTRY_FAIL_DEFAULT); ··· 9497 8739 9498 8740 vmx_segment_cache_clear(vmx); 9499 8741 9500 - vmcs12->launch_state = 1; 9501 - 9502 8742 prepare_vmcs02(vcpu, vmcs12); 8743 + 8744 + msr_entry_idx = nested_vmx_load_msr(vcpu, 8745 + vmcs12->vm_entry_msr_load_addr, 8746 + vmcs12->vm_entry_msr_load_count); 8747 + if (msr_entry_idx) { 8748 + leave_guest_mode(vcpu); 8749 + vmx_load_vmcs01(vcpu); 8750 + nested_vmx_entry_failure(vcpu, vmcs12, 8751 + EXIT_REASON_MSR_LOAD_FAIL, msr_entry_idx); 8752 + return 1; 8753 + } 8754 + 8755 + vmcs12->launch_state = 1; 9503 8756 9504 8757 if (vmcs12->guest_activity_state == GUEST_ACTIVITY_HLT) 9505 8758 return kvm_emulate_halt(vcpu); ··· 9638 8869 if (vmx->nested.nested_run_pending) 9639 8870 return -EBUSY; 9640 8871 nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); 8872 + return 0; 9641 8873 } 9642 8874 9643 - return 0; 8875 + return vmx_complete_nested_posted_interrupt(vcpu); 9644 8876 } 9645 8877 9646 8878 static u32 vmx_get_preemption_timer_value(struct kvm_vcpu *vcpu) ··· 9750 8980 vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); 9751 8981 vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); 9752 8982 } 8983 + 8984 + if (nested_cpu_has_vid(vmcs12)) 8985 + vmcs12->guest_intr_status = vmcs_read16(GUEST_INTR_STATUS); 9753 8986 9754 8987 vmcs12->vm_entry_controls = 9755 8988 (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) | ··· 9945 9172 9946 9173 kvm_set_dr(vcpu, 7, 0x400); 9947 9174 vmcs_write64(GUEST_IA32_DEBUGCTL, 0); 9175 + 9176 + if (cpu_has_vmx_msr_bitmap()) 9177 + vmx_set_msr_bitmap(vcpu); 9178 + 9179 + if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr, 9180 + vmcs12->vm_exit_msr_load_count)) 9181 + nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL); 9948 9182 } 9949 9183 9950 9184 /* ··· 9972 9192 leave_guest_mode(vcpu); 9973 9193 prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info, 9974 9194 exit_qualification); 9195 + 9196 + if (nested_vmx_store_msr(vcpu, vmcs12->vm_exit_msr_store_addr, 9197 + vmcs12->vm_exit_msr_store_count)) 9198 + nested_vmx_abort(vcpu, VMX_ABORT_SAVE_GUEST_MSR_FAIL); 9975 9199 9976 9200 vmx_load_vmcs01(vcpu); 9977 9201 ··· 10018 9234 if (vmx->nested.virtual_apic_page) { 10019 9235 nested_release_page(vmx->nested.virtual_apic_page); 10020 9236 vmx->nested.virtual_apic_page = NULL; 9237 + } 9238 + if (vmx->nested.pi_desc_page) { 9239 + kunmap(vmx->nested.pi_desc_page); 9240 + nested_release_page(vmx->nested.pi_desc_page); 9241 + vmx->nested.pi_desc_page = NULL; 9242 + vmx->nested.pi_desc = NULL; 10021 9243 } 10022 9244 10023 9245 /* ··· 10089 9299 { 10090 9300 if (ple_gap) 10091 9301 shrink_ple_window(vcpu); 9302 + } 9303 + 9304 + static void vmx_slot_enable_log_dirty(struct kvm *kvm, 9305 + struct kvm_memory_slot *slot) 9306 + { 9307 + kvm_mmu_slot_leaf_clear_dirty(kvm, slot); 9308 + kvm_mmu_slot_largepage_remove_write_access(kvm, slot); 9309 + } 9310 + 9311 + static void vmx_slot_disable_log_dirty(struct kvm *kvm, 9312 + struct kvm_memory_slot *slot) 9313 + { 9314 + kvm_mmu_slot_set_dirty(kvm, slot); 9315 + } 9316 + 9317 + static void vmx_flush_log_dirty(struct kvm *kvm) 9318 + { 9319 + kvm_flush_pml_buffers(kvm); 9320 + } 9321 + 9322 + static void vmx_enable_log_dirty_pt_masked(struct kvm *kvm, 9323 + struct kvm_memory_slot *memslot, 9324 + gfn_t offset, unsigned long mask) 9325 + { 9326 + kvm_mmu_clear_dirty_pt_masked(kvm, memslot, offset, mask); 10092 9327 } 10093 9328 10094 9329 static struct kvm_x86_ops vmx_x86_ops = { ··· 10224 9409 .check_nested_events = vmx_check_nested_events, 10225 9410 10226 9411 .sched_in = vmx_sched_in, 9412 + 9413 + .slot_enable_log_dirty = vmx_slot_enable_log_dirty, 9414 + .slot_disable_log_dirty = vmx_slot_disable_log_dirty, 9415 + .flush_log_dirty = vmx_flush_log_dirty, 9416 + .enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked, 10227 9417 }; 10228 9418 10229 9419 static int __init vmx_init(void)

+127 -82

arch/x86/kvm/x86.c

··· 108 108 static u32 tsc_tolerance_ppm = 250; 109 109 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); 110 110 111 + /* lapic timer advance (tscdeadline mode only) in nanoseconds */ 112 + unsigned int lapic_timer_advance_ns = 0; 113 + module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR); 114 + 111 115 static bool backwards_tsc_observed = false; 112 116 113 117 #define KVM_NR_SHARED_MSRS 16 ··· 145 141 { "irq_window", VCPU_STAT(irq_window_exits) }, 146 142 { "nmi_window", VCPU_STAT(nmi_window_exits) }, 147 143 { "halt_exits", VCPU_STAT(halt_exits) }, 144 + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) }, 148 145 { "halt_wakeup", VCPU_STAT(halt_wakeup) }, 149 146 { "hypercalls", VCPU_STAT(hypercalls) }, 150 147 { "request_irq", VCPU_STAT(request_irq_exits) }, ··· 497 492 } 498 493 EXPORT_SYMBOL_GPL(kvm_read_guest_page_mmu); 499 494 500 - int kvm_read_nested_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, 495 + static int kvm_read_nested_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, 501 496 void *data, int offset, int len, u32 access) 502 497 { 503 498 return kvm_read_guest_page_mmu(vcpu, vcpu->arch.walk_mmu, gfn, ··· 648 643 } 649 644 } 650 645 651 - int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) 646 + static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) 652 647 { 653 648 u64 xcr0 = xcr; 654 649 u64 old_xcr0 = vcpu->arch.xcr0; ··· 1088 1083 } 1089 1084 #endif 1090 1085 1086 + void kvm_set_pending_timer(struct kvm_vcpu *vcpu) 1087 + { 1088 + /* 1089 + * Note: KVM_REQ_PENDING_TIMER is implicitly checked in 1090 + * vcpu_enter_guest. This function is only called from 1091 + * the physical CPU that is running vcpu. 1092 + */ 1093 + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); 1094 + } 1091 1095 1092 1096 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) 1093 1097 { ··· 1194 1180 #endif 1195 1181 1196 1182 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); 1197 - unsigned long max_tsc_khz; 1183 + static unsigned long max_tsc_khz; 1198 1184 1199 1185 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec) 1200 1186 { ··· 1248 1234 return tsc; 1249 1235 } 1250 1236 1251 - void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) 1237 + static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) 1252 1238 { 1253 1239 #ifdef CONFIG_X86_64 1254 1240 bool vcpus_matched; ··· 1543 1529 &ka->master_cycle_now); 1544 1530 1545 1531 ka->use_master_clock = host_tsc_clocksource && vcpus_matched 1546 - && !backwards_tsc_observed; 1532 + && !backwards_tsc_observed 1533 + && !ka->boot_vcpu_runs_old_kvmclock; 1547 1534 1548 1535 if (ka->use_master_clock) 1549 1536 atomic_set(&kvm_guest_has_master_clock, 1); ··· 2176 2161 case MSR_KVM_SYSTEM_TIME_NEW: 2177 2162 case MSR_KVM_SYSTEM_TIME: { 2178 2163 u64 gpa_offset; 2164 + struct kvm_arch *ka = &vcpu->kvm->arch; 2165 + 2179 2166 kvmclock_reset(vcpu); 2167 + 2168 + if (vcpu->vcpu_id == 0 && !msr_info->host_initiated) { 2169 + bool tmp = (msr == MSR_KVM_SYSTEM_TIME); 2170 + 2171 + if (ka->boot_vcpu_runs_old_kvmclock != tmp) 2172 + set_bit(KVM_REQ_MASTERCLOCK_UPDATE, 2173 + &vcpu->requests); 2174 + 2175 + ka->boot_vcpu_runs_old_kvmclock = tmp; 2176 + } 2180 2177 2181 2178 vcpu->arch.time = data; 2182 2179 kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); ··· 2351 2324 { 2352 2325 return kvm_x86_ops->get_msr(vcpu, msr_index, pdata); 2353 2326 } 2327 + EXPORT_SYMBOL_GPL(kvm_get_msr); 2354 2328 2355 2329 static int get_msr_mtrr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) 2356 2330 { ··· 2766 2738 case KVM_CAP_READONLY_MEM: 2767 2739 case KVM_CAP_HYPERV_TIME: 2768 2740 case KVM_CAP_IOAPIC_POLARITY_IGNORED: 2741 + case KVM_CAP_TSC_DEADLINE_TIMER: 2769 2742 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT 2770 2743 case KVM_CAP_ASSIGN_DEV_IRQ: 2771 2744 case KVM_CAP_PCI_2_3: ··· 2804 2775 break; 2805 2776 case KVM_CAP_TSC_CONTROL: 2806 2777 r = kvm_has_tsc_control; 2807 - break; 2808 - case KVM_CAP_TSC_DEADLINE_TIMER: 2809 - r = boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER); 2810 2778 break; 2811 2779 default: 2812 2780 r = 0; ··· 3760 3734 * @kvm: kvm instance 3761 3735 * @log: slot id and address to which we copy the log 3762 3736 * 3763 - * We need to keep it in mind that VCPU threads can write to the bitmap 3764 - * concurrently. So, to avoid losing data, we keep the following order for 3765 - * each bit: 3737 + * Steps 1-4 below provide general overview of dirty page logging. See 3738 + * kvm_get_dirty_log_protect() function description for additional details. 3739 + * 3740 + * We call kvm_get_dirty_log_protect() to handle steps 1-3, upon return we 3741 + * always flush the TLB (step 4) even if previous step failed and the dirty 3742 + * bitmap may be corrupt. Regardless of previous outcome the KVM logging API 3743 + * does not preclude user space subsequent dirty log read. Flushing TLB ensures 3744 + * writes will be marked dirty for next log read. 3766 3745 * 3767 3746 * 1. Take a snapshot of the bit and clear it if needed. 3768 3747 * 2. Write protect the corresponding page. 3769 - * 3. Flush TLB's if needed. 3770 - * 4. Copy the snapshot to the userspace. 3771 - * 3772 - * Between 2 and 3, the guest may write to the page using the remaining TLB 3773 - * entry. This is not a problem because the page will be reported dirty at 3774 - * step 4 using the snapshot taken before and step 3 ensures that successive 3775 - * writes will be logged for the next call. 3748 + * 3. Copy the snapshot to the userspace. 3749 + * 4. Flush TLB's if needed. 3776 3750 */ 3777 3751 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) 3778 3752 { 3779 - int r; 3780 - struct kvm_memory_slot *memslot; 3781 - unsigned long n, i; 3782 - unsigned long *dirty_bitmap; 3783 - unsigned long *dirty_bitmap_buffer; 3784 3753 bool is_dirty = false; 3754 + int r; 3785 3755 3786 3756 mutex_lock(&kvm->slots_lock); 3787 3757 3788 - r = -EINVAL; 3789 - if (log->slot >= KVM_USER_MEM_SLOTS) 3790 - goto out; 3758 + /* 3759 + * Flush potentially hardware-cached dirty pages to dirty_bitmap. 3760 + */ 3761 + if (kvm_x86_ops->flush_log_dirty) 3762 + kvm_x86_ops->flush_log_dirty(kvm); 3791 3763 3792 - memslot = id_to_memslot(kvm->memslots, log->slot); 3793 - 3794 - dirty_bitmap = memslot->dirty_bitmap; 3795 - r = -ENOENT; 3796 - if (!dirty_bitmap) 3797 - goto out; 3798 - 3799 - n = kvm_dirty_bitmap_bytes(memslot); 3800 - 3801 - dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long); 3802 - memset(dirty_bitmap_buffer, 0, n); 3803 - 3804 - spin_lock(&kvm->mmu_lock); 3805 - 3806 - for (i = 0; i < n / sizeof(long); i++) { 3807 - unsigned long mask; 3808 - gfn_t offset; 3809 - 3810 - if (!dirty_bitmap[i]) 3811 - continue; 3812 - 3813 - is_dirty = true; 3814 - 3815 - mask = xchg(&dirty_bitmap[i], 0); 3816 - dirty_bitmap_buffer[i] = mask; 3817 - 3818 - offset = i * BITS_PER_LONG; 3819 - kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask); 3820 - } 3821 - 3822 - spin_unlock(&kvm->mmu_lock); 3823 - 3824 - /* See the comments in kvm_mmu_slot_remove_write_access(). */ 3825 - lockdep_assert_held(&kvm->slots_lock); 3764 + r = kvm_get_dirty_log_protect(kvm, log, &is_dirty); 3826 3765 3827 3766 /* 3828 3767 * All the TLBs can be flushed out of mmu lock, see the comments in 3829 3768 * kvm_mmu_slot_remove_write_access(). 3830 3769 */ 3770 + lockdep_assert_held(&kvm->slots_lock); 3831 3771 if (is_dirty) 3832 3772 kvm_flush_remote_tlbs(kvm); 3833 3773 3834 - r = -EFAULT; 3835 - if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) 3836 - goto out; 3837 - 3838 - r = 0; 3839 - out: 3840 3774 mutex_unlock(&kvm->slots_lock); 3841 3775 return r; 3842 3776 } ··· 4502 4516 if (rc != X86EMUL_CONTINUE) 4503 4517 return rc; 4504 4518 addr += now; 4519 + if (ctxt->mode != X86EMUL_MODE_PROT64) 4520 + addr = (u32)addr; 4505 4521 val += now; 4506 4522 bytes -= now; 4507 4523 } ··· 4972 4984 kvm_register_write(emul_to_vcpu(ctxt), reg, val); 4973 4985 } 4974 4986 4987 + static void emulator_set_nmi_mask(struct x86_emulate_ctxt *ctxt, bool masked) 4988 + { 4989 + kvm_x86_ops->set_nmi_mask(emul_to_vcpu(ctxt), masked); 4990 + } 4991 + 4975 4992 static const struct x86_emulate_ops emulate_ops = { 4976 4993 .read_gpr = emulator_read_gpr, 4977 4994 .write_gpr = emulator_write_gpr, ··· 5012 5019 .put_fpu = emulator_put_fpu, 5013 5020 .intercept = emulator_intercept, 5014 5021 .get_cpuid = emulator_get_cpuid, 5022 + .set_nmi_mask = emulator_set_nmi_mask, 5015 5023 }; 5016 5024 5017 5025 static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) ··· 6305 6311 } 6306 6312 6307 6313 trace_kvm_entry(vcpu->vcpu_id); 6314 + wait_lapic_expire(vcpu); 6308 6315 kvm_x86_ops->run(vcpu); 6309 6316 6310 6317 /* ··· 7036 7041 return r; 7037 7042 } 7038 7043 7039 - int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 7044 + void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 7040 7045 { 7041 - int r; 7042 7046 struct msr_data msr; 7043 7047 struct kvm *kvm = vcpu->kvm; 7044 7048 7045 - r = vcpu_load(vcpu); 7046 - if (r) 7047 - return r; 7049 + if (vcpu_load(vcpu)) 7050 + return; 7048 7051 msr.data = 0x0; 7049 7052 msr.index = MSR_IA32_TSC; 7050 7053 msr.host_initiated = true; ··· 7051 7058 7052 7059 schedule_delayed_work(&kvm->arch.kvmclock_sync_work, 7053 7060 KVMCLOCK_SYNC_PERIOD); 7054 - 7055 - return r; 7056 7061 } 7057 7062 7058 7063 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) ··· 7540 7549 return 0; 7541 7550 } 7542 7551 7552 + static void kvm_mmu_slot_apply_flags(struct kvm *kvm, 7553 + struct kvm_memory_slot *new) 7554 + { 7555 + /* Still write protect RO slot */ 7556 + if (new->flags & KVM_MEM_READONLY) { 7557 + kvm_mmu_slot_remove_write_access(kvm, new); 7558 + return; 7559 + } 7560 + 7561 + /* 7562 + * Call kvm_x86_ops dirty logging hooks when they are valid. 7563 + * 7564 + * kvm_x86_ops->slot_disable_log_dirty is called when: 7565 + * 7566 + * - KVM_MR_CREATE with dirty logging is disabled 7567 + * - KVM_MR_FLAGS_ONLY with dirty logging is disabled in new flag 7568 + * 7569 + * The reason is, in case of PML, we need to set D-bit for any slots 7570 + * with dirty logging disabled in order to eliminate unnecessary GPA 7571 + * logging in PML buffer (and potential PML buffer full VMEXT). This 7572 + * guarantees leaving PML enabled during guest's lifetime won't have 7573 + * any additonal overhead from PML when guest is running with dirty 7574 + * logging disabled for memory slots. 7575 + * 7576 + * kvm_x86_ops->slot_enable_log_dirty is called when switching new slot 7577 + * to dirty logging mode. 7578 + * 7579 + * If kvm_x86_ops dirty logging hooks are invalid, use write protect. 7580 + * 7581 + * In case of write protect: 7582 + * 7583 + * Write protect all pages for dirty logging. 7584 + * 7585 + * All the sptes including the large sptes which point to this 7586 + * slot are set to readonly. We can not create any new large 7587 + * spte on this slot until the end of the logging. 7588 + * 7589 + * See the comments in fast_page_fault(). 7590 + */ 7591 + if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) { 7592 + if (kvm_x86_ops->slot_enable_log_dirty) 7593 + kvm_x86_ops->slot_enable_log_dirty(kvm, new); 7594 + else 7595 + kvm_mmu_slot_remove_write_access(kvm, new); 7596 + } else { 7597 + if (kvm_x86_ops->slot_disable_log_dirty) 7598 + kvm_x86_ops->slot_disable_log_dirty(kvm, new); 7599 + } 7600 + } 7601 + 7543 7602 void kvm_arch_commit_memory_region(struct kvm *kvm, 7544 7603 struct kvm_userspace_memory_region *mem, 7545 7604 const struct kvm_memory_slot *old, 7546 7605 enum kvm_mr_change change) 7547 7606 { 7548 - 7607 + struct kvm_memory_slot *new; 7549 7608 int nr_mmu_pages = 0; 7550 7609 7551 7610 if ((mem->slot >= KVM_USER_MEM_SLOTS) && (change == KVM_MR_DELETE)) { ··· 7614 7573 7615 7574 if (nr_mmu_pages) 7616 7575 kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages); 7576 + 7577 + /* It's OK to get 'new' slot here as it has already been installed */ 7578 + new = id_to_memslot(kvm->memslots, mem->slot); 7579 + 7617 7580 /* 7618 - * Write protect all pages for dirty logging. 7581 + * Set up write protection and/or dirty logging for the new slot. 7619 7582 * 7620 - * All the sptes including the large sptes which point to this 7621 - * slot are set to readonly. We can not create any new large 7622 - * spte on this slot until the end of the logging. 7623 - * 7624 - * See the comments in fast_page_fault(). 7583 + * For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have 7584 + * been zapped so no dirty logging staff is needed for old slot. For 7585 + * KVM_MR_FLAGS_ONLY, the old slot is essentially the same one as the 7586 + * new and it's also covered when dealing with the new slot. 7625 7587 */ 7626 - if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES)) 7627 - kvm_mmu_slot_remove_write_access(kvm, mem->slot); 7588 + if (change != KVM_MR_DELETE) 7589 + kvm_mmu_slot_apply_flags(kvm, new); 7628 7590 } 7629 7591 7630 7592 void kvm_arch_flush_shadow_all(struct kvm *kvm) ··· 7881 7837 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts); 7882 7838 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset); 7883 7839 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window); 7840 + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);

+3

arch/x86/kvm/x86.h

··· 147 147 148 148 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu); 149 149 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu); 150 + void kvm_set_pending_timer(struct kvm_vcpu *vcpu); 150 151 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); 151 152 152 153 void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr); ··· 170 169 extern u64 kvm_supported_xcr0(void); 171 170 172 171 extern unsigned int min_timer_period_us; 172 + 173 + extern unsigned int lapic_timer_advance_ns; 173 174 174 175 extern struct static_key kvm_no_apic_vcpu; 175 176 #endif

+9 -5

drivers/irqchip/irq-gic-v3.c

··· 481 481 return tlist; 482 482 } 483 483 484 + #define MPIDR_TO_SGI_AFFINITY(cluster_id, level) \ 485 + (MPIDR_AFFINITY_LEVEL(cluster_id, level) \ 486 + << ICC_SGI1R_AFFINITY_## level ##_SHIFT) 487 + 484 488 static void gic_send_sgi(u64 cluster_id, u16 tlist, unsigned int irq) 485 489 { 486 490 u64 val; 487 491 488 - val = (MPIDR_AFFINITY_LEVEL(cluster_id, 3) << 48 | 489 - MPIDR_AFFINITY_LEVEL(cluster_id, 2) << 32 | 490 - irq << 24 | 491 - MPIDR_AFFINITY_LEVEL(cluster_id, 1) << 16 | 492 - tlist); 492 + val = (MPIDR_TO_SGI_AFFINITY(cluster_id, 3) | 493 + MPIDR_TO_SGI_AFFINITY(cluster_id, 2) | 494 + irq << ICC_SGI1R_SGI_ID_SHIFT | 495 + MPIDR_TO_SGI_AFFINITY(cluster_id, 1) | 496 + tlist << ICC_SGI1R_TARGET_LIST_SHIFT); 493 497 494 498 pr_debug("CPU%d: ICC_SGI1R_EL1 %llx\n", smp_processor_id(), val); 495 499 gic_write_sgi1r(val);

+8

drivers/s390/char/sclp_early.c

··· 54 54 static unsigned int sclp_max_cpu; 55 55 static struct sclp_ipl_info sclp_ipl_info; 56 56 static unsigned char sclp_siif; 57 + static unsigned char sclp_sigpif; 57 58 static u32 sclp_ibc; 58 59 static unsigned int sclp_mtid; 59 60 static unsigned int sclp_mtid_cp; ··· 141 140 if (boot_cpu_address != cpue->core_id) 142 141 continue; 143 142 sclp_siif = cpue->siif; 143 + sclp_sigpif = cpue->sigpif; 144 144 break; 145 145 } 146 146 ··· 187 185 return sclp_siif; 188 186 } 189 187 EXPORT_SYMBOL(sclp_has_siif); 188 + 189 + int sclp_has_sigpif(void) 190 + { 191 + return sclp_sigpif; 192 + } 193 + EXPORT_SYMBOL(sclp_has_sigpif); 190 194 191 195 unsigned int sclp_get_ibc(void) 192 196 {

+38 -5

include/kvm/arm_vgic.h

··· 33 33 #define VGIC_V2_MAX_LRS (1 << 6) 34 34 #define VGIC_V3_MAX_LRS 16 35 35 #define VGIC_MAX_IRQS 1024 36 + #define VGIC_V2_MAX_CPUS 8 36 37 37 38 /* Sanity checks... */ 38 - #if (KVM_MAX_VCPUS > 8) 39 - #error Invalid number of CPU interfaces 39 + #if (KVM_MAX_VCPUS > 255) 40 + #error Too many KVM VCPUs, the VGIC only supports up to 255 VCPUs for now 40 41 #endif 41 42 42 43 #if (VGIC_NR_IRQS_LEGACY & 31) ··· 133 132 unsigned int maint_irq; 134 133 /* Virtual control interface base address */ 135 134 void __iomem *vctrl_base; 135 + int max_gic_vcpus; 136 + /* Only needed for the legacy KVM_CREATE_IRQCHIP */ 137 + bool can_emulate_gicv2; 138 + }; 139 + 140 + struct vgic_vm_ops { 141 + bool (*handle_mmio)(struct kvm_vcpu *, struct kvm_run *, 142 + struct kvm_exit_mmio *); 143 + bool (*queue_sgi)(struct kvm_vcpu *, int irq); 144 + void (*add_sgi_source)(struct kvm_vcpu *, int irq, int source); 145 + int (*init_model)(struct kvm *); 146 + int (*map_resources)(struct kvm *, const struct vgic_params *); 136 147 }; 137 148 138 149 struct vgic_dist { ··· 152 139 spinlock_t lock; 153 140 bool in_kernel; 154 141 bool ready; 142 + 143 + /* vGIC model the kernel emulates for the guest (GICv2 or GICv3) */ 144 + u32 vgic_model; 155 145 156 146 int nr_cpus; 157 147 int nr_irqs; ··· 164 148 165 149 /* Distributor and vcpu interface mapping in the guest */ 166 150 phys_addr_t vgic_dist_base; 167 - phys_addr_t vgic_cpu_base; 151 + /* GICv2 and GICv3 use different mapped register blocks */ 152 + union { 153 + phys_addr_t vgic_cpu_base; 154 + phys_addr_t vgic_redist_base; 155 + }; 168 156 169 157 /* Distributor enabled */ 170 158 u32 enabled; ··· 230 210 */ 231 211 struct vgic_bitmap *irq_spi_target; 232 212 213 + /* Target MPIDR for each IRQ (needed for GICv3 IROUTERn) only */ 214 + u32 *irq_spi_mpidr; 215 + 233 216 /* Bitmap indicating which CPU has something pending */ 234 217 unsigned long *irq_pending_on_cpu; 218 + 219 + struct vgic_vm_ops vm_ops; 235 220 #endif 236 221 }; 237 222 ··· 254 229 #ifdef CONFIG_ARM_GIC_V3 255 230 u32 vgic_hcr; 256 231 u32 vgic_vmcr; 232 + u32 vgic_sre; /* Restored only, change ignored */ 257 233 u32 vgic_misr; /* Saved only */ 258 234 u32 vgic_eisr; /* Saved only */ 259 235 u32 vgic_elrsr; /* Saved only */ ··· 301 275 int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 *addr, bool write); 302 276 int kvm_vgic_hyp_init(void); 303 277 int kvm_vgic_map_resources(struct kvm *kvm); 304 - int kvm_vgic_create(struct kvm *kvm); 278 + int kvm_vgic_get_max_vcpus(void); 279 + int kvm_vgic_create(struct kvm *kvm, u32 type); 305 280 void kvm_vgic_destroy(struct kvm *kvm); 306 281 void kvm_vgic_vcpu_destroy(struct kvm_vcpu *vcpu); 307 282 void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu); 308 283 void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu); 309 284 int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, 310 285 bool level); 286 + void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg); 311 287 int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu); 312 288 bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, 313 289 struct kvm_exit_mmio *mmio); ··· 355 327 return 0; 356 328 } 357 329 358 - static inline int kvm_vgic_create(struct kvm *kvm) 330 + static inline int kvm_vgic_create(struct kvm *kvm, u32 type) 359 331 { 360 332 return 0; 361 333 } ··· 406 378 static inline bool vgic_ready(struct kvm *kvm) 407 379 { 408 380 return true; 381 + } 382 + 383 + static inline int kvm_vgic_get_max_vcpus(void) 384 + { 385 + return KVM_MAX_VCPUS; 409 386 } 410 387 #endif 411 388

+44

include/linux/irqchip/arm-gic-v3.h

··· 33 33 #define GICD_SETSPI_SR 0x0050 34 34 #define GICD_CLRSPI_SR 0x0058 35 35 #define GICD_SEIR 0x0068 36 + #define GICD_IGROUPR 0x0080 36 37 #define GICD_ISENABLER 0x0100 37 38 #define GICD_ICENABLER 0x0180 38 39 #define GICD_ISPENDR 0x0200 ··· 42 41 #define GICD_ICACTIVER 0x0380 43 42 #define GICD_IPRIORITYR 0x0400 44 43 #define GICD_ICFGR 0x0C00 44 + #define GICD_IGRPMODR 0x0D00 45 + #define GICD_NSACR 0x0E00 45 46 #define GICD_IROUTER 0x6000 47 + #define GICD_IDREGS 0xFFD0 46 48 #define GICD_PIDR2 0xFFE8 47 49 50 + /* 51 + * Those registers are actually from GICv2, but the spec demands that they 52 + * are implemented as RES0 if ARE is 1 (which we do in KVM's emulated GICv3). 53 + */ 54 + #define GICD_ITARGETSR 0x0800 55 + #define GICD_SGIR 0x0F00 56 + #define GICD_CPENDSGIR 0x0F10 57 + #define GICD_SPENDSGIR 0x0F20 58 + 48 59 #define GICD_CTLR_RWP (1U << 31) 60 + #define GICD_CTLR_DS (1U << 6) 49 61 #define GICD_CTLR_ARE_NS (1U << 4) 50 62 #define GICD_CTLR_ENABLE_G1A (1U << 1) 51 63 #define GICD_CTLR_ENABLE_G1 (1U << 0) 64 + 65 + /* 66 + * In systems with a single security state (what we emulate in KVM) 67 + * the meaning of the interrupt group enable bits is slightly different 68 + */ 69 + #define GICD_CTLR_ENABLE_SS_G1 (1U << 1) 70 + #define GICD_CTLR_ENABLE_SS_G0 (1U << 0) 71 + 72 + #define GICD_TYPER_LPIS (1U << 17) 73 + #define GICD_TYPER_MBIS (1U << 16) 52 74 53 75 #define GICD_TYPER_ID_BITS(typer) ((((typer) >> 19) & 0x1f) + 1) 54 76 #define GICD_TYPER_IRQS(typer) ((((typer) & 0x1f) + 1) * 32) ··· 83 59 #define GIC_PIDR2_ARCH_MASK 0xf0 84 60 #define GIC_PIDR2_ARCH_GICv3 0x30 85 61 #define GIC_PIDR2_ARCH_GICv4 0x40 62 + 63 + #define GIC_V3_DIST_SIZE 0x10000 86 64 87 65 /* 88 66 * Re-Distributor registers, offsets from RD_base ··· 104 78 #define GICR_SYNCR 0x00C0 105 79 #define GICR_MOVLPIR 0x0100 106 80 #define GICR_MOVALLR 0x0110 81 + #define GICR_IDREGS GICD_IDREGS 107 82 #define GICR_PIDR2 GICD_PIDR2 108 83 109 84 #define GICR_CTLR_ENABLE_LPIS (1UL << 0) ··· 131 104 /* 132 105 * Re-Distributor registers, offsets from SGI_base 133 106 */ 107 + #define GICR_IGROUPR0 GICD_IGROUPR 134 108 #define GICR_ISENABLER0 GICD_ISENABLER 135 109 #define GICR_ICENABLER0 GICD_ICENABLER 136 110 #define GICR_ISPENDR0 GICD_ISPENDR ··· 140 112 #define GICR_ICACTIVER0 GICD_ICACTIVER 141 113 #define GICR_IPRIORITYR0 GICD_IPRIORITYR 142 114 #define GICR_ICFGR0 GICD_ICFGR 115 + #define GICR_IGRPMODR0 GICD_IGRPMODR 116 + #define GICR_NSACR GICD_NSACR 143 117 144 118 #define GICR_TYPER_PLPIS (1U << 0) 145 119 #define GICR_TYPER_VLPIS (1U << 1) 146 120 #define GICR_TYPER_LAST (1U << 4) 121 + 122 + #define GIC_V3_REDIST_SIZE 0x20000 147 123 148 124 #define LPI_PROP_GROUP1 (1 << 1) 149 125 #define LPI_PROP_ENABLED (1 << 0) ··· 279 247 280 248 #define ICC_SRE_EL2_SRE (1 << 0) 281 249 #define ICC_SRE_EL2_ENABLE (1 << 3) 250 + 251 + #define ICC_SGI1R_TARGET_LIST_SHIFT 0 252 + #define ICC_SGI1R_TARGET_LIST_MASK (0xffff << ICC_SGI1R_TARGET_LIST_SHIFT) 253 + #define ICC_SGI1R_AFFINITY_1_SHIFT 16 254 + #define ICC_SGI1R_AFFINITY_1_MASK (0xff << ICC_SGI1R_AFFINITY_1_SHIFT) 255 + #define ICC_SGI1R_SGI_ID_SHIFT 24 256 + #define ICC_SGI1R_SGI_ID_MASK (0xff << ICC_SGI1R_SGI_ID_SHIFT) 257 + #define ICC_SGI1R_AFFINITY_2_SHIFT 32 258 + #define ICC_SGI1R_AFFINITY_2_MASK (0xffULL << ICC_SGI1R_AFFINITY_1_SHIFT) 259 + #define ICC_SGI1R_IRQ_ROUTING_MODE_BIT 40 260 + #define ICC_SGI1R_AFFINITY_3_SHIFT 48 261 + #define ICC_SGI1R_AFFINITY_3_MASK (0xffULL << ICC_SGI1R_AFFINITY_1_SHIFT) 282 262 283 263 /* 284 264 * System register definitions

+12 -5

include/linux/kvm_host.h

··· 33 33 34 34 #include <asm/kvm_host.h> 35 35 36 - #ifndef KVM_MMIO_SIZE 37 - #define KVM_MMIO_SIZE 8 38 - #endif 39 - 40 36 /* 41 37 * The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used 42 38 * in kvm, other bits are visible for userspace which are defined in ··· 596 600 597 601 int kvm_get_dirty_log(struct kvm *kvm, 598 602 struct kvm_dirty_log *log, int *is_dirty); 603 + 604 + int kvm_get_dirty_log_protect(struct kvm *kvm, 605 + struct kvm_dirty_log *log, bool *is_dirty); 606 + 607 + void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, 608 + struct kvm_memory_slot *slot, 609 + gfn_t gfn_offset, 610 + unsigned long mask); 611 + 599 612 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, 600 613 struct kvm_dirty_log *log); 601 614 ··· 646 641 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); 647 642 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id); 648 643 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu); 649 - int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); 644 + void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); 650 645 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu); 651 646 652 647 int kvm_arch_hardware_enable(void); ··· 1036 1031 1037 1032 extern struct kvm_device_ops kvm_mpic_ops; 1038 1033 extern struct kvm_device_ops kvm_xics_ops; 1034 + extern struct kvm_device_ops kvm_arm_vgic_v2_ops; 1035 + extern struct kvm_device_ops kvm_arm_vgic_v3_ops; 1039 1036 1040 1037 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT 1041 1038

+19

include/trace/events/kvm.h

··· 37 37 __entry->errno < 0 ? -__entry->errno : __entry->reason) 38 38 ); 39 39 40 + TRACE_EVENT(kvm_vcpu_wakeup, 41 + TP_PROTO(__u64 ns, bool waited), 42 + TP_ARGS(ns, waited), 43 + 44 + TP_STRUCT__entry( 45 + __field( __u64, ns ) 46 + __field( bool, waited ) 47 + ), 48 + 49 + TP_fast_assign( 50 + __entry->ns = ns; 51 + __entry->waited = waited; 52 + ), 53 + 54 + TP_printk("%s time %lld ns", 55 + __entry->waited ? "wait" : "poll", 56 + __entry->ns) 57 + ); 58 + 40 59 #if defined(CONFIG_HAVE_KVM_IRQFD) 41 60 TRACE_EVENT(kvm_set_irq, 42 61 TP_PROTO(unsigned int gsi, int level, int irq_source_id),

+9

include/uapi/linux/kvm.h

··· 491 491 __u16 code; 492 492 }; 493 493 494 + #define KVM_S390_STOP_FLAG_STORE_STATUS 0x01 495 + struct kvm_s390_stop_info { 496 + __u32 flags; 497 + }; 498 + 494 499 struct kvm_s390_mchk_info { 495 500 __u64 cr14; 496 501 __u64 mcic; ··· 514 509 struct kvm_s390_emerg_info emerg; 515 510 struct kvm_s390_extcall_info extcall; 516 511 struct kvm_s390_prefix_info prefix; 512 + struct kvm_s390_stop_info stop; 517 513 struct kvm_s390_mchk_info mchk; 518 514 char reserved[64]; 519 515 } u; ··· 759 753 #define KVM_CAP_PPC_FIXUP_HCALL 103 760 754 #define KVM_CAP_PPC_ENABLE_HCALL 104 761 755 #define KVM_CAP_CHECK_EXTENSION_VM 105 756 + #define KVM_CAP_S390_USER_SIGP 106 762 757 763 758 #ifdef KVM_CAP_IRQ_ROUTING 764 759 ··· 959 952 #define KVM_DEV_TYPE_ARM_VGIC_V2 KVM_DEV_TYPE_ARM_VGIC_V2 960 953 KVM_DEV_TYPE_FLIC, 961 954 #define KVM_DEV_TYPE_FLIC KVM_DEV_TYPE_FLIC 955 + KVM_DEV_TYPE_ARM_VGIC_V3, 956 + #define KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_V3 962 957 KVM_DEV_TYPE_MAX, 963 958 }; 964 959

+10

virt/kvm/Kconfig

··· 37 37 38 38 config KVM_VFIO 39 39 bool 40 + 41 + config HAVE_KVM_ARCH_TLB_FLUSH_ALL 42 + bool 43 + 44 + config KVM_GENERIC_DIRTYLOG_READ_PROTECT 45 + bool 46 + 47 + config KVM_COMPAT 48 + def_bool y 49 + depends on COMPAT && !S390

+847

virt/kvm/arm/vgic-v2-emul.c

··· 1 + /* 2 + * Contains GICv2 specific emulation code, was in vgic.c before. 3 + * 4 + * Copyright (C) 2012 ARM Ltd. 5 + * Author: Marc Zyngier <marc.zyngier@arm.com> 6 + * 7 + * This program is free software; you can redistribute it and/or modify 8 + * it under the terms of the GNU General Public License version 2 as 9 + * published by the Free Software Foundation. 10 + * 11 + * This program is distributed in the hope that it will be useful, 12 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 13 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 + * GNU General Public License for more details. 15 + * 16 + * You should have received a copy of the GNU General Public License 17 + * along with this program. If not, see <http://www.gnu.org/licenses/>. 18 + */ 19 + 20 + #include <linux/cpu.h> 21 + #include <linux/kvm.h> 22 + #include <linux/kvm_host.h> 23 + #include <linux/interrupt.h> 24 + #include <linux/io.h> 25 + #include <linux/uaccess.h> 26 + 27 + #include <linux/irqchip/arm-gic.h> 28 + 29 + #include <asm/kvm_emulate.h> 30 + #include <asm/kvm_arm.h> 31 + #include <asm/kvm_mmu.h> 32 + 33 + #include "vgic.h" 34 + 35 + #define GICC_ARCH_VERSION_V2 0x2 36 + 37 + static void vgic_dispatch_sgi(struct kvm_vcpu *vcpu, u32 reg); 38 + static u8 *vgic_get_sgi_sources(struct vgic_dist *dist, int vcpu_id, int sgi) 39 + { 40 + return dist->irq_sgi_sources + vcpu_id * VGIC_NR_SGIS + sgi; 41 + } 42 + 43 + static bool handle_mmio_misc(struct kvm_vcpu *vcpu, 44 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 45 + { 46 + u32 reg; 47 + u32 word_offset = offset & 3; 48 + 49 + switch (offset & ~3) { 50 + case 0: /* GICD_CTLR */ 51 + reg = vcpu->kvm->arch.vgic.enabled; 52 + vgic_reg_access(mmio, &reg, word_offset, 53 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 54 + if (mmio->is_write) { 55 + vcpu->kvm->arch.vgic.enabled = reg & 1; 56 + vgic_update_state(vcpu->kvm); 57 + return true; 58 + } 59 + break; 60 + 61 + case 4: /* GICD_TYPER */ 62 + reg = (atomic_read(&vcpu->kvm->online_vcpus) - 1) << 5; 63 + reg |= (vcpu->kvm->arch.vgic.nr_irqs >> 5) - 1; 64 + vgic_reg_access(mmio, &reg, word_offset, 65 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 66 + break; 67 + 68 + case 8: /* GICD_IIDR */ 69 + reg = (PRODUCT_ID_KVM << 24) | (IMPLEMENTER_ARM << 0); 70 + vgic_reg_access(mmio, &reg, word_offset, 71 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 72 + break; 73 + } 74 + 75 + return false; 76 + } 77 + 78 + static bool handle_mmio_set_enable_reg(struct kvm_vcpu *vcpu, 79 + struct kvm_exit_mmio *mmio, 80 + phys_addr_t offset) 81 + { 82 + return vgic_handle_enable_reg(vcpu->kvm, mmio, offset, 83 + vcpu->vcpu_id, ACCESS_WRITE_SETBIT); 84 + } 85 + 86 + static bool handle_mmio_clear_enable_reg(struct kvm_vcpu *vcpu, 87 + struct kvm_exit_mmio *mmio, 88 + phys_addr_t offset) 89 + { 90 + return vgic_handle_enable_reg(vcpu->kvm, mmio, offset, 91 + vcpu->vcpu_id, ACCESS_WRITE_CLEARBIT); 92 + } 93 + 94 + static bool handle_mmio_set_pending_reg(struct kvm_vcpu *vcpu, 95 + struct kvm_exit_mmio *mmio, 96 + phys_addr_t offset) 97 + { 98 + return vgic_handle_set_pending_reg(vcpu->kvm, mmio, offset, 99 + vcpu->vcpu_id); 100 + } 101 + 102 + static bool handle_mmio_clear_pending_reg(struct kvm_vcpu *vcpu, 103 + struct kvm_exit_mmio *mmio, 104 + phys_addr_t offset) 105 + { 106 + return vgic_handle_clear_pending_reg(vcpu->kvm, mmio, offset, 107 + vcpu->vcpu_id); 108 + } 109 + 110 + static bool handle_mmio_priority_reg(struct kvm_vcpu *vcpu, 111 + struct kvm_exit_mmio *mmio, 112 + phys_addr_t offset) 113 + { 114 + u32 *reg = vgic_bytemap_get_reg(&vcpu->kvm->arch.vgic.irq_priority, 115 + vcpu->vcpu_id, offset); 116 + vgic_reg_access(mmio, reg, offset, 117 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 118 + return false; 119 + } 120 + 121 + #define GICD_ITARGETSR_SIZE 32 122 + #define GICD_CPUTARGETS_BITS 8 123 + #define GICD_IRQS_PER_ITARGETSR (GICD_ITARGETSR_SIZE / GICD_CPUTARGETS_BITS) 124 + static u32 vgic_get_target_reg(struct kvm *kvm, int irq) 125 + { 126 + struct vgic_dist *dist = &kvm->arch.vgic; 127 + int i; 128 + u32 val = 0; 129 + 130 + irq -= VGIC_NR_PRIVATE_IRQS; 131 + 132 + for (i = 0; i < GICD_IRQS_PER_ITARGETSR; i++) 133 + val |= 1 << (dist->irq_spi_cpu[irq + i] + i * 8); 134 + 135 + return val; 136 + } 137 + 138 + static void vgic_set_target_reg(struct kvm *kvm, u32 val, int irq) 139 + { 140 + struct vgic_dist *dist = &kvm->arch.vgic; 141 + struct kvm_vcpu *vcpu; 142 + int i, c; 143 + unsigned long *bmap; 144 + u32 target; 145 + 146 + irq -= VGIC_NR_PRIVATE_IRQS; 147 + 148 + /* 149 + * Pick the LSB in each byte. This ensures we target exactly 150 + * one vcpu per IRQ. If the byte is null, assume we target 151 + * CPU0. 152 + */ 153 + for (i = 0; i < GICD_IRQS_PER_ITARGETSR; i++) { 154 + int shift = i * GICD_CPUTARGETS_BITS; 155 + 156 + target = ffs((val >> shift) & 0xffU); 157 + target = target ? (target - 1) : 0; 158 + dist->irq_spi_cpu[irq + i] = target; 159 + kvm_for_each_vcpu(c, vcpu, kvm) { 160 + bmap = vgic_bitmap_get_shared_map(&dist->irq_spi_target[c]); 161 + if (c == target) 162 + set_bit(irq + i, bmap); 163 + else 164 + clear_bit(irq + i, bmap); 165 + } 166 + } 167 + } 168 + 169 + static bool handle_mmio_target_reg(struct kvm_vcpu *vcpu, 170 + struct kvm_exit_mmio *mmio, 171 + phys_addr_t offset) 172 + { 173 + u32 reg; 174 + 175 + /* We treat the banked interrupts targets as read-only */ 176 + if (offset < 32) { 177 + u32 roreg; 178 + 179 + roreg = 1 << vcpu->vcpu_id; 180 + roreg |= roreg << 8; 181 + roreg |= roreg << 16; 182 + 183 + vgic_reg_access(mmio, &roreg, offset, 184 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 185 + return false; 186 + } 187 + 188 + reg = vgic_get_target_reg(vcpu->kvm, offset & ~3U); 189 + vgic_reg_access(mmio, &reg, offset, 190 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 191 + if (mmio->is_write) { 192 + vgic_set_target_reg(vcpu->kvm, reg, offset & ~3U); 193 + vgic_update_state(vcpu->kvm); 194 + return true; 195 + } 196 + 197 + return false; 198 + } 199 + 200 + static bool handle_mmio_cfg_reg(struct kvm_vcpu *vcpu, 201 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 202 + { 203 + u32 *reg; 204 + 205 + reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_cfg, 206 + vcpu->vcpu_id, offset >> 1); 207 + 208 + return vgic_handle_cfg_reg(reg, mmio, offset); 209 + } 210 + 211 + static bool handle_mmio_sgi_reg(struct kvm_vcpu *vcpu, 212 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 213 + { 214 + u32 reg; 215 + 216 + vgic_reg_access(mmio, &reg, offset, 217 + ACCESS_READ_RAZ | ACCESS_WRITE_VALUE); 218 + if (mmio->is_write) { 219 + vgic_dispatch_sgi(vcpu, reg); 220 + vgic_update_state(vcpu->kvm); 221 + return true; 222 + } 223 + 224 + return false; 225 + } 226 + 227 + /* Handle reads of GICD_CPENDSGIRn and GICD_SPENDSGIRn */ 228 + static bool read_set_clear_sgi_pend_reg(struct kvm_vcpu *vcpu, 229 + struct kvm_exit_mmio *mmio, 230 + phys_addr_t offset) 231 + { 232 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 233 + int sgi; 234 + int min_sgi = (offset & ~0x3); 235 + int max_sgi = min_sgi + 3; 236 + int vcpu_id = vcpu->vcpu_id; 237 + u32 reg = 0; 238 + 239 + /* Copy source SGIs from distributor side */ 240 + for (sgi = min_sgi; sgi <= max_sgi; sgi++) { 241 + u8 sources = *vgic_get_sgi_sources(dist, vcpu_id, sgi); 242 + 243 + reg |= ((u32)sources) << (8 * (sgi - min_sgi)); 244 + } 245 + 246 + mmio_data_write(mmio, ~0, reg); 247 + return false; 248 + } 249 + 250 + static bool write_set_clear_sgi_pend_reg(struct kvm_vcpu *vcpu, 251 + struct kvm_exit_mmio *mmio, 252 + phys_addr_t offset, bool set) 253 + { 254 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 255 + int sgi; 256 + int min_sgi = (offset & ~0x3); 257 + int max_sgi = min_sgi + 3; 258 + int vcpu_id = vcpu->vcpu_id; 259 + u32 reg; 260 + bool updated = false; 261 + 262 + reg = mmio_data_read(mmio, ~0); 263 + 264 + /* Clear pending SGIs on the distributor */ 265 + for (sgi = min_sgi; sgi <= max_sgi; sgi++) { 266 + u8 mask = reg >> (8 * (sgi - min_sgi)); 267 + u8 *src = vgic_get_sgi_sources(dist, vcpu_id, sgi); 268 + 269 + if (set) { 270 + if ((*src & mask) != mask) 271 + updated = true; 272 + *src |= mask; 273 + } else { 274 + if (*src & mask) 275 + updated = true; 276 + *src &= ~mask; 277 + } 278 + } 279 + 280 + if (updated) 281 + vgic_update_state(vcpu->kvm); 282 + 283 + return updated; 284 + } 285 + 286 + static bool handle_mmio_sgi_set(struct kvm_vcpu *vcpu, 287 + struct kvm_exit_mmio *mmio, 288 + phys_addr_t offset) 289 + { 290 + if (!mmio->is_write) 291 + return read_set_clear_sgi_pend_reg(vcpu, mmio, offset); 292 + else 293 + return write_set_clear_sgi_pend_reg(vcpu, mmio, offset, true); 294 + } 295 + 296 + static bool handle_mmio_sgi_clear(struct kvm_vcpu *vcpu, 297 + struct kvm_exit_mmio *mmio, 298 + phys_addr_t offset) 299 + { 300 + if (!mmio->is_write) 301 + return read_set_clear_sgi_pend_reg(vcpu, mmio, offset); 302 + else 303 + return write_set_clear_sgi_pend_reg(vcpu, mmio, offset, false); 304 + } 305 + 306 + static const struct kvm_mmio_range vgic_dist_ranges[] = { 307 + { 308 + .base = GIC_DIST_CTRL, 309 + .len = 12, 310 + .bits_per_irq = 0, 311 + .handle_mmio = handle_mmio_misc, 312 + }, 313 + { 314 + .base = GIC_DIST_IGROUP, 315 + .len = VGIC_MAX_IRQS / 8, 316 + .bits_per_irq = 1, 317 + .handle_mmio = handle_mmio_raz_wi, 318 + }, 319 + { 320 + .base = GIC_DIST_ENABLE_SET, 321 + .len = VGIC_MAX_IRQS / 8, 322 + .bits_per_irq = 1, 323 + .handle_mmio = handle_mmio_set_enable_reg, 324 + }, 325 + { 326 + .base = GIC_DIST_ENABLE_CLEAR, 327 + .len = VGIC_MAX_IRQS / 8, 328 + .bits_per_irq = 1, 329 + .handle_mmio = handle_mmio_clear_enable_reg, 330 + }, 331 + { 332 + .base = GIC_DIST_PENDING_SET, 333 + .len = VGIC_MAX_IRQS / 8, 334 + .bits_per_irq = 1, 335 + .handle_mmio = handle_mmio_set_pending_reg, 336 + }, 337 + { 338 + .base = GIC_DIST_PENDING_CLEAR, 339 + .len = VGIC_MAX_IRQS / 8, 340 + .bits_per_irq = 1, 341 + .handle_mmio = handle_mmio_clear_pending_reg, 342 + }, 343 + { 344 + .base = GIC_DIST_ACTIVE_SET, 345 + .len = VGIC_MAX_IRQS / 8, 346 + .bits_per_irq = 1, 347 + .handle_mmio = handle_mmio_raz_wi, 348 + }, 349 + { 350 + .base = GIC_DIST_ACTIVE_CLEAR, 351 + .len = VGIC_MAX_IRQS / 8, 352 + .bits_per_irq = 1, 353 + .handle_mmio = handle_mmio_raz_wi, 354 + }, 355 + { 356 + .base = GIC_DIST_PRI, 357 + .len = VGIC_MAX_IRQS, 358 + .bits_per_irq = 8, 359 + .handle_mmio = handle_mmio_priority_reg, 360 + }, 361 + { 362 + .base = GIC_DIST_TARGET, 363 + .len = VGIC_MAX_IRQS, 364 + .bits_per_irq = 8, 365 + .handle_mmio = handle_mmio_target_reg, 366 + }, 367 + { 368 + .base = GIC_DIST_CONFIG, 369 + .len = VGIC_MAX_IRQS / 4, 370 + .bits_per_irq = 2, 371 + .handle_mmio = handle_mmio_cfg_reg, 372 + }, 373 + { 374 + .base = GIC_DIST_SOFTINT, 375 + .len = 4, 376 + .handle_mmio = handle_mmio_sgi_reg, 377 + }, 378 + { 379 + .base = GIC_DIST_SGI_PENDING_CLEAR, 380 + .len = VGIC_NR_SGIS, 381 + .handle_mmio = handle_mmio_sgi_clear, 382 + }, 383 + { 384 + .base = GIC_DIST_SGI_PENDING_SET, 385 + .len = VGIC_NR_SGIS, 386 + .handle_mmio = handle_mmio_sgi_set, 387 + }, 388 + {} 389 + }; 390 + 391 + static bool vgic_v2_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, 392 + struct kvm_exit_mmio *mmio) 393 + { 394 + unsigned long base = vcpu->kvm->arch.vgic.vgic_dist_base; 395 + 396 + if (!is_in_range(mmio->phys_addr, mmio->len, base, 397 + KVM_VGIC_V2_DIST_SIZE)) 398 + return false; 399 + 400 + /* GICv2 does not support accesses wider than 32 bits */ 401 + if (mmio->len > 4) { 402 + kvm_inject_dabt(vcpu, mmio->phys_addr); 403 + return true; 404 + } 405 + 406 + return vgic_handle_mmio_range(vcpu, run, mmio, vgic_dist_ranges, base); 407 + } 408 + 409 + static void vgic_dispatch_sgi(struct kvm_vcpu *vcpu, u32 reg) 410 + { 411 + struct kvm *kvm = vcpu->kvm; 412 + struct vgic_dist *dist = &kvm->arch.vgic; 413 + int nrcpus = atomic_read(&kvm->online_vcpus); 414 + u8 target_cpus; 415 + int sgi, mode, c, vcpu_id; 416 + 417 + vcpu_id = vcpu->vcpu_id; 418 + 419 + sgi = reg & 0xf; 420 + target_cpus = (reg >> 16) & 0xff; 421 + mode = (reg >> 24) & 3; 422 + 423 + switch (mode) { 424 + case 0: 425 + if (!target_cpus) 426 + return; 427 + break; 428 + 429 + case 1: 430 + target_cpus = ((1 << nrcpus) - 1) & ~(1 << vcpu_id) & 0xff; 431 + break; 432 + 433 + case 2: 434 + target_cpus = 1 << vcpu_id; 435 + break; 436 + } 437 + 438 + kvm_for_each_vcpu(c, vcpu, kvm) { 439 + if (target_cpus & 1) { 440 + /* Flag the SGI as pending */ 441 + vgic_dist_irq_set_pending(vcpu, sgi); 442 + *vgic_get_sgi_sources(dist, c, sgi) |= 1 << vcpu_id; 443 + kvm_debug("SGI%d from CPU%d to CPU%d\n", 444 + sgi, vcpu_id, c); 445 + } 446 + 447 + target_cpus >>= 1; 448 + } 449 + } 450 + 451 + static bool vgic_v2_queue_sgi(struct kvm_vcpu *vcpu, int irq) 452 + { 453 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 454 + unsigned long sources; 455 + int vcpu_id = vcpu->vcpu_id; 456 + int c; 457 + 458 + sources = *vgic_get_sgi_sources(dist, vcpu_id, irq); 459 + 460 + for_each_set_bit(c, &sources, dist->nr_cpus) { 461 + if (vgic_queue_irq(vcpu, c, irq)) 462 + clear_bit(c, &sources); 463 + } 464 + 465 + *vgic_get_sgi_sources(dist, vcpu_id, irq) = sources; 466 + 467 + /* 468 + * If the sources bitmap has been cleared it means that we 469 + * could queue all the SGIs onto link registers (see the 470 + * clear_bit above), and therefore we are done with them in 471 + * our emulated gic and can get rid of them. 472 + */ 473 + if (!sources) { 474 + vgic_dist_irq_clear_pending(vcpu, irq); 475 + vgic_cpu_irq_clear(vcpu, irq); 476 + return true; 477 + } 478 + 479 + return false; 480 + } 481 + 482 + /** 483 + * kvm_vgic_map_resources - Configure global VGIC state before running any VCPUs 484 + * @kvm: pointer to the kvm struct 485 + * 486 + * Map the virtual CPU interface into the VM before running any VCPUs. We 487 + * can't do this at creation time, because user space must first set the 488 + * virtual CPU interface address in the guest physical address space. 489 + */ 490 + static int vgic_v2_map_resources(struct kvm *kvm, 491 + const struct vgic_params *params) 492 + { 493 + int ret = 0; 494 + 495 + if (!irqchip_in_kernel(kvm)) 496 + return 0; 497 + 498 + mutex_lock(&kvm->lock); 499 + 500 + if (vgic_ready(kvm)) 501 + goto out; 502 + 503 + if (IS_VGIC_ADDR_UNDEF(kvm->arch.vgic.vgic_dist_base) || 504 + IS_VGIC_ADDR_UNDEF(kvm->arch.vgic.vgic_cpu_base)) { 505 + kvm_err("Need to set vgic cpu and dist addresses first\n"); 506 + ret = -ENXIO; 507 + goto out; 508 + } 509 + 510 + /* 511 + * Initialize the vgic if this hasn't already been done on demand by 512 + * accessing the vgic state from userspace. 513 + */ 514 + ret = vgic_init(kvm); 515 + if (ret) { 516 + kvm_err("Unable to allocate maps\n"); 517 + goto out; 518 + } 519 + 520 + ret = kvm_phys_addr_ioremap(kvm, kvm->arch.vgic.vgic_cpu_base, 521 + params->vcpu_base, KVM_VGIC_V2_CPU_SIZE, 522 + true); 523 + if (ret) { 524 + kvm_err("Unable to remap VGIC CPU to VCPU\n"); 525 + goto out; 526 + } 527 + 528 + kvm->arch.vgic.ready = true; 529 + out: 530 + if (ret) 531 + kvm_vgic_destroy(kvm); 532 + mutex_unlock(&kvm->lock); 533 + return ret; 534 + } 535 + 536 + static void vgic_v2_add_sgi_source(struct kvm_vcpu *vcpu, int irq, int source) 537 + { 538 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 539 + 540 + *vgic_get_sgi_sources(dist, vcpu->vcpu_id, irq) |= 1 << source; 541 + } 542 + 543 + static int vgic_v2_init_model(struct kvm *kvm) 544 + { 545 + int i; 546 + 547 + for (i = VGIC_NR_PRIVATE_IRQS; i < kvm->arch.vgic.nr_irqs; i += 4) 548 + vgic_set_target_reg(kvm, 0, i); 549 + 550 + return 0; 551 + } 552 + 553 + void vgic_v2_init_emulation(struct kvm *kvm) 554 + { 555 + struct vgic_dist *dist = &kvm->arch.vgic; 556 + 557 + dist->vm_ops.handle_mmio = vgic_v2_handle_mmio; 558 + dist->vm_ops.queue_sgi = vgic_v2_queue_sgi; 559 + dist->vm_ops.add_sgi_source = vgic_v2_add_sgi_source; 560 + dist->vm_ops.init_model = vgic_v2_init_model; 561 + dist->vm_ops.map_resources = vgic_v2_map_resources; 562 + 563 + kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS; 564 + } 565 + 566 + static bool handle_cpu_mmio_misc(struct kvm_vcpu *vcpu, 567 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 568 + { 569 + bool updated = false; 570 + struct vgic_vmcr vmcr; 571 + u32 *vmcr_field; 572 + u32 reg; 573 + 574 + vgic_get_vmcr(vcpu, &vmcr); 575 + 576 + switch (offset & ~0x3) { 577 + case GIC_CPU_CTRL: 578 + vmcr_field = &vmcr.ctlr; 579 + break; 580 + case GIC_CPU_PRIMASK: 581 + vmcr_field = &vmcr.pmr; 582 + break; 583 + case GIC_CPU_BINPOINT: 584 + vmcr_field = &vmcr.bpr; 585 + break; 586 + case GIC_CPU_ALIAS_BINPOINT: 587 + vmcr_field = &vmcr.abpr; 588 + break; 589 + default: 590 + BUG(); 591 + } 592 + 593 + if (!mmio->is_write) { 594 + reg = *vmcr_field; 595 + mmio_data_write(mmio, ~0, reg); 596 + } else { 597 + reg = mmio_data_read(mmio, ~0); 598 + if (reg != *vmcr_field) { 599 + *vmcr_field = reg; 600 + vgic_set_vmcr(vcpu, &vmcr); 601 + updated = true; 602 + } 603 + } 604 + return updated; 605 + } 606 + 607 + static bool handle_mmio_abpr(struct kvm_vcpu *vcpu, 608 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 609 + { 610 + return handle_cpu_mmio_misc(vcpu, mmio, GIC_CPU_ALIAS_BINPOINT); 611 + } 612 + 613 + static bool handle_cpu_mmio_ident(struct kvm_vcpu *vcpu, 614 + struct kvm_exit_mmio *mmio, 615 + phys_addr_t offset) 616 + { 617 + u32 reg; 618 + 619 + if (mmio->is_write) 620 + return false; 621 + 622 + /* GICC_IIDR */ 623 + reg = (PRODUCT_ID_KVM << 20) | 624 + (GICC_ARCH_VERSION_V2 << 16) | 625 + (IMPLEMENTER_ARM << 0); 626 + mmio_data_write(mmio, ~0, reg); 627 + return false; 628 + } 629 + 630 + /* 631 + * CPU Interface Register accesses - these are not accessed by the VM, but by 632 + * user space for saving and restoring VGIC state. 633 + */ 634 + static const struct kvm_mmio_range vgic_cpu_ranges[] = { 635 + { 636 + .base = GIC_CPU_CTRL, 637 + .len = 12, 638 + .handle_mmio = handle_cpu_mmio_misc, 639 + }, 640 + { 641 + .base = GIC_CPU_ALIAS_BINPOINT, 642 + .len = 4, 643 + .handle_mmio = handle_mmio_abpr, 644 + }, 645 + { 646 + .base = GIC_CPU_ACTIVEPRIO, 647 + .len = 16, 648 + .handle_mmio = handle_mmio_raz_wi, 649 + }, 650 + { 651 + .base = GIC_CPU_IDENT, 652 + .len = 4, 653 + .handle_mmio = handle_cpu_mmio_ident, 654 + }, 655 + }; 656 + 657 + static int vgic_attr_regs_access(struct kvm_device *dev, 658 + struct kvm_device_attr *attr, 659 + u32 *reg, bool is_write) 660 + { 661 + const struct kvm_mmio_range *r = NULL, *ranges; 662 + phys_addr_t offset; 663 + int ret, cpuid, c; 664 + struct kvm_vcpu *vcpu, *tmp_vcpu; 665 + struct vgic_dist *vgic; 666 + struct kvm_exit_mmio mmio; 667 + 668 + offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; 669 + cpuid = (attr->attr & KVM_DEV_ARM_VGIC_CPUID_MASK) >> 670 + KVM_DEV_ARM_VGIC_CPUID_SHIFT; 671 + 672 + mutex_lock(&dev->kvm->lock); 673 + 674 + ret = vgic_init(dev->kvm); 675 + if (ret) 676 + goto out; 677 + 678 + if (cpuid >= atomic_read(&dev->kvm->online_vcpus)) { 679 + ret = -EINVAL; 680 + goto out; 681 + } 682 + 683 + vcpu = kvm_get_vcpu(dev->kvm, cpuid); 684 + vgic = &dev->kvm->arch.vgic; 685 + 686 + mmio.len = 4; 687 + mmio.is_write = is_write; 688 + if (is_write) 689 + mmio_data_write(&mmio, ~0, *reg); 690 + switch (attr->group) { 691 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 692 + mmio.phys_addr = vgic->vgic_dist_base + offset; 693 + ranges = vgic_dist_ranges; 694 + break; 695 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 696 + mmio.phys_addr = vgic->vgic_cpu_base + offset; 697 + ranges = vgic_cpu_ranges; 698 + break; 699 + default: 700 + BUG(); 701 + } 702 + r = vgic_find_range(ranges, &mmio, offset); 703 + 704 + if (unlikely(!r || !r->handle_mmio)) { 705 + ret = -ENXIO; 706 + goto out; 707 + } 708 + 709 + 710 + spin_lock(&vgic->lock); 711 + 712 + /* 713 + * Ensure that no other VCPU is running by checking the vcpu->cpu 714 + * field. If no other VPCUs are running we can safely access the VGIC 715 + * state, because even if another VPU is run after this point, that 716 + * VCPU will not touch the vgic state, because it will block on 717 + * getting the vgic->lock in kvm_vgic_sync_hwstate(). 718 + */ 719 + kvm_for_each_vcpu(c, tmp_vcpu, dev->kvm) { 720 + if (unlikely(tmp_vcpu->cpu != -1)) { 721 + ret = -EBUSY; 722 + goto out_vgic_unlock; 723 + } 724 + } 725 + 726 + /* 727 + * Move all pending IRQs from the LRs on all VCPUs so the pending 728 + * state can be properly represented in the register state accessible 729 + * through this API. 730 + */ 731 + kvm_for_each_vcpu(c, tmp_vcpu, dev->kvm) 732 + vgic_unqueue_irqs(tmp_vcpu); 733 + 734 + offset -= r->base; 735 + r->handle_mmio(vcpu, &mmio, offset); 736 + 737 + if (!is_write) 738 + *reg = mmio_data_read(&mmio, ~0); 739 + 740 + ret = 0; 741 + out_vgic_unlock: 742 + spin_unlock(&vgic->lock); 743 + out: 744 + mutex_unlock(&dev->kvm->lock); 745 + return ret; 746 + } 747 + 748 + static int vgic_v2_create(struct kvm_device *dev, u32 type) 749 + { 750 + return kvm_vgic_create(dev->kvm, type); 751 + } 752 + 753 + static void vgic_v2_destroy(struct kvm_device *dev) 754 + { 755 + kfree(dev); 756 + } 757 + 758 + static int vgic_v2_set_attr(struct kvm_device *dev, 759 + struct kvm_device_attr *attr) 760 + { 761 + int ret; 762 + 763 + ret = vgic_set_common_attr(dev, attr); 764 + if (ret != -ENXIO) 765 + return ret; 766 + 767 + switch (attr->group) { 768 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 769 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: { 770 + u32 __user *uaddr = (u32 __user *)(long)attr->addr; 771 + u32 reg; 772 + 773 + if (get_user(reg, uaddr)) 774 + return -EFAULT; 775 + 776 + return vgic_attr_regs_access(dev, attr, &reg, true); 777 + } 778 + 779 + } 780 + 781 + return -ENXIO; 782 + } 783 + 784 + static int vgic_v2_get_attr(struct kvm_device *dev, 785 + struct kvm_device_attr *attr) 786 + { 787 + int ret; 788 + 789 + ret = vgic_get_common_attr(dev, attr); 790 + if (ret != -ENXIO) 791 + return ret; 792 + 793 + switch (attr->group) { 794 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 795 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: { 796 + u32 __user *uaddr = (u32 __user *)(long)attr->addr; 797 + u32 reg = 0; 798 + 799 + ret = vgic_attr_regs_access(dev, attr, &reg, false); 800 + if (ret) 801 + return ret; 802 + return put_user(reg, uaddr); 803 + } 804 + 805 + } 806 + 807 + return -ENXIO; 808 + } 809 + 810 + static int vgic_v2_has_attr(struct kvm_device *dev, 811 + struct kvm_device_attr *attr) 812 + { 813 + phys_addr_t offset; 814 + 815 + switch (attr->group) { 816 + case KVM_DEV_ARM_VGIC_GRP_ADDR: 817 + switch (attr->attr) { 818 + case KVM_VGIC_V2_ADDR_TYPE_DIST: 819 + case KVM_VGIC_V2_ADDR_TYPE_CPU: 820 + return 0; 821 + } 822 + break; 823 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 824 + offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; 825 + return vgic_has_attr_regs(vgic_dist_ranges, offset); 826 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 827 + offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; 828 + return vgic_has_attr_regs(vgic_cpu_ranges, offset); 829 + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: 830 + return 0; 831 + case KVM_DEV_ARM_VGIC_GRP_CTRL: 832 + switch (attr->attr) { 833 + case KVM_DEV_ARM_VGIC_CTRL_INIT: 834 + return 0; 835 + } 836 + } 837 + return -ENXIO; 838 + } 839 + 840 + struct kvm_device_ops kvm_arm_vgic_v2_ops = { 841 + .name = "kvm-arm-vgic-v2", 842 + .create = vgic_v2_create, 843 + .destroy = vgic_v2_destroy, 844 + .set_attr = vgic_v2_set_attr, 845 + .get_attr = vgic_v2_get_attr, 846 + .has_attr = vgic_v2_has_attr, 847 + };

+4

virt/kvm/arm/vgic-v2.c

··· 229 229 goto out_unmap; 230 230 } 231 231 232 + vgic->can_emulate_gicv2 = true; 233 + kvm_register_device_ops(&kvm_arm_vgic_v2_ops, KVM_DEV_TYPE_ARM_VGIC_V2); 234 + 232 235 vgic->vcpu_base = vcpu_res.start; 233 236 234 237 kvm_info("%s@%llx IRQ%d\n", vgic_node->name, 235 238 vctrl_res.start, vgic->maint_irq); 236 239 237 240 vgic->type = VGIC_V2; 241 + vgic->max_gic_vcpus = VGIC_V2_MAX_CPUS; 238 242 *ops = &vgic_v2_ops; 239 243 *params = vgic; 240 244 goto out;

+1036

virt/kvm/arm/vgic-v3-emul.c

··· 1 + /* 2 + * GICv3 distributor and redistributor emulation 3 + * 4 + * GICv3 emulation is currently only supported on a GICv3 host (because 5 + * we rely on the hardware's CPU interface virtualization support), but 6 + * supports both hardware with or without the optional GICv2 backwards 7 + * compatibility features. 8 + * 9 + * Limitations of the emulation: 10 + * (RAZ/WI: read as zero, write ignore, RAO/WI: read as one, write ignore) 11 + * - We do not support LPIs (yet). TYPER.LPIS is reported as 0 and is RAZ/WI. 12 + * - We do not support the message based interrupts (MBIs) triggered by 13 + * writes to the GICD_{SET,CLR}SPI_* registers. TYPER.MBIS is reported as 0. 14 + * - We do not support the (optional) backwards compatibility feature. 15 + * GICD_CTLR.ARE resets to 1 and is RAO/WI. If the _host_ GIC supports 16 + * the compatiblity feature, you can use a GICv2 in the guest, though. 17 + * - We only support a single security state. GICD_CTLR.DS is 1 and is RAO/WI. 18 + * - Priorities are not emulated (same as the GICv2 emulation). Linux 19 + * as a guest is fine with this, because it does not use priorities. 20 + * - We only support Group1 interrupts. Again Linux uses only those. 21 + * 22 + * Copyright (C) 2014 ARM Ltd. 23 + * Author: Andre Przywara <andre.przywara@arm.com> 24 + * 25 + * This program is free software; you can redistribute it and/or modify 26 + * it under the terms of the GNU General Public License version 2 as 27 + * published by the Free Software Foundation. 28 + * 29 + * This program is distributed in the hope that it will be useful, 30 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 31 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 32 + * GNU General Public License for more details. 33 + * 34 + * You should have received a copy of the GNU General Public License 35 + * along with this program. If not, see <http://www.gnu.org/licenses/>. 36 + */ 37 + 38 + #include <linux/cpu.h> 39 + #include <linux/kvm.h> 40 + #include <linux/kvm_host.h> 41 + #include <linux/interrupt.h> 42 + 43 + #include <linux/irqchip/arm-gic-v3.h> 44 + #include <kvm/arm_vgic.h> 45 + 46 + #include <asm/kvm_emulate.h> 47 + #include <asm/kvm_arm.h> 48 + #include <asm/kvm_mmu.h> 49 + 50 + #include "vgic.h" 51 + 52 + static bool handle_mmio_rao_wi(struct kvm_vcpu *vcpu, 53 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 54 + { 55 + u32 reg = 0xffffffff; 56 + 57 + vgic_reg_access(mmio, &reg, offset, 58 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 59 + 60 + return false; 61 + } 62 + 63 + static bool handle_mmio_ctlr(struct kvm_vcpu *vcpu, 64 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 65 + { 66 + u32 reg = 0; 67 + 68 + /* 69 + * Force ARE and DS to 1, the guest cannot change this. 70 + * For the time being we only support Group1 interrupts. 71 + */ 72 + if (vcpu->kvm->arch.vgic.enabled) 73 + reg = GICD_CTLR_ENABLE_SS_G1; 74 + reg |= GICD_CTLR_ARE_NS | GICD_CTLR_DS; 75 + 76 + vgic_reg_access(mmio, &reg, offset, 77 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 78 + if (mmio->is_write) { 79 + if (reg & GICD_CTLR_ENABLE_SS_G0) 80 + kvm_info("guest tried to enable unsupported Group0 interrupts\n"); 81 + vcpu->kvm->arch.vgic.enabled = !!(reg & GICD_CTLR_ENABLE_SS_G1); 82 + vgic_update_state(vcpu->kvm); 83 + return true; 84 + } 85 + return false; 86 + } 87 + 88 + /* 89 + * As this implementation does not provide compatibility 90 + * with GICv2 (ARE==1), we report zero CPUs in bits [5..7]. 91 + * Also LPIs and MBIs are not supported, so we set the respective bits to 0. 92 + * Also we report at most 2**10=1024 interrupt IDs (to match 1024 SPIs). 93 + */ 94 + #define INTERRUPT_ID_BITS 10 95 + static bool handle_mmio_typer(struct kvm_vcpu *vcpu, 96 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 97 + { 98 + u32 reg; 99 + 100 + reg = (min(vcpu->kvm->arch.vgic.nr_irqs, 1024) >> 5) - 1; 101 + 102 + reg |= (INTERRUPT_ID_BITS - 1) << 19; 103 + 104 + vgic_reg_access(mmio, &reg, offset, 105 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 106 + 107 + return false; 108 + } 109 + 110 + static bool handle_mmio_iidr(struct kvm_vcpu *vcpu, 111 + struct kvm_exit_mmio *mmio, phys_addr_t offset) 112 + { 113 + u32 reg; 114 + 115 + reg = (PRODUCT_ID_KVM << 24) | (IMPLEMENTER_ARM << 0); 116 + vgic_reg_access(mmio, &reg, offset, 117 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 118 + 119 + return false; 120 + } 121 + 122 + static bool handle_mmio_set_enable_reg_dist(struct kvm_vcpu *vcpu, 123 + struct kvm_exit_mmio *mmio, 124 + phys_addr_t offset) 125 + { 126 + if (likely(offset >= VGIC_NR_PRIVATE_IRQS / 8)) 127 + return vgic_handle_enable_reg(vcpu->kvm, mmio, offset, 128 + vcpu->vcpu_id, 129 + ACCESS_WRITE_SETBIT); 130 + 131 + vgic_reg_access(mmio, NULL, offset, 132 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 133 + return false; 134 + } 135 + 136 + static bool handle_mmio_clear_enable_reg_dist(struct kvm_vcpu *vcpu, 137 + struct kvm_exit_mmio *mmio, 138 + phys_addr_t offset) 139 + { 140 + if (likely(offset >= VGIC_NR_PRIVATE_IRQS / 8)) 141 + return vgic_handle_enable_reg(vcpu->kvm, mmio, offset, 142 + vcpu->vcpu_id, 143 + ACCESS_WRITE_CLEARBIT); 144 + 145 + vgic_reg_access(mmio, NULL, offset, 146 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 147 + return false; 148 + } 149 + 150 + static bool handle_mmio_set_pending_reg_dist(struct kvm_vcpu *vcpu, 151 + struct kvm_exit_mmio *mmio, 152 + phys_addr_t offset) 153 + { 154 + if (likely(offset >= VGIC_NR_PRIVATE_IRQS / 8)) 155 + return vgic_handle_set_pending_reg(vcpu->kvm, mmio, offset, 156 + vcpu->vcpu_id); 157 + 158 + vgic_reg_access(mmio, NULL, offset, 159 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 160 + return false; 161 + } 162 + 163 + static bool handle_mmio_clear_pending_reg_dist(struct kvm_vcpu *vcpu, 164 + struct kvm_exit_mmio *mmio, 165 + phys_addr_t offset) 166 + { 167 + if (likely(offset >= VGIC_NR_PRIVATE_IRQS / 8)) 168 + return vgic_handle_clear_pending_reg(vcpu->kvm, mmio, offset, 169 + vcpu->vcpu_id); 170 + 171 + vgic_reg_access(mmio, NULL, offset, 172 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 173 + return false; 174 + } 175 + 176 + static bool handle_mmio_priority_reg_dist(struct kvm_vcpu *vcpu, 177 + struct kvm_exit_mmio *mmio, 178 + phys_addr_t offset) 179 + { 180 + u32 *reg; 181 + 182 + if (unlikely(offset < VGIC_NR_PRIVATE_IRQS)) { 183 + vgic_reg_access(mmio, NULL, offset, 184 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 185 + return false; 186 + } 187 + 188 + reg = vgic_bytemap_get_reg(&vcpu->kvm->arch.vgic.irq_priority, 189 + vcpu->vcpu_id, offset); 190 + vgic_reg_access(mmio, reg, offset, 191 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 192 + return false; 193 + } 194 + 195 + static bool handle_mmio_cfg_reg_dist(struct kvm_vcpu *vcpu, 196 + struct kvm_exit_mmio *mmio, 197 + phys_addr_t offset) 198 + { 199 + u32 *reg; 200 + 201 + if (unlikely(offset < VGIC_NR_PRIVATE_IRQS / 4)) { 202 + vgic_reg_access(mmio, NULL, offset, 203 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 204 + return false; 205 + } 206 + 207 + reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_cfg, 208 + vcpu->vcpu_id, offset >> 1); 209 + 210 + return vgic_handle_cfg_reg(reg, mmio, offset); 211 + } 212 + 213 + /* 214 + * We use a compressed version of the MPIDR (all 32 bits in one 32-bit word) 215 + * when we store the target MPIDR written by the guest. 216 + */ 217 + static u32 compress_mpidr(unsigned long mpidr) 218 + { 219 + u32 ret; 220 + 221 + ret = MPIDR_AFFINITY_LEVEL(mpidr, 0); 222 + ret |= MPIDR_AFFINITY_LEVEL(mpidr, 1) << 8; 223 + ret |= MPIDR_AFFINITY_LEVEL(mpidr, 2) << 16; 224 + ret |= MPIDR_AFFINITY_LEVEL(mpidr, 3) << 24; 225 + 226 + return ret; 227 + } 228 + 229 + static unsigned long uncompress_mpidr(u32 value) 230 + { 231 + unsigned long mpidr; 232 + 233 + mpidr = ((value >> 0) & 0xFF) << MPIDR_LEVEL_SHIFT(0); 234 + mpidr |= ((value >> 8) & 0xFF) << MPIDR_LEVEL_SHIFT(1); 235 + mpidr |= ((value >> 16) & 0xFF) << MPIDR_LEVEL_SHIFT(2); 236 + mpidr |= (u64)((value >> 24) & 0xFF) << MPIDR_LEVEL_SHIFT(3); 237 + 238 + return mpidr; 239 + } 240 + 241 + /* 242 + * Lookup the given MPIDR value to get the vcpu_id (if there is one) 243 + * and store that in the irq_spi_cpu[] array. 244 + * This limits the number of VCPUs to 255 for now, extending the data 245 + * type (or storing kvm_vcpu pointers) should lift the limit. 246 + * Store the original MPIDR value in an extra array to support read-as-written. 247 + * Unallocated MPIDRs are translated to a special value and caught 248 + * before any array accesses. 249 + */ 250 + static bool handle_mmio_route_reg(struct kvm_vcpu *vcpu, 251 + struct kvm_exit_mmio *mmio, 252 + phys_addr_t offset) 253 + { 254 + struct kvm *kvm = vcpu->kvm; 255 + struct vgic_dist *dist = &kvm->arch.vgic; 256 + int spi; 257 + u32 reg; 258 + int vcpu_id; 259 + unsigned long *bmap, mpidr; 260 + 261 + /* 262 + * The upper 32 bits of each 64 bit register are zero, 263 + * as we don't support Aff3. 264 + */ 265 + if ((offset & 4)) { 266 + vgic_reg_access(mmio, NULL, offset, 267 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 268 + return false; 269 + } 270 + 271 + /* This region only covers SPIs, so no handling of private IRQs here. */ 272 + spi = offset / 8; 273 + 274 + /* get the stored MPIDR for this IRQ */ 275 + mpidr = uncompress_mpidr(dist->irq_spi_mpidr[spi]); 276 + reg = mpidr; 277 + 278 + vgic_reg_access(mmio, &reg, offset, 279 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 280 + 281 + if (!mmio->is_write) 282 + return false; 283 + 284 + /* 285 + * Now clear the currently assigned vCPU from the map, making room 286 + * for the new one to be written below 287 + */ 288 + vcpu = kvm_mpidr_to_vcpu(kvm, mpidr); 289 + if (likely(vcpu)) { 290 + vcpu_id = vcpu->vcpu_id; 291 + bmap = vgic_bitmap_get_shared_map(&dist->irq_spi_target[vcpu_id]); 292 + __clear_bit(spi, bmap); 293 + } 294 + 295 + dist->irq_spi_mpidr[spi] = compress_mpidr(reg); 296 + vcpu = kvm_mpidr_to_vcpu(kvm, reg & MPIDR_HWID_BITMASK); 297 + 298 + /* 299 + * The spec says that non-existent MPIDR values should not be 300 + * forwarded to any existent (v)CPU, but should be able to become 301 + * pending anyway. We simply keep the irq_spi_target[] array empty, so 302 + * the interrupt will never be injected. 303 + * irq_spi_cpu[irq] gets a magic value in this case. 304 + */ 305 + if (likely(vcpu)) { 306 + vcpu_id = vcpu->vcpu_id; 307 + dist->irq_spi_cpu[spi] = vcpu_id; 308 + bmap = vgic_bitmap_get_shared_map(&dist->irq_spi_target[vcpu_id]); 309 + __set_bit(spi, bmap); 310 + } else { 311 + dist->irq_spi_cpu[spi] = VCPU_NOT_ALLOCATED; 312 + } 313 + 314 + vgic_update_state(kvm); 315 + 316 + return true; 317 + } 318 + 319 + /* 320 + * We should be careful about promising too much when a guest reads 321 + * this register. Don't claim to be like any hardware implementation, 322 + * but just report the GIC as version 3 - which is what a Linux guest 323 + * would check. 324 + */ 325 + static bool handle_mmio_idregs(struct kvm_vcpu *vcpu, 326 + struct kvm_exit_mmio *mmio, 327 + phys_addr_t offset) 328 + { 329 + u32 reg = 0; 330 + 331 + switch (offset + GICD_IDREGS) { 332 + case GICD_PIDR2: 333 + reg = 0x3b; 334 + break; 335 + } 336 + 337 + vgic_reg_access(mmio, &reg, offset, 338 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 339 + 340 + return false; 341 + } 342 + 343 + static const struct kvm_mmio_range vgic_v3_dist_ranges[] = { 344 + { 345 + .base = GICD_CTLR, 346 + .len = 0x04, 347 + .bits_per_irq = 0, 348 + .handle_mmio = handle_mmio_ctlr, 349 + }, 350 + { 351 + .base = GICD_TYPER, 352 + .len = 0x04, 353 + .bits_per_irq = 0, 354 + .handle_mmio = handle_mmio_typer, 355 + }, 356 + { 357 + .base = GICD_IIDR, 358 + .len = 0x04, 359 + .bits_per_irq = 0, 360 + .handle_mmio = handle_mmio_iidr, 361 + }, 362 + { 363 + /* this register is optional, it is RAZ/WI if not implemented */ 364 + .base = GICD_STATUSR, 365 + .len = 0x04, 366 + .bits_per_irq = 0, 367 + .handle_mmio = handle_mmio_raz_wi, 368 + }, 369 + { 370 + /* this write only register is WI when TYPER.MBIS=0 */ 371 + .base = GICD_SETSPI_NSR, 372 + .len = 0x04, 373 + .bits_per_irq = 0, 374 + .handle_mmio = handle_mmio_raz_wi, 375 + }, 376 + { 377 + /* this write only register is WI when TYPER.MBIS=0 */ 378 + .base = GICD_CLRSPI_NSR, 379 + .len = 0x04, 380 + .bits_per_irq = 0, 381 + .handle_mmio = handle_mmio_raz_wi, 382 + }, 383 + { 384 + /* this is RAZ/WI when DS=1 */ 385 + .base = GICD_SETSPI_SR, 386 + .len = 0x04, 387 + .bits_per_irq = 0, 388 + .handle_mmio = handle_mmio_raz_wi, 389 + }, 390 + { 391 + /* this is RAZ/WI when DS=1 */ 392 + .base = GICD_CLRSPI_SR, 393 + .len = 0x04, 394 + .bits_per_irq = 0, 395 + .handle_mmio = handle_mmio_raz_wi, 396 + }, 397 + { 398 + .base = GICD_IGROUPR, 399 + .len = 0x80, 400 + .bits_per_irq = 1, 401 + .handle_mmio = handle_mmio_rao_wi, 402 + }, 403 + { 404 + .base = GICD_ISENABLER, 405 + .len = 0x80, 406 + .bits_per_irq = 1, 407 + .handle_mmio = handle_mmio_set_enable_reg_dist, 408 + }, 409 + { 410 + .base = GICD_ICENABLER, 411 + .len = 0x80, 412 + .bits_per_irq = 1, 413 + .handle_mmio = handle_mmio_clear_enable_reg_dist, 414 + }, 415 + { 416 + .base = GICD_ISPENDR, 417 + .len = 0x80, 418 + .bits_per_irq = 1, 419 + .handle_mmio = handle_mmio_set_pending_reg_dist, 420 + }, 421 + { 422 + .base = GICD_ICPENDR, 423 + .len = 0x80, 424 + .bits_per_irq = 1, 425 + .handle_mmio = handle_mmio_clear_pending_reg_dist, 426 + }, 427 + { 428 + .base = GICD_ISACTIVER, 429 + .len = 0x80, 430 + .bits_per_irq = 1, 431 + .handle_mmio = handle_mmio_raz_wi, 432 + }, 433 + { 434 + .base = GICD_ICACTIVER, 435 + .len = 0x80, 436 + .bits_per_irq = 1, 437 + .handle_mmio = handle_mmio_raz_wi, 438 + }, 439 + { 440 + .base = GICD_IPRIORITYR, 441 + .len = 0x400, 442 + .bits_per_irq = 8, 443 + .handle_mmio = handle_mmio_priority_reg_dist, 444 + }, 445 + { 446 + /* TARGETSRn is RES0 when ARE=1 */ 447 + .base = GICD_ITARGETSR, 448 + .len = 0x400, 449 + .bits_per_irq = 8, 450 + .handle_mmio = handle_mmio_raz_wi, 451 + }, 452 + { 453 + .base = GICD_ICFGR, 454 + .len = 0x100, 455 + .bits_per_irq = 2, 456 + .handle_mmio = handle_mmio_cfg_reg_dist, 457 + }, 458 + { 459 + /* this is RAZ/WI when DS=1 */ 460 + .base = GICD_IGRPMODR, 461 + .len = 0x80, 462 + .bits_per_irq = 1, 463 + .handle_mmio = handle_mmio_raz_wi, 464 + }, 465 + { 466 + /* this is RAZ/WI when DS=1 */ 467 + .base = GICD_NSACR, 468 + .len = 0x100, 469 + .bits_per_irq = 2, 470 + .handle_mmio = handle_mmio_raz_wi, 471 + }, 472 + { 473 + /* this is RAZ/WI when ARE=1 */ 474 + .base = GICD_SGIR, 475 + .len = 0x04, 476 + .handle_mmio = handle_mmio_raz_wi, 477 + }, 478 + { 479 + /* this is RAZ/WI when ARE=1 */ 480 + .base = GICD_CPENDSGIR, 481 + .len = 0x10, 482 + .handle_mmio = handle_mmio_raz_wi, 483 + }, 484 + { 485 + /* this is RAZ/WI when ARE=1 */ 486 + .base = GICD_SPENDSGIR, 487 + .len = 0x10, 488 + .handle_mmio = handle_mmio_raz_wi, 489 + }, 490 + { 491 + .base = GICD_IROUTER + 0x100, 492 + .len = 0x1ee0, 493 + .bits_per_irq = 64, 494 + .handle_mmio = handle_mmio_route_reg, 495 + }, 496 + { 497 + .base = GICD_IDREGS, 498 + .len = 0x30, 499 + .bits_per_irq = 0, 500 + .handle_mmio = handle_mmio_idregs, 501 + }, 502 + {}, 503 + }; 504 + 505 + static bool handle_mmio_set_enable_reg_redist(struct kvm_vcpu *vcpu, 506 + struct kvm_exit_mmio *mmio, 507 + phys_addr_t offset) 508 + { 509 + struct kvm_vcpu *redist_vcpu = mmio->private; 510 + 511 + return vgic_handle_enable_reg(vcpu->kvm, mmio, offset, 512 + redist_vcpu->vcpu_id, 513 + ACCESS_WRITE_SETBIT); 514 + } 515 + 516 + static bool handle_mmio_clear_enable_reg_redist(struct kvm_vcpu *vcpu, 517 + struct kvm_exit_mmio *mmio, 518 + phys_addr_t offset) 519 + { 520 + struct kvm_vcpu *redist_vcpu = mmio->private; 521 + 522 + return vgic_handle_enable_reg(vcpu->kvm, mmio, offset, 523 + redist_vcpu->vcpu_id, 524 + ACCESS_WRITE_CLEARBIT); 525 + } 526 + 527 + static bool handle_mmio_set_pending_reg_redist(struct kvm_vcpu *vcpu, 528 + struct kvm_exit_mmio *mmio, 529 + phys_addr_t offset) 530 + { 531 + struct kvm_vcpu *redist_vcpu = mmio->private; 532 + 533 + return vgic_handle_set_pending_reg(vcpu->kvm, mmio, offset, 534 + redist_vcpu->vcpu_id); 535 + } 536 + 537 + static bool handle_mmio_clear_pending_reg_redist(struct kvm_vcpu *vcpu, 538 + struct kvm_exit_mmio *mmio, 539 + phys_addr_t offset) 540 + { 541 + struct kvm_vcpu *redist_vcpu = mmio->private; 542 + 543 + return vgic_handle_clear_pending_reg(vcpu->kvm, mmio, offset, 544 + redist_vcpu->vcpu_id); 545 + } 546 + 547 + static bool handle_mmio_priority_reg_redist(struct kvm_vcpu *vcpu, 548 + struct kvm_exit_mmio *mmio, 549 + phys_addr_t offset) 550 + { 551 + struct kvm_vcpu *redist_vcpu = mmio->private; 552 + u32 *reg; 553 + 554 + reg = vgic_bytemap_get_reg(&vcpu->kvm->arch.vgic.irq_priority, 555 + redist_vcpu->vcpu_id, offset); 556 + vgic_reg_access(mmio, reg, offset, 557 + ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 558 + return false; 559 + } 560 + 561 + static bool handle_mmio_cfg_reg_redist(struct kvm_vcpu *vcpu, 562 + struct kvm_exit_mmio *mmio, 563 + phys_addr_t offset) 564 + { 565 + struct kvm_vcpu *redist_vcpu = mmio->private; 566 + 567 + u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_cfg, 568 + redist_vcpu->vcpu_id, offset >> 1); 569 + 570 + return vgic_handle_cfg_reg(reg, mmio, offset); 571 + } 572 + 573 + static const struct kvm_mmio_range vgic_redist_sgi_ranges[] = { 574 + { 575 + .base = GICR_IGROUPR0, 576 + .len = 0x04, 577 + .bits_per_irq = 1, 578 + .handle_mmio = handle_mmio_rao_wi, 579 + }, 580 + { 581 + .base = GICR_ISENABLER0, 582 + .len = 0x04, 583 + .bits_per_irq = 1, 584 + .handle_mmio = handle_mmio_set_enable_reg_redist, 585 + }, 586 + { 587 + .base = GICR_ICENABLER0, 588 + .len = 0x04, 589 + .bits_per_irq = 1, 590 + .handle_mmio = handle_mmio_clear_enable_reg_redist, 591 + }, 592 + { 593 + .base = GICR_ISPENDR0, 594 + .len = 0x04, 595 + .bits_per_irq = 1, 596 + .handle_mmio = handle_mmio_set_pending_reg_redist, 597 + }, 598 + { 599 + .base = GICR_ICPENDR0, 600 + .len = 0x04, 601 + .bits_per_irq = 1, 602 + .handle_mmio = handle_mmio_clear_pending_reg_redist, 603 + }, 604 + { 605 + .base = GICR_ISACTIVER0, 606 + .len = 0x04, 607 + .bits_per_irq = 1, 608 + .handle_mmio = handle_mmio_raz_wi, 609 + }, 610 + { 611 + .base = GICR_ICACTIVER0, 612 + .len = 0x04, 613 + .bits_per_irq = 1, 614 + .handle_mmio = handle_mmio_raz_wi, 615 + }, 616 + { 617 + .base = GICR_IPRIORITYR0, 618 + .len = 0x20, 619 + .bits_per_irq = 8, 620 + .handle_mmio = handle_mmio_priority_reg_redist, 621 + }, 622 + { 623 + .base = GICR_ICFGR0, 624 + .len = 0x08, 625 + .bits_per_irq = 2, 626 + .handle_mmio = handle_mmio_cfg_reg_redist, 627 + }, 628 + { 629 + .base = GICR_IGRPMODR0, 630 + .len = 0x04, 631 + .bits_per_irq = 1, 632 + .handle_mmio = handle_mmio_raz_wi, 633 + }, 634 + { 635 + .base = GICR_NSACR, 636 + .len = 0x04, 637 + .handle_mmio = handle_mmio_raz_wi, 638 + }, 639 + {}, 640 + }; 641 + 642 + static bool handle_mmio_ctlr_redist(struct kvm_vcpu *vcpu, 643 + struct kvm_exit_mmio *mmio, 644 + phys_addr_t offset) 645 + { 646 + /* since we don't support LPIs, this register is zero for now */ 647 + vgic_reg_access(mmio, NULL, offset, 648 + ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 649 + return false; 650 + } 651 + 652 + static bool handle_mmio_typer_redist(struct kvm_vcpu *vcpu, 653 + struct kvm_exit_mmio *mmio, 654 + phys_addr_t offset) 655 + { 656 + u32 reg; 657 + u64 mpidr; 658 + struct kvm_vcpu *redist_vcpu = mmio->private; 659 + int target_vcpu_id = redist_vcpu->vcpu_id; 660 + 661 + /* the upper 32 bits contain the affinity value */ 662 + if ((offset & ~3) == 4) { 663 + mpidr = kvm_vcpu_get_mpidr_aff(redist_vcpu); 664 + reg = compress_mpidr(mpidr); 665 + 666 + vgic_reg_access(mmio, &reg, offset, 667 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 668 + return false; 669 + } 670 + 671 + reg = redist_vcpu->vcpu_id << 8; 672 + if (target_vcpu_id == atomic_read(&vcpu->kvm->online_vcpus) - 1) 673 + reg |= GICR_TYPER_LAST; 674 + vgic_reg_access(mmio, &reg, offset, 675 + ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 676 + return false; 677 + } 678 + 679 + static const struct kvm_mmio_range vgic_redist_ranges[] = { 680 + { 681 + .base = GICR_CTLR, 682 + .len = 0x04, 683 + .bits_per_irq = 0, 684 + .handle_mmio = handle_mmio_ctlr_redist, 685 + }, 686 + { 687 + .base = GICR_TYPER, 688 + .len = 0x08, 689 + .bits_per_irq = 0, 690 + .handle_mmio = handle_mmio_typer_redist, 691 + }, 692 + { 693 + .base = GICR_IIDR, 694 + .len = 0x04, 695 + .bits_per_irq = 0, 696 + .handle_mmio = handle_mmio_iidr, 697 + }, 698 + { 699 + .base = GICR_WAKER, 700 + .len = 0x04, 701 + .bits_per_irq = 0, 702 + .handle_mmio = handle_mmio_raz_wi, 703 + }, 704 + { 705 + .base = GICR_IDREGS, 706 + .len = 0x30, 707 + .bits_per_irq = 0, 708 + .handle_mmio = handle_mmio_idregs, 709 + }, 710 + {}, 711 + }; 712 + 713 + /* 714 + * This function splits accesses between the distributor and the two 715 + * redistributor parts (private/SPI). As each redistributor is accessible 716 + * from any CPU, we have to determine the affected VCPU by taking the faulting 717 + * address into account. We then pass this VCPU to the handler function via 718 + * the private parameter. 719 + */ 720 + #define SGI_BASE_OFFSET SZ_64K 721 + static bool vgic_v3_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, 722 + struct kvm_exit_mmio *mmio) 723 + { 724 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 725 + unsigned long dbase = dist->vgic_dist_base; 726 + unsigned long rdbase = dist->vgic_redist_base; 727 + int nrcpus = atomic_read(&vcpu->kvm->online_vcpus); 728 + int vcpu_id; 729 + const struct kvm_mmio_range *mmio_range; 730 + 731 + if (is_in_range(mmio->phys_addr, mmio->len, dbase, GIC_V3_DIST_SIZE)) { 732 + return vgic_handle_mmio_range(vcpu, run, mmio, 733 + vgic_v3_dist_ranges, dbase); 734 + } 735 + 736 + if (!is_in_range(mmio->phys_addr, mmio->len, rdbase, 737 + GIC_V3_REDIST_SIZE * nrcpus)) 738 + return false; 739 + 740 + vcpu_id = (mmio->phys_addr - rdbase) / GIC_V3_REDIST_SIZE; 741 + rdbase += (vcpu_id * GIC_V3_REDIST_SIZE); 742 + mmio->private = kvm_get_vcpu(vcpu->kvm, vcpu_id); 743 + 744 + if (mmio->phys_addr >= rdbase + SGI_BASE_OFFSET) { 745 + rdbase += SGI_BASE_OFFSET; 746 + mmio_range = vgic_redist_sgi_ranges; 747 + } else { 748 + mmio_range = vgic_redist_ranges; 749 + } 750 + return vgic_handle_mmio_range(vcpu, run, mmio, mmio_range, rdbase); 751 + } 752 + 753 + static bool vgic_v3_queue_sgi(struct kvm_vcpu *vcpu, int irq) 754 + { 755 + if (vgic_queue_irq(vcpu, 0, irq)) { 756 + vgic_dist_irq_clear_pending(vcpu, irq); 757 + vgic_cpu_irq_clear(vcpu, irq); 758 + return true; 759 + } 760 + 761 + return false; 762 + } 763 + 764 + static int vgic_v3_map_resources(struct kvm *kvm, 765 + const struct vgic_params *params) 766 + { 767 + int ret = 0; 768 + struct vgic_dist *dist = &kvm->arch.vgic; 769 + 770 + if (!irqchip_in_kernel(kvm)) 771 + return 0; 772 + 773 + mutex_lock(&kvm->lock); 774 + 775 + if (vgic_ready(kvm)) 776 + goto out; 777 + 778 + if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base) || 779 + IS_VGIC_ADDR_UNDEF(dist->vgic_redist_base)) { 780 + kvm_err("Need to set vgic distributor addresses first\n"); 781 + ret = -ENXIO; 782 + goto out; 783 + } 784 + 785 + /* 786 + * For a VGICv3 we require the userland to explicitly initialize 787 + * the VGIC before we need to use it. 788 + */ 789 + if (!vgic_initialized(kvm)) { 790 + ret = -EBUSY; 791 + goto out; 792 + } 793 + 794 + kvm->arch.vgic.ready = true; 795 + out: 796 + if (ret) 797 + kvm_vgic_destroy(kvm); 798 + mutex_unlock(&kvm->lock); 799 + return ret; 800 + } 801 + 802 + static int vgic_v3_init_model(struct kvm *kvm) 803 + { 804 + int i; 805 + u32 mpidr; 806 + struct vgic_dist *dist = &kvm->arch.vgic; 807 + int nr_spis = dist->nr_irqs - VGIC_NR_PRIVATE_IRQS; 808 + 809 + dist->irq_spi_mpidr = kcalloc(nr_spis, sizeof(dist->irq_spi_mpidr[0]), 810 + GFP_KERNEL); 811 + 812 + if (!dist->irq_spi_mpidr) 813 + return -ENOMEM; 814 + 815 + /* Initialize the target VCPUs for each IRQ to VCPU 0 */ 816 + mpidr = compress_mpidr(kvm_vcpu_get_mpidr_aff(kvm_get_vcpu(kvm, 0))); 817 + for (i = VGIC_NR_PRIVATE_IRQS; i < dist->nr_irqs; i++) { 818 + dist->irq_spi_cpu[i - VGIC_NR_PRIVATE_IRQS] = 0; 819 + dist->irq_spi_mpidr[i - VGIC_NR_PRIVATE_IRQS] = mpidr; 820 + vgic_bitmap_set_irq_val(dist->irq_spi_target, 0, i, 1); 821 + } 822 + 823 + return 0; 824 + } 825 + 826 + /* GICv3 does not keep track of SGI sources anymore. */ 827 + static void vgic_v3_add_sgi_source(struct kvm_vcpu *vcpu, int irq, int source) 828 + { 829 + } 830 + 831 + void vgic_v3_init_emulation(struct kvm *kvm) 832 + { 833 + struct vgic_dist *dist = &kvm->arch.vgic; 834 + 835 + dist->vm_ops.handle_mmio = vgic_v3_handle_mmio; 836 + dist->vm_ops.queue_sgi = vgic_v3_queue_sgi; 837 + dist->vm_ops.add_sgi_source = vgic_v3_add_sgi_source; 838 + dist->vm_ops.init_model = vgic_v3_init_model; 839 + dist->vm_ops.map_resources = vgic_v3_map_resources; 840 + 841 + kvm->arch.max_vcpus = KVM_MAX_VCPUS; 842 + } 843 + 844 + /* 845 + * Compare a given affinity (level 1-3 and a level 0 mask, from the SGI 846 + * generation register ICC_SGI1R_EL1) with a given VCPU. 847 + * If the VCPU's MPIDR matches, return the level0 affinity, otherwise 848 + * return -1. 849 + */ 850 + static int match_mpidr(u64 sgi_aff, u16 sgi_cpu_mask, struct kvm_vcpu *vcpu) 851 + { 852 + unsigned long affinity; 853 + int level0; 854 + 855 + /* 856 + * Split the current VCPU's MPIDR into affinity level 0 and the 857 + * rest as this is what we have to compare against. 858 + */ 859 + affinity = kvm_vcpu_get_mpidr_aff(vcpu); 860 + level0 = MPIDR_AFFINITY_LEVEL(affinity, 0); 861 + affinity &= ~MPIDR_LEVEL_MASK; 862 + 863 + /* bail out if the upper three levels don't match */ 864 + if (sgi_aff != affinity) 865 + return -1; 866 + 867 + /* Is this VCPU's bit set in the mask ? */ 868 + if (!(sgi_cpu_mask & BIT(level0))) 869 + return -1; 870 + 871 + return level0; 872 + } 873 + 874 + #define SGI_AFFINITY_LEVEL(reg, level) \ 875 + ((((reg) & ICC_SGI1R_AFFINITY_## level ##_MASK) \ 876 + >> ICC_SGI1R_AFFINITY_## level ##_SHIFT) << MPIDR_LEVEL_SHIFT(level)) 877 + 878 + /** 879 + * vgic_v3_dispatch_sgi - handle SGI requests from VCPUs 880 + * @vcpu: The VCPU requesting a SGI 881 + * @reg: The value written into the ICC_SGI1R_EL1 register by that VCPU 882 + * 883 + * With GICv3 (and ARE=1) CPUs trigger SGIs by writing to a system register. 884 + * This will trap in sys_regs.c and call this function. 885 + * This ICC_SGI1R_EL1 register contains the upper three affinity levels of the 886 + * target processors as well as a bitmask of 16 Aff0 CPUs. 887 + * If the interrupt routing mode bit is not set, we iterate over all VCPUs to 888 + * check for matching ones. If this bit is set, we signal all, but not the 889 + * calling VCPU. 890 + */ 891 + void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg) 892 + { 893 + struct kvm *kvm = vcpu->kvm; 894 + struct kvm_vcpu *c_vcpu; 895 + struct vgic_dist *dist = &kvm->arch.vgic; 896 + u16 target_cpus; 897 + u64 mpidr; 898 + int sgi, c; 899 + int vcpu_id = vcpu->vcpu_id; 900 + bool broadcast; 901 + int updated = 0; 902 + 903 + sgi = (reg & ICC_SGI1R_SGI_ID_MASK) >> ICC_SGI1R_SGI_ID_SHIFT; 904 + broadcast = reg & BIT(ICC_SGI1R_IRQ_ROUTING_MODE_BIT); 905 + target_cpus = (reg & ICC_SGI1R_TARGET_LIST_MASK) >> ICC_SGI1R_TARGET_LIST_SHIFT; 906 + mpidr = SGI_AFFINITY_LEVEL(reg, 3); 907 + mpidr |= SGI_AFFINITY_LEVEL(reg, 2); 908 + mpidr |= SGI_AFFINITY_LEVEL(reg, 1); 909 + 910 + /* 911 + * We take the dist lock here, because we come from the sysregs 912 + * code path and not from the MMIO one (which already takes the lock). 913 + */ 914 + spin_lock(&dist->lock); 915 + 916 + /* 917 + * We iterate over all VCPUs to find the MPIDRs matching the request. 918 + * If we have handled one CPU, we clear it's bit to detect early 919 + * if we are already finished. This avoids iterating through all 920 + * VCPUs when most of the times we just signal a single VCPU. 921 + */ 922 + kvm_for_each_vcpu(c, c_vcpu, kvm) { 923 + 924 + /* Exit early if we have dealt with all requested CPUs */ 925 + if (!broadcast && target_cpus == 0) 926 + break; 927 + 928 + /* Don't signal the calling VCPU */ 929 + if (broadcast && c == vcpu_id) 930 + continue; 931 + 932 + if (!broadcast) { 933 + int level0; 934 + 935 + level0 = match_mpidr(mpidr, target_cpus, c_vcpu); 936 + if (level0 == -1) 937 + continue; 938 + 939 + /* remove this matching VCPU from the mask */ 940 + target_cpus &= ~BIT(level0); 941 + } 942 + 943 + /* Flag the SGI as pending */ 944 + vgic_dist_irq_set_pending(c_vcpu, sgi); 945 + updated = 1; 946 + kvm_debug("SGI%d from CPU%d to CPU%d\n", sgi, vcpu_id, c); 947 + } 948 + if (updated) 949 + vgic_update_state(vcpu->kvm); 950 + spin_unlock(&dist->lock); 951 + if (updated) 952 + vgic_kick_vcpus(vcpu->kvm); 953 + } 954 + 955 + static int vgic_v3_create(struct kvm_device *dev, u32 type) 956 + { 957 + return kvm_vgic_create(dev->kvm, type); 958 + } 959 + 960 + static void vgic_v3_destroy(struct kvm_device *dev) 961 + { 962 + kfree(dev); 963 + } 964 + 965 + static int vgic_v3_set_attr(struct kvm_device *dev, 966 + struct kvm_device_attr *attr) 967 + { 968 + int ret; 969 + 970 + ret = vgic_set_common_attr(dev, attr); 971 + if (ret != -ENXIO) 972 + return ret; 973 + 974 + switch (attr->group) { 975 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 976 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 977 + return -ENXIO; 978 + } 979 + 980 + return -ENXIO; 981 + } 982 + 983 + static int vgic_v3_get_attr(struct kvm_device *dev, 984 + struct kvm_device_attr *attr) 985 + { 986 + int ret; 987 + 988 + ret = vgic_get_common_attr(dev, attr); 989 + if (ret != -ENXIO) 990 + return ret; 991 + 992 + switch (attr->group) { 993 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 994 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 995 + return -ENXIO; 996 + } 997 + 998 + return -ENXIO; 999 + } 1000 + 1001 + static int vgic_v3_has_attr(struct kvm_device *dev, 1002 + struct kvm_device_attr *attr) 1003 + { 1004 + switch (attr->group) { 1005 + case KVM_DEV_ARM_VGIC_GRP_ADDR: 1006 + switch (attr->attr) { 1007 + case KVM_VGIC_V2_ADDR_TYPE_DIST: 1008 + case KVM_VGIC_V2_ADDR_TYPE_CPU: 1009 + return -ENXIO; 1010 + case KVM_VGIC_V3_ADDR_TYPE_DIST: 1011 + case KVM_VGIC_V3_ADDR_TYPE_REDIST: 1012 + return 0; 1013 + } 1014 + break; 1015 + case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 1016 + case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 1017 + return -ENXIO; 1018 + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: 1019 + return 0; 1020 + case KVM_DEV_ARM_VGIC_GRP_CTRL: 1021 + switch (attr->attr) { 1022 + case KVM_DEV_ARM_VGIC_CTRL_INIT: 1023 + return 0; 1024 + } 1025 + } 1026 + return -ENXIO; 1027 + } 1028 + 1029 + struct kvm_device_ops kvm_arm_vgic_v3_ops = { 1030 + .name = "kvm-arm-vgic-v3", 1031 + .create = vgic_v3_create, 1032 + .destroy = vgic_v3_destroy, 1033 + .set_attr = vgic_v3_set_attr, 1034 + .get_attr = vgic_v3_get_attr, 1035 + .has_attr = vgic_v3_has_attr, 1036 + };

+57 -25

virt/kvm/arm/vgic-v3.c

··· 34 34 #define GICH_LR_VIRTUALID (0x3ffUL << 0) 35 35 #define GICH_LR_PHYSID_CPUID_SHIFT (10) 36 36 #define GICH_LR_PHYSID_CPUID (7UL << GICH_LR_PHYSID_CPUID_SHIFT) 37 + #define ICH_LR_VIRTUALID_MASK (BIT_ULL(32) - 1) 37 38 38 39 /* 39 40 * LRs are stored in reverse order in memory. make sure we index them ··· 49 48 struct vgic_lr lr_desc; 50 49 u64 val = vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)]; 51 50 52 - lr_desc.irq = val & GICH_LR_VIRTUALID; 53 - if (lr_desc.irq <= 15) 54 - lr_desc.source = (val >> GICH_LR_PHYSID_CPUID_SHIFT) & 0x7; 51 + if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) 52 + lr_desc.irq = val & ICH_LR_VIRTUALID_MASK; 55 53 else 56 - lr_desc.source = 0; 57 - lr_desc.state = 0; 54 + lr_desc.irq = val & GICH_LR_VIRTUALID; 55 + 56 + lr_desc.source = 0; 57 + if (lr_desc.irq <= 15 && 58 + vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V2) 59 + lr_desc.source = (val >> GICH_LR_PHYSID_CPUID_SHIFT) & 0x7; 60 + 61 + lr_desc.state = 0; 58 62 59 63 if (val & ICH_LR_PENDING_BIT) 60 64 lr_desc.state |= LR_STATE_PENDING; ··· 74 68 static void vgic_v3_set_lr(struct kvm_vcpu *vcpu, int lr, 75 69 struct vgic_lr lr_desc) 76 70 { 77 - u64 lr_val = (((u32)lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT) | 78 - lr_desc.irq); 71 + u64 lr_val; 72 + 73 + lr_val = lr_desc.irq; 74 + 75 + /* 76 + * Currently all guest IRQs are Group1, as Group0 would result 77 + * in a FIQ in the guest, which it wouldn't expect. 78 + * Eventually we want to make this configurable, so we may revisit 79 + * this in the future. 80 + */ 81 + if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) 82 + lr_val |= ICH_LR_GROUP; 83 + else 84 + lr_val |= (u32)lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT; 79 85 80 86 if (lr_desc.state & LR_STATE_PENDING) 81 87 lr_val |= ICH_LR_PENDING_BIT; ··· 163 145 164 146 static void vgic_v3_enable(struct kvm_vcpu *vcpu) 165 147 { 148 + struct vgic_v3_cpu_if *vgic_v3 = &vcpu->arch.vgic_cpu.vgic_v3; 149 + 166 150 /* 167 151 * By forcing VMCR to zero, the GIC will restore the binary 168 152 * points to their reset values. Anything else resets to zero 169 153 * anyway. 170 154 */ 171 - vcpu->arch.vgic_cpu.vgic_v3.vgic_vmcr = 0; 155 + vgic_v3->vgic_vmcr = 0; 156 + 157 + /* 158 + * If we are emulating a GICv3, we do it in an non-GICv2-compatible 159 + * way, so we force SRE to 1 to demonstrate this to the guest. 160 + * This goes with the spec allowing the value to be RAO/WI. 161 + */ 162 + if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) 163 + vgic_v3->vgic_sre = ICC_SRE_EL1_SRE; 164 + else 165 + vgic_v3->vgic_sre = 0; 172 166 173 167 /* Get the show on the road... */ 174 - vcpu->arch.vgic_cpu.vgic_v3.vgic_hcr = ICH_HCR_EN; 168 + vgic_v3->vgic_hcr = ICH_HCR_EN; 175 169 } 176 170 177 171 static const struct vgic_ops vgic_v3_ops = { ··· 235 205 * maximum of 16 list registers. Just ignore bit 4... 236 206 */ 237 207 vgic->nr_lr = (ich_vtr_el2 & 0xf) + 1; 208 + vgic->can_emulate_gicv2 = false; 238 209 239 210 if (of_property_read_u32(vgic_node, "#redistributor-regions", &gicv_idx)) 240 211 gicv_idx = 1; 241 212 242 213 gicv_idx += 3; /* Also skip GICD, GICC, GICH */ 243 214 if (of_address_to_resource(vgic_node, gicv_idx, &vcpu_res)) { 244 - kvm_err("Cannot obtain GICV region\n"); 245 - ret = -ENXIO; 246 - goto out; 247 - } 248 - 249 - if (!PAGE_ALIGNED(vcpu_res.start)) { 250 - kvm_err("GICV physical address 0x%llx not page aligned\n", 215 + kvm_info("GICv3: no GICV resource entry\n"); 216 + vgic->vcpu_base = 0; 217 + } else if (!PAGE_ALIGNED(vcpu_res.start)) { 218 + pr_warn("GICV physical address 0x%llx not page aligned\n", 251 219 (unsigned long long)vcpu_res.start); 252 - ret = -ENXIO; 253 - goto out; 254 - } 255 - 256 - if (!PAGE_ALIGNED(resource_size(&vcpu_res))) { 257 - kvm_err("GICV size 0x%llx not a multiple of page size 0x%lx\n", 220 + vgic->vcpu_base = 0; 221 + } else if (!PAGE_ALIGNED(resource_size(&vcpu_res))) { 222 + pr_warn("GICV size 0x%llx not a multiple of page size 0x%lx\n", 258 223 (unsigned long long)resource_size(&vcpu_res), 259 224 PAGE_SIZE); 260 - ret = -ENXIO; 261 - goto out; 225 + vgic->vcpu_base = 0; 226 + } else { 227 + vgic->vcpu_base = vcpu_res.start; 228 + vgic->can_emulate_gicv2 = true; 229 + kvm_register_device_ops(&kvm_arm_vgic_v2_ops, 230 + KVM_DEV_TYPE_ARM_VGIC_V2); 262 231 } 232 + if (vgic->vcpu_base == 0) 233 + kvm_info("disabling GICv2 emulation\n"); 234 + kvm_register_device_ops(&kvm_arm_vgic_v3_ops, KVM_DEV_TYPE_ARM_VGIC_V3); 263 235 264 - vgic->vcpu_base = vcpu_res.start; 265 236 vgic->vctrl_base = NULL; 266 237 vgic->type = VGIC_V3; 238 + vgic->max_gic_vcpus = KVM_MAX_VCPUS; 267 239 268 240 kvm_info("%s@%llx IRQ%d\n", vgic_node->name, 269 241 vcpu_res.start, vgic->maint_irq);

+286 -853

virt/kvm/arm/vgic.c

··· 75 75 * inactive as long as the external input line is held high. 76 76 */ 77 77 78 - #define VGIC_ADDR_UNDEF (-1) 79 - #define IS_VGIC_ADDR_UNDEF(_x) ((_x) == VGIC_ADDR_UNDEF) 78 + #include "vgic.h" 80 79 81 - #define PRODUCT_ID_KVM 0x4b /* ASCII code K */ 82 - #define IMPLEMENTER_ARM 0x43b 83 - #define GICC_ARCH_VERSION_V2 0x2 84 - 85 - #define ACCESS_READ_VALUE (1 << 0) 86 - #define ACCESS_READ_RAZ (0 << 0) 87 - #define ACCESS_READ_MASK(x) ((x) & (1 << 0)) 88 - #define ACCESS_WRITE_IGNORED (0 << 1) 89 - #define ACCESS_WRITE_SETBIT (1 << 1) 90 - #define ACCESS_WRITE_CLEARBIT (2 << 1) 91 - #define ACCESS_WRITE_VALUE (3 << 1) 92 - #define ACCESS_WRITE_MASK(x) ((x) & (3 << 1)) 93 - 94 - static int vgic_init(struct kvm *kvm); 95 80 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu); 96 81 static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu); 97 - static void vgic_update_state(struct kvm *kvm); 98 - static void vgic_kick_vcpus(struct kvm *kvm); 99 - static u8 *vgic_get_sgi_sources(struct vgic_dist *dist, int vcpu_id, int sgi); 100 - static void vgic_dispatch_sgi(struct kvm_vcpu *vcpu, u32 reg); 101 82 static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr); 102 83 static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr lr_desc); 103 - static void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 104 - static void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 105 84 106 85 static const struct vgic_ops *vgic_ops; 107 86 static const struct vgic_params *vgic; 87 + 88 + static void add_sgi_source(struct kvm_vcpu *vcpu, int irq, int source) 89 + { 90 + vcpu->kvm->arch.vgic.vm_ops.add_sgi_source(vcpu, irq, source); 91 + } 92 + 93 + static bool queue_sgi(struct kvm_vcpu *vcpu, int irq) 94 + { 95 + return vcpu->kvm->arch.vgic.vm_ops.queue_sgi(vcpu, irq); 96 + } 97 + 98 + int kvm_vgic_map_resources(struct kvm *kvm) 99 + { 100 + return kvm->arch.vgic.vm_ops.map_resources(kvm, vgic); 101 + } 108 102 109 103 /* 110 104 * struct vgic_bitmap contains a bitmap made of unsigned longs, but ··· 154 160 return (unsigned long *)val; 155 161 } 156 162 157 - static u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x, 158 - int cpuid, u32 offset) 163 + u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x, int cpuid, u32 offset) 159 164 { 160 165 offset >>= 2; 161 166 if (!offset) ··· 172 179 return test_bit(irq - VGIC_NR_PRIVATE_IRQS, x->shared); 173 180 } 174 181 175 - static void vgic_bitmap_set_irq_val(struct vgic_bitmap *x, int cpuid, 176 - int irq, int val) 182 + void vgic_bitmap_set_irq_val(struct vgic_bitmap *x, int cpuid, 183 + int irq, int val) 177 184 { 178 185 unsigned long *reg; 179 186 ··· 195 202 return x->private + cpuid; 196 203 } 197 204 198 - static unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap *x) 205 + unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap *x) 199 206 { 200 207 return x->shared; 201 208 } ··· 222 229 b->shared = NULL; 223 230 } 224 231 225 - static u32 *vgic_bytemap_get_reg(struct vgic_bytemap *x, int cpuid, u32 offset) 232 + u32 *vgic_bytemap_get_reg(struct vgic_bytemap *x, int cpuid, u32 offset) 226 233 { 227 234 u32 *reg; 228 235 ··· 319 326 return vgic_bitmap_get_irq_val(&dist->irq_pending, vcpu->vcpu_id, irq); 320 327 } 321 328 322 - static void vgic_dist_irq_set_pending(struct kvm_vcpu *vcpu, int irq) 329 + void vgic_dist_irq_set_pending(struct kvm_vcpu *vcpu, int irq) 323 330 { 324 331 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 325 332 326 333 vgic_bitmap_set_irq_val(&dist->irq_pending, vcpu->vcpu_id, irq, 1); 327 334 } 328 335 329 - static void vgic_dist_irq_clear_pending(struct kvm_vcpu *vcpu, int irq) 336 + void vgic_dist_irq_clear_pending(struct kvm_vcpu *vcpu, int irq) 330 337 { 331 338 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 332 339 ··· 342 349 vcpu->arch.vgic_cpu.pending_shared); 343 350 } 344 351 345 - static void vgic_cpu_irq_clear(struct kvm_vcpu *vcpu, int irq) 352 + void vgic_cpu_irq_clear(struct kvm_vcpu *vcpu, int irq) 346 353 { 347 354 if (irq < VGIC_NR_PRIVATE_IRQS) 348 355 clear_bit(irq, vcpu->arch.vgic_cpu.pending_percpu); ··· 356 363 return vgic_irq_is_edge(vcpu, irq) || !vgic_irq_is_queued(vcpu, irq); 357 364 } 358 365 359 - static u32 mmio_data_read(struct kvm_exit_mmio *mmio, u32 mask) 360 - { 361 - return le32_to_cpu(*((u32 *)mmio->data)) & mask; 362 - } 363 - 364 - static void mmio_data_write(struct kvm_exit_mmio *mmio, u32 mask, u32 value) 365 - { 366 - *((u32 *)mmio->data) = cpu_to_le32(value) & mask; 367 - } 368 - 369 366 /** 370 367 * vgic_reg_access - access vgic register 371 368 * @mmio: pointer to the data describing the mmio access ··· 367 384 * modes defined for vgic register access 368 385 * (read,raz,write-ignored,setbit,clearbit,write) 369 386 */ 370 - static void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg, 371 - phys_addr_t offset, int mode) 387 + void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg, 388 + phys_addr_t offset, int mode) 372 389 { 373 390 int word_offset = (offset & 3) * 8; 374 391 u32 mask = (1UL << (mmio->len * 8)) - 1; ··· 417 434 } 418 435 } 419 436 420 - static bool handle_mmio_misc(struct kvm_vcpu *vcpu, 421 - struct kvm_exit_mmio *mmio, phys_addr_t offset) 422 - { 423 - u32 reg; 424 - u32 word_offset = offset & 3; 425 - 426 - switch (offset & ~3) { 427 - case 0: /* GICD_CTLR */ 428 - reg = vcpu->kvm->arch.vgic.enabled; 429 - vgic_reg_access(mmio, &reg, word_offset, 430 - ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 431 - if (mmio->is_write) { 432 - vcpu->kvm->arch.vgic.enabled = reg & 1; 433 - vgic_update_state(vcpu->kvm); 434 - return true; 435 - } 436 - break; 437 - 438 - case 4: /* GICD_TYPER */ 439 - reg = (atomic_read(&vcpu->kvm->online_vcpus) - 1) << 5; 440 - reg |= (vcpu->kvm->arch.vgic.nr_irqs >> 5) - 1; 441 - vgic_reg_access(mmio, &reg, word_offset, 442 - ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 443 - break; 444 - 445 - case 8: /* GICD_IIDR */ 446 - reg = (PRODUCT_ID_KVM << 24) | (IMPLEMENTER_ARM << 0); 447 - vgic_reg_access(mmio, &reg, word_offset, 448 - ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 449 - break; 450 - } 451 - 452 - return false; 453 - } 454 - 455 - static bool handle_mmio_raz_wi(struct kvm_vcpu *vcpu, 456 - struct kvm_exit_mmio *mmio, phys_addr_t offset) 437 + bool handle_mmio_raz_wi(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio, 438 + phys_addr_t offset) 457 439 { 458 440 vgic_reg_access(mmio, NULL, offset, 459 441 ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 460 442 return false; 461 443 } 462 444 463 - static bool handle_mmio_set_enable_reg(struct kvm_vcpu *vcpu, 464 - struct kvm_exit_mmio *mmio, 465 - phys_addr_t offset) 445 + bool vgic_handle_enable_reg(struct kvm *kvm, struct kvm_exit_mmio *mmio, 446 + phys_addr_t offset, int vcpu_id, int access) 466 447 { 467 - u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_enabled, 468 - vcpu->vcpu_id, offset); 469 - vgic_reg_access(mmio, reg, offset, 470 - ACCESS_READ_VALUE | ACCESS_WRITE_SETBIT); 448 + u32 *reg; 449 + int mode = ACCESS_READ_VALUE | access; 450 + struct kvm_vcpu *target_vcpu = kvm_get_vcpu(kvm, vcpu_id); 451 + 452 + reg = vgic_bitmap_get_reg(&kvm->arch.vgic.irq_enabled, vcpu_id, offset); 453 + vgic_reg_access(mmio, reg, offset, mode); 471 454 if (mmio->is_write) { 472 - vgic_update_state(vcpu->kvm); 455 + if (access & ACCESS_WRITE_CLEARBIT) { 456 + if (offset < 4) /* Force SGI enabled */ 457 + *reg |= 0xffff; 458 + vgic_retire_disabled_irqs(target_vcpu); 459 + } 460 + vgic_update_state(kvm); 473 461 return true; 474 462 } 475 463 476 464 return false; 477 465 } 478 466 479 - static bool handle_mmio_clear_enable_reg(struct kvm_vcpu *vcpu, 480 - struct kvm_exit_mmio *mmio, 481 - phys_addr_t offset) 482 - { 483 - u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_enabled, 484 - vcpu->vcpu_id, offset); 485 - vgic_reg_access(mmio, reg, offset, 486 - ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT); 487 - if (mmio->is_write) { 488 - if (offset < 4) /* Force SGI enabled */ 489 - *reg |= 0xffff; 490 - vgic_retire_disabled_irqs(vcpu); 491 - vgic_update_state(vcpu->kvm); 492 - return true; 493 - } 494 - 495 - return false; 496 - } 497 - 498 - static bool handle_mmio_set_pending_reg(struct kvm_vcpu *vcpu, 499 - struct kvm_exit_mmio *mmio, 500 - phys_addr_t offset) 467 + bool vgic_handle_set_pending_reg(struct kvm *kvm, 468 + struct kvm_exit_mmio *mmio, 469 + phys_addr_t offset, int vcpu_id) 501 470 { 502 471 u32 *reg, orig; 503 472 u32 level_mask; 504 - struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 473 + int mode = ACCESS_READ_VALUE | ACCESS_WRITE_SETBIT; 474 + struct vgic_dist *dist = &kvm->arch.vgic; 505 475 506 - reg = vgic_bitmap_get_reg(&dist->irq_cfg, vcpu->vcpu_id, offset); 476 + reg = vgic_bitmap_get_reg(&dist->irq_cfg, vcpu_id, offset); 507 477 level_mask = (~(*reg)); 508 478 509 479 /* Mark both level and edge triggered irqs as pending */ 510 - reg = vgic_bitmap_get_reg(&dist->irq_pending, vcpu->vcpu_id, offset); 480 + reg = vgic_bitmap_get_reg(&dist->irq_pending, vcpu_id, offset); 511 481 orig = *reg; 512 - vgic_reg_access(mmio, reg, offset, 513 - ACCESS_READ_VALUE | ACCESS_WRITE_SETBIT); 482 + vgic_reg_access(mmio, reg, offset, mode); 514 483 515 484 if (mmio->is_write) { 516 485 /* Set the soft-pending flag only for level-triggered irqs */ 517 486 reg = vgic_bitmap_get_reg(&dist->irq_soft_pend, 518 - vcpu->vcpu_id, offset); 519 - vgic_reg_access(mmio, reg, offset, 520 - ACCESS_READ_VALUE | ACCESS_WRITE_SETBIT); 487 + vcpu_id, offset); 488 + vgic_reg_access(mmio, reg, offset, mode); 521 489 *reg &= level_mask; 522 490 523 491 /* Ignore writes to SGIs */ ··· 477 543 *reg |= orig & 0xffff; 478 544 } 479 545 480 - vgic_update_state(vcpu->kvm); 546 + vgic_update_state(kvm); 481 547 return true; 482 548 } 483 549 484 550 return false; 485 551 } 486 552 487 - static bool handle_mmio_clear_pending_reg(struct kvm_vcpu *vcpu, 488 - struct kvm_exit_mmio *mmio, 489 - phys_addr_t offset) 553 + bool vgic_handle_clear_pending_reg(struct kvm *kvm, 554 + struct kvm_exit_mmio *mmio, 555 + phys_addr_t offset, int vcpu_id) 490 556 { 491 557 u32 *level_active; 492 558 u32 *reg, orig; 493 - struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 559 + int mode = ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT; 560 + struct vgic_dist *dist = &kvm->arch.vgic; 494 561 495 - reg = vgic_bitmap_get_reg(&dist->irq_pending, vcpu->vcpu_id, offset); 562 + reg = vgic_bitmap_get_reg(&dist->irq_pending, vcpu_id, offset); 496 563 orig = *reg; 497 - vgic_reg_access(mmio, reg, offset, 498 - ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT); 564 + vgic_reg_access(mmio, reg, offset, mode); 499 565 if (mmio->is_write) { 500 566 /* Re-set level triggered level-active interrupts */ 501 567 level_active = vgic_bitmap_get_reg(&dist->irq_level, 502 - vcpu->vcpu_id, offset); 503 - reg = vgic_bitmap_get_reg(&dist->irq_pending, 504 - vcpu->vcpu_id, offset); 568 + vcpu_id, offset); 569 + reg = vgic_bitmap_get_reg(&dist->irq_pending, vcpu_id, offset); 505 570 *reg |= *level_active; 506 571 507 572 /* Ignore writes to SGIs */ ··· 511 578 512 579 /* Clear soft-pending flags */ 513 580 reg = vgic_bitmap_get_reg(&dist->irq_soft_pend, 514 - vcpu->vcpu_id, offset); 515 - vgic_reg_access(mmio, reg, offset, 516 - ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT); 581 + vcpu_id, offset); 582 + vgic_reg_access(mmio, reg, offset, mode); 517 583 518 - vgic_update_state(vcpu->kvm); 584 + vgic_update_state(kvm); 519 585 return true; 520 586 } 521 - 522 - return false; 523 - } 524 - 525 - static bool handle_mmio_priority_reg(struct kvm_vcpu *vcpu, 526 - struct kvm_exit_mmio *mmio, 527 - phys_addr_t offset) 528 - { 529 - u32 *reg = vgic_bytemap_get_reg(&vcpu->kvm->arch.vgic.irq_priority, 530 - vcpu->vcpu_id, offset); 531 - vgic_reg_access(mmio, reg, offset, 532 - ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 533 - return false; 534 - } 535 - 536 - #define GICD_ITARGETSR_SIZE 32 537 - #define GICD_CPUTARGETS_BITS 8 538 - #define GICD_IRQS_PER_ITARGETSR (GICD_ITARGETSR_SIZE / GICD_CPUTARGETS_BITS) 539 - static u32 vgic_get_target_reg(struct kvm *kvm, int irq) 540 - { 541 - struct vgic_dist *dist = &kvm->arch.vgic; 542 - int i; 543 - u32 val = 0; 544 - 545 - irq -= VGIC_NR_PRIVATE_IRQS; 546 - 547 - for (i = 0; i < GICD_IRQS_PER_ITARGETSR; i++) 548 - val |= 1 << (dist->irq_spi_cpu[irq + i] + i * 8); 549 - 550 - return val; 551 - } 552 - 553 - static void vgic_set_target_reg(struct kvm *kvm, u32 val, int irq) 554 - { 555 - struct vgic_dist *dist = &kvm->arch.vgic; 556 - struct kvm_vcpu *vcpu; 557 - int i, c; 558 - unsigned long *bmap; 559 - u32 target; 560 - 561 - irq -= VGIC_NR_PRIVATE_IRQS; 562 - 563 - /* 564 - * Pick the LSB in each byte. This ensures we target exactly 565 - * one vcpu per IRQ. If the byte is null, assume we target 566 - * CPU0. 567 - */ 568 - for (i = 0; i < GICD_IRQS_PER_ITARGETSR; i++) { 569 - int shift = i * GICD_CPUTARGETS_BITS; 570 - target = ffs((val >> shift) & 0xffU); 571 - target = target ? (target - 1) : 0; 572 - dist->irq_spi_cpu[irq + i] = target; 573 - kvm_for_each_vcpu(c, vcpu, kvm) { 574 - bmap = vgic_bitmap_get_shared_map(&dist->irq_spi_target[c]); 575 - if (c == target) 576 - set_bit(irq + i, bmap); 577 - else 578 - clear_bit(irq + i, bmap); 579 - } 580 - } 581 - } 582 - 583 - static bool handle_mmio_target_reg(struct kvm_vcpu *vcpu, 584 - struct kvm_exit_mmio *mmio, 585 - phys_addr_t offset) 586 - { 587 - u32 reg; 588 - 589 - /* We treat the banked interrupts targets as read-only */ 590 - if (offset < 32) { 591 - u32 roreg = 1 << vcpu->vcpu_id; 592 - roreg |= roreg << 8; 593 - roreg |= roreg << 16; 594 - 595 - vgic_reg_access(mmio, &roreg, offset, 596 - ACCESS_READ_VALUE | ACCESS_WRITE_IGNORED); 597 - return false; 598 - } 599 - 600 - reg = vgic_get_target_reg(vcpu->kvm, offset & ~3U); 601 - vgic_reg_access(mmio, &reg, offset, 602 - ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 603 - if (mmio->is_write) { 604 - vgic_set_target_reg(vcpu->kvm, reg, offset & ~3U); 605 - vgic_update_state(vcpu->kvm); 606 - return true; 607 - } 608 - 609 587 return false; 610 588 } 611 589 ··· 555 711 * LSB is always 0. As such, we only keep the upper bit, and use the 556 712 * two above functions to compress/expand the bits 557 713 */ 558 - static bool handle_mmio_cfg_reg(struct kvm_vcpu *vcpu, 559 - struct kvm_exit_mmio *mmio, phys_addr_t offset) 714 + bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio *mmio, 715 + phys_addr_t offset) 560 716 { 561 717 u32 val; 562 - u32 *reg; 563 - 564 - reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_cfg, 565 - vcpu->vcpu_id, offset >> 1); 566 718 567 719 if (offset & 4) 568 720 val = *reg >> 16; ··· 587 747 return false; 588 748 } 589 749 590 - static bool handle_mmio_sgi_reg(struct kvm_vcpu *vcpu, 591 - struct kvm_exit_mmio *mmio, phys_addr_t offset) 592 - { 593 - u32 reg; 594 - vgic_reg_access(mmio, &reg, offset, 595 - ACCESS_READ_RAZ | ACCESS_WRITE_VALUE); 596 - if (mmio->is_write) { 597 - vgic_dispatch_sgi(vcpu, reg); 598 - vgic_update_state(vcpu->kvm); 599 - return true; 600 - } 601 - 602 - return false; 603 - } 604 - 605 750 /** 606 751 * vgic_unqueue_irqs - move pending IRQs from LRs to the distributor 607 752 * @vgic_cpu: Pointer to the vgic_cpu struct holding the LRs ··· 599 774 * to the distributor but the active state stays in the LRs, because we don't 600 775 * track the active state on the distributor side. 601 776 */ 602 - static void vgic_unqueue_irqs(struct kvm_vcpu *vcpu) 777 + void vgic_unqueue_irqs(struct kvm_vcpu *vcpu) 603 778 { 604 - struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 605 779 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 606 - int vcpu_id = vcpu->vcpu_id; 607 780 int i; 608 781 609 782 for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) { ··· 628 805 */ 629 806 vgic_dist_irq_set_pending(vcpu, lr.irq); 630 807 if (lr.irq < VGIC_NR_SGIS) 631 - *vgic_get_sgi_sources(dist, vcpu_id, lr.irq) |= 1 << lr.source; 808 + add_sgi_source(vcpu, lr.irq, lr.source); 632 809 lr.state &= ~LR_STATE_PENDING; 633 810 vgic_set_lr(vcpu, i, lr); 634 811 ··· 647 824 } 648 825 } 649 826 650 - /* Handle reads of GICD_CPENDSGIRn and GICD_SPENDSGIRn */ 651 - static bool read_set_clear_sgi_pend_reg(struct kvm_vcpu *vcpu, 652 - struct kvm_exit_mmio *mmio, 653 - phys_addr_t offset) 654 - { 655 - struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 656 - int sgi; 657 - int min_sgi = (offset & ~0x3); 658 - int max_sgi = min_sgi + 3; 659 - int vcpu_id = vcpu->vcpu_id; 660 - u32 reg = 0; 661 - 662 - /* Copy source SGIs from distributor side */ 663 - for (sgi = min_sgi; sgi <= max_sgi; sgi++) { 664 - int shift = 8 * (sgi - min_sgi); 665 - reg |= ((u32)*vgic_get_sgi_sources(dist, vcpu_id, sgi)) << shift; 666 - } 667 - 668 - mmio_data_write(mmio, ~0, reg); 669 - return false; 670 - } 671 - 672 - static bool write_set_clear_sgi_pend_reg(struct kvm_vcpu *vcpu, 673 - struct kvm_exit_mmio *mmio, 674 - phys_addr_t offset, bool set) 675 - { 676 - struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 677 - int sgi; 678 - int min_sgi = (offset & ~0x3); 679 - int max_sgi = min_sgi + 3; 680 - int vcpu_id = vcpu->vcpu_id; 681 - u32 reg; 682 - bool updated = false; 683 - 684 - reg = mmio_data_read(mmio, ~0); 685 - 686 - /* Clear pending SGIs on the distributor */ 687 - for (sgi = min_sgi; sgi <= max_sgi; sgi++) { 688 - u8 mask = reg >> (8 * (sgi - min_sgi)); 689 - u8 *src = vgic_get_sgi_sources(dist, vcpu_id, sgi); 690 - if (set) { 691 - if ((*src & mask) != mask) 692 - updated = true; 693 - *src |= mask; 694 - } else { 695 - if (*src & mask) 696 - updated = true; 697 - *src &= ~mask; 698 - } 699 - } 700 - 701 - if (updated) 702 - vgic_update_state(vcpu->kvm); 703 - 704 - return updated; 705 - } 706 - 707 - static bool handle_mmio_sgi_set(struct kvm_vcpu *vcpu, 708 - struct kvm_exit_mmio *mmio, 709 - phys_addr_t offset) 710 - { 711 - if (!mmio->is_write) 712 - return read_set_clear_sgi_pend_reg(vcpu, mmio, offset); 713 - else 714 - return write_set_clear_sgi_pend_reg(vcpu, mmio, offset, true); 715 - } 716 - 717 - static bool handle_mmio_sgi_clear(struct kvm_vcpu *vcpu, 718 - struct kvm_exit_mmio *mmio, 719 - phys_addr_t offset) 720 - { 721 - if (!mmio->is_write) 722 - return read_set_clear_sgi_pend_reg(vcpu, mmio, offset); 723 - else 724 - return write_set_clear_sgi_pend_reg(vcpu, mmio, offset, false); 725 - } 726 - 727 - /* 728 - * I would have liked to use the kvm_bus_io_*() API instead, but it 729 - * cannot cope with banked registers (only the VM pointer is passed 730 - * around, and we need the vcpu). One of these days, someone please 731 - * fix it! 732 - */ 733 - struct mmio_range { 734 - phys_addr_t base; 735 - unsigned long len; 736 - int bits_per_irq; 737 - bool (*handle_mmio)(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio, 738 - phys_addr_t offset); 739 - }; 740 - 741 - static const struct mmio_range vgic_dist_ranges[] = { 742 - { 743 - .base = GIC_DIST_CTRL, 744 - .len = 12, 745 - .bits_per_irq = 0, 746 - .handle_mmio = handle_mmio_misc, 747 - }, 748 - { 749 - .base = GIC_DIST_IGROUP, 750 - .len = VGIC_MAX_IRQS / 8, 751 - .bits_per_irq = 1, 752 - .handle_mmio = handle_mmio_raz_wi, 753 - }, 754 - { 755 - .base = GIC_DIST_ENABLE_SET, 756 - .len = VGIC_MAX_IRQS / 8, 757 - .bits_per_irq = 1, 758 - .handle_mmio = handle_mmio_set_enable_reg, 759 - }, 760 - { 761 - .base = GIC_DIST_ENABLE_CLEAR, 762 - .len = VGIC_MAX_IRQS / 8, 763 - .bits_per_irq = 1, 764 - .handle_mmio = handle_mmio_clear_enable_reg, 765 - }, 766 - { 767 - .base = GIC_DIST_PENDING_SET, 768 - .len = VGIC_MAX_IRQS / 8, 769 - .bits_per_irq = 1, 770 - .handle_mmio = handle_mmio_set_pending_reg, 771 - }, 772 - { 773 - .base = GIC_DIST_PENDING_CLEAR, 774 - .len = VGIC_MAX_IRQS / 8, 775 - .bits_per_irq = 1, 776 - .handle_mmio = handle_mmio_clear_pending_reg, 777 - }, 778 - { 779 - .base = GIC_DIST_ACTIVE_SET, 780 - .len = VGIC_MAX_IRQS / 8, 781 - .bits_per_irq = 1, 782 - .handle_mmio = handle_mmio_raz_wi, 783 - }, 784 - { 785 - .base = GIC_DIST_ACTIVE_CLEAR, 786 - .len = VGIC_MAX_IRQS / 8, 787 - .bits_per_irq = 1, 788 - .handle_mmio = handle_mmio_raz_wi, 789 - }, 790 - { 791 - .base = GIC_DIST_PRI, 792 - .len = VGIC_MAX_IRQS, 793 - .bits_per_irq = 8, 794 - .handle_mmio = handle_mmio_priority_reg, 795 - }, 796 - { 797 - .base = GIC_DIST_TARGET, 798 - .len = VGIC_MAX_IRQS, 799 - .bits_per_irq = 8, 800 - .handle_mmio = handle_mmio_target_reg, 801 - }, 802 - { 803 - .base = GIC_DIST_CONFIG, 804 - .len = VGIC_MAX_IRQS / 4, 805 - .bits_per_irq = 2, 806 - .handle_mmio = handle_mmio_cfg_reg, 807 - }, 808 - { 809 - .base = GIC_DIST_SOFTINT, 810 - .len = 4, 811 - .handle_mmio = handle_mmio_sgi_reg, 812 - }, 813 - { 814 - .base = GIC_DIST_SGI_PENDING_CLEAR, 815 - .len = VGIC_NR_SGIS, 816 - .handle_mmio = handle_mmio_sgi_clear, 817 - }, 818 - { 819 - .base = GIC_DIST_SGI_PENDING_SET, 820 - .len = VGIC_NR_SGIS, 821 - .handle_mmio = handle_mmio_sgi_set, 822 - }, 823 - {} 824 - }; 825 - 826 - static const 827 - struct mmio_range *find_matching_range(const struct mmio_range *ranges, 827 + const 828 + struct kvm_mmio_range *vgic_find_range(const struct kvm_mmio_range *ranges, 828 829 struct kvm_exit_mmio *mmio, 829 830 phys_addr_t offset) 830 831 { 831 - const struct mmio_range *r = ranges; 832 + const struct kvm_mmio_range *r = ranges; 832 833 833 834 while (r->len) { 834 835 if (offset >= r->base && ··· 665 1018 } 666 1019 667 1020 static bool vgic_validate_access(const struct vgic_dist *dist, 668 - const struct mmio_range *range, 1021 + const struct kvm_mmio_range *range, 669 1022 unsigned long offset) 670 1023 { 671 1024 int irq; ··· 680 1033 return true; 681 1034 } 682 1035 1036 + /* 1037 + * Call the respective handler function for the given range. 1038 + * We split up any 64 bit accesses into two consecutive 32 bit 1039 + * handler calls and merge the result afterwards. 1040 + * We do this in a little endian fashion regardless of the host's 1041 + * or guest's endianness, because the GIC is always LE and the rest of 1042 + * the code (vgic_reg_access) also puts it in a LE fashion already. 1043 + * At this point we have already identified the handle function, so 1044 + * range points to that one entry and offset is relative to this. 1045 + */ 1046 + static bool call_range_handler(struct kvm_vcpu *vcpu, 1047 + struct kvm_exit_mmio *mmio, 1048 + unsigned long offset, 1049 + const struct kvm_mmio_range *range) 1050 + { 1051 + u32 *data32 = (void *)mmio->data; 1052 + struct kvm_exit_mmio mmio32; 1053 + bool ret; 1054 + 1055 + if (likely(mmio->len <= 4)) 1056 + return range->handle_mmio(vcpu, mmio, offset); 1057 + 1058 + /* 1059 + * Any access bigger than 4 bytes (that we currently handle in KVM) 1060 + * is actually 8 bytes long, caused by a 64-bit access 1061 + */ 1062 + 1063 + mmio32.len = 4; 1064 + mmio32.is_write = mmio->is_write; 1065 + mmio32.private = mmio->private; 1066 + 1067 + mmio32.phys_addr = mmio->phys_addr + 4; 1068 + if (mmio->is_write) 1069 + *(u32 *)mmio32.data = data32[1]; 1070 + ret = range->handle_mmio(vcpu, &mmio32, offset + 4); 1071 + if (!mmio->is_write) 1072 + data32[1] = *(u32 *)mmio32.data; 1073 + 1074 + mmio32.phys_addr = mmio->phys_addr; 1075 + if (mmio->is_write) 1076 + *(u32 *)mmio32.data = data32[0]; 1077 + ret |= range->handle_mmio(vcpu, &mmio32, offset); 1078 + if (!mmio->is_write) 1079 + data32[0] = *(u32 *)mmio32.data; 1080 + 1081 + return ret; 1082 + } 1083 + 683 1084 /** 684 - * vgic_handle_mmio - handle an in-kernel MMIO access 1085 + * vgic_handle_mmio_range - handle an in-kernel MMIO access 685 1086 * @vcpu: pointer to the vcpu performing the access 686 1087 * @run: pointer to the kvm_run structure 687 1088 * @mmio: pointer to the data describing the access 1089 + * @ranges: array of MMIO ranges in a given region 1090 + * @mmio_base: base address of that region 688 1091 * 689 - * returns true if the MMIO access has been performed in kernel space, 690 - * and false if it needs to be emulated in user space. 1092 + * returns true if the MMIO access could be performed 691 1093 */ 692 - bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, 693 - struct kvm_exit_mmio *mmio) 1094 + bool vgic_handle_mmio_range(struct kvm_vcpu *vcpu, struct kvm_run *run, 1095 + struct kvm_exit_mmio *mmio, 1096 + const struct kvm_mmio_range *ranges, 1097 + unsigned long mmio_base) 694 1098 { 695 - const struct mmio_range *range; 1099 + const struct kvm_mmio_range *range; 696 1100 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 697 - unsigned long base = dist->vgic_dist_base; 698 1101 bool updated_state; 699 1102 unsigned long offset; 700 1103 701 - if (!irqchip_in_kernel(vcpu->kvm) || 702 - mmio->phys_addr < base || 703 - (mmio->phys_addr + mmio->len) > (base + KVM_VGIC_V2_DIST_SIZE)) 704 - return false; 705 - 706 - /* We don't support ldrd / strd or ldm / stm to the emulated vgic */ 707 - if (mmio->len > 4) { 708 - kvm_inject_dabt(vcpu, mmio->phys_addr); 709 - return true; 710 - } 711 - 712 - offset = mmio->phys_addr - base; 713 - range = find_matching_range(vgic_dist_ranges, mmio, offset); 1104 + offset = mmio->phys_addr - mmio_base; 1105 + range = vgic_find_range(ranges, mmio, offset); 714 1106 if (unlikely(!range || !range->handle_mmio)) { 715 1107 pr_warn("Unhandled access %d %08llx %d\n", 716 1108 mmio->is_write, mmio->phys_addr, mmio->len); ··· 757 1071 } 758 1072 759 1073 spin_lock(&vcpu->kvm->arch.vgic.lock); 760 - offset = mmio->phys_addr - range->base - base; 1074 + offset -= range->base; 761 1075 if (vgic_validate_access(dist, range, offset)) { 762 - updated_state = range->handle_mmio(vcpu, mmio, offset); 1076 + updated_state = call_range_handler(vcpu, mmio, offset, range); 763 1077 } else { 764 - vgic_reg_access(mmio, NULL, offset, 765 - ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED); 1078 + if (!mmio->is_write) 1079 + memset(mmio->data, 0, mmio->len); 766 1080 updated_state = false; 767 1081 } 768 1082 spin_unlock(&vcpu->kvm->arch.vgic.lock); ··· 775 1089 return true; 776 1090 } 777 1091 778 - static u8 *vgic_get_sgi_sources(struct vgic_dist *dist, int vcpu_id, int sgi) 1092 + /** 1093 + * vgic_handle_mmio - handle an in-kernel MMIO access for the GIC emulation 1094 + * @vcpu: pointer to the vcpu performing the access 1095 + * @run: pointer to the kvm_run structure 1096 + * @mmio: pointer to the data describing the access 1097 + * 1098 + * returns true if the MMIO access has been performed in kernel space, 1099 + * and false if it needs to be emulated in user space. 1100 + * Calls the actual handling routine for the selected VGIC model. 1101 + */ 1102 + bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, 1103 + struct kvm_exit_mmio *mmio) 779 1104 { 780 - return dist->irq_sgi_sources + vcpu_id * VGIC_NR_SGIS + sgi; 781 - } 1105 + if (!irqchip_in_kernel(vcpu->kvm)) 1106 + return false; 782 1107 783 - static void vgic_dispatch_sgi(struct kvm_vcpu *vcpu, u32 reg) 784 - { 785 - struct kvm *kvm = vcpu->kvm; 786 - struct vgic_dist *dist = &kvm->arch.vgic; 787 - int nrcpus = atomic_read(&kvm->online_vcpus); 788 - u8 target_cpus; 789 - int sgi, mode, c, vcpu_id; 790 - 791 - vcpu_id = vcpu->vcpu_id; 792 - 793 - sgi = reg & 0xf; 794 - target_cpus = (reg >> 16) & 0xff; 795 - mode = (reg >> 24) & 3; 796 - 797 - switch (mode) { 798 - case 0: 799 - if (!target_cpus) 800 - return; 801 - break; 802 - 803 - case 1: 804 - target_cpus = ((1 << nrcpus) - 1) & ~(1 << vcpu_id) & 0xff; 805 - break; 806 - 807 - case 2: 808 - target_cpus = 1 << vcpu_id; 809 - break; 810 - } 811 - 812 - kvm_for_each_vcpu(c, vcpu, kvm) { 813 - if (target_cpus & 1) { 814 - /* Flag the SGI as pending */ 815 - vgic_dist_irq_set_pending(vcpu, sgi); 816 - *vgic_get_sgi_sources(dist, c, sgi) |= 1 << vcpu_id; 817 - kvm_debug("SGI%d from CPU%d to CPU%d\n", sgi, vcpu_id, c); 818 - } 819 - 820 - target_cpus >>= 1; 821 - } 1108 + /* 1109 + * This will currently call either vgic_v2_handle_mmio() or 1110 + * vgic_v3_handle_mmio(), which in turn will call 1111 + * vgic_handle_mmio_range() defined above. 1112 + */ 1113 + return vcpu->kvm->arch.vgic.vm_ops.handle_mmio(vcpu, run, mmio); 822 1114 } 823 1115 824 1116 static int vgic_nr_shared_irqs(struct vgic_dist *dist) ··· 837 1173 * Update the interrupt state and determine which CPUs have pending 838 1174 * interrupts. Must be called with distributor lock held. 839 1175 */ 840 - static void vgic_update_state(struct kvm *kvm) 1176 + void vgic_update_state(struct kvm *kvm) 841 1177 { 842 1178 struct vgic_dist *dist = &kvm->arch.vgic; 843 1179 struct kvm_vcpu *vcpu; ··· 898 1234 vgic_ops->disable_underflow(vcpu); 899 1235 } 900 1236 901 - static inline void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) 1237 + void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) 902 1238 { 903 1239 vgic_ops->get_vmcr(vcpu, vmcr); 904 1240 } 905 1241 906 - static void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) 1242 + void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) 907 1243 { 908 1244 vgic_ops->set_vmcr(vcpu, vmcr); 909 1245 } ··· 952 1288 /* 953 1289 * Queue an interrupt to a CPU virtual interface. Return true on success, 954 1290 * or false if it wasn't possible to queue it. 1291 + * sgi_source must be zero for any non-SGI interrupts. 955 1292 */ 956 - static bool vgic_queue_irq(struct kvm_vcpu *vcpu, u8 sgi_source_id, int irq) 1293 + bool vgic_queue_irq(struct kvm_vcpu *vcpu, u8 sgi_source_id, int irq) 957 1294 { 958 1295 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 959 1296 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; ··· 1003 1338 return true; 1004 1339 } 1005 1340 1006 - static bool vgic_queue_sgi(struct kvm_vcpu *vcpu, int irq) 1007 - { 1008 - struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1009 - unsigned long sources; 1010 - int vcpu_id = vcpu->vcpu_id; 1011 - int c; 1012 - 1013 - sources = *vgic_get_sgi_sources(dist, vcpu_id, irq); 1014 - 1015 - for_each_set_bit(c, &sources, dist->nr_cpus) { 1016 - if (vgic_queue_irq(vcpu, c, irq)) 1017 - clear_bit(c, &sources); 1018 - } 1019 - 1020 - *vgic_get_sgi_sources(dist, vcpu_id, irq) = sources; 1021 - 1022 - /* 1023 - * If the sources bitmap has been cleared it means that we 1024 - * could queue all the SGIs onto link registers (see the 1025 - * clear_bit above), and therefore we are done with them in 1026 - * our emulated gic and can get rid of them. 1027 - */ 1028 - if (!sources) { 1029 - vgic_dist_irq_clear_pending(vcpu, irq); 1030 - vgic_cpu_irq_clear(vcpu, irq); 1031 - return true; 1032 - } 1033 - 1034 - return false; 1035 - } 1036 - 1037 1341 static bool vgic_queue_hwirq(struct kvm_vcpu *vcpu, int irq) 1038 1342 { 1039 1343 if (!vgic_can_sample_irq(vcpu, irq)) ··· 1047 1413 1048 1414 /* SGIs */ 1049 1415 for_each_set_bit(i, vgic_cpu->pending_percpu, VGIC_NR_SGIS) { 1050 - if (!vgic_queue_sgi(vcpu, i)) 1416 + if (!queue_sgi(vcpu, i)) 1051 1417 overflow = 1; 1052 1418 } 1053 1419 ··· 1209 1575 return test_bit(vcpu->vcpu_id, dist->irq_pending_on_cpu); 1210 1576 } 1211 1577 1212 - static void vgic_kick_vcpus(struct kvm *kvm) 1578 + void vgic_kick_vcpus(struct kvm *kvm) 1213 1579 { 1214 1580 struct kvm_vcpu *vcpu; 1215 1581 int c; ··· 1249 1615 struct kvm_vcpu *vcpu; 1250 1616 int edge_triggered, level_triggered; 1251 1617 int enabled; 1252 - bool ret = true; 1618 + bool ret = true, can_inject = true; 1253 1619 1254 1620 spin_lock(&dist->lock); 1255 1621 ··· 1264 1630 1265 1631 if (irq_num >= VGIC_NR_PRIVATE_IRQS) { 1266 1632 cpuid = dist->irq_spi_cpu[irq_num - VGIC_NR_PRIVATE_IRQS]; 1633 + if (cpuid == VCPU_NOT_ALLOCATED) { 1634 + /* Pretend we use CPU0, and prevent injection */ 1635 + cpuid = 0; 1636 + can_inject = false; 1637 + } 1267 1638 vcpu = kvm_get_vcpu(kvm, cpuid); 1268 1639 } 1269 1640 ··· 1291 1652 1292 1653 enabled = vgic_irq_is_enabled(vcpu, irq_num); 1293 1654 1294 - if (!enabled) { 1655 + if (!enabled || !can_inject) { 1295 1656 ret = false; 1296 1657 goto out; 1297 1658 } ··· 1337 1698 int vcpu_id; 1338 1699 1339 1700 if (unlikely(!vgic_initialized(kvm))) { 1701 + /* 1702 + * We only provide the automatic initialization of the VGIC 1703 + * for the legacy case of a GICv2. Any other type must 1704 + * be explicitly initialized once setup with the respective 1705 + * KVM device call. 1706 + */ 1707 + if (kvm->arch.vgic.vgic_model != KVM_DEV_TYPE_ARM_VGIC_V2) { 1708 + ret = -EBUSY; 1709 + goto out; 1710 + } 1340 1711 mutex_lock(&kvm->lock); 1341 1712 ret = vgic_init(kvm); 1342 1713 mutex_unlock(&kvm->lock); ··· 1411 1762 return 0; 1412 1763 } 1413 1764 1765 + /** 1766 + * kvm_vgic_get_max_vcpus - Get the maximum number of VCPUs allowed by HW 1767 + * 1768 + * The host's GIC naturally limits the maximum amount of VCPUs a guest 1769 + * can use. 1770 + */ 1771 + int kvm_vgic_get_max_vcpus(void) 1772 + { 1773 + return vgic->max_gic_vcpus; 1774 + } 1775 + 1414 1776 void kvm_vgic_destroy(struct kvm *kvm) 1415 1777 { 1416 1778 struct vgic_dist *dist = &kvm->arch.vgic; ··· 1444 1784 } 1445 1785 kfree(dist->irq_sgi_sources); 1446 1786 kfree(dist->irq_spi_cpu); 1787 + kfree(dist->irq_spi_mpidr); 1447 1788 kfree(dist->irq_spi_target); 1448 1789 kfree(dist->irq_pending_on_cpu); 1449 1790 dist->irq_sgi_sources = NULL; ··· 1458 1797 * Allocate and initialize the various data structures. Must be called 1459 1798 * with kvm->lock held! 1460 1799 */ 1461 - static int vgic_init(struct kvm *kvm) 1800 + int vgic_init(struct kvm *kvm) 1462 1801 { 1463 1802 struct vgic_dist *dist = &kvm->arch.vgic; 1464 1803 struct kvm_vcpu *vcpu; ··· 1470 1809 1471 1810 nr_cpus = dist->nr_cpus = atomic_read(&kvm->online_vcpus); 1472 1811 if (!nr_cpus) /* No vcpus? Can't be good... */ 1473 - return -EINVAL; 1812 + return -ENODEV; 1474 1813 1475 1814 /* 1476 1815 * If nobody configured the number of interrupts, use the ··· 1513 1852 if (ret) 1514 1853 goto out; 1515 1854 1516 - for (i = VGIC_NR_PRIVATE_IRQS; i < dist->nr_irqs; i += 4) 1517 - vgic_set_target_reg(kvm, 0, i); 1855 + ret = kvm->arch.vgic.vm_ops.init_model(kvm); 1856 + if (ret) 1857 + goto out; 1518 1858 1519 1859 kvm_for_each_vcpu(vcpu_id, vcpu, kvm) { 1520 1860 ret = vgic_vcpu_init_maps(vcpu, nr_irqs); ··· 1544 1882 return ret; 1545 1883 } 1546 1884 1547 - /** 1548 - * kvm_vgic_map_resources - Configure global VGIC state before running any VCPUs 1549 - * @kvm: pointer to the kvm struct 1550 - * 1551 - * Map the virtual CPU interface into the VM before running any VCPUs. We 1552 - * can't do this at creation time, because user space must first set the 1553 - * virtual CPU interface address in the guest physical address space. 1554 - */ 1555 - int kvm_vgic_map_resources(struct kvm *kvm) 1885 + static int init_vgic_model(struct kvm *kvm, int type) 1556 1886 { 1557 - int ret = 0; 1558 - 1559 - if (!irqchip_in_kernel(kvm)) 1560 - return 0; 1561 - 1562 - mutex_lock(&kvm->lock); 1563 - 1564 - if (vgic_ready(kvm)) 1565 - goto out; 1566 - 1567 - if (IS_VGIC_ADDR_UNDEF(kvm->arch.vgic.vgic_dist_base) || 1568 - IS_VGIC_ADDR_UNDEF(kvm->arch.vgic.vgic_cpu_base)) { 1569 - kvm_err("Need to set vgic cpu and dist addresses first\n"); 1570 - ret = -ENXIO; 1571 - goto out; 1887 + switch (type) { 1888 + case KVM_DEV_TYPE_ARM_VGIC_V2: 1889 + vgic_v2_init_emulation(kvm); 1890 + break; 1891 + #ifdef CONFIG_ARM_GIC_V3 1892 + case KVM_DEV_TYPE_ARM_VGIC_V3: 1893 + vgic_v3_init_emulation(kvm); 1894 + break; 1895 + #endif 1896 + default: 1897 + return -ENODEV; 1572 1898 } 1573 1899 1574 - /* 1575 - * Initialize the vgic if this hasn't already been done on demand by 1576 - * accessing the vgic state from userspace. 1577 - */ 1578 - ret = vgic_init(kvm); 1579 - if (ret) { 1580 - kvm_err("Unable to allocate maps\n"); 1581 - goto out; 1582 - } 1900 + if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) 1901 + return -E2BIG; 1583 1902 1584 - ret = kvm_phys_addr_ioremap(kvm, kvm->arch.vgic.vgic_cpu_base, 1585 - vgic->vcpu_base, KVM_VGIC_V2_CPU_SIZE, 1586 - true); 1587 - if (ret) { 1588 - kvm_err("Unable to remap VGIC CPU to VCPU\n"); 1589 - goto out; 1590 - } 1591 - 1592 - kvm->arch.vgic.ready = true; 1593 - out: 1594 - if (ret) 1595 - kvm_vgic_destroy(kvm); 1596 - mutex_unlock(&kvm->lock); 1597 - return ret; 1903 + return 0; 1598 1904 } 1599 1905 1600 - int kvm_vgic_create(struct kvm *kvm) 1906 + int kvm_vgic_create(struct kvm *kvm, u32 type) 1601 1907 { 1602 1908 int i, vcpu_lock_idx = -1, ret; 1603 1909 struct kvm_vcpu *vcpu; 1604 1910 1605 1911 mutex_lock(&kvm->lock); 1606 1912 1607 - if (kvm->arch.vgic.vctrl_base) { 1913 + if (irqchip_in_kernel(kvm)) { 1608 1914 ret = -EEXIST; 1609 1915 goto out; 1610 1916 } 1917 + 1918 + /* 1919 + * This function is also called by the KVM_CREATE_IRQCHIP handler, 1920 + * which had no chance yet to check the availability of the GICv2 1921 + * emulation. So check this here again. KVM_CREATE_DEVICE does 1922 + * the proper checks already. 1923 + */ 1924 + if (type == KVM_DEV_TYPE_ARM_VGIC_V2 && !vgic->can_emulate_gicv2) 1925 + return -ENODEV; 1611 1926 1612 1927 /* 1613 1928 * Any time a vcpu is run, vcpu_load is called which tries to grab the ··· 1604 1965 } 1605 1966 ret = 0; 1606 1967 1968 + ret = init_vgic_model(kvm, type); 1969 + if (ret) 1970 + goto out_unlock; 1971 + 1607 1972 spin_lock_init(&kvm->arch.vgic.lock); 1608 1973 kvm->arch.vgic.in_kernel = true; 1974 + kvm->arch.vgic.vgic_model = type; 1609 1975 kvm->arch.vgic.vctrl_base = vgic->vctrl_base; 1610 1976 kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF; 1611 1977 kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; 1978 + kvm->arch.vgic.vgic_redist_base = VGIC_ADDR_UNDEF; 1612 1979 1613 1980 out_unlock: 1614 1981 for (; vcpu_lock_idx >= 0; vcpu_lock_idx--) { ··· 1667 2022 /** 1668 2023 * kvm_vgic_addr - set or get vgic VM base addresses 1669 2024 * @kvm: pointer to the vm struct 1670 - * @type: the VGIC addr type, one of KVM_VGIC_V2_ADDR_TYPE_XXX 2025 + * @type: the VGIC addr type, one of KVM_VGIC_V[23]_ADDR_TYPE_XXX 1671 2026 * @addr: pointer to address value 1672 2027 * @write: if true set the address in the VM address space, if false read the 1673 2028 * address ··· 1681 2036 { 1682 2037 int r = 0; 1683 2038 struct vgic_dist *vgic = &kvm->arch.vgic; 2039 + int type_needed; 2040 + phys_addr_t *addr_ptr, block_size; 2041 + phys_addr_t alignment; 1684 2042 1685 2043 mutex_lock(&kvm->lock); 1686 2044 switch (type) { 1687 2045 case KVM_VGIC_V2_ADDR_TYPE_DIST: 1688 - if (write) { 1689 - r = vgic_ioaddr_assign(kvm, &vgic->vgic_dist_base, 1690 - *addr, KVM_VGIC_V2_DIST_SIZE); 1691 - } else { 1692 - *addr = vgic->vgic_dist_base; 1693 - } 2046 + type_needed = KVM_DEV_TYPE_ARM_VGIC_V2; 2047 + addr_ptr = &vgic->vgic_dist_base; 2048 + block_size = KVM_VGIC_V2_DIST_SIZE; 2049 + alignment = SZ_4K; 1694 2050 break; 1695 2051 case KVM_VGIC_V2_ADDR_TYPE_CPU: 1696 - if (write) { 1697 - r = vgic_ioaddr_assign(kvm, &vgic->vgic_cpu_base, 1698 - *addr, KVM_VGIC_V2_CPU_SIZE); 1699 - } else { 1700 - *addr = vgic->vgic_cpu_base; 1701 - } 2052 + type_needed = KVM_DEV_TYPE_ARM_VGIC_V2; 2053 + addr_ptr = &vgic->vgic_cpu_base; 2054 + block_size = KVM_VGIC_V2_CPU_SIZE; 2055 + alignment = SZ_4K; 1702 2056 break; 2057 + #ifdef CONFIG_ARM_GIC_V3 2058 + case KVM_VGIC_V3_ADDR_TYPE_DIST: 2059 + type_needed = KVM_DEV_TYPE_ARM_VGIC_V3; 2060 + addr_ptr = &vgic->vgic_dist_base; 2061 + block_size = KVM_VGIC_V3_DIST_SIZE; 2062 + alignment = SZ_64K; 2063 + break; 2064 + case KVM_VGIC_V3_ADDR_TYPE_REDIST: 2065 + type_needed = KVM_DEV_TYPE_ARM_VGIC_V3; 2066 + addr_ptr = &vgic->vgic_redist_base; 2067 + block_size = KVM_VGIC_V3_REDIST_SIZE; 2068 + alignment = SZ_64K; 2069 + break; 2070 + #endif 1703 2071 default: 1704 2072 r = -ENODEV; 2073 + goto out; 1705 2074 } 1706 2075 2076 + if (vgic->vgic_model != type_needed) { 2077 + r = -ENODEV; 2078 + goto out; 2079 + } 2080 + 2081 + if (write) { 2082 + if (!IS_ALIGNED(*addr, alignment)) 2083 + r = -EINVAL; 2084 + else 2085 + r = vgic_ioaddr_assign(kvm, addr_ptr, *addr, 2086 + block_size); 2087 + } else { 2088 + *addr = *addr_ptr; 2089 + } 2090 + 2091 + out: 1707 2092 mutex_unlock(&kvm->lock); 1708 2093 return r; 1709 2094 } 1710 2095 1711 - static bool handle_cpu_mmio_misc(struct kvm_vcpu *vcpu, 1712 - struct kvm_exit_mmio *mmio, phys_addr_t offset) 1713 - { 1714 - bool updated = false; 1715 - struct vgic_vmcr vmcr; 1716 - u32 *vmcr_field; 1717 - u32 reg; 1718 - 1719 - vgic_get_vmcr(vcpu, &vmcr); 1720 - 1721 - switch (offset & ~0x3) { 1722 - case GIC_CPU_CTRL: 1723 - vmcr_field = &vmcr.ctlr; 1724 - break; 1725 - case GIC_CPU_PRIMASK: 1726 - vmcr_field = &vmcr.pmr; 1727 - break; 1728 - case GIC_CPU_BINPOINT: 1729 - vmcr_field = &vmcr.bpr; 1730 - break; 1731 - case GIC_CPU_ALIAS_BINPOINT: 1732 - vmcr_field = &vmcr.abpr; 1733 - break; 1734 - default: 1735 - BUG(); 1736 - } 1737 - 1738 - if (!mmio->is_write) { 1739 - reg = *vmcr_field; 1740 - mmio_data_write(mmio, ~0, reg); 1741 - } else { 1742 - reg = mmio_data_read(mmio, ~0); 1743 - if (reg != *vmcr_field) { 1744 - *vmcr_field = reg; 1745 - vgic_set_vmcr(vcpu, &vmcr); 1746 - updated = true; 1747 - } 1748 - } 1749 - return updated; 1750 - } 1751 - 1752 - static bool handle_mmio_abpr(struct kvm_vcpu *vcpu, 1753 - struct kvm_exit_mmio *mmio, phys_addr_t offset) 1754 - { 1755 - return handle_cpu_mmio_misc(vcpu, mmio, GIC_CPU_ALIAS_BINPOINT); 1756 - } 1757 - 1758 - static bool handle_cpu_mmio_ident(struct kvm_vcpu *vcpu, 1759 - struct kvm_exit_mmio *mmio, 1760 - phys_addr_t offset) 1761 - { 1762 - u32 reg; 1763 - 1764 - if (mmio->is_write) 1765 - return false; 1766 - 1767 - /* GICC_IIDR */ 1768 - reg = (PRODUCT_ID_KVM << 20) | 1769 - (GICC_ARCH_VERSION_V2 << 16) | 1770 - (IMPLEMENTER_ARM << 0); 1771 - mmio_data_write(mmio, ~0, reg); 1772 - return false; 1773 - } 1774 - 1775 - /* 1776 - * CPU Interface Register accesses - these are not accessed by the VM, but by 1777 - * user space for saving and restoring VGIC state. 1778 - */ 1779 - static const struct mmio_range vgic_cpu_ranges[] = { 1780 - { 1781 - .base = GIC_CPU_CTRL, 1782 - .len = 12, 1783 - .handle_mmio = handle_cpu_mmio_misc, 1784 - }, 1785 - { 1786 - .base = GIC_CPU_ALIAS_BINPOINT, 1787 - .len = 4, 1788 - .handle_mmio = handle_mmio_abpr, 1789 - }, 1790 - { 1791 - .base = GIC_CPU_ACTIVEPRIO, 1792 - .len = 16, 1793 - .handle_mmio = handle_mmio_raz_wi, 1794 - }, 1795 - { 1796 - .base = GIC_CPU_IDENT, 1797 - .len = 4, 1798 - .handle_mmio = handle_cpu_mmio_ident, 1799 - }, 1800 - }; 1801 - 1802 - static int vgic_attr_regs_access(struct kvm_device *dev, 1803 - struct kvm_device_attr *attr, 1804 - u32 *reg, bool is_write) 1805 - { 1806 - const struct mmio_range *r = NULL, *ranges; 1807 - phys_addr_t offset; 1808 - int ret, cpuid, c; 1809 - struct kvm_vcpu *vcpu, *tmp_vcpu; 1810 - struct vgic_dist *vgic; 1811 - struct kvm_exit_mmio mmio; 1812 - 1813 - offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; 1814 - cpuid = (attr->attr & KVM_DEV_ARM_VGIC_CPUID_MASK) >> 1815 - KVM_DEV_ARM_VGIC_CPUID_SHIFT; 1816 - 1817 - mutex_lock(&dev->kvm->lock); 1818 - 1819 - ret = vgic_init(dev->kvm); 1820 - if (ret) 1821 - goto out; 1822 - 1823 - if (cpuid >= atomic_read(&dev->kvm->online_vcpus)) { 1824 - ret = -EINVAL; 1825 - goto out; 1826 - } 1827 - 1828 - vcpu = kvm_get_vcpu(dev->kvm, cpuid); 1829 - vgic = &dev->kvm->arch.vgic; 1830 - 1831 - mmio.len = 4; 1832 - mmio.is_write = is_write; 1833 - if (is_write) 1834 - mmio_data_write(&mmio, ~0, *reg); 1835 - switch (attr->group) { 1836 - case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 1837 - mmio.phys_addr = vgic->vgic_dist_base + offset; 1838 - ranges = vgic_dist_ranges; 1839 - break; 1840 - case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 1841 - mmio.phys_addr = vgic->vgic_cpu_base + offset; 1842 - ranges = vgic_cpu_ranges; 1843 - break; 1844 - default: 1845 - BUG(); 1846 - } 1847 - r = find_matching_range(ranges, &mmio, offset); 1848 - 1849 - if (unlikely(!r || !r->handle_mmio)) { 1850 - ret = -ENXIO; 1851 - goto out; 1852 - } 1853 - 1854 - 1855 - spin_lock(&vgic->lock); 1856 - 1857 - /* 1858 - * Ensure that no other VCPU is running by checking the vcpu->cpu 1859 - * field. If no other VPCUs are running we can safely access the VGIC 1860 - * state, because even if another VPU is run after this point, that 1861 - * VCPU will not touch the vgic state, because it will block on 1862 - * getting the vgic->lock in kvm_vgic_sync_hwstate(). 1863 - */ 1864 - kvm_for_each_vcpu(c, tmp_vcpu, dev->kvm) { 1865 - if (unlikely(tmp_vcpu->cpu != -1)) { 1866 - ret = -EBUSY; 1867 - goto out_vgic_unlock; 1868 - } 1869 - } 1870 - 1871 - /* 1872 - * Move all pending IRQs from the LRs on all VCPUs so the pending 1873 - * state can be properly represented in the register state accessible 1874 - * through this API. 1875 - */ 1876 - kvm_for_each_vcpu(c, tmp_vcpu, dev->kvm) 1877 - vgic_unqueue_irqs(tmp_vcpu); 1878 - 1879 - offset -= r->base; 1880 - r->handle_mmio(vcpu, &mmio, offset); 1881 - 1882 - if (!is_write) 1883 - *reg = mmio_data_read(&mmio, ~0); 1884 - 1885 - ret = 0; 1886 - out_vgic_unlock: 1887 - spin_unlock(&vgic->lock); 1888 - out: 1889 - mutex_unlock(&dev->kvm->lock); 1890 - return ret; 1891 - } 1892 - 1893 - static int vgic_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 2096 + int vgic_set_common_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 1894 2097 { 1895 2098 int r; 1896 2099 ··· 1753 2260 1754 2261 r = kvm_vgic_addr(dev->kvm, type, &addr, true); 1755 2262 return (r == -ENODEV) ? -ENXIO : r; 1756 - } 1757 - 1758 - case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 1759 - case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: { 1760 - u32 __user *uaddr = (u32 __user *)(long)attr->addr; 1761 - u32 reg; 1762 - 1763 - if (get_user(reg, uaddr)) 1764 - return -EFAULT; 1765 - 1766 - return vgic_attr_regs_access(dev, attr, &reg, true); 1767 2263 } 1768 2264 case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: { 1769 2265 u32 __user *uaddr = (u32 __user *)(long)attr->addr; ··· 1784 2302 1785 2303 return ret; 1786 2304 } 1787 - 2305 + case KVM_DEV_ARM_VGIC_GRP_CTRL: { 2306 + switch (attr->attr) { 2307 + case KVM_DEV_ARM_VGIC_CTRL_INIT: 2308 + r = vgic_init(dev->kvm); 2309 + return r; 2310 + } 2311 + break; 2312 + } 1788 2313 } 1789 2314 1790 2315 return -ENXIO; 1791 2316 } 1792 2317 1793 - static int vgic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 2318 + int vgic_get_common_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 1794 2319 { 1795 2320 int r = -ENXIO; 1796 2321 ··· 1815 2326 return -EFAULT; 1816 2327 break; 1817 2328 } 1818 - 1819 - case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 1820 - case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: { 1821 - u32 __user *uaddr = (u32 __user *)(long)attr->addr; 1822 - u32 reg = 0; 1823 - 1824 - r = vgic_attr_regs_access(dev, attr, &reg, false); 1825 - if (r) 1826 - return r; 1827 - r = put_user(reg, uaddr); 1828 - break; 1829 - } 1830 2329 case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: { 1831 2330 u32 __user *uaddr = (u32 __user *)(long)attr->addr; 2331 + 1832 2332 r = put_user(dev->kvm->arch.vgic.nr_irqs, uaddr); 1833 2333 break; 1834 2334 } ··· 1827 2349 return r; 1828 2350 } 1829 2351 1830 - static int vgic_has_attr_regs(const struct mmio_range *ranges, 1831 - phys_addr_t offset) 2352 + int vgic_has_attr_regs(const struct kvm_mmio_range *ranges, phys_addr_t offset) 1832 2353 { 1833 2354 struct kvm_exit_mmio dev_attr_mmio; 1834 2355 1835 2356 dev_attr_mmio.len = 4; 1836 - if (find_matching_range(ranges, &dev_attr_mmio, offset)) 2357 + if (vgic_find_range(ranges, &dev_attr_mmio, offset)) 1837 2358 return 0; 1838 2359 else 1839 2360 return -ENXIO; 1840 2361 } 1841 - 1842 - static int vgic_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr) 1843 - { 1844 - phys_addr_t offset; 1845 - 1846 - switch (attr->group) { 1847 - case KVM_DEV_ARM_VGIC_GRP_ADDR: 1848 - switch (attr->attr) { 1849 - case KVM_VGIC_V2_ADDR_TYPE_DIST: 1850 - case KVM_VGIC_V2_ADDR_TYPE_CPU: 1851 - return 0; 1852 - } 1853 - break; 1854 - case KVM_DEV_ARM_VGIC_GRP_DIST_REGS: 1855 - offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; 1856 - return vgic_has_attr_regs(vgic_dist_ranges, offset); 1857 - case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: 1858 - offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; 1859 - return vgic_has_attr_regs(vgic_cpu_ranges, offset); 1860 - case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: 1861 - return 0; 1862 - } 1863 - return -ENXIO; 1864 - } 1865 - 1866 - static void vgic_destroy(struct kvm_device *dev) 1867 - { 1868 - kfree(dev); 1869 - } 1870 - 1871 - static int vgic_create(struct kvm_device *dev, u32 type) 1872 - { 1873 - return kvm_vgic_create(dev->kvm); 1874 - } 1875 - 1876 - static struct kvm_device_ops kvm_arm_vgic_v2_ops = { 1877 - .name = "kvm-arm-vgic", 1878 - .create = vgic_create, 1879 - .destroy = vgic_destroy, 1880 - .set_attr = vgic_set_attr, 1881 - .get_attr = vgic_get_attr, 1882 - .has_attr = vgic_has_attr, 1883 - }; 1884 2362 1885 2363 static void vgic_init_maintenance_interrupt(void *info) 1886 2364 { ··· 1908 2474 1909 2475 on_each_cpu(vgic_init_maintenance_interrupt, NULL, 1); 1910 2476 1911 - return kvm_register_device_ops(&kvm_arm_vgic_v2_ops, 1912 - KVM_DEV_TYPE_ARM_VGIC_V2); 2477 + return 0; 1913 2478 1914 2479 out_free_irq: 1915 2480 free_percpu_irq(vgic->maint_irq, kvm_get_running_vcpus());

+123

virt/kvm/arm/vgic.h

··· 1 + /* 2 + * Copyright (C) 2012-2014 ARM Ltd. 3 + * Author: Marc Zyngier <marc.zyngier@arm.com> 4 + * 5 + * Derived from virt/kvm/arm/vgic.c 6 + * 7 + * This program is free software; you can redistribute it and/or modify 8 + * it under the terms of the GNU General Public License version 2 as 9 + * published by the Free Software Foundation. 10 + * 11 + * This program is distributed in the hope that it will be useful, 12 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 13 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 + * GNU General Public License for more details. 15 + * 16 + * You should have received a copy of the GNU General Public License 17 + * along with this program. If not, see <http://www.gnu.org/licenses/>. 18 + */ 19 + 20 + #ifndef __KVM_VGIC_H__ 21 + #define __KVM_VGIC_H__ 22 + 23 + #define VGIC_ADDR_UNDEF (-1) 24 + #define IS_VGIC_ADDR_UNDEF(_x) ((_x) == VGIC_ADDR_UNDEF) 25 + 26 + #define PRODUCT_ID_KVM 0x4b /* ASCII code K */ 27 + #define IMPLEMENTER_ARM 0x43b 28 + 29 + #define ACCESS_READ_VALUE (1 << 0) 30 + #define ACCESS_READ_RAZ (0 << 0) 31 + #define ACCESS_READ_MASK(x) ((x) & (1 << 0)) 32 + #define ACCESS_WRITE_IGNORED (0 << 1) 33 + #define ACCESS_WRITE_SETBIT (1 << 1) 34 + #define ACCESS_WRITE_CLEARBIT (2 << 1) 35 + #define ACCESS_WRITE_VALUE (3 << 1) 36 + #define ACCESS_WRITE_MASK(x) ((x) & (3 << 1)) 37 + 38 + #define VCPU_NOT_ALLOCATED ((u8)-1) 39 + 40 + unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap *x); 41 + 42 + void vgic_update_state(struct kvm *kvm); 43 + int vgic_init_common_maps(struct kvm *kvm); 44 + 45 + u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x, int cpuid, u32 offset); 46 + u32 *vgic_bytemap_get_reg(struct vgic_bytemap *x, int cpuid, u32 offset); 47 + 48 + void vgic_dist_irq_set_pending(struct kvm_vcpu *vcpu, int irq); 49 + void vgic_dist_irq_clear_pending(struct kvm_vcpu *vcpu, int irq); 50 + void vgic_cpu_irq_clear(struct kvm_vcpu *vcpu, int irq); 51 + void vgic_bitmap_set_irq_val(struct vgic_bitmap *x, int cpuid, 52 + int irq, int val); 53 + 54 + void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 55 + void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 56 + 57 + bool vgic_queue_irq(struct kvm_vcpu *vcpu, u8 sgi_source_id, int irq); 58 + void vgic_unqueue_irqs(struct kvm_vcpu *vcpu); 59 + 60 + void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg, 61 + phys_addr_t offset, int mode); 62 + bool handle_mmio_raz_wi(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio, 63 + phys_addr_t offset); 64 + 65 + static inline 66 + u32 mmio_data_read(struct kvm_exit_mmio *mmio, u32 mask) 67 + { 68 + return le32_to_cpu(*((u32 *)mmio->data)) & mask; 69 + } 70 + 71 + static inline 72 + void mmio_data_write(struct kvm_exit_mmio *mmio, u32 mask, u32 value) 73 + { 74 + *((u32 *)mmio->data) = cpu_to_le32(value) & mask; 75 + } 76 + 77 + struct kvm_mmio_range { 78 + phys_addr_t base; 79 + unsigned long len; 80 + int bits_per_irq; 81 + bool (*handle_mmio)(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio, 82 + phys_addr_t offset); 83 + }; 84 + 85 + static inline bool is_in_range(phys_addr_t addr, unsigned long len, 86 + phys_addr_t baseaddr, unsigned long size) 87 + { 88 + return (addr >= baseaddr) && (addr + len <= baseaddr + size); 89 + } 90 + 91 + const 92 + struct kvm_mmio_range *vgic_find_range(const struct kvm_mmio_range *ranges, 93 + struct kvm_exit_mmio *mmio, 94 + phys_addr_t offset); 95 + 96 + bool vgic_handle_mmio_range(struct kvm_vcpu *vcpu, struct kvm_run *run, 97 + struct kvm_exit_mmio *mmio, 98 + const struct kvm_mmio_range *ranges, 99 + unsigned long mmio_base); 100 + 101 + bool vgic_handle_enable_reg(struct kvm *kvm, struct kvm_exit_mmio *mmio, 102 + phys_addr_t offset, int vcpu_id, int access); 103 + 104 + bool vgic_handle_set_pending_reg(struct kvm *kvm, struct kvm_exit_mmio *mmio, 105 + phys_addr_t offset, int vcpu_id); 106 + 107 + bool vgic_handle_clear_pending_reg(struct kvm *kvm, struct kvm_exit_mmio *mmio, 108 + phys_addr_t offset, int vcpu_id); 109 + 110 + bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio *mmio, 111 + phys_addr_t offset); 112 + 113 + void vgic_kick_vcpus(struct kvm *kvm); 114 + 115 + int vgic_has_attr_regs(const struct kvm_mmio_range *ranges, phys_addr_t offset); 116 + int vgic_set_common_attr(struct kvm_device *dev, struct kvm_device_attr *attr); 117 + int vgic_get_common_attr(struct kvm_device *dev, struct kvm_device_attr *attr); 118 + 119 + int vgic_init(struct kvm *kvm); 120 + void vgic_v2_init_emulation(struct kvm *kvm); 121 + void vgic_v3_init_emulation(struct kvm *kvm); 122 + 123 + #endif

+131 -13

virt/kvm/kvm_main.c

··· 66 66 MODULE_AUTHOR("Qumranet"); 67 67 MODULE_LICENSE("GPL"); 68 68 69 + unsigned int halt_poll_ns = 0; 70 + module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR); 71 + 69 72 /* 70 73 * Ordering of locks: 71 74 * ··· 92 89 93 90 static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl, 94 91 unsigned long arg); 95 - #ifdef CONFIG_COMPAT 92 + #ifdef CONFIG_KVM_COMPAT 96 93 static long kvm_vcpu_compat_ioctl(struct file *file, unsigned int ioctl, 97 94 unsigned long arg); 98 95 #endif ··· 179 176 return called; 180 177 } 181 178 179 + #ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL 182 180 void kvm_flush_remote_tlbs(struct kvm *kvm) 183 181 { 184 182 long dirty_count = kvm->tlbs_dirty; ··· 190 186 cmpxchg(&kvm->tlbs_dirty, dirty_count, 0); 191 187 } 192 188 EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs); 189 + #endif 193 190 194 191 void kvm_reload_remote_mmus(struct kvm *kvm) 195 192 { ··· 678 673 if (!new->npages) { 679 674 WARN_ON(!mslots[i].npages); 680 675 new->base_gfn = 0; 676 + new->flags = 0; 681 677 if (mslots[i].npages) 682 678 slots->used_slots--; 683 679 } else { ··· 998 992 return r; 999 993 } 1000 994 EXPORT_SYMBOL_GPL(kvm_get_dirty_log); 995 + 996 + #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT 997 + /** 998 + * kvm_get_dirty_log_protect - get a snapshot of dirty pages, and if any pages 999 + * are dirty write protect them for next write. 1000 + * @kvm: pointer to kvm instance 1001 + * @log: slot id and address to which we copy the log 1002 + * @is_dirty: flag set if any page is dirty 1003 + * 1004 + * We need to keep it in mind that VCPU threads can write to the bitmap 1005 + * concurrently. So, to avoid losing track of dirty pages we keep the 1006 + * following order: 1007 + * 1008 + * 1. Take a snapshot of the bit and clear it if needed. 1009 + * 2. Write protect the corresponding page. 1010 + * 3. Copy the snapshot to the userspace. 1011 + * 4. Upon return caller flushes TLB's if needed. 1012 + * 1013 + * Between 2 and 4, the guest may write to the page using the remaining TLB 1014 + * entry. This is not a problem because the page is reported dirty using 1015 + * the snapshot taken before and step 4 ensures that writes done after 1016 + * exiting to userspace will be logged for the next call. 1017 + * 1018 + */ 1019 + int kvm_get_dirty_log_protect(struct kvm *kvm, 1020 + struct kvm_dirty_log *log, bool *is_dirty) 1021 + { 1022 + struct kvm_memory_slot *memslot; 1023 + int r, i; 1024 + unsigned long n; 1025 + unsigned long *dirty_bitmap; 1026 + unsigned long *dirty_bitmap_buffer; 1027 + 1028 + r = -EINVAL; 1029 + if (log->slot >= KVM_USER_MEM_SLOTS) 1030 + goto out; 1031 + 1032 + memslot = id_to_memslot(kvm->memslots, log->slot); 1033 + 1034 + dirty_bitmap = memslot->dirty_bitmap; 1035 + r = -ENOENT; 1036 + if (!dirty_bitmap) 1037 + goto out; 1038 + 1039 + n = kvm_dirty_bitmap_bytes(memslot); 1040 + 1041 + dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long); 1042 + memset(dirty_bitmap_buffer, 0, n); 1043 + 1044 + spin_lock(&kvm->mmu_lock); 1045 + *is_dirty = false; 1046 + for (i = 0; i < n / sizeof(long); i++) { 1047 + unsigned long mask; 1048 + gfn_t offset; 1049 + 1050 + if (!dirty_bitmap[i]) 1051 + continue; 1052 + 1053 + *is_dirty = true; 1054 + 1055 + mask = xchg(&dirty_bitmap[i], 0); 1056 + dirty_bitmap_buffer[i] = mask; 1057 + 1058 + offset = i * BITS_PER_LONG; 1059 + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, 1060 + mask); 1061 + } 1062 + 1063 + spin_unlock(&kvm->mmu_lock); 1064 + 1065 + r = -EFAULT; 1066 + if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) 1067 + goto out; 1068 + 1069 + r = 0; 1070 + out: 1071 + return r; 1072 + } 1073 + EXPORT_SYMBOL_GPL(kvm_get_dirty_log_protect); 1074 + #endif 1001 1075 1002 1076 bool kvm_largepages_enabled(void) 1003 1077 { ··· 1637 1551 } 1638 1552 return 0; 1639 1553 } 1554 + EXPORT_SYMBOL_GPL(kvm_write_guest); 1640 1555 1641 1556 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, 1642 1557 gpa_t gpa, unsigned long len) ··· 1774 1687 } 1775 1688 EXPORT_SYMBOL_GPL(mark_page_dirty); 1776 1689 1690 + static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu) 1691 + { 1692 + if (kvm_arch_vcpu_runnable(vcpu)) { 1693 + kvm_make_request(KVM_REQ_UNHALT, vcpu); 1694 + return -EINTR; 1695 + } 1696 + if (kvm_cpu_has_pending_timer(vcpu)) 1697 + return -EINTR; 1698 + if (signal_pending(current)) 1699 + return -EINTR; 1700 + 1701 + return 0; 1702 + } 1703 + 1777 1704 /* 1778 1705 * The vCPU has executed a HLT instruction with in-kernel mode enabled. 1779 1706 */ 1780 1707 void kvm_vcpu_block(struct kvm_vcpu *vcpu) 1781 1708 { 1709 + ktime_t start, cur; 1782 1710 DEFINE_WAIT(wait); 1711 + bool waited = false; 1712 + 1713 + start = cur = ktime_get(); 1714 + if (halt_poll_ns) { 1715 + ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns); 1716 + do { 1717 + /* 1718 + * This sets KVM_REQ_UNHALT if an interrupt 1719 + * arrives. 1720 + */ 1721 + if (kvm_vcpu_check_block(vcpu) < 0) { 1722 + ++vcpu->stat.halt_successful_poll; 1723 + goto out; 1724 + } 1725 + cur = ktime_get(); 1726 + } while (single_task_running() && ktime_before(cur, stop)); 1727 + } 1783 1728 1784 1729 for (;;) { 1785 1730 prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); 1786 1731 1787 - if (kvm_arch_vcpu_runnable(vcpu)) { 1788 - kvm_make_request(KVM_REQ_UNHALT, vcpu); 1789 - break; 1790 - } 1791 - if (kvm_cpu_has_pending_timer(vcpu)) 1792 - break; 1793 - if (signal_pending(current)) 1732 + if (kvm_vcpu_check_block(vcpu) < 0) 1794 1733 break; 1795 1734 1735 + waited = true; 1796 1736 schedule(); 1797 1737 } 1798 1738 1799 1739 finish_wait(&vcpu->wq, &wait); 1740 + cur = ktime_get(); 1741 + 1742 + out: 1743 + trace_kvm_vcpu_wakeup(ktime_to_ns(cur) - ktime_to_ns(start), waited); 1800 1744 } 1801 1745 EXPORT_SYMBOL_GPL(kvm_vcpu_block); 1802 1746 ··· 2010 1892 static struct file_operations kvm_vcpu_fops = { 2011 1893 .release = kvm_vcpu_release, 2012 1894 .unlocked_ioctl = kvm_vcpu_ioctl, 2013 - #ifdef CONFIG_COMPAT 1895 + #ifdef CONFIG_KVM_COMPAT 2014 1896 .compat_ioctl = kvm_vcpu_compat_ioctl, 2015 1897 #endif 2016 1898 .mmap = kvm_vcpu_mmap, ··· 2300 2182 return r; 2301 2183 } 2302 2184 2303 - #ifdef CONFIG_COMPAT 2185 + #ifdef CONFIG_KVM_COMPAT 2304 2186 static long kvm_vcpu_compat_ioctl(struct file *filp, 2305 2187 unsigned int ioctl, unsigned long arg) 2306 2188 { ··· 2392 2274 2393 2275 static const struct file_operations kvm_device_fops = { 2394 2276 .unlocked_ioctl = kvm_device_ioctl, 2395 - #ifdef CONFIG_COMPAT 2277 + #ifdef CONFIG_KVM_COMPAT 2396 2278 .compat_ioctl = kvm_device_ioctl, 2397 2279 #endif 2398 2280 .release = kvm_device_release, ··· 2679 2561 return r; 2680 2562 } 2681 2563 2682 - #ifdef CONFIG_COMPAT 2564 + #ifdef CONFIG_KVM_COMPAT 2683 2565 struct compat_kvm_dirty_log { 2684 2566 __u32 slot; 2685 2567 __u32 padding1; ··· 2726 2608 static struct file_operations kvm_vm_fops = { 2727 2609 .release = kvm_vm_release, 2728 2610 .unlocked_ioctl = kvm_vm_ioctl, 2729 - #ifdef CONFIG_COMPAT 2611 + #ifdef CONFIG_KVM_COMPAT 2730 2612 .compat_ioctl = kvm_vm_compat_ioctl, 2731 2613 #endif 2732 2614 .llseek = noop_llseek,