Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+1

Documentation/kernel-parameters.txt

··· 1585 1585 nosid disable Source ID checking 1586 1586 no_x2apic_optout 1587 1587 BIOS x2APIC opt-out request will be ignored 1588 + nopost disable Interrupt Posting 1588 1589 1589 1590 iomem= Disable strict checking of access to MMIO memory 1590 1591 strict regions from userspace.

+47 -5

Documentation/virtual/kvm/api.txt

··· 401 401 Architectures: x86, ppc, mips 402 402 Type: vcpu ioctl 403 403 Parameters: struct kvm_interrupt (in) 404 - Returns: 0 on success, -1 on error 404 + Returns: 0 on success, negative on failure. 405 405 406 - Queues a hardware interrupt vector to be injected. This is only 407 - useful if in-kernel local APIC or equivalent is not used. 406 + Queues a hardware interrupt vector to be injected. 408 407 409 408 /* for KVM_INTERRUPT */ 410 409 struct kvm_interrupt { ··· 413 414 414 415 X86: 415 416 416 - Note 'irq' is an interrupt vector, not an interrupt pin or line. 417 + Returns: 0 on success, 418 + -EEXIST if an interrupt is already enqueued 419 + -EINVAL the the irq number is invalid 420 + -ENXIO if the PIC is in the kernel 421 + -EFAULT if the pointer is invalid 422 + 423 + Note 'irq' is an interrupt vector, not an interrupt pin or line. This 424 + ioctl is useful if the in-kernel PIC is not used. 417 425 418 426 PPC: 419 427 ··· 1604 1598 struct kvm_ioeventfd { 1605 1599 __u64 datamatch; 1606 1600 __u64 addr; /* legal pio/mmio address */ 1607 - __u32 len; /* 1, 2, 4, or 8 bytes */ 1601 + __u32 len; /* 0, 1, 2, 4, or 8 bytes */ 1608 1602 __s32 fd; 1609 1603 __u32 flags; 1610 1604 __u8 pad[36]; ··· 1627 1621 For virtio-ccw devices, addr contains the subchannel id and datamatch the 1628 1622 virtqueue index. 1629 1623 1624 + With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and 1625 + the kernel will ignore the length of guest write and may get a faster vmexit. 1626 + The speedup may only apply to specific architectures, but the ioeventfd will 1627 + work anyway. 1630 1628 1631 1629 4.60 KVM_DIRTY_TLB 1632 1630 ··· 3319 3309 to ignore the request, or to gather VM memory core dump and/or 3320 3310 reset/shutdown of the VM. 3321 3311 3312 + /* KVM_EXIT_IOAPIC_EOI */ 3313 + struct { 3314 + __u8 vector; 3315 + } eoi; 3316 + 3317 + Indicates that the VCPU's in-kernel local APIC received an EOI for a 3318 + level-triggered IOAPIC interrupt. This exit only triggers when the 3319 + IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled); 3320 + the userspace IOAPIC should process the EOI and retrigger the interrupt if 3321 + it is still asserted. Vector is the LAPIC interrupt vector for which the 3322 + EOI was received. 3323 + 3322 3324 /* Fix the size of the union. */ 3323 3325 char padding[256]; 3324 3326 }; ··· 3648 3626 @ar - access register number 3649 3627 3650 3628 KVM handlers should exit to userspace with rc = -EREMOTE. 3629 + 3630 + 7.5 KVM_CAP_SPLIT_IRQCHIP 3631 + 3632 + Architectures: x86 3633 + Parameters: args[0] - number of routes reserved for userspace IOAPICs 3634 + Returns: 0 on success, -1 on error 3635 + 3636 + Create a local apic for each processor in the kernel. This can be used 3637 + instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the 3638 + IOAPIC and PIC (and also the PIT, even though this has to be enabled 3639 + separately). 3640 + 3641 + This capability also enables in kernel routing of interrupt requests; 3642 + when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are 3643 + used in the IRQ routing table. The first args[0] MSI routes are reserved 3644 + for the IOAPIC pins. Whenever the LAPIC receives an EOI for these routes, 3645 + a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace. 3646 + 3647 + Fails if VCPU has already been created, or if the irqchip is already in the 3648 + kernel (i.e. KVM_CREATE_IRQCHIP has already been called). 3651 3649 3652 3650 3653 3651 8. Other capabilities.

+187

Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt

··· 1 + KVM/ARM VGIC Forwarded Physical Interrupts 2 + ========================================== 3 + 4 + The KVM/ARM code implements software support for the ARM Generic 5 + Interrupt Controller's (GIC's) hardware support for virtualization by 6 + allowing software to inject virtual interrupts to a VM, which the guest 7 + OS sees as regular interrupts. The code is famously known as the VGIC. 8 + 9 + Some of these virtual interrupts, however, correspond to physical 10 + interrupts from real physical devices. One example could be the 11 + architected timer, which itself supports virtualization, and therefore 12 + lets a guest OS program the hardware device directly to raise an 13 + interrupt at some point in time. When such an interrupt is raised, the 14 + host OS initially handles the interrupt and must somehow signal this 15 + event as a virtual interrupt to the guest. Another example could be a 16 + passthrough device, where the physical interrupts are initially handled 17 + by the host, but the device driver for the device lives in the guest OS 18 + and KVM must therefore somehow inject a virtual interrupt on behalf of 19 + the physical one to the guest OS. 20 + 21 + These virtual interrupts corresponding to a physical interrupt on the 22 + host are called forwarded physical interrupts, but are also sometimes 23 + referred to as 'virtualized physical interrupts' and 'mapped interrupts'. 24 + 25 + Forwarded physical interrupts are handled slightly differently compared 26 + to virtual interrupts generated purely by a software emulated device. 27 + 28 + 29 + The HW bit 30 + ---------- 31 + Virtual interrupts are signalled to the guest by programming the List 32 + Registers (LRs) on the GIC before running a VCPU. The LR is programmed 33 + with the virtual IRQ number and the state of the interrupt (Pending, 34 + Active, or Pending+Active). When the guest ACKs and EOIs a virtual 35 + interrupt, the LR state moves from Pending to Active, and finally to 36 + inactive. 37 + 38 + The LRs include an extra bit, called the HW bit. When this bit is set, 39 + KVM must also program an additional field in the LR, the physical IRQ 40 + number, to link the virtual with the physical IRQ. 41 + 42 + When the HW bit is set, KVM must EITHER set the Pending OR the Active 43 + bit, never both at the same time. 44 + 45 + Setting the HW bit causes the hardware to deactivate the physical 46 + interrupt on the physical distributor when the guest deactivates the 47 + corresponding virtual interrupt. 48 + 49 + 50 + Forwarded Physical Interrupts Life Cycle 51 + ---------------------------------------- 52 + 53 + The state of forwarded physical interrupts is managed in the following way: 54 + 55 + - The physical interrupt is acked by the host, and becomes active on 56 + the physical distributor (*). 57 + - KVM sets the LR.Pending bit, because this is the only way the GICV 58 + interface is going to present it to the guest. 59 + - LR.Pending will stay set as long as the guest has not acked the interrupt. 60 + - LR.Pending transitions to LR.Active on the guest read of the IAR, as 61 + expected. 62 + - On guest EOI, the *physical distributor* active bit gets cleared, 63 + but the LR.Active is left untouched (set). 64 + - KVM clears the LR on VM exits when the physical distributor 65 + active state has been cleared. 66 + 67 + (*): The host handling is slightly more complicated. For some forwarded 68 + interrupts (shared), KVM directly sets the active state on the physical 69 + distributor before entering the guest, because the interrupt is never actually 70 + handled on the host (see details on the timer as an example below). For other 71 + forwarded interrupts (non-shared) the host does not deactivate the interrupt 72 + when the host ISR completes, but leaves the interrupt active until the guest 73 + deactivates it. Leaving the interrupt active is allowed, because Linux 74 + configures the physical GIC with EOIMode=1, which causes EOI operations to 75 + perform a priority drop allowing the GIC to receive other interrupts of the 76 + default priority. 77 + 78 + 79 + Forwarded Edge and Level Triggered PPIs and SPIs 80 + ------------------------------------------------ 81 + Forwarded physical interrupts injected should always be active on the 82 + physical distributor when injected to a guest. 83 + 84 + Level-triggered interrupts will keep the interrupt line to the GIC 85 + asserted, typically until the guest programs the device to deassert the 86 + line. This means that the interrupt will remain pending on the physical 87 + distributor until the guest has reprogrammed the device. Since we 88 + always run the VM with interrupts enabled on the CPU, a pending 89 + interrupt will exit the guest as soon as we switch into the guest, 90 + preventing the guest from ever making progress as the process repeats 91 + over and over. Therefore, the active state on the physical distributor 92 + must be set when entering the guest, preventing the GIC from forwarding 93 + the pending interrupt to the CPU. As soon as the guest deactivates the 94 + interrupt, the physical line is sampled by the hardware again and the host 95 + takes a new interrupt if and only if the physical line is still asserted. 96 + 97 + Edge-triggered interrupts do not exhibit the same problem with 98 + preventing guest execution that level-triggered interrupts do. One 99 + option is to not use HW bit at all, and inject edge-triggered interrupts 100 + from a physical device as pure virtual interrupts. But that would 101 + potentially slow down handling of the interrupt in the guest, because a 102 + physical interrupt occurring in the middle of the guest ISR would 103 + preempt the guest for the host to handle the interrupt. Additionally, 104 + if you configure the system to handle interrupts on a separate physical 105 + core from that running your VCPU, you still have to interrupt the VCPU 106 + to queue the pending state onto the LR, even though the guest won't use 107 + this information until the guest ISR completes. Therefore, the HW 108 + bit should always be set for forwarded edge-triggered interrupts. With 109 + the HW bit set, the virtual interrupt is injected and additional 110 + physical interrupts occurring before the guest deactivates the interrupt 111 + simply mark the state on the physical distributor as Pending+Active. As 112 + soon as the guest deactivates the interrupt, the host takes another 113 + interrupt if and only if there was a physical interrupt between injecting 114 + the forwarded interrupt to the guest and the guest deactivating the 115 + interrupt. 116 + 117 + Consequently, whenever we schedule a VCPU with one or more LRs with the 118 + HW bit set, the interrupt must also be active on the physical 119 + distributor. 120 + 121 + 122 + Forwarded LPIs 123 + -------------- 124 + LPIs, introduced in GICv3, are always edge-triggered and do not have an 125 + active state. They become pending when a device signal them, and as 126 + soon as they are acked by the CPU, they are inactive again. 127 + 128 + It therefore doesn't make sense, and is not supported, to set the HW bit 129 + for physical LPIs that are forwarded to a VM as virtual interrupts, 130 + typically virtual SPIs. 131 + 132 + For LPIs, there is no other choice than to preempt the VCPU thread if 133 + necessary, and queue the pending state onto the LR. 134 + 135 + 136 + Putting It Together: The Architected Timer 137 + ------------------------------------------ 138 + The architected timer is a device that signals interrupts with level 139 + triggered semantics. The timer hardware is directly accessed by VCPUs 140 + which program the timer to fire at some point in time. Each VCPU on a 141 + system programs the timer to fire at different times, and therefore the 142 + hardware is multiplexed between multiple VCPUs. This is implemented by 143 + context-switching the timer state along with each VCPU thread. 144 + 145 + However, this means that a scenario like the following is entirely 146 + possible, and in fact, typical: 147 + 148 + 1. KVM runs the VCPU 149 + 2. The guest programs the time to fire in T+100 150 + 3. The guest is idle and calls WFI (wait-for-interrupts) 151 + 4. The hardware traps to the host 152 + 5. KVM stores the timer state to memory and disables the hardware timer 153 + 6. KVM schedules a soft timer to fire in T+(100 - time since step 2) 154 + 7. KVM puts the VCPU thread to sleep (on a waitqueue) 155 + 8. The soft timer fires, waking up the VCPU thread 156 + 9. KVM reprograms the timer hardware with the VCPU's values 157 + 10. KVM marks the timer interrupt as active on the physical distributor 158 + 11. KVM injects a forwarded physical interrupt to the guest 159 + 12. KVM runs the VCPU 160 + 161 + Notice that KVM injects a forwarded physical interrupt in step 11 without 162 + the corresponding interrupt having actually fired on the host. That is 163 + exactly why we mark the timer interrupt as active in step 10, because 164 + the active state on the physical distributor is part of the state 165 + belonging to the timer hardware, which is context-switched along with 166 + the VCPU thread. 167 + 168 + If the guest does not idle because it is busy, the flow looks like this 169 + instead: 170 + 171 + 1. KVM runs the VCPU 172 + 2. The guest programs the time to fire in T+100 173 + 4. At T+100 the timer fires and a physical IRQ causes the VM to exit 174 + (note that this initially only traps to EL2 and does not run the host ISR 175 + until KVM has returned to the host). 176 + 5. With interrupts still disabled on the CPU coming back from the guest, KVM 177 + stores the virtual timer state to memory and disables the virtual hw timer. 178 + 6. KVM looks at the timer state (in memory) and injects a forwarded physical 179 + interrupt because it concludes the timer has expired. 180 + 7. KVM marks the timer interrupt as active on the physical distributor 181 + 7. KVM enables the timer, enables interrupts, and runs the VCPU 182 + 183 + Notice that again the forwarded physical interrupt is injected to the 184 + guest without having actually been handled on the host. In this case it 185 + is because the physical interrupt is never actually seen by the host because the 186 + timer is disabled upon guest return, and the virtual forwarded interrupt is 187 + injected on the KVM guest entry path.

+10 -8

Documentation/virtual/kvm/devices/arm-vgic.txt

··· 44 44 Attributes: 45 45 The attr field of kvm_device_attr encodes two values: 46 46 bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | 47 - values: | reserved | cpu id | offset | 47 + values: | reserved | vcpu_index | offset | 48 48 49 49 All distributor regs are (rw, 32-bit) 50 50 51 51 The offset is relative to the "Distributor base address" as defined in the 52 52 GICv2 specs. Getting or setting such a register has the same effect as 53 - reading or writing the register on the actual hardware from the cpu 54 - specified with cpu id field. Note that most distributor fields are not 55 - banked, but return the same value regardless of the cpu id used to access 56 - the register. 53 + reading or writing the register on the actual hardware from the cpu whose 54 + index is specified with the vcpu_index field. Note that most distributor 55 + fields are not banked, but return the same value regardless of the 56 + vcpu_index used to access the register. 57 57 Limitations: 58 58 - Priorities are not implemented, and registers are RAZ/WI 59 59 - Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2. 60 60 Errors: 61 - -ENODEV: Getting or setting this register is not yet supported 61 + -ENXIO: Getting or setting this register is not yet supported 62 62 -EBUSY: One or more VCPUs are running 63 + -EINVAL: Invalid vcpu_index supplied 63 64 64 65 KVM_DEV_ARM_VGIC_GRP_CPU_REGS 65 66 Attributes: 66 67 The attr field of kvm_device_attr encodes two values: 67 68 bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | 68 - values: | reserved | cpu id | offset | 69 + values: | reserved | vcpu_index | offset | 69 70 70 71 All CPU interface regs are (rw, 32-bit) 71 72 ··· 92 91 - Priorities are not implemented, and registers are RAZ/WI 93 92 - Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2. 94 93 Errors: 95 - -ENODEV: Getting or setting this register is not yet supported 94 + -ENXIO: Getting or setting this register is not yet supported 96 95 -EBUSY: One or more VCPUs are running 96 + -EINVAL: Invalid vcpu_index supplied 97 97 98 98 KVM_DEV_ARM_VGIC_GRP_NR_IRQS 99 99 Attributes:

+12

Documentation/virtual/kvm/locking.txt

··· 166 166 MMIO/PIO address->device structure mapping (kvm->buses). 167 167 The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 168 168 if it is needed by multiple functions. 169 + 170 + Name: blocked_vcpu_on_cpu_lock 171 + Type: spinlock_t 172 + Arch: x86 173 + Protects: blocked_vcpu_on_cpu 174 + Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 175 + When VT-d posted-interrupts is supported and the VM has assigned 176 + devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu 177 + protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues 178 + wakeup notification event since external interrupts from the 179 + assigned devices happens, we will find the vCPU on the list to 180 + wakeup.

+7

MAINTAINERS

··· 11348 11348 S: Maintained 11349 11349 F: drivers/net/ethernet/via/via-velocity.* 11350 11350 11351 + VIRT LIB 11352 + M: Alex Williamson <alex.williamson@redhat.com> 11353 + M: Paolo Bonzini <pbonzini@redhat.com> 11354 + L: kvm@vger.kernel.org 11355 + S: Supported 11356 + F: virt/lib/ 11357 + 11351 11358 VIVID VIRTUAL VIDEO DRIVER 11352 11359 M: Hans Verkuil <hverkuil@xs4all.nl> 11353 11360 L: linux-media@vger.kernel.org

+6 -4

Makefile

··· 550 550 net-y := net/ 551 551 libs-y := lib/ 552 552 core-y := usr/ 553 + virt-y := virt/ 553 554 endif # KBUILD_EXTMOD 554 555 555 556 ifeq ($(dot-config),1) ··· 883 882 884 883 vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ 885 884 $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ 886 - $(net-y) $(net-m) $(libs-y) $(libs-m))) 885 + $(net-y) $(net-m) $(libs-y) $(libs-m) $(virt-y))) 887 886 888 887 vmlinux-alldirs := $(sort $(vmlinux-dirs) $(patsubst %/,%,$(filter %/, \ 889 - $(init-) $(core-) $(drivers-) $(net-) $(libs-)))) 888 + $(init-) $(core-) $(drivers-) $(net-) $(libs-) $(virt-)))) 890 889 891 890 init-y := $(patsubst %/, %/built-in.o, $(init-y)) 892 891 core-y := $(patsubst %/, %/built-in.o, $(core-y)) ··· 895 894 libs-y1 := $(patsubst %/, %/lib.a, $(libs-y)) 896 895 libs-y2 := $(patsubst %/, %/built-in.o, $(libs-y)) 897 896 libs-y := $(libs-y1) $(libs-y2) 897 + virt-y := $(patsubst %/, %/built-in.o, $(virt-y)) 898 898 899 899 # Externally visible symbols (used by link-vmlinux.sh) 900 900 export KBUILD_VMLINUX_INIT := $(head-y) $(init-y) 901 - export KBUILD_VMLINUX_MAIN := $(core-y) $(libs-y) $(drivers-y) $(net-y) 901 + export KBUILD_VMLINUX_MAIN := $(core-y) $(libs-y) $(drivers-y) $(net-y) $(virt-y) 902 902 export KBUILD_LDS := arch/$(SRCARCH)/kernel/vmlinux.lds 903 903 export LDFLAGS_vmlinux 904 904 # used by scripts/pacmage/Makefile 905 - export KBUILD_ALLDIRS := $(sort $(filter-out arch/%,$(vmlinux-alldirs)) arch Documentation include samples scripts tools virt) 905 + export KBUILD_ALLDIRS := $(sort $(filter-out arch/%,$(vmlinux-alldirs)) arch Documentation include samples scripts tools) 906 906 907 907 vmlinux-deps := $(KBUILD_LDS) $(KBUILD_VMLINUX_INIT) $(KBUILD_VMLINUX_MAIN) 908 908

+20

arch/arm/include/asm/kvm_arm.h

··· 218 218 #define HSR_DABT_CM (1U << 8) 219 219 #define HSR_DABT_EA (1U << 9) 220 220 221 + #define kvm_arm_exception_type \ 222 + {0, "RESET" }, \ 223 + {1, "UNDEFINED" }, \ 224 + {2, "SOFTWARE" }, \ 225 + {3, "PREF_ABORT" }, \ 226 + {4, "DATA_ABORT" }, \ 227 + {5, "IRQ" }, \ 228 + {6, "FIQ" }, \ 229 + {7, "HVC" } 230 + 231 + #define HSRECN(x) { HSR_EC_##x, #x } 232 + 233 + #define kvm_arm_exception_class \ 234 + HSRECN(UNKNOWN), HSRECN(WFI), HSRECN(CP15_32), HSRECN(CP15_64), \ 235 + HSRECN(CP14_MR), HSRECN(CP14_LS), HSRECN(CP_0_13), HSRECN(CP10_ID), \ 236 + HSRECN(JAZELLE), HSRECN(BXJ), HSRECN(CP14_64), HSRECN(SVC_HYP), \ 237 + HSRECN(HVC), HSRECN(SMC), HSRECN(IABT), HSRECN(IABT_HYP), \ 238 + HSRECN(DABT), HSRECN(DABT_HYP) 239 + 240 + 221 241 #endif /* __ARM_KVM_ARM_H__ */

+4 -1

arch/arm/include/asm/kvm_host.h

··· 126 126 * here. 127 127 */ 128 128 129 - /* Don't run the guest on this vcpu */ 129 + /* vcpu power-off state */ 130 + bool power_off; 131 + 132 + /* Don't run the guest (internal implementation need) */ 130 133 bool pause; 131 134 132 135 /* IO related fields */

+2

arch/arm/kvm/Kconfig

··· 46 46 ---help--- 47 47 Provides host support for ARM processors. 48 48 49 + source drivers/vhost/Kconfig 50 + 49 51 endif # VIRTUALIZATION

+60 -16

arch/arm/kvm/arm.c

··· 271 271 return kvm_timer_should_fire(vcpu); 272 272 } 273 273 274 + void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) 275 + { 276 + kvm_timer_schedule(vcpu); 277 + } 278 + 279 + void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) 280 + { 281 + kvm_timer_unschedule(vcpu); 282 + } 283 + 274 284 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) 275 285 { 276 286 /* Force users to call KVM_ARM_VCPU_INIT */ ··· 318 308 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, 319 309 struct kvm_mp_state *mp_state) 320 310 { 321 - if (vcpu->arch.pause) 311 + if (vcpu->arch.power_off) 322 312 mp_state->mp_state = KVM_MP_STATE_STOPPED; 323 313 else 324 314 mp_state->mp_state = KVM_MP_STATE_RUNNABLE; ··· 331 321 { 332 322 switch (mp_state->mp_state) { 333 323 case KVM_MP_STATE_RUNNABLE: 334 - vcpu->arch.pause = false; 324 + vcpu->arch.power_off = false; 335 325 break; 336 326 case KVM_MP_STATE_STOPPED: 337 - vcpu->arch.pause = true; 327 + vcpu->arch.power_off = true; 338 328 break; 339 329 default: 340 330 return -EINVAL; ··· 352 342 */ 353 343 int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) 354 344 { 355 - return !!v->arch.irq_lines || kvm_vgic_vcpu_pending_irq(v); 345 + return ((!!v->arch.irq_lines || kvm_vgic_vcpu_pending_irq(v)) 346 + && !v->arch.power_off && !v->arch.pause); 356 347 } 357 348 358 349 /* Just ensure a guest exit from a particular CPU */ ··· 479 468 return vgic_initialized(kvm); 480 469 } 481 470 482 - static void vcpu_pause(struct kvm_vcpu *vcpu) 471 + static void kvm_arm_halt_guest(struct kvm *kvm) __maybe_unused; 472 + static void kvm_arm_resume_guest(struct kvm *kvm) __maybe_unused; 473 + 474 + static void kvm_arm_halt_guest(struct kvm *kvm) 475 + { 476 + int i; 477 + struct kvm_vcpu *vcpu; 478 + 479 + kvm_for_each_vcpu(i, vcpu, kvm) 480 + vcpu->arch.pause = true; 481 + force_vm_exit(cpu_all_mask); 482 + } 483 + 484 + static void kvm_arm_resume_guest(struct kvm *kvm) 485 + { 486 + int i; 487 + struct kvm_vcpu *vcpu; 488 + 489 + kvm_for_each_vcpu(i, vcpu, kvm) { 490 + wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu); 491 + 492 + vcpu->arch.pause = false; 493 + wake_up_interruptible(wq); 494 + } 495 + } 496 + 497 + static void vcpu_sleep(struct kvm_vcpu *vcpu) 483 498 { 484 499 wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu); 485 500 486 - wait_event_interruptible(*wq, !vcpu->arch.pause); 501 + wait_event_interruptible(*wq, ((!vcpu->arch.power_off) && 502 + (!vcpu->arch.pause))); 487 503 } 488 504 489 505 static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu) ··· 560 522 561 523 update_vttbr(vcpu->kvm); 562 524 563 - if (vcpu->arch.pause) 564 - vcpu_pause(vcpu); 525 + if (vcpu->arch.power_off || vcpu->arch.pause) 526 + vcpu_sleep(vcpu); 565 527 566 528 /* 567 529 * Disarming the background timer must be done in a ··· 587 549 run->exit_reason = KVM_EXIT_INTR; 588 550 } 589 551 590 - if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) { 552 + if (ret <= 0 || need_new_vmid_gen(vcpu->kvm) || 553 + vcpu->arch.power_off || vcpu->arch.pause) { 591 554 local_irq_enable(); 555 + kvm_timer_sync_hwstate(vcpu); 592 556 kvm_vgic_sync_hwstate(vcpu); 593 557 preempt_enable(); 594 - kvm_timer_sync_hwstate(vcpu); 595 558 continue; 596 559 } 597 560 ··· 635 596 * guest time. 636 597 */ 637 598 kvm_guest_exit(); 638 - trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu)); 599 + trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu)); 600 + 601 + /* 602 + * We must sync the timer state before the vgic state so that 603 + * the vgic can properly sample the updated state of the 604 + * interrupt line. 605 + */ 606 + kvm_timer_sync_hwstate(vcpu); 639 607 640 608 kvm_vgic_sync_hwstate(vcpu); 641 609 642 610 preempt_enable(); 643 - 644 - kvm_timer_sync_hwstate(vcpu); 645 611 646 612 ret = handle_exit(vcpu, run, ret); 647 613 } ··· 809 765 vcpu_reset_hcr(vcpu); 810 766 811 767 /* 812 - * Handle the "start in power-off" case by marking the VCPU as paused. 768 + * Handle the "start in power-off" case. 813 769 */ 814 770 if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features)) 815 - vcpu->arch.pause = true; 771 + vcpu->arch.power_off = true; 816 772 else 817 - vcpu->arch.pause = false; 773 + vcpu->arch.power_off = false; 818 774 819 775 return 0; 820 776 }

+5 -5

arch/arm/kvm/psci.c

··· 63 63 64 64 static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu) 65 65 { 66 - vcpu->arch.pause = true; 66 + vcpu->arch.power_off = true; 67 67 } 68 68 69 69 static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) ··· 87 87 */ 88 88 if (!vcpu) 89 89 return PSCI_RET_INVALID_PARAMS; 90 - if (!vcpu->arch.pause) { 90 + if (!vcpu->arch.power_off) { 91 91 if (kvm_psci_version(source_vcpu) != KVM_ARM_PSCI_0_1) 92 92 return PSCI_RET_ALREADY_ON; 93 93 else ··· 115 115 * the general puspose registers are undefined upon CPU_ON. 116 116 */ 117 117 *vcpu_reg(vcpu, 0) = context_id; 118 - vcpu->arch.pause = false; 118 + vcpu->arch.power_off = false; 119 119 smp_mb(); /* Make sure the above is visible */ 120 120 121 121 wq = kvm_arch_vcpu_wq(vcpu); ··· 153 153 mpidr = kvm_vcpu_get_mpidr_aff(tmp); 154 154 if ((mpidr & target_affinity_mask) == target_affinity) { 155 155 matching_cpus++; 156 - if (!tmp->arch.pause) 156 + if (!tmp->arch.power_off) 157 157 return PSCI_0_2_AFFINITY_LEVEL_ON; 158 158 } 159 159 } ··· 179 179 * re-initialized. 180 180 */ 181 181 kvm_for_each_vcpu(i, tmp, vcpu->kvm) { 182 - tmp->arch.pause = true; 182 + tmp->arch.power_off = true; 183 183 kvm_vcpu_kick(tmp); 184 184 } 185 185

+7 -3

arch/arm/kvm/trace.h

··· 25 25 ); 26 26 27 27 TRACE_EVENT(kvm_exit, 28 - TP_PROTO(unsigned int exit_reason, unsigned long vcpu_pc), 29 - TP_ARGS(exit_reason, vcpu_pc), 28 + TP_PROTO(int idx, unsigned int exit_reason, unsigned long vcpu_pc), 29 + TP_ARGS(idx, exit_reason, vcpu_pc), 30 30 31 31 TP_STRUCT__entry( 32 + __field( int, idx ) 32 33 __field( unsigned int, exit_reason ) 33 34 __field( unsigned long, vcpu_pc ) 34 35 ), 35 36 36 37 TP_fast_assign( 38 + __entry->idx = idx; 37 39 __entry->exit_reason = exit_reason; 38 40 __entry->vcpu_pc = vcpu_pc; 39 41 ), 40 42 41 - TP_printk("HSR_EC: 0x%04x, PC: 0x%08lx", 43 + TP_printk("%s: HSR_EC: 0x%04x (%s), PC: 0x%08lx", 44 + __print_symbolic(__entry->idx, kvm_arm_exception_type), 42 45 __entry->exit_reason, 46 + __print_symbolic(__entry->exit_reason, kvm_arm_exception_class), 43 47 __entry->vcpu_pc) 44 48 ); 45 49

+16

arch/arm64/include/asm/kvm_arm.h

··· 200 200 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */ 201 201 #define HPFAR_MASK (~UL(0xf)) 202 202 203 + #define kvm_arm_exception_type \ 204 + {0, "IRQ" }, \ 205 + {1, "TRAP" } 206 + 207 + #define ECN(x) { ESR_ELx_EC_##x, #x } 208 + 209 + #define kvm_arm_exception_class \ 210 + ECN(UNKNOWN), ECN(WFx), ECN(CP15_32), ECN(CP15_64), ECN(CP14_MR), \ 211 + ECN(CP14_LS), ECN(FP_ASIMD), ECN(CP10_ID), ECN(CP14_64), ECN(SVC64), \ 212 + ECN(HVC64), ECN(SMC64), ECN(SYS64), ECN(IMP_DEF), ECN(IABT_LOW), \ 213 + ECN(IABT_CUR), ECN(PC_ALIGN), ECN(DABT_LOW), ECN(DABT_CUR), \ 214 + ECN(SP_ALIGN), ECN(FP_EXC32), ECN(FP_EXC64), ECN(SERROR), \ 215 + ECN(BREAKPT_LOW), ECN(BREAKPT_CUR), ECN(SOFTSTP_LOW), \ 216 + ECN(SOFTSTP_CUR), ECN(WATCHPT_LOW), ECN(WATCHPT_CUR), \ 217 + ECN(BKPT32), ECN(VECTOR32), ECN(BRK64) 218 + 203 219 #endif /* __ARM64_KVM_ARM_H__ */

+4 -1

arch/arm64/include/asm/kvm_host.h

··· 149 149 u32 mdscr_el1; 150 150 } guest_debug_preserved; 151 151 152 - /* Don't run the guest */ 152 + /* vcpu power-off state */ 153 + bool power_off; 154 + 155 + /* Don't run the guest (internal implementation need) */ 153 156 bool pause; 154 157 155 158 /* IO related fields */

+2

arch/arm64/kvm/Kconfig

··· 48 48 ---help--- 49 49 Provides host support for ARM processors. 50 50 51 + source drivers/vhost/Kconfig 52 + 51 53 endif # VIRTUALIZATION

+8

arch/arm64/kvm/hyp.S

··· 880 880 881 881 bl __restore_sysregs 882 882 883 + /* 884 + * Make sure we have a valid host stack, and don't leave junk in the 885 + * frame pointer that will give us a misleading host stack unwinding. 886 + */ 887 + ldr x22, [x2, #CPU_GP_REG_OFFSET(CPU_SP_EL1)] 888 + msr sp_el1, x22 889 + mov x29, xzr 890 + 883 891 1: adr x0, __hyp_panic_str 884 892 adr x1, 2f 885 893 ldp x2, x3, [x1]

+2

arch/mips/include/asm/kvm_host.h

··· 847 847 struct kvm_memory_slot *slot) {} 848 848 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} 849 849 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} 850 + static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} 851 + static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} 850 852 851 853 #endif /* __MIPS_KVM_HOST_H__ */

+5

arch/powerpc/include/asm/disassemble.h

··· 42 42 return ((inst >> 16) & 0x1f) | ((inst >> 6) & 0x3e0); 43 43 } 44 44 45 + static inline unsigned int get_tmrn(u32 inst) 46 + { 47 + return ((inst >> 16) & 0x1f) | ((inst >> 6) & 0x3e0); 48 + } 49 + 45 50 static inline unsigned int get_rt(u32 inst) 46 51 { 47 52 return (inst >> 21) & 0x1f;

+2

arch/powerpc/include/asm/kvm_host.h

··· 716 716 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} 717 717 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} 718 718 static inline void kvm_arch_exit(void) {} 719 + static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} 720 + static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} 719 721 720 722 #endif /* __POWERPC_KVM_HOST_H__ */

+6

arch/powerpc/include/asm/reg_booke.h

··· 742 742 #define MMUBE1_VBE4 0x00000002 743 743 #define MMUBE1_VBE5 0x00000001 744 744 745 + #define TMRN_TMCFG0 16 /* Thread Management Configuration Register 0 */ 746 + #define TMRN_TMCFG0_NPRIBITS 0x003f0000 /* Bits of thread priority */ 747 + #define TMRN_TMCFG0_NPRIBITS_SHIFT 16 748 + #define TMRN_TMCFG0_NATHRD 0x00003f00 /* Number of active threads */ 749 + #define TMRN_TMCFG0_NATHRD_SHIFT 8 750 + #define TMRN_TMCFG0_NTHRD 0x0000003f /* Number of threads */ 745 751 #define TMRN_IMSR0 0x120 /* Initial MSR Register 0 (e6500) */ 746 752 #define TMRN_IMSR1 0x121 /* Initial MSR Register 1 (e6500) */ 747 753 #define TMRN_INIA0 0x140 /* Next Instruction Address Register 0 */

+2 -1

arch/powerpc/kvm/book3s_64_mmu_hv.c

··· 70 70 } 71 71 72 72 /* Lastly try successively smaller sizes from the page allocator */ 73 - while (!hpt && order > PPC_MIN_HPT_ORDER) { 73 + /* Only do this if userspace didn't specify a size via ioctl */ 74 + while (!hpt && order > PPC_MIN_HPT_ORDER && !htab_orderp) { 74 75 hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT| 75 76 __GFP_NOWARN, order - PAGE_SHIFT); 76 77 if (!hpt)

+2

arch/powerpc/kvm/book3s_hv_rm_mmu.c

··· 470 470 note_hpte_modification(kvm, rev); 471 471 unlock_hpte(hpte, 0); 472 472 473 + if (v & HPTE_V_ABSENT) 474 + v = (v & ~HPTE_V_ABSENT) | HPTE_V_VALID; 473 475 hpret[0] = v; 474 476 hpret[1] = r; 475 477 return H_SUCCESS;

+22 -7

arch/powerpc/kvm/book3s_hv_rmhandlers.S

··· 150 150 cmpwi cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK 151 151 cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL 152 152 beq 11f 153 + cmpwi r12, BOOK3S_INTERRUPT_H_DOORBELL 154 + beq 15f /* Invoke the H_DOORBELL handler */ 153 155 cmpwi cr2, r12, BOOK3S_INTERRUPT_HMI 154 156 beq cr2, 14f /* HMI check */ 155 157 ··· 175 173 14: mtspr SPRN_HSRR0, r8 176 174 mtspr SPRN_HSRR1, r7 177 175 b hmi_exception_after_realmode 176 + 177 + 15: mtspr SPRN_HSRR0, r8 178 + mtspr SPRN_HSRR1, r7 179 + ba 0xe80 178 180 179 181 kvmppc_primary_no_guest: 180 182 /* We handle this much like a ceded vcpu */ ··· 2383 2377 mr r3, r9 /* get vcpu pointer */ 2384 2378 bl kvmppc_realmode_machine_check 2385 2379 nop 2386 - cmpdi r3, 0 /* Did we handle MCE ? */ 2387 2380 ld r9, HSTATE_KVM_VCPU(r13) 2388 2381 li r12, BOOK3S_INTERRUPT_MACHINE_CHECK 2389 2382 /* ··· 2395 2390 * The old code used to return to host for unhandled errors which 2396 2391 * was causing guest to hang with soft lockups inside guest and 2397 2392 * makes it difficult to recover guest instance. 2393 + * 2394 + * if we receive machine check with MSR(RI=0) then deliver it to 2395 + * guest as machine check causing guest to crash. 2398 2396 */ 2399 - ld r10, VCPU_PC(r9) 2400 2397 ld r11, VCPU_MSR(r9) 2398 + andi. r10, r11, MSR_RI /* check for unrecoverable exception */ 2399 + beq 1f /* Deliver a machine check to guest */ 2400 + ld r10, VCPU_PC(r9) 2401 + cmpdi r3, 0 /* Did we handle MCE ? */ 2401 2402 bne 2f /* Continue guest execution. */ 2402 2403 /* If not, deliver a machine check. SRR0/1 are already set */ 2403 - li r10, BOOK3S_INTERRUPT_MACHINE_CHECK 2404 - ld r11, VCPU_MSR(r9) 2404 + 1: li r10, BOOK3S_INTERRUPT_MACHINE_CHECK 2405 2405 bl kvmppc_msr_interrupt 2406 2406 2: b fast_interrupt_c_return 2407 2407 ··· 2446 2436 2447 2437 /* hypervisor doorbell */ 2448 2438 3: li r12, BOOK3S_INTERRUPT_H_DOORBELL 2439 + 2440 + /* 2441 + * Clear the doorbell as we will invoke the handler 2442 + * explicitly in the guest exit path. 2443 + */ 2444 + lis r6, (PPC_DBELL_SERVER << (63-36))@h 2445 + PPC_MSGCLR(6) 2449 2446 /* see if it's a host IPI */ 2450 2447 li r3, 1 2451 2448 lbz r0, HSTATE_HOST_IPI(r13) 2452 2449 cmpwi r0, 0 2453 2450 bnelr 2454 - /* if not, clear it and return -1 */ 2455 - lis r6, (PPC_DBELL_SERVER << (63-36))@h 2456 - PPC_MSGCLR(6) 2451 + /* if not, return -1 */ 2457 2452 li r3, -1 2458 2453 blr 2459 2454

+2 -1

arch/powerpc/kvm/e500.c

··· 237 237 struct kvm_book3e_206_tlb_entry *gtlbe) 238 238 { 239 239 struct vcpu_id_table *idt = vcpu_e500->idt; 240 - unsigned int pr, tid, ts, pid; 240 + unsigned int pr, tid, ts; 241 + int pid; 241 242 u32 val, eaddr; 242 243 unsigned long flags; 243 244

+19

arch/powerpc/kvm/e500_emulate.c

··· 15 15 #include <asm/kvm_ppc.h> 16 16 #include <asm/disassemble.h> 17 17 #include <asm/dbell.h> 18 + #include <asm/reg_booke.h> 18 19 19 20 #include "booke.h" 20 21 #include "e500.h" ··· 23 22 #define XOP_DCBTLS 166 24 23 #define XOP_MSGSND 206 25 24 #define XOP_MSGCLR 238 25 + #define XOP_MFTMR 366 26 26 #define XOP_TLBIVAX 786 27 27 #define XOP_TLBSX 914 28 28 #define XOP_TLBRE 946 ··· 115 113 return EMULATE_DONE; 116 114 } 117 115 116 + static int kvmppc_e500_emul_mftmr(struct kvm_vcpu *vcpu, unsigned int inst, 117 + int rt) 118 + { 119 + /* Expose one thread per vcpu */ 120 + if (get_tmrn(inst) == TMRN_TMCFG0) { 121 + kvmppc_set_gpr(vcpu, rt, 122 + 1 | (1 << TMRN_TMCFG0_NATHRD_SHIFT)); 123 + return EMULATE_DONE; 124 + } 125 + 126 + return EMULATE_FAIL; 127 + } 128 + 118 129 int kvmppc_core_emulate_op_e500(struct kvm_run *run, struct kvm_vcpu *vcpu, 119 130 unsigned int inst, int *advance) 120 131 { ··· 178 163 case XOP_TLBIVAX: 179 164 ea = kvmppc_get_ea_indexed(vcpu, ra, rb); 180 165 emulated = kvmppc_e500_emul_tlbivax(vcpu, ea); 166 + break; 167 + 168 + case XOP_MFTMR: 169 + emulated = kvmppc_e500_emul_mftmr(vcpu, inst, rt); 181 170 break; 182 171 183 172 case XOP_EHPRIV:

+2 -2

arch/powerpc/kvm/e500_mmu_host.c

··· 406 406 407 407 for (; tsize > BOOK3E_PAGESZ_4K; tsize -= 2) { 408 408 unsigned long gfn_start, gfn_end; 409 - tsize_pages = 1 << (tsize - 2); 409 + tsize_pages = 1UL << (tsize - 2); 410 410 411 411 gfn_start = gfn & ~(tsize_pages - 1); 412 412 gfn_end = gfn_start + tsize_pages; ··· 447 447 } 448 448 449 449 if (likely(!pfnmap)) { 450 - tsize_pages = 1 << (tsize + 10 - PAGE_SHIFT); 450 + tsize_pages = 1UL << (tsize + 10 - PAGE_SHIFT); 451 451 pfn = gfn_to_pfn_memslot(slot, gfn); 452 452 if (is_error_noslot_pfn(pfn)) { 453 453 if (printk_ratelimit())

+3

arch/powerpc/kvm/powerpc.c

··· 559 559 else 560 560 r = num_online_cpus(); 561 561 break; 562 + case KVM_CAP_NR_MEMSLOTS: 563 + r = KVM_USER_MEM_SLOTS; 564 + break; 562 565 case KVM_CAP_MAX_VCPUS: 563 566 r = KVM_MAX_VCPUS; 564 567 break;

+2

arch/s390/include/asm/kvm_host.h

··· 644 644 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} 645 645 static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm, 646 646 struct kvm_memory_slot *slot) {} 647 + static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} 648 + static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} 647 649 648 650 #endif

+21 -21

arch/s390/kvm/intercept.c

··· 336 336 return -EOPNOTSUPP; 337 337 } 338 338 339 - static const intercept_handler_t intercept_funcs[] = { 340 - [0x00 >> 2] = handle_noop, 341 - [0x04 >> 2] = handle_instruction, 342 - [0x08 >> 2] = handle_prog, 343 - [0x10 >> 2] = handle_noop, 344 - [0x14 >> 2] = handle_external_interrupt, 345 - [0x18 >> 2] = handle_noop, 346 - [0x1C >> 2] = kvm_s390_handle_wait, 347 - [0x20 >> 2] = handle_validity, 348 - [0x28 >> 2] = handle_stop, 349 - [0x38 >> 2] = handle_partial_execution, 350 - }; 351 - 352 339 int kvm_handle_sie_intercept(struct kvm_vcpu *vcpu) 353 340 { 354 - intercept_handler_t func; 355 - u8 code = vcpu->arch.sie_block->icptcode; 356 - 357 - if (code & 3 || (code >> 2) >= ARRAY_SIZE(intercept_funcs)) 341 + switch (vcpu->arch.sie_block->icptcode) { 342 + case 0x00: 343 + case 0x10: 344 + case 0x18: 345 + return handle_noop(vcpu); 346 + case 0x04: 347 + return handle_instruction(vcpu); 348 + case 0x08: 349 + return handle_prog(vcpu); 350 + case 0x14: 351 + return handle_external_interrupt(vcpu); 352 + case 0x1c: 353 + return kvm_s390_handle_wait(vcpu); 354 + case 0x20: 355 + return handle_validity(vcpu); 356 + case 0x28: 357 + return handle_stop(vcpu); 358 + case 0x38: 359 + return handle_partial_execution(vcpu); 360 + default: 358 361 return -EOPNOTSUPP; 359 - func = intercept_funcs[code >> 2]; 360 - if (func) 361 - return func(vcpu); 362 - return -EOPNOTSUPP; 362 + } 363 363 }

+43 -73

arch/s390/kvm/interrupt.c

··· 51 51 52 52 static int psw_interrupts_disabled(struct kvm_vcpu *vcpu) 53 53 { 54 - if ((vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PER) || 55 - (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_IO) || 56 - (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_EXT)) 57 - return 0; 58 - return 1; 54 + return psw_extint_disabled(vcpu) && 55 + psw_ioint_disabled(vcpu) && 56 + psw_mchk_disabled(vcpu); 59 57 } 60 58 61 59 static int ckc_interrupts_enabled(struct kvm_vcpu *vcpu) ··· 69 71 70 72 static int ckc_irq_pending(struct kvm_vcpu *vcpu) 71 73 { 72 - preempt_disable(); 73 - if (!(vcpu->arch.sie_block->ckc < 74 - get_tod_clock_fast() + vcpu->arch.sie_block->epoch)) { 75 - preempt_enable(); 74 + if (vcpu->arch.sie_block->ckc >= kvm_s390_get_tod_clock_fast(vcpu->kvm)) 76 75 return 0; 77 - } 78 - preempt_enable(); 79 76 return ckc_interrupts_enabled(vcpu); 80 77 } 81 78 ··· 102 109 return (int_word & 0x38000000) >> 27; 103 110 } 104 111 105 - static inline unsigned long pending_floating_irqs(struct kvm_vcpu *vcpu) 112 + static inline unsigned long pending_irqs(struct kvm_vcpu *vcpu) 106 113 { 107 - return vcpu->kvm->arch.float_int.pending_irqs; 108 - } 109 - 110 - static inline unsigned long pending_local_irqs(struct kvm_vcpu *vcpu) 111 - { 112 - return vcpu->arch.local_int.pending_irqs; 114 + return vcpu->kvm->arch.float_int.pending_irqs | 115 + vcpu->arch.local_int.pending_irqs; 113 116 } 114 117 115 118 static unsigned long disable_iscs(struct kvm_vcpu *vcpu, ··· 124 135 { 125 136 unsigned long active_mask; 126 137 127 - active_mask = pending_local_irqs(vcpu); 128 - active_mask |= pending_floating_irqs(vcpu); 138 + active_mask = pending_irqs(vcpu); 129 139 if (!active_mask) 130 140 return 0; 131 141 ··· 192 204 193 205 static void set_intercept_indicators_io(struct kvm_vcpu *vcpu) 194 206 { 195 - if (!(pending_floating_irqs(vcpu) & IRQ_PEND_IO_MASK)) 207 + if (!(pending_irqs(vcpu) & IRQ_PEND_IO_MASK)) 196 208 return; 197 209 else if (psw_ioint_disabled(vcpu)) 198 210 __set_cpuflag(vcpu, CPUSTAT_IO_INT); ··· 202 214 203 215 static void set_intercept_indicators_ext(struct kvm_vcpu *vcpu) 204 216 { 205 - if (!(pending_local_irqs(vcpu) & IRQ_PEND_EXT_MASK)) 217 + if (!(pending_irqs(vcpu) & IRQ_PEND_EXT_MASK)) 206 218 return; 207 219 if (psw_extint_disabled(vcpu)) 208 220 __set_cpuflag(vcpu, CPUSTAT_EXT_INT); ··· 212 224 213 225 static void set_intercept_indicators_mchk(struct kvm_vcpu *vcpu) 214 226 { 215 - if (!(pending_local_irqs(vcpu) & IRQ_PEND_MCHK_MASK)) 227 + if (!(pending_irqs(vcpu) & IRQ_PEND_MCHK_MASK)) 216 228 return; 217 229 if (psw_mchk_disabled(vcpu)) 218 230 vcpu->arch.sie_block->ictl |= ICTL_LPSW; ··· 803 815 804 816 int kvm_s390_vcpu_has_irq(struct kvm_vcpu *vcpu, int exclude_stop) 805 817 { 806 - int rc; 818 + if (deliverable_irqs(vcpu)) 819 + return 1; 807 820 808 - rc = !!deliverable_irqs(vcpu); 809 - 810 - if (!rc && kvm_cpu_has_pending_timer(vcpu)) 811 - rc = 1; 821 + if (kvm_cpu_has_pending_timer(vcpu)) 822 + return 1; 812 823 813 824 /* external call pending and deliverable */ 814 - if (!rc && kvm_s390_ext_call_pending(vcpu) && 825 + if (kvm_s390_ext_call_pending(vcpu) && 815 826 !psw_extint_disabled(vcpu) && 816 827 (vcpu->arch.sie_block->gcr[0] & 0x2000ul)) 817 - rc = 1; 828 + return 1; 818 829 819 - if (!rc && !exclude_stop && kvm_s390_is_stop_irq_pending(vcpu)) 820 - rc = 1; 821 - 822 - return rc; 830 + if (!exclude_stop && kvm_s390_is_stop_irq_pending(vcpu)) 831 + return 1; 832 + return 0; 823 833 } 824 834 825 835 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) ··· 832 846 vcpu->stat.exit_wait_state++; 833 847 834 848 /* fast path */ 835 - if (kvm_cpu_has_pending_timer(vcpu) || kvm_arch_vcpu_runnable(vcpu)) 849 + if (kvm_arch_vcpu_runnable(vcpu)) 836 850 return 0; 837 851 838 852 if (psw_interrupts_disabled(vcpu)) { ··· 846 860 goto no_timer; 847 861 } 848 862 849 - preempt_disable(); 850 - now = get_tod_clock_fast() + vcpu->arch.sie_block->epoch; 851 - preempt_enable(); 863 + now = kvm_s390_get_tod_clock_fast(vcpu->kvm); 852 864 sltime = tod_to_ns(vcpu->arch.sie_block->ckc - now); 853 865 854 866 /* underflow */ ··· 885 901 u64 now, sltime; 886 902 887 903 vcpu = container_of(timer, struct kvm_vcpu, arch.ckc_timer); 888 - preempt_disable(); 889 - now = get_tod_clock_fast() + vcpu->arch.sie_block->epoch; 890 - preempt_enable(); 904 + now = kvm_s390_get_tod_clock_fast(vcpu->kvm); 891 905 sltime = tod_to_ns(vcpu->arch.sie_block->ckc - now); 892 906 893 907 /* ··· 963 981 trace_kvm_s390_inject_vcpu(vcpu->vcpu_id, KVM_S390_PROGRAM_INT, 964 982 irq->u.pgm.code, 0); 965 983 966 - li->irq.pgm = irq->u.pgm; 984 + if (irq->u.pgm.code == PGM_PER) { 985 + li->irq.pgm.code |= PGM_PER; 986 + /* only modify PER related information */ 987 + li->irq.pgm.per_address = irq->u.pgm.per_address; 988 + li->irq.pgm.per_code = irq->u.pgm.per_code; 989 + li->irq.pgm.per_atmid = irq->u.pgm.per_atmid; 990 + li->irq.pgm.per_access_id = irq->u.pgm.per_access_id; 991 + } else if (!(irq->u.pgm.code & PGM_PER)) { 992 + li->irq.pgm.code = (li->irq.pgm.code & PGM_PER) | 993 + irq->u.pgm.code; 994 + /* only modify non-PER information */ 995 + li->irq.pgm.trans_exc_code = irq->u.pgm.trans_exc_code; 996 + li->irq.pgm.mon_code = irq->u.pgm.mon_code; 997 + li->irq.pgm.data_exc_code = irq->u.pgm.data_exc_code; 998 + li->irq.pgm.mon_class_nr = irq->u.pgm.mon_class_nr; 999 + li->irq.pgm.exc_access_id = irq->u.pgm.exc_access_id; 1000 + li->irq.pgm.op_access_id = irq->u.pgm.op_access_id; 1001 + } else { 1002 + li->irq.pgm = irq->u.pgm; 1003 + } 967 1004 set_bit(IRQ_PEND_PROG, &li->pending_irqs); 968 1005 return 0; 969 - } 970 - 971 - int kvm_s390_inject_program_int(struct kvm_vcpu *vcpu, u16 code) 972 - { 973 - struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 974 - struct kvm_s390_irq irq; 975 - 976 - spin_lock(&li->lock); 977 - irq.u.pgm.code = code; 978 - __inject_prog(vcpu, &irq); 979 - BUG_ON(waitqueue_active(li->wq)); 980 - spin_unlock(&li->lock); 981 - return 0; 982 - } 983 - 984 - int kvm_s390_inject_prog_irq(struct kvm_vcpu *vcpu, 985 - struct kvm_s390_pgm_info *pgm_info) 986 - { 987 - struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 988 - struct kvm_s390_irq irq; 989 - int rc; 990 - 991 - spin_lock(&li->lock); 992 - irq.u.pgm = *pgm_info; 993 - rc = __inject_prog(vcpu, &irq); 994 - BUG_ON(waitqueue_active(li->wq)); 995 - spin_unlock(&li->lock); 996 - return rc; 997 1006 } 998 1007 999 1008 static int __inject_pfault_init(struct kvm_vcpu *vcpu, struct kvm_s390_irq *irq) ··· 1363 1390 1364 1391 static int __inject_vm(struct kvm *kvm, struct kvm_s390_interrupt_info *inti) 1365 1392 { 1366 - struct kvm_s390_float_interrupt *fi; 1367 1393 u64 type = READ_ONCE(inti->type); 1368 1394 int rc; 1369 - 1370 - fi = &kvm->arch.float_int; 1371 1395 1372 1396 switch (type) { 1373 1397 case KVM_S390_MCHK:

+27 -31

arch/s390/kvm/kvm-s390.c

··· 514 514 515 515 if (gtod_high != 0) 516 516 return -EINVAL; 517 - VM_EVENT(kvm, 3, "SET: TOD extension: 0x%x\n", gtod_high); 517 + VM_EVENT(kvm, 3, "SET: TOD extension: 0x%x", gtod_high); 518 518 519 519 return 0; 520 520 } 521 521 522 522 static int kvm_s390_set_tod_low(struct kvm *kvm, struct kvm_device_attr *attr) 523 523 { 524 - struct kvm_vcpu *cur_vcpu; 525 - unsigned int vcpu_idx; 526 - u64 host_tod, gtod; 527 - int r; 524 + u64 gtod; 528 525 529 526 if (copy_from_user(&gtod, (void __user *)attr->addr, sizeof(gtod))) 530 527 return -EFAULT; 531 528 532 - r = store_tod_clock(&host_tod); 533 - if (r) 534 - return r; 535 - 536 - mutex_lock(&kvm->lock); 537 - preempt_disable(); 538 - kvm->arch.epoch = gtod - host_tod; 539 - kvm_s390_vcpu_block_all(kvm); 540 - kvm_for_each_vcpu(vcpu_idx, cur_vcpu, kvm) 541 - cur_vcpu->arch.sie_block->epoch = kvm->arch.epoch; 542 - kvm_s390_vcpu_unblock_all(kvm); 543 - preempt_enable(); 544 - mutex_unlock(&kvm->lock); 545 - VM_EVENT(kvm, 3, "SET: TOD base: 0x%llx\n", gtod); 529 + kvm_s390_set_tod_clock(kvm, gtod); 530 + VM_EVENT(kvm, 3, "SET: TOD base: 0x%llx", gtod); 546 531 return 0; 547 532 } 548 533 ··· 559 574 if (copy_to_user((void __user *)attr->addr, &gtod_high, 560 575 sizeof(gtod_high))) 561 576 return -EFAULT; 562 - VM_EVENT(kvm, 3, "QUERY: TOD extension: 0x%x\n", gtod_high); 577 + VM_EVENT(kvm, 3, "QUERY: TOD extension: 0x%x", gtod_high); 563 578 564 579 return 0; 565 580 } 566 581 567 582 static int kvm_s390_get_tod_low(struct kvm *kvm, struct kvm_device_attr *attr) 568 583 { 569 - u64 host_tod, gtod; 570 - int r; 584 + u64 gtod; 571 585 572 - r = store_tod_clock(&host_tod); 573 - if (r) 574 - return r; 575 - 576 - preempt_disable(); 577 - gtod = host_tod + kvm->arch.epoch; 578 - preempt_enable(); 586 + gtod = kvm_s390_get_tod_clock_fast(kvm); 579 587 if (copy_to_user((void __user *)attr->addr, &gtod, sizeof(gtod))) 580 588 return -EFAULT; 581 - VM_EVENT(kvm, 3, "QUERY: TOD base: 0x%llx\n", gtod); 589 + VM_EVENT(kvm, 3, "QUERY: TOD base: 0x%llx", gtod); 582 590 583 591 return 0; 584 592 } ··· 1098 1120 if (!kvm->arch.sca) 1099 1121 goto out_err; 1100 1122 spin_lock(&kvm_lock); 1101 - sca_offset = (sca_offset + 16) & 0x7f0; 1123 + sca_offset += 16; 1124 + if (sca_offset + sizeof(struct sca_block) > PAGE_SIZE) 1125 + sca_offset = 0; 1102 1126 kvm->arch.sca = (struct sca_block *) ((char *) kvm->arch.sca + sca_offset); 1103 1127 spin_unlock(&kvm_lock); 1104 1128 ··· 1889 1909 clear_bit(KVM_REQ_UNHALT, &vcpu->requests); 1890 1910 1891 1911 return 0; 1912 + } 1913 + 1914 + void kvm_s390_set_tod_clock(struct kvm *kvm, u64 tod) 1915 + { 1916 + struct kvm_vcpu *vcpu; 1917 + int i; 1918 + 1919 + mutex_lock(&kvm->lock); 1920 + preempt_disable(); 1921 + kvm->arch.epoch = tod - get_tod_clock(); 1922 + kvm_s390_vcpu_block_all(kvm); 1923 + kvm_for_each_vcpu(i, vcpu, kvm) 1924 + vcpu->arch.sie_block->epoch = kvm->arch.epoch; 1925 + kvm_s390_vcpu_unblock_all(kvm); 1926 + preempt_enable(); 1927 + mutex_unlock(&kvm->lock); 1892 1928 } 1893 1929 1894 1930 /**

+31 -4

arch/s390/kvm/kvm-s390.h

··· 175 175 return kvm->arch.user_cpu_state_ctrl != 0; 176 176 } 177 177 178 + /* implemented in interrupt.c */ 178 179 int kvm_s390_handle_wait(struct kvm_vcpu *vcpu); 179 180 void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu); 180 181 enum hrtimer_restart kvm_s390_idle_wakeup(struct hrtimer *timer); ··· 186 185 struct kvm_s390_interrupt *s390int); 187 186 int __must_check kvm_s390_inject_vcpu(struct kvm_vcpu *vcpu, 188 187 struct kvm_s390_irq *irq); 189 - int __must_check kvm_s390_inject_program_int(struct kvm_vcpu *vcpu, u16 code); 188 + static inline int kvm_s390_inject_prog_irq(struct kvm_vcpu *vcpu, 189 + struct kvm_s390_pgm_info *pgm_info) 190 + { 191 + struct kvm_s390_irq irq = { 192 + .type = KVM_S390_PROGRAM_INT, 193 + .u.pgm = *pgm_info, 194 + }; 195 + 196 + return kvm_s390_inject_vcpu(vcpu, &irq); 197 + } 198 + static inline int kvm_s390_inject_program_int(struct kvm_vcpu *vcpu, u16 code) 199 + { 200 + struct kvm_s390_irq irq = { 201 + .type = KVM_S390_PROGRAM_INT, 202 + .u.pgm.code = code, 203 + }; 204 + 205 + return kvm_s390_inject_vcpu(vcpu, &irq); 206 + } 190 207 struct kvm_s390_interrupt_info *kvm_s390_get_io_int(struct kvm *kvm, 191 208 u64 isc_mask, u32 schid); 192 209 int kvm_s390_reinject_io_int(struct kvm *kvm, ··· 231 212 int kvm_s390_handle_sigp_pei(struct kvm_vcpu *vcpu); 232 213 233 214 /* implemented in kvm-s390.c */ 215 + void kvm_s390_set_tod_clock(struct kvm *kvm, u64 tod); 234 216 long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable); 235 217 int kvm_s390_store_status_unloaded(struct kvm_vcpu *vcpu, unsigned long addr); 236 218 int kvm_s390_store_adtl_status_unloaded(struct kvm_vcpu *vcpu, ··· 251 231 252 232 /* implemented in diag.c */ 253 233 int kvm_s390_handle_diag(struct kvm_vcpu *vcpu); 254 - /* implemented in interrupt.c */ 255 - int kvm_s390_inject_prog_irq(struct kvm_vcpu *vcpu, 256 - struct kvm_s390_pgm_info *pgm_info); 257 234 258 235 static inline void kvm_s390_vcpu_block_all(struct kvm *kvm) 259 236 { ··· 269 252 270 253 kvm_for_each_vcpu(i, vcpu, kvm) 271 254 kvm_s390_vcpu_unblock(vcpu); 255 + } 256 + 257 + static inline u64 kvm_s390_get_tod_clock_fast(struct kvm *kvm) 258 + { 259 + u64 rc; 260 + 261 + preempt_disable(); 262 + rc = get_tod_clock_fast() + kvm->arch.epoch; 263 + preempt_enable(); 264 + return rc; 272 265 } 273 266 274 267 /**

+3 -16

arch/s390/kvm/priv.c

··· 33 33 /* Handle SCK (SET CLOCK) interception */ 34 34 static int handle_set_clock(struct kvm_vcpu *vcpu) 35 35 { 36 - struct kvm_vcpu *cpup; 37 - s64 hostclk, val; 38 - int i, rc; 36 + int rc; 39 37 ar_t ar; 40 - u64 op2; 38 + u64 op2, val; 41 39 42 40 if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) 43 41 return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); ··· 47 49 if (rc) 48 50 return kvm_s390_inject_prog_cond(vcpu, rc); 49 51 50 - if (store_tod_clock(&hostclk)) { 51 - kvm_s390_set_psw_cc(vcpu, 3); 52 - return 0; 53 - } 54 52 VCPU_EVENT(vcpu, 3, "SCK: setting guest TOD to 0x%llx", val); 55 - val = (val - hostclk) & ~0x3fUL; 56 - 57 - mutex_lock(&vcpu->kvm->lock); 58 - preempt_disable(); 59 - kvm_for_each_vcpu(i, cpup, vcpu->kvm) 60 - cpup->arch.sie_block->epoch = val; 61 - preempt_enable(); 62 - mutex_unlock(&vcpu->kvm->lock); 53 + kvm_s390_set_tod_clock(vcpu->kvm, val); 63 54 64 55 kvm_s390_set_psw_cc(vcpu, 0); 65 56 return 0;

+5 -5

arch/x86/include/asm/irq_remapping.h

··· 33 33 IRQ_POSTING_CAP = 0, 34 34 }; 35 35 36 + struct vcpu_data { 37 + u64 pi_desc_addr; /* Physical address of PI Descriptor */ 38 + u32 vector; /* Guest vector of the interrupt */ 39 + }; 40 + 36 41 #ifdef CONFIG_IRQ_REMAP 37 42 38 43 extern bool irq_remapping_cap(enum irq_remap_cap cap); ··· 62 57 { 63 58 return x86_vector_domain; 64 59 } 65 - 66 - struct vcpu_data { 67 - u64 pi_desc_addr; /* Physical address of PI Descriptor */ 68 - u32 vector; /* Guest vector of the interrupt */ 69 - }; 70 60 71 61 #else /* CONFIG_IRQ_REMAP */ 72 62

+10

arch/x86/include/asm/kvm_emulate.h

··· 112 112 struct x86_exception *fault); 113 113 114 114 /* 115 + * read_phys: Read bytes of standard (non-emulated/special) memory. 116 + * Used for descriptor reading. 117 + * @addr: [IN ] Physical address from which to read. 118 + * @val: [OUT] Value read from memory. 119 + * @bytes: [IN ] Number of bytes to read from memory. 120 + */ 121 + int (*read_phys)(struct x86_emulate_ctxt *ctxt, unsigned long addr, 122 + void *val, unsigned int bytes); 123 + 124 + /* 115 125 * write_std: Write bytes of standard (non-emulated/special) memory. 116 126 * Used for descriptor writing. 117 127 * @addr: [IN ] Linear address to which to write.

+36 -2

arch/x86/include/asm/kvm_host.h

··· 24 24 #include <linux/perf_event.h> 25 25 #include <linux/pvclock_gtod.h> 26 26 #include <linux/clocksource.h> 27 + #include <linux/irqbypass.h> 27 28 28 29 #include <asm/pvclock-abi.h> 29 30 #include <asm/desc.h> ··· 176 175 * See the implementation in apic_update_pv_eoi. 177 176 */ 178 177 #define KVM_APIC_PV_EOI_PENDING 1 178 + 179 + struct kvm_kernel_irq_routing_entry; 179 180 180 181 /* 181 182 * We don't want allocation failures within the mmu code, so we preallocate ··· 377 374 /* Hyper-V per vcpu emulation context */ 378 375 struct kvm_vcpu_hv { 379 376 u64 hv_vapic; 377 + s64 runtime_offset; 380 378 }; 381 379 382 380 struct kvm_vcpu_arch { ··· 400 396 u64 efer; 401 397 u64 apic_base; 402 398 struct kvm_lapic *apic; /* kernel irqchip context */ 399 + u64 eoi_exit_bitmap[4]; 403 400 unsigned long apic_attention; 404 401 int32_t apic_arb_prio; 405 402 int mp_state; ··· 578 573 struct { 579 574 bool pv_unhalted; 580 575 } pv; 576 + 577 + int pending_ioapic_eoi; 578 + int pending_external_vector; 581 579 }; 582 580 583 581 struct kvm_lpage_info { ··· 691 683 u32 bsp_vcpu_id; 692 684 693 685 u64 disabled_quirks; 686 + 687 + bool irqchip_split; 688 + u8 nr_reserved_ioapic_pins; 694 689 }; 695 690 696 691 struct kvm_vm_stat { ··· 830 819 void (*enable_nmi_window)(struct kvm_vcpu *vcpu); 831 820 void (*enable_irq_window)(struct kvm_vcpu *vcpu); 832 821 void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 833 - int (*vm_has_apicv)(struct kvm *kvm); 822 + int (*cpu_uses_apicv)(struct kvm_vcpu *vcpu); 834 823 void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); 835 824 void (*hwapic_isr_update)(struct kvm *kvm, int isr); 836 - void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); 825 + void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu); 837 826 void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); 838 827 void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa); 839 828 void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); ··· 898 887 gfn_t offset, unsigned long mask); 899 888 /* pmu operations of sub-arch */ 900 889 const struct kvm_pmu_ops *pmu_ops; 890 + 891 + /* 892 + * Architecture specific hooks for vCPU blocking due to 893 + * HLT instruction. 894 + * Returns for .pre_block(): 895 + * - 0 means continue to block the vCPU. 896 + * - 1 means we cannot block the vCPU since some event 897 + * happens during this period, such as, 'ON' bit in 898 + * posted-interrupts descriptor is set. 899 + */ 900 + int (*pre_block)(struct kvm_vcpu *vcpu); 901 + void (*post_block)(struct kvm_vcpu *vcpu); 902 + int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq, 903 + uint32_t guest_irq, bool set); 901 904 }; 902 905 903 906 struct kvm_arch_async_pf { ··· 1255 1230 int x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size); 1256 1231 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu); 1257 1232 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu); 1233 + 1234 + bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq, 1235 + struct kvm_vcpu **dest_vcpu); 1236 + 1237 + void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e, 1238 + struct kvm_lapic_irq *irq); 1239 + 1240 + static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} 1241 + static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} 1258 1242 1259 1243 #endif /* _ASM_X86_KVM_HOST_H */

+2 -1

arch/x86/include/asm/vmx.h

··· 72 72 #define SECONDARY_EXEC_SHADOW_VMCS 0x00004000 73 73 #define SECONDARY_EXEC_ENABLE_PML 0x00020000 74 74 #define SECONDARY_EXEC_XSAVES 0x00100000 75 - 75 + #define SECONDARY_EXEC_PCOMMIT 0x00200000 76 76 77 77 #define PIN_BASED_EXT_INTR_MASK 0x00000001 78 78 #define PIN_BASED_NMI_EXITING 0x00000008 ··· 416 416 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25) 417 417 #define VMX_EPT_EXTENT_GLOBAL_BIT (1ull << 26) 418 418 419 + #define VMX_VPID_INVVPID_BIT (1ull << 0) /* (32 - 32) */ 419 420 #define VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT (1ull << 9) /* (41 - 32) */ 420 421 #define VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT (1ull << 10) /* (42 - 32) */ 421 422

+18

arch/x86/include/uapi/asm/hyperv.h

··· 153 153 /* MSR used to provide vcpu index */ 154 154 #define HV_X64_MSR_VP_INDEX 0x40000002 155 155 156 + /* MSR used to reset the guest OS. */ 157 + #define HV_X64_MSR_RESET 0x40000003 158 + 159 + /* MSR used to provide vcpu runtime in 100ns units */ 160 + #define HV_X64_MSR_VP_RUNTIME 0x40000010 161 + 156 162 /* MSR used to read the per-partition time reference counter */ 157 163 #define HV_X64_MSR_TIME_REF_COUNT 0x40000020 158 164 ··· 256 250 __u64 tsc_scale; 257 251 __s64 tsc_offset; 258 252 } HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE; 253 + 254 + /* Define the number of synthetic interrupt sources. */ 255 + #define HV_SYNIC_SINT_COUNT (16) 256 + /* Define the expected SynIC version. */ 257 + #define HV_SYNIC_VERSION_1 (0x1) 258 + 259 + #define HV_SYNIC_CONTROL_ENABLE (1ULL << 0) 260 + #define HV_SYNIC_SIMP_ENABLE (1ULL << 0) 261 + #define HV_SYNIC_SIEFP_ENABLE (1ULL << 0) 262 + #define HV_SYNIC_SINT_MASKED (1ULL << 16) 263 + #define HV_SYNIC_SINT_AUTO_EOI (1ULL << 17) 264 + #define HV_SYNIC_SINT_VECTOR_MASK (0xFF) 259 265 260 266 #endif

+3 -1

arch/x86/include/uapi/asm/vmx.h

··· 78 78 #define EXIT_REASON_PML_FULL 62 79 79 #define EXIT_REASON_XSAVES 63 80 80 #define EXIT_REASON_XRSTORS 64 81 + #define EXIT_REASON_PCOMMIT 65 81 82 82 83 #define VMX_EXIT_REASONS \ 83 84 { EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \ ··· 127 126 { EXIT_REASON_INVVPID, "INVVPID" }, \ 128 127 { EXIT_REASON_INVPCID, "INVPCID" }, \ 129 128 { EXIT_REASON_XSAVES, "XSAVES" }, \ 130 - { EXIT_REASON_XRSTORS, "XRSTORS" } 129 + { EXIT_REASON_XRSTORS, "XRSTORS" }, \ 130 + { EXIT_REASON_PCOMMIT, "PCOMMIT" } 131 131 132 132 #define VMX_ABORT_SAVE_GUEST_MSR_FAIL 1 133 133 #define VMX_ABORT_LOAD_HOST_MSR_FAIL 4

+35 -11

arch/x86/kernel/kvmclock.c

··· 32 32 static int kvmclock = 1; 33 33 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; 34 34 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; 35 + static cycle_t kvm_sched_clock_offset; 35 36 36 37 static int parse_no_kvmclock(char *arg) 37 38 { ··· 91 90 static cycle_t kvm_clock_get_cycles(struct clocksource *cs) 92 91 { 93 92 return kvm_clock_read(); 93 + } 94 + 95 + static cycle_t kvm_sched_clock_read(void) 96 + { 97 + return kvm_clock_read() - kvm_sched_clock_offset; 98 + } 99 + 100 + static inline void kvm_sched_clock_init(bool stable) 101 + { 102 + if (!stable) { 103 + pv_time_ops.sched_clock = kvm_clock_read; 104 + return; 105 + } 106 + 107 + kvm_sched_clock_offset = kvm_clock_read(); 108 + pv_time_ops.sched_clock = kvm_sched_clock_read; 109 + set_sched_clock_stable(); 110 + 111 + printk(KERN_INFO "kvm-clock: using sched offset of %llu cycles\n", 112 + kvm_sched_clock_offset); 113 + 114 + BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) > 115 + sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time)); 94 116 } 95 117 96 118 /* ··· 272 248 memblock_free(mem, size); 273 249 return; 274 250 } 275 - pv_time_ops.sched_clock = kvm_clock_read; 251 + 252 + if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) 253 + pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); 254 + 255 + cpu = get_cpu(); 256 + vcpu_time = &hv_clock[cpu].pvti; 257 + flags = pvclock_read_flags(vcpu_time); 258 + 259 + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); 260 + put_cpu(); 261 + 276 262 x86_platform.calibrate_tsc = kvm_get_tsc_khz; 277 263 x86_platform.get_wallclock = kvm_get_wallclock; 278 264 x86_platform.set_wallclock = kvm_set_wallclock; ··· 299 265 kvm_get_preset_lpj(); 300 266 clocksource_register_hz(&kvm_clock, NSEC_PER_SEC); 301 267 pv_info.name = "KVM"; 302 - 303 - if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) 304 - pvclock_set_flags(~0); 305 - 306 - cpu = get_cpu(); 307 - vcpu_time = &hv_clock[cpu].pvti; 308 - flags = pvclock_read_flags(vcpu_time); 309 - if (flags & PVCLOCK_COUNTS_FROM_ZERO) 310 - set_sched_clock_stable(); 311 - put_cpu(); 312 268 } 313 269 314 270 int __init kvm_setup_vsyscall_timeinfo(void)

+2

arch/x86/kvm/Kconfig

··· 28 28 select ANON_INODES 29 29 select HAVE_KVM_IRQCHIP 30 30 select HAVE_KVM_IRQFD 31 + select IRQ_BYPASS_MANAGER 32 + select HAVE_KVM_IRQ_BYPASS 31 33 select HAVE_KVM_IRQ_ROUTING 32 34 select HAVE_KVM_EVENTFD 33 35 select KVM_APIC_ARCHITECTURE

+37 -25

arch/x86/kvm/assigned-dev.c

··· 21 21 #include <linux/fs.h> 22 22 #include "irq.h" 23 23 #include "assigned-dev.h" 24 + #include "trace/events/kvm.h" 24 25 25 26 struct kvm_assigned_dev_kernel { 26 27 struct kvm_irq_ack_notifier ack_notifier; ··· 132 131 return IRQ_HANDLED; 133 132 } 134 133 135 - #ifdef __KVM_HAVE_MSI 134 + /* 135 + * Deliver an IRQ in an atomic context if we can, or return a failure, 136 + * user can retry in a process context. 137 + * Return value: 138 + * -EWOULDBLOCK - Can't deliver in atomic context: retry in a process context. 139 + * Other values - No need to retry. 140 + */ 141 + static int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, 142 + int level) 143 + { 144 + struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS]; 145 + struct kvm_kernel_irq_routing_entry *e; 146 + int ret = -EINVAL; 147 + int idx; 148 + 149 + trace_kvm_set_irq(irq, level, irq_source_id); 150 + 151 + /* 152 + * Injection into either PIC or IOAPIC might need to scan all CPUs, 153 + * which would need to be retried from thread context; when same GSI 154 + * is connected to both PIC and IOAPIC, we'd have to report a 155 + * partial failure here. 156 + * Since there's no easy way to do this, we only support injecting MSI 157 + * which is limited to 1:1 GSI mapping. 158 + */ 159 + idx = srcu_read_lock(&kvm->irq_srcu); 160 + if (kvm_irq_map_gsi(kvm, entries, irq) > 0) { 161 + e = &entries[0]; 162 + ret = kvm_arch_set_irq_inatomic(e, kvm, irq_source_id, 163 + irq, level); 164 + } 165 + srcu_read_unlock(&kvm->irq_srcu, idx); 166 + return ret; 167 + } 168 + 169 + 136 170 static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) 137 171 { 138 172 struct kvm_assigned_dev_kernel *assigned_dev = dev_id; ··· 186 150 187 151 return IRQ_HANDLED; 188 152 } 189 - #endif 190 153 191 - #ifdef __KVM_HAVE_MSIX 192 154 static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) 193 155 { 194 156 struct kvm_assigned_dev_kernel *assigned_dev = dev_id; ··· 217 183 218 184 return IRQ_HANDLED; 219 185 } 220 - #endif 221 186 222 187 /* Ack the irq line for an assigned device */ 223 188 static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian) ··· 419 386 return 0; 420 387 } 421 388 422 - #ifdef __KVM_HAVE_MSI 423 389 static int assigned_device_enable_host_msi(struct kvm *kvm, 424 390 struct kvm_assigned_dev_kernel *dev) 425 391 { ··· 440 408 441 409 return 0; 442 410 } 443 - #endif 444 411 445 - #ifdef __KVM_HAVE_MSIX 446 412 static int assigned_device_enable_host_msix(struct kvm *kvm, 447 413 struct kvm_assigned_dev_kernel *dev) 448 414 { ··· 473 443 return r; 474 444 } 475 445 476 - #endif 477 - 478 446 static int assigned_device_enable_guest_intx(struct kvm *kvm, 479 447 struct kvm_assigned_dev_kernel *dev, 480 448 struct kvm_assigned_irq *irq) ··· 482 454 return 0; 483 455 } 484 456 485 - #ifdef __KVM_HAVE_MSI 486 457 static int assigned_device_enable_guest_msi(struct kvm *kvm, 487 458 struct kvm_assigned_dev_kernel *dev, 488 459 struct kvm_assigned_irq *irq) ··· 490 463 dev->ack_notifier.gsi = -1; 491 464 return 0; 492 465 } 493 - #endif 494 466 495 - #ifdef __KVM_HAVE_MSIX 496 467 static int assigned_device_enable_guest_msix(struct kvm *kvm, 497 468 struct kvm_assigned_dev_kernel *dev, 498 469 struct kvm_assigned_irq *irq) ··· 499 474 dev->ack_notifier.gsi = -1; 500 475 return 0; 501 476 } 502 - #endif 503 477 504 478 static int assign_host_irq(struct kvm *kvm, 505 479 struct kvm_assigned_dev_kernel *dev, ··· 516 492 case KVM_DEV_IRQ_HOST_INTX: 517 493 r = assigned_device_enable_host_intx(kvm, dev); 518 494 break; 519 - #ifdef __KVM_HAVE_MSI 520 495 case KVM_DEV_IRQ_HOST_MSI: 521 496 r = assigned_device_enable_host_msi(kvm, dev); 522 497 break; 523 - #endif 524 - #ifdef __KVM_HAVE_MSIX 525 498 case KVM_DEV_IRQ_HOST_MSIX: 526 499 r = assigned_device_enable_host_msix(kvm, dev); 527 500 break; 528 - #endif 529 501 default: 530 502 r = -EINVAL; 531 503 } ··· 554 534 case KVM_DEV_IRQ_GUEST_INTX: 555 535 r = assigned_device_enable_guest_intx(kvm, dev, irq); 556 536 break; 557 - #ifdef __KVM_HAVE_MSI 558 537 case KVM_DEV_IRQ_GUEST_MSI: 559 538 r = assigned_device_enable_guest_msi(kvm, dev, irq); 560 539 break; 561 - #endif 562 - #ifdef __KVM_HAVE_MSIX 563 540 case KVM_DEV_IRQ_GUEST_MSIX: 564 541 r = assigned_device_enable_guest_msix(kvm, dev, irq); 565 542 break; 566 - #endif 567 543 default: 568 544 r = -EINVAL; 569 545 } ··· 842 826 } 843 827 844 828 845 - #ifdef __KVM_HAVE_MSIX 846 829 static int kvm_vm_ioctl_set_msix_nr(struct kvm *kvm, 847 830 struct kvm_assigned_msix_nr *entry_nr) 848 831 { ··· 921 906 922 907 return r; 923 908 } 924 - #endif 925 909 926 910 static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm, 927 911 struct kvm_assigned_pci_dev *assigned_dev) ··· 1026 1012 goto out; 1027 1013 break; 1028 1014 } 1029 - #ifdef __KVM_HAVE_MSIX 1030 1015 case KVM_ASSIGN_SET_MSIX_NR: { 1031 1016 struct kvm_assigned_msix_nr entry_nr; 1032 1017 r = -EFAULT; ··· 1046 1033 goto out; 1047 1034 break; 1048 1035 } 1049 - #endif 1050 1036 case KVM_ASSIGN_SET_INTX_MASK: { 1051 1037 struct kvm_assigned_pci_dev assigned_dev; 1052 1038

+1 -1

arch/x86/kvm/cpuid.c

··· 348 348 F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | 349 349 F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | f_mpx | F(RDSEED) | 350 350 F(ADX) | F(SMAP) | F(AVX512F) | F(AVX512PF) | F(AVX512ER) | 351 - F(AVX512CD); 351 + F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(PCOMMIT); 352 352 353 353 /* cpuid 0xD.1.eax */ 354 354 const u32 kvm_supported_word10_x86_features =

+37

arch/x86/kvm/cpuid.h

··· 133 133 best = kvm_find_cpuid_entry(vcpu, 7, 0); 134 134 return best && (best->ebx & bit(X86_FEATURE_MPX)); 135 135 } 136 + 137 + static inline bool guest_cpuid_has_pcommit(struct kvm_vcpu *vcpu) 138 + { 139 + struct kvm_cpuid_entry2 *best; 140 + 141 + best = kvm_find_cpuid_entry(vcpu, 7, 0); 142 + return best && (best->ebx & bit(X86_FEATURE_PCOMMIT)); 143 + } 144 + 145 + static inline bool guest_cpuid_has_rdtscp(struct kvm_vcpu *vcpu) 146 + { 147 + struct kvm_cpuid_entry2 *best; 148 + 149 + best = kvm_find_cpuid_entry(vcpu, 0x80000001, 0); 150 + return best && (best->edx & bit(X86_FEATURE_RDTSCP)); 151 + } 152 + 153 + /* 154 + * NRIPS is provided through cpuidfn 0x8000000a.edx bit 3 155 + */ 156 + #define BIT_NRIPS 3 157 + 158 + static inline bool guest_cpuid_has_nrips(struct kvm_vcpu *vcpu) 159 + { 160 + struct kvm_cpuid_entry2 *best; 161 + 162 + best = kvm_find_cpuid_entry(vcpu, 0x8000000a, 0); 163 + 164 + /* 165 + * NRIPS is a scattered cpuid feature, so we can't use 166 + * X86_FEATURE_NRIPS here (X86_FEATURE_NRIPS would be bit 167 + * position 8, not 3). 168 + */ 169 + return best && (best->edx & bit(BIT_NRIPS)); 170 + } 171 + #undef BIT_NRIPS 172 + 136 173 #endif

+27 -8

arch/x86/kvm/emulate.c

··· 2272 2272 #define GET_SMSTATE(type, smbase, offset) \ 2273 2273 ({ \ 2274 2274 type __val; \ 2275 - int r = ctxt->ops->read_std(ctxt, smbase + offset, &__val, \ 2276 - sizeof(__val), NULL); \ 2275 + int r = ctxt->ops->read_phys(ctxt, smbase + offset, &__val, \ 2276 + sizeof(__val)); \ 2277 2277 if (r != X86EMUL_CONTINUE) \ 2278 2278 return X86EMUL_UNHANDLEABLE; \ 2279 2279 __val; \ ··· 2484 2484 2485 2485 /* 2486 2486 * Get back to real mode, to prepare a safe state in which to load 2487 - * CR0/CR3/CR4/EFER. Also this will ensure that addresses passed 2488 - * to read_std/write_std are not virtual. 2489 - * 2490 - * CR4.PCIDE must be zero, because it is a 64-bit mode only feature. 2487 + * CR0/CR3/CR4/EFER. It's all a bit more complicated if the vCPU 2488 + * supports long mode. 2491 2489 */ 2490 + cr4 = ctxt->ops->get_cr(ctxt, 4); 2491 + if (emulator_has_longmode(ctxt)) { 2492 + struct desc_struct cs_desc; 2493 + 2494 + /* Zero CR4.PCIDE before CR0.PG. */ 2495 + if (cr4 & X86_CR4_PCIDE) { 2496 + ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE); 2497 + cr4 &= ~X86_CR4_PCIDE; 2498 + } 2499 + 2500 + /* A 32-bit code segment is required to clear EFER.LMA. */ 2501 + memset(&cs_desc, 0, sizeof(cs_desc)); 2502 + cs_desc.type = 0xb; 2503 + cs_desc.s = cs_desc.g = cs_desc.p = 1; 2504 + ctxt->ops->set_segment(ctxt, 0, &cs_desc, 0, VCPU_SREG_CS); 2505 + } 2506 + 2507 + /* For the 64-bit case, this will clear EFER.LMA. */ 2492 2508 cr0 = ctxt->ops->get_cr(ctxt, 0); 2493 2509 if (cr0 & X86_CR0_PE) 2494 2510 ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE)); 2495 - cr4 = ctxt->ops->get_cr(ctxt, 4); 2511 + 2512 + /* Now clear CR4.PAE (which must be done before clearing EFER.LME). */ 2496 2513 if (cr4 & X86_CR4_PAE) 2497 2514 ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE); 2515 + 2516 + /* And finally go back to 32-bit mode. */ 2498 2517 efer = 0; 2499 2518 ctxt->ops->set_msr(ctxt, MSR_EFER, efer); 2500 2519 ··· 4474 4455 F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N, 4475 4456 /* 0xA8 - 0xAF */ 4476 4457 I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), 4477 - II(No64 | EmulateOnUD | ImplicitOps, em_rsm, rsm), 4458 + II(EmulateOnUD | ImplicitOps, em_rsm, rsm), 4478 4459 F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), 4479 4460 F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), 4480 4461 F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),

+29 -2

arch/x86/kvm/hyperv.c

··· 41 41 case HV_X64_MSR_TIME_REF_COUNT: 42 42 case HV_X64_MSR_CRASH_CTL: 43 43 case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4: 44 + case HV_X64_MSR_RESET: 44 45 r = true; 45 46 break; 46 47 } ··· 164 163 data); 165 164 case HV_X64_MSR_CRASH_CTL: 166 165 return kvm_hv_msr_set_crash_ctl(vcpu, data, host); 166 + case HV_X64_MSR_RESET: 167 + if (data == 1) { 168 + vcpu_debug(vcpu, "hyper-v reset requested\n"); 169 + kvm_make_request(KVM_REQ_HV_RESET, vcpu); 170 + } 171 + break; 167 172 default: 168 173 vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n", 169 174 msr, data); ··· 178 171 return 0; 179 172 } 180 173 181 - static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data) 174 + /* Calculate cpu time spent by current task in 100ns units */ 175 + static u64 current_task_runtime_100ns(void) 176 + { 177 + cputime_t utime, stime; 178 + 179 + task_cputime_adjusted(current, &utime, &stime); 180 + return div_u64(cputime_to_nsecs(utime + stime), 100); 181 + } 182 + 183 + static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host) 182 184 { 183 185 struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv; 184 186 ··· 221 205 return kvm_hv_vapic_msr_write(vcpu, APIC_ICR, data); 222 206 case HV_X64_MSR_TPR: 223 207 return kvm_hv_vapic_msr_write(vcpu, APIC_TASKPRI, data); 208 + case HV_X64_MSR_VP_RUNTIME: 209 + if (!host) 210 + return 1; 211 + hv->runtime_offset = data - current_task_runtime_100ns(); 212 + break; 224 213 default: 225 214 vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n", 226 215 msr, data); ··· 262 241 pdata); 263 242 case HV_X64_MSR_CRASH_CTL: 264 243 return kvm_hv_msr_get_crash_ctl(vcpu, pdata); 244 + case HV_X64_MSR_RESET: 245 + data = 0; 246 + break; 265 247 default: 266 248 vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr); 267 249 return 1; ··· 301 277 case HV_X64_MSR_APIC_ASSIST_PAGE: 302 278 data = hv->hv_vapic; 303 279 break; 280 + case HV_X64_MSR_VP_RUNTIME: 281 + data = current_task_runtime_100ns() + hv->runtime_offset; 282 + break; 304 283 default: 305 284 vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr); 306 285 return 1; ··· 322 295 mutex_unlock(&vcpu->kvm->lock); 323 296 return r; 324 297 } else 325 - return kvm_hv_set_msr(vcpu, msr, data); 298 + return kvm_hv_set_msr(vcpu, msr, data, host); 326 299 } 327 300 328 301 int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)

+3 -1

arch/x86/kvm/i8254.c

··· 35 35 #include <linux/kvm_host.h> 36 36 #include <linux/slab.h> 37 37 38 + #include "ioapic.h" 38 39 #include "irq.h" 39 40 #include "i8254.h" 40 41 #include "x86.h" ··· 334 333 struct kvm_kpit_state *ps = &kvm->arch.vpit->pit_state; 335 334 s64 interval; 336 335 337 - if (!irqchip_in_kernel(kvm) || ps->flags & KVM_PIT_FLAGS_HPET_LEGACY) 336 + if (!ioapic_in_kernel(kvm) || 337 + ps->flags & KVM_PIT_FLAGS_HPET_LEGACY) 338 338 return; 339 339 340 340 interval = muldiv64(val, NSEC_PER_SEC, KVM_PIT_FREQ);

+6 -23

arch/x86/kvm/ioapic.c

··· 233 233 } 234 234 235 235 236 - static void update_handled_vectors(struct kvm_ioapic *ioapic) 237 - { 238 - DECLARE_BITMAP(handled_vectors, 256); 239 - int i; 240 - 241 - memset(handled_vectors, 0, sizeof(handled_vectors)); 242 - for (i = 0; i < IOAPIC_NUM_PINS; ++i) 243 - __set_bit(ioapic->redirtbl[i].fields.vector, handled_vectors); 244 - memcpy(ioapic->handled_vectors, handled_vectors, 245 - sizeof(handled_vectors)); 246 - smp_wmb(); 247 - } 248 - 249 - void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, 250 - u32 *tmr) 236 + void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) 251 237 { 252 238 struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic; 253 239 union kvm_ioapic_redirect_entry *e; ··· 246 260 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index) || 247 261 index == RTC_GSI) { 248 262 if (kvm_apic_match_dest(vcpu, NULL, 0, 249 - e->fields.dest_id, e->fields.dest_mode)) { 263 + e->fields.dest_id, e->fields.dest_mode) || 264 + (e->fields.trig_mode == IOAPIC_EDGE_TRIG && 265 + kvm_apic_pending_eoi(vcpu, e->fields.vector))) 250 266 __set_bit(e->fields.vector, 251 267 (unsigned long *)eoi_exit_bitmap); 252 - if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG) 253 - __set_bit(e->fields.vector, 254 - (unsigned long *)tmr); 255 - } 256 268 } 257 269 } 258 270 spin_unlock(&ioapic->lock); ··· 299 315 e->bits |= (u32) val; 300 316 e->fields.remote_irr = 0; 301 317 } 302 - update_handled_vectors(ioapic); 303 318 mask_after = e->fields.mask; 304 319 if (mask_before != mask_after) 305 320 kvm_fire_mask_notifiers(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index, mask_after); ··· 582 599 ioapic->id = 0; 583 600 memset(ioapic->irq_eoi, 0x00, IOAPIC_NUM_PINS); 584 601 rtc_irq_eoi_tracking_reset(ioapic); 585 - update_handled_vectors(ioapic); 586 602 } 587 603 588 604 static const struct kvm_io_device_ops ioapic_mmio_ops = { ··· 610 628 if (ret < 0) { 611 629 kvm->arch.vioapic = NULL; 612 630 kfree(ioapic); 631 + return ret; 613 632 } 614 633 634 + kvm_vcpu_request_scan_ioapic(kvm); 615 635 return ret; 616 636 } 617 637 ··· 650 666 memcpy(ioapic, state, sizeof(struct kvm_ioapic_state)); 651 667 ioapic->irr = 0; 652 668 ioapic->irr_delivered = 0; 653 - update_handled_vectors(ioapic); 654 669 kvm_vcpu_request_scan_ioapic(kvm); 655 670 kvm_ioapic_inject_all(ioapic, state->irr); 656 671 spin_unlock(&ioapic->lock);

+8 -7

arch/x86/kvm/ioapic.h

··· 9 9 struct kvm_vcpu; 10 10 11 11 #define IOAPIC_NUM_PINS KVM_IOAPIC_NUM_PINS 12 + #define MAX_NR_RESERVED_IOAPIC_PINS KVM_MAX_IRQ_ROUTES 12 13 #define IOAPIC_VERSION_ID 0x11 /* IOAPIC version */ 13 14 #define IOAPIC_EDGE_TRIG 0 14 15 #define IOAPIC_LEVEL_TRIG 1 ··· 74 73 struct kvm *kvm; 75 74 void (*ack_notifier)(void *opaque, int irq); 76 75 spinlock_t lock; 77 - DECLARE_BITMAP(handled_vectors, 256); 78 76 struct rtc_status rtc_status; 79 77 struct delayed_work eoi_inject; 80 78 u32 irq_eoi[IOAPIC_NUM_PINS]; ··· 98 98 return kvm->arch.vioapic; 99 99 } 100 100 101 - static inline bool kvm_ioapic_handles_vector(struct kvm *kvm, int vector) 101 + static inline int ioapic_in_kernel(struct kvm *kvm) 102 102 { 103 - struct kvm_ioapic *ioapic = kvm->arch.vioapic; 104 - smp_rmb(); 105 - return test_bit(vector, ioapic->handled_vectors); 103 + int ret; 104 + 105 + ret = (ioapic_irqchip(kvm) != NULL); 106 + return ret; 106 107 } 107 108 108 109 void kvm_rtc_eoi_tracking_restore_one(struct kvm_vcpu *vcpu); ··· 121 120 struct kvm_lapic_irq *irq, unsigned long *dest_map); 122 121 int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); 123 122 int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); 124 - void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, 125 - u32 *tmr); 123 + void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); 124 + void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); 126 125 127 126 #endif

+30 -10

arch/x86/kvm/irq.c

··· 38 38 EXPORT_SYMBOL(kvm_cpu_has_pending_timer); 39 39 40 40 /* 41 + * check if there is a pending userspace external interrupt 42 + */ 43 + static int pending_userspace_extint(struct kvm_vcpu *v) 44 + { 45 + return v->arch.pending_external_vector != -1; 46 + } 47 + 48 + /* 41 49 * check if there is pending interrupt from 42 50 * non-APIC source without intack. 43 51 */ 44 52 static int kvm_cpu_has_extint(struct kvm_vcpu *v) 45 53 { 46 - if (kvm_apic_accept_pic_intr(v)) 47 - return pic_irqchip(v->kvm)->output; /* PIC */ 48 - else 54 + u8 accept = kvm_apic_accept_pic_intr(v); 55 + 56 + if (accept) { 57 + if (irqchip_split(v->kvm)) 58 + return pending_userspace_extint(v); 59 + else 60 + return pic_irqchip(v->kvm)->output; 61 + } else 49 62 return 0; 50 63 } 51 64 ··· 70 57 */ 71 58 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v) 72 59 { 73 - if (!irqchip_in_kernel(v->kvm)) 60 + if (!lapic_in_kernel(v)) 74 61 return v->arch.interrupt.pending; 75 62 76 63 if (kvm_cpu_has_extint(v)) 77 64 return 1; 78 65 79 - if (kvm_apic_vid_enabled(v->kvm)) 66 + if (kvm_vcpu_apic_vid_enabled(v)) 80 67 return 0; 81 68 82 69 return kvm_apic_has_interrupt(v) != -1; /* LAPIC */ ··· 88 75 */ 89 76 int kvm_cpu_has_interrupt(struct kvm_vcpu *v) 90 77 { 91 - if (!irqchip_in_kernel(v->kvm)) 78 + if (!lapic_in_kernel(v)) 92 79 return v->arch.interrupt.pending; 93 80 94 81 if (kvm_cpu_has_extint(v)) ··· 104 91 */ 105 92 static int kvm_cpu_get_extint(struct kvm_vcpu *v) 106 93 { 107 - if (kvm_cpu_has_extint(v)) 108 - return kvm_pic_read_irq(v->kvm); /* PIC */ 109 - return -1; 94 + if (kvm_cpu_has_extint(v)) { 95 + if (irqchip_split(v->kvm)) { 96 + int vector = v->arch.pending_external_vector; 97 + 98 + v->arch.pending_external_vector = -1; 99 + return vector; 100 + } else 101 + return kvm_pic_read_irq(v->kvm); /* PIC */ 102 + } else 103 + return -1; 110 104 } 111 105 112 106 /* ··· 123 103 { 124 104 int vector; 125 105 126 - if (!irqchip_in_kernel(v->kvm)) 106 + if (!lapic_in_kernel(v)) 127 107 return v->arch.interrupt.nr; 128 108 129 109 vector = kvm_cpu_get_extint(v);

+26 -1

arch/x86/kvm/irq.h

··· 83 83 return kvm->arch.vpic; 84 84 } 85 85 86 + static inline int pic_in_kernel(struct kvm *kvm) 87 + { 88 + int ret; 89 + 90 + ret = (pic_irqchip(kvm) != NULL); 91 + return ret; 92 + } 93 + 94 + static inline int irqchip_split(struct kvm *kvm) 95 + { 96 + return kvm->arch.irqchip_split; 97 + } 98 + 86 99 static inline int irqchip_in_kernel(struct kvm *kvm) 87 100 { 88 101 struct kvm_pic *vpic = pic_irqchip(kvm); 102 + bool ret; 103 + 104 + ret = (vpic != NULL); 105 + ret |= irqchip_split(kvm); 89 106 90 107 /* Read vpic before kvm->irq_routing. */ 91 108 smp_rmb(); 92 - return vpic != NULL; 109 + return ret; 110 + } 111 + 112 + static inline int lapic_in_kernel(struct kvm_vcpu *vcpu) 113 + { 114 + /* Same as irqchip_in_kernel(vcpu->kvm), but with less 115 + * pointer chasing and no unnecessary memory barriers. 116 + */ 117 + return vcpu->arch.apic != NULL; 93 118 } 94 119 95 120 void kvm_pic_reset(struct kvm_kpic_state *s);

+88 -41

arch/x86/kvm/irq_comm.c

··· 91 91 return r; 92 92 } 93 93 94 - static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e, 95 - struct kvm_lapic_irq *irq) 94 + void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e, 95 + struct kvm_lapic_irq *irq) 96 96 { 97 97 trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data); 98 98 ··· 108 108 irq->level = 1; 109 109 irq->shorthand = 0; 110 110 } 111 + EXPORT_SYMBOL_GPL(kvm_set_msi_irq); 111 112 112 113 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, 113 114 struct kvm *kvm, int irq_source_id, int level, bool line_status) ··· 124 123 } 125 124 126 125 127 - static int kvm_set_msi_inatomic(struct kvm_kernel_irq_routing_entry *e, 128 - struct kvm *kvm) 126 + int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, 127 + struct kvm *kvm, int irq_source_id, int level, 128 + bool line_status) 129 129 { 130 130 struct kvm_lapic_irq irq; 131 131 int r; 132 + 133 + if (unlikely(e->type != KVM_IRQ_ROUTING_MSI)) 134 + return -EWOULDBLOCK; 132 135 133 136 kvm_set_msi_irq(e, &irq); 134 137 ··· 140 135 return r; 141 136 else 142 137 return -EWOULDBLOCK; 143 - } 144 - 145 - /* 146 - * Deliver an IRQ in an atomic context if we can, or return a failure, 147 - * user can retry in a process context. 148 - * Return value: 149 - * -EWOULDBLOCK - Can't deliver in atomic context: retry in a process context. 150 - * Other values - No need to retry. 151 - */ 152 - int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level) 153 - { 154 - struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS]; 155 - struct kvm_kernel_irq_routing_entry *e; 156 - int ret = -EINVAL; 157 - int idx; 158 - 159 - trace_kvm_set_irq(irq, level, irq_source_id); 160 - 161 - /* 162 - * Injection into either PIC or IOAPIC might need to scan all CPUs, 163 - * which would need to be retried from thread context; when same GSI 164 - * is connected to both PIC and IOAPIC, we'd have to report a 165 - * partial failure here. 166 - * Since there's no easy way to do this, we only support injecting MSI 167 - * which is limited to 1:1 GSI mapping. 168 - */ 169 - idx = srcu_read_lock(&kvm->irq_srcu); 170 - if (kvm_irq_map_gsi(kvm, entries, irq) > 0) { 171 - e = &entries[0]; 172 - if (likely(e->type == KVM_IRQ_ROUTING_MSI)) 173 - ret = kvm_set_msi_inatomic(e, kvm); 174 - else 175 - ret = -EWOULDBLOCK; 176 - } 177 - srcu_read_unlock(&kvm->irq_srcu, idx); 178 - return ret; 179 138 } 180 139 181 140 int kvm_request_irq_source_id(struct kvm *kvm) ··· 177 208 goto unlock; 178 209 } 179 210 clear_bit(irq_source_id, &kvm->arch.irq_sources_bitmap); 180 - if (!irqchip_in_kernel(kvm)) 211 + if (!ioapic_in_kernel(kvm)) 181 212 goto unlock; 182 213 183 214 kvm_ioapic_clear_all(kvm->arch.vioapic, irq_source_id); ··· 266 297 return r; 267 298 } 268 299 300 + bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq, 301 + struct kvm_vcpu **dest_vcpu) 302 + { 303 + int i, r = 0; 304 + struct kvm_vcpu *vcpu; 305 + 306 + if (kvm_intr_is_single_vcpu_fast(kvm, irq, dest_vcpu)) 307 + return true; 308 + 309 + kvm_for_each_vcpu(i, vcpu, kvm) { 310 + if (!kvm_apic_present(vcpu)) 311 + continue; 312 + 313 + if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand, 314 + irq->dest_id, irq->dest_mode)) 315 + continue; 316 + 317 + if (++r == 2) 318 + return false; 319 + 320 + *dest_vcpu = vcpu; 321 + } 322 + 323 + return r == 1; 324 + } 325 + EXPORT_SYMBOL_GPL(kvm_intr_is_single_vcpu); 326 + 269 327 #define IOAPIC_ROUTING_ENTRY(irq) \ 270 328 { .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \ 271 329 .u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } } ··· 323 327 { 324 328 return kvm_set_irq_routing(kvm, default_routing, 325 329 ARRAY_SIZE(default_routing), 0); 330 + } 331 + 332 + static const struct kvm_irq_routing_entry empty_routing[] = {}; 333 + 334 + int kvm_setup_empty_irq_routing(struct kvm *kvm) 335 + { 336 + return kvm_set_irq_routing(kvm, empty_routing, 0, 0); 337 + } 338 + 339 + void kvm_arch_irq_routing_update(struct kvm *kvm) 340 + { 341 + if (ioapic_in_kernel(kvm) || !irqchip_in_kernel(kvm)) 342 + return; 343 + kvm_make_scan_ioapic_request(kvm); 344 + } 345 + 346 + void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) 347 + { 348 + struct kvm *kvm = vcpu->kvm; 349 + struct kvm_kernel_irq_routing_entry *entry; 350 + struct kvm_irq_routing_table *table; 351 + u32 i, nr_ioapic_pins; 352 + int idx; 353 + 354 + /* kvm->irq_routing must be read after clearing 355 + * KVM_SCAN_IOAPIC. */ 356 + smp_mb(); 357 + idx = srcu_read_lock(&kvm->irq_srcu); 358 + table = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); 359 + nr_ioapic_pins = min_t(u32, table->nr_rt_entries, 360 + kvm->arch.nr_reserved_ioapic_pins); 361 + for (i = 0; i < nr_ioapic_pins; ++i) { 362 + hlist_for_each_entry(entry, &table->map[i], link) { 363 + u32 dest_id, dest_mode; 364 + bool level; 365 + 366 + if (entry->type != KVM_IRQ_ROUTING_MSI) 367 + continue; 368 + dest_id = (entry->msi.address_lo >> 12) & 0xff; 369 + dest_mode = (entry->msi.address_lo >> 2) & 0x1; 370 + level = entry->msi.data & MSI_DATA_TRIGGER_LEVEL; 371 + if (level && kvm_apic_match_dest(vcpu, NULL, 0, 372 + dest_id, dest_mode)) { 373 + u32 vector = entry->msi.data & 0xff; 374 + 375 + __set_bit(vector, 376 + (unsigned long *) eoi_exit_bitmap); 377 + } 378 + } 379 + } 380 + srcu_read_unlock(&kvm->irq_srcu, idx); 326 381 }

+104 -23

arch/x86/kvm/lapic.c

··· 209 209 if (old) 210 210 kfree_rcu(old, rcu); 211 211 212 - kvm_vcpu_request_scan_ioapic(kvm); 212 + kvm_make_scan_ioapic_request(kvm); 213 213 } 214 214 215 215 static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) ··· 348 348 struct kvm_lapic *apic = vcpu->arch.apic; 349 349 350 350 __kvm_apic_update_irr(pir, apic->regs); 351 + 352 + kvm_make_request(KVM_REQ_EVENT, vcpu); 351 353 } 352 354 EXPORT_SYMBOL_GPL(kvm_apic_update_irr); 353 355 ··· 392 390 393 391 vcpu = apic->vcpu; 394 392 395 - if (unlikely(kvm_apic_vid_enabled(vcpu->kvm))) { 393 + if (unlikely(kvm_vcpu_apic_vid_enabled(vcpu))) { 396 394 /* try to update RVI */ 397 395 apic_clear_vector(vec, apic->regs + APIC_IRR); 398 396 kvm_make_request(KVM_REQ_EVENT, vcpu); ··· 551 549 return; 552 550 } 553 551 __clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention); 554 - } 555 - 556 - void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr) 557 - { 558 - struct kvm_lapic *apic = vcpu->arch.apic; 559 - int i; 560 - 561 - for (i = 0; i < 8; i++) 562 - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]); 563 552 } 564 553 565 554 static void apic_update_ppr(struct kvm_lapic *apic) ··· 757 764 return ret; 758 765 } 759 766 767 + bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq, 768 + struct kvm_vcpu **dest_vcpu) 769 + { 770 + struct kvm_apic_map *map; 771 + bool ret = false; 772 + struct kvm_lapic *dst = NULL; 773 + 774 + if (irq->shorthand) 775 + return false; 776 + 777 + rcu_read_lock(); 778 + map = rcu_dereference(kvm->arch.apic_map); 779 + 780 + if (!map) 781 + goto out; 782 + 783 + if (irq->dest_mode == APIC_DEST_PHYSICAL) { 784 + if (irq->dest_id == 0xFF) 785 + goto out; 786 + 787 + if (irq->dest_id >= ARRAY_SIZE(map->phys_map)) 788 + goto out; 789 + 790 + dst = map->phys_map[irq->dest_id]; 791 + if (dst && kvm_apic_present(dst->vcpu)) 792 + *dest_vcpu = dst->vcpu; 793 + else 794 + goto out; 795 + } else { 796 + u16 cid; 797 + unsigned long bitmap = 1; 798 + int i, r = 0; 799 + 800 + if (!kvm_apic_logical_map_valid(map)) 801 + goto out; 802 + 803 + apic_logical_id(map, irq->dest_id, &cid, (u16 *)&bitmap); 804 + 805 + if (cid >= ARRAY_SIZE(map->logical_map)) 806 + goto out; 807 + 808 + for_each_set_bit(i, &bitmap, 16) { 809 + dst = map->logical_map[cid][i]; 810 + if (++r == 2) 811 + goto out; 812 + } 813 + 814 + if (dst && kvm_apic_present(dst->vcpu)) 815 + *dest_vcpu = dst->vcpu; 816 + else 817 + goto out; 818 + } 819 + 820 + ret = true; 821 + out: 822 + rcu_read_unlock(); 823 + return ret; 824 + } 825 + 760 826 /* 761 827 * Add a pending IRQ into lapic. 762 828 * Return 1 if successfully added and 0 if discarded. ··· 833 781 case APIC_DM_LOWEST: 834 782 vcpu->arch.apic_arb_prio++; 835 783 case APIC_DM_FIXED: 784 + if (unlikely(trig_mode && !level)) 785 + break; 786 + 836 787 /* FIXME add logic for vcpu on reset */ 837 788 if (unlikely(!apic_enabled(apic))) 838 789 break; ··· 844 789 845 790 if (dest_map) 846 791 __set_bit(vcpu->vcpu_id, dest_map); 792 + 793 + if (apic_test_vector(vector, apic->regs + APIC_TMR) != !!trig_mode) { 794 + if (trig_mode) 795 + apic_set_vector(vector, apic->regs + APIC_TMR); 796 + else 797 + apic_clear_vector(vector, apic->regs + APIC_TMR); 798 + } 847 799 848 800 if (kvm_x86_ops->deliver_posted_interrupt) 849 801 kvm_x86_ops->deliver_posted_interrupt(vcpu, vector); ··· 930 868 return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio; 931 869 } 932 870 871 + static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector) 872 + { 873 + return test_bit(vector, (ulong *)apic->vcpu->arch.eoi_exit_bitmap); 874 + } 875 + 933 876 static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) 934 877 { 935 - if (kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) { 936 - int trigger_mode; 937 - if (apic_test_vector(vector, apic->regs + APIC_TMR)) 938 - trigger_mode = IOAPIC_LEVEL_TRIG; 939 - else 940 - trigger_mode = IOAPIC_EDGE_TRIG; 941 - kvm_ioapic_update_eoi(apic->vcpu, vector, trigger_mode); 878 + int trigger_mode; 879 + 880 + /* Eoi the ioapic only if the ioapic doesn't own the vector. */ 881 + if (!kvm_ioapic_handles_vector(apic, vector)) 882 + return; 883 + 884 + /* Request a KVM exit to inform the userspace IOAPIC. */ 885 + if (irqchip_split(apic->vcpu->kvm)) { 886 + apic->vcpu->arch.pending_ioapic_eoi = vector; 887 + kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, apic->vcpu); 888 + return; 942 889 } 890 + 891 + if (apic_test_vector(vector, apic->regs + APIC_TMR)) 892 + trigger_mode = IOAPIC_LEVEL_TRIG; 893 + else 894 + trigger_mode = IOAPIC_EDGE_TRIG; 895 + 896 + kvm_ioapic_update_eoi(apic->vcpu, vector, trigger_mode); 943 897 } 944 898 945 899 static int apic_set_eoi(struct kvm_lapic *apic) ··· 1693 1615 apic_set_reg(apic, APIC_ISR + 0x10 * i, 0); 1694 1616 apic_set_reg(apic, APIC_TMR + 0x10 * i, 0); 1695 1617 } 1696 - apic->irr_pending = kvm_apic_vid_enabled(vcpu->kvm); 1618 + apic->irr_pending = kvm_vcpu_apic_vid_enabled(vcpu); 1697 1619 apic->isr_count = kvm_x86_ops->hwapic_isr_update ? 1 : 0; 1698 1620 apic->highest_isr_cache = -1; 1699 1621 update_divide_count(apic); ··· 1916 1838 kvm_x86_ops->hwapic_isr_update(vcpu->kvm, 1917 1839 apic_find_highest_isr(apic)); 1918 1840 kvm_make_request(KVM_REQ_EVENT, vcpu); 1919 - kvm_rtc_eoi_tracking_restore_one(vcpu); 1841 + if (ioapic_in_kernel(vcpu->kvm)) 1842 + kvm_rtc_eoi_tracking_restore_one(vcpu); 1843 + 1844 + vcpu->arch.apic_arb_prio = 0; 1920 1845 } 1921 1846 1922 1847 void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) ··· 2003 1922 /* Cache not set: could be safe but we don't bother. */ 2004 1923 apic->highest_isr_cache == -1 || 2005 1924 /* Need EOI to update ioapic. */ 2006 - kvm_ioapic_handles_vector(vcpu->kvm, apic->highest_isr_cache)) { 1925 + kvm_ioapic_handles_vector(apic, apic->highest_isr_cache)) { 2007 1926 /* 2008 1927 * PV EOI was disabled by apic_sync_pv_eoi_from_guest 2009 1928 * so we need not do anything here. ··· 2059 1978 struct kvm_lapic *apic = vcpu->arch.apic; 2060 1979 u32 reg = (msr - APIC_BASE_MSR) << 4; 2061 1980 2062 - if (!irqchip_in_kernel(vcpu->kvm) || !apic_x2apic_mode(apic)) 1981 + if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic)) 2063 1982 return 1; 2064 1983 2065 1984 if (reg == APIC_ICR2) ··· 2076 1995 struct kvm_lapic *apic = vcpu->arch.apic; 2077 1996 u32 reg = (msr - APIC_BASE_MSR) << 4, low, high = 0; 2078 1997 2079 - if (!irqchip_in_kernel(vcpu->kvm) || !apic_x2apic_mode(apic)) 1998 + if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic)) 2080 1999 return 1; 2081 2000 2082 2001 if (reg == APIC_DFR || reg == APIC_ICR2) {

+4 -3

arch/x86/kvm/lapic.h

··· 57 57 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); 58 58 void kvm_apic_set_version(struct kvm_vcpu *vcpu); 59 59 60 - void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); 61 60 void __kvm_apic_update_irr(u32 *pir, void *regs); 62 61 void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir); 63 62 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq, ··· 143 144 return apic->vcpu->arch.apic_base & X2APIC_ENABLE; 144 145 } 145 146 146 - static inline bool kvm_apic_vid_enabled(struct kvm *kvm) 147 + static inline bool kvm_vcpu_apic_vid_enabled(struct kvm_vcpu *vcpu) 147 148 { 148 - return kvm_x86_ops->vm_has_apicv(kvm); 149 + return kvm_x86_ops->cpu_uses_apicv(vcpu); 149 150 } 150 151 151 152 static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) ··· 168 169 169 170 void wait_lapic_expire(struct kvm_vcpu *vcpu); 170 171 172 + bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq, 173 + struct kvm_vcpu **dest_vcpu); 171 174 #endif

+53 -38

arch/x86/kvm/mmu.c

··· 818 818 kvm->arch.indirect_shadow_pages--; 819 819 } 820 820 821 - static int has_wrprotected_page(struct kvm_vcpu *vcpu, 822 - gfn_t gfn, 823 - int level) 821 + static int __has_wrprotected_page(gfn_t gfn, int level, 822 + struct kvm_memory_slot *slot) 824 823 { 825 - struct kvm_memory_slot *slot; 826 824 struct kvm_lpage_info *linfo; 827 825 828 - slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 829 826 if (slot) { 830 827 linfo = lpage_info_slot(gfn, slot, level); 831 828 return linfo->write_count; 832 829 } 833 830 834 831 return 1; 832 + } 833 + 834 + static int has_wrprotected_page(struct kvm_vcpu *vcpu, gfn_t gfn, int level) 835 + { 836 + struct kvm_memory_slot *slot; 837 + 838 + slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 839 + return __has_wrprotected_page(gfn, level, slot); 835 840 } 836 841 837 842 static int host_mapping_level(struct kvm *kvm, gfn_t gfn) ··· 856 851 return ret; 857 852 } 858 853 854 + static inline bool memslot_valid_for_gpte(struct kvm_memory_slot *slot, 855 + bool no_dirty_log) 856 + { 857 + if (!slot || slot->flags & KVM_MEMSLOT_INVALID) 858 + return false; 859 + if (no_dirty_log && slot->dirty_bitmap) 860 + return false; 861 + 862 + return true; 863 + } 864 + 859 865 static struct kvm_memory_slot * 860 866 gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn, 861 867 bool no_dirty_log) ··· 874 858 struct kvm_memory_slot *slot; 875 859 876 860 slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 877 - if (!slot || slot->flags & KVM_MEMSLOT_INVALID || 878 - (no_dirty_log && slot->dirty_bitmap)) 861 + if (!memslot_valid_for_gpte(slot, no_dirty_log)) 879 862 slot = NULL; 880 863 881 864 return slot; 882 865 } 883 866 884 - static bool mapping_level_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t large_gfn) 885 - { 886 - return !gfn_to_memslot_dirty_bitmap(vcpu, large_gfn, true); 887 - } 888 - 889 - static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) 867 + static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn, 868 + bool *force_pt_level) 890 869 { 891 870 int host_level, level, max_level; 871 + struct kvm_memory_slot *slot; 872 + 873 + if (unlikely(*force_pt_level)) 874 + return PT_PAGE_TABLE_LEVEL; 875 + 876 + slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn); 877 + *force_pt_level = !memslot_valid_for_gpte(slot, true); 878 + if (unlikely(*force_pt_level)) 879 + return PT_PAGE_TABLE_LEVEL; 892 880 893 881 host_level = host_mapping_level(vcpu->kvm, large_gfn); 894 882 ··· 902 882 max_level = min(kvm_x86_ops->get_lpage_level(), host_level); 903 883 904 884 for (level = PT_DIRECTORY_LEVEL; level <= max_level; ++level) 905 - if (has_wrprotected_page(vcpu, large_gfn, level)) 885 + if (__has_wrprotected_page(large_gfn, level, slot)) 906 886 break; 907 887 908 888 return level - 1; ··· 2982 2962 { 2983 2963 int r; 2984 2964 int level; 2985 - int force_pt_level; 2965 + bool force_pt_level = false; 2986 2966 pfn_t pfn; 2987 2967 unsigned long mmu_seq; 2988 2968 bool map_writable, write = error_code & PFERR_WRITE_MASK; 2989 2969 2990 - force_pt_level = mapping_level_dirty_bitmap(vcpu, gfn); 2970 + level = mapping_level(vcpu, gfn, &force_pt_level); 2991 2971 if (likely(!force_pt_level)) { 2992 - level = mapping_level(vcpu, gfn); 2993 2972 /* 2994 2973 * This path builds a PAE pagetable - so we can map 2995 2974 * 2mb pages at maximum. Therefore check if the level ··· 2998 2979 level = PT_DIRECTORY_LEVEL; 2999 2980 3000 2981 gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1); 3001 - } else 3002 - level = PT_PAGE_TABLE_LEVEL; 2982 + } 3003 2983 3004 2984 if (fast_page_fault(vcpu, v, level, error_code)) 3005 2985 return 0; ··· 3445 3427 3446 3428 static bool can_do_async_pf(struct kvm_vcpu *vcpu) 3447 3429 { 3448 - if (unlikely(!irqchip_in_kernel(vcpu->kvm) || 3430 + if (unlikely(!lapic_in_kernel(vcpu) || 3449 3431 kvm_event_needs_reinjection(vcpu))) 3450 3432 return false; 3451 3433 ··· 3494 3476 pfn_t pfn; 3495 3477 int r; 3496 3478 int level; 3497 - int force_pt_level; 3479 + bool force_pt_level; 3498 3480 gfn_t gfn = gpa >> PAGE_SHIFT; 3499 3481 unsigned long mmu_seq; 3500 3482 int write = error_code & PFERR_WRITE_MASK; ··· 3513 3495 if (r) 3514 3496 return r; 3515 3497 3516 - if (mapping_level_dirty_bitmap(vcpu, gfn) || 3517 - !check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL)) 3518 - force_pt_level = 1; 3519 - else 3520 - force_pt_level = 0; 3521 - 3498 + force_pt_level = !check_hugepage_cache_consistency(vcpu, gfn, 3499 + PT_DIRECTORY_LEVEL); 3500 + level = mapping_level(vcpu, gfn, &force_pt_level); 3522 3501 if (likely(!force_pt_level)) { 3523 - level = mapping_level(vcpu, gfn); 3524 3502 if (level > PT_DIRECTORY_LEVEL && 3525 3503 !check_hugepage_cache_consistency(vcpu, gfn, level)) 3526 3504 level = PT_DIRECTORY_LEVEL; 3527 3505 gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1); 3528 - } else 3529 - level = PT_PAGE_TABLE_LEVEL; 3506 + } 3530 3507 3531 3508 if (fast_page_fault(vcpu, gpa, level, error_code)) 3532 3509 return 0; ··· 3719 3706 __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check, 3720 3707 int maxphyaddr, bool execonly) 3721 3708 { 3722 - int pte; 3709 + u64 bad_mt_xwr; 3723 3710 3724 3711 rsvd_check->rsvd_bits_mask[0][3] = 3725 3712 rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7); ··· 3737 3724 rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20); 3738 3725 rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0]; 3739 3726 3740 - for (pte = 0; pte < 64; pte++) { 3741 - int rwx_bits = pte & 7; 3742 - int mt = pte >> 3; 3743 - if (mt == 0x2 || mt == 0x3 || mt == 0x7 || 3744 - rwx_bits == 0x2 || rwx_bits == 0x6 || 3745 - (rwx_bits == 0x4 && !execonly)) 3746 - rsvd_check->bad_mt_xwr |= (1ull << pte); 3727 + bad_mt_xwr = 0xFFull << (2 * 8); /* bits 3..5 must not be 2 */ 3728 + bad_mt_xwr |= 0xFFull << (3 * 8); /* bits 3..5 must not be 3 */ 3729 + bad_mt_xwr |= 0xFFull << (7 * 8); /* bits 3..5 must not be 7 */ 3730 + bad_mt_xwr |= REPEAT_BYTE(1ull << 2); /* bits 0..2 must not be 010 */ 3731 + bad_mt_xwr |= REPEAT_BYTE(1ull << 6); /* bits 0..2 must not be 110 */ 3732 + if (!execonly) { 3733 + /* bits 0..2 must not be 100 unless VMX capabilities allow it */ 3734 + bad_mt_xwr |= REPEAT_BYTE(1ull << 4); 3747 3735 } 3736 + rsvd_check->bad_mt_xwr = bad_mt_xwr; 3748 3737 } 3749 3738 3750 3739 static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,

+9 -10

arch/x86/kvm/paging_tmpl.h

··· 698 698 int r; 699 699 pfn_t pfn; 700 700 int level = PT_PAGE_TABLE_LEVEL; 701 - int force_pt_level; 701 + bool force_pt_level = false; 702 702 unsigned long mmu_seq; 703 703 bool map_writable, is_self_change_mapping; 704 704 ··· 743 743 is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu, 744 744 &walker, user_fault, &vcpu->arch.write_fault_to_shadow_pgtable); 745 745 746 - if (walker.level >= PT_DIRECTORY_LEVEL) 747 - force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn) 748 - || is_self_change_mapping; 749 - else 750 - force_pt_level = 1; 751 - if (!force_pt_level) { 752 - level = min(walker.level, mapping_level(vcpu, walker.gfn)); 753 - walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1); 754 - } 746 + if (walker.level >= PT_DIRECTORY_LEVEL && !is_self_change_mapping) { 747 + level = mapping_level(vcpu, walker.gfn, &force_pt_level); 748 + if (likely(!force_pt_level)) { 749 + level = min(walker.level, level); 750 + walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1); 751 + } 752 + } else 753 + force_pt_level = true; 755 754 756 755 mmu_seq = vcpu->kvm->mmu_notifier_seq; 757 756 smp_rmb();

+19 -24

arch/x86/kvm/svm.c

··· 159 159 u32 apf_reason; 160 160 161 161 u64 tsc_ratio; 162 + 163 + /* cached guest cpuid flags for faster access */ 164 + bool nrips_enabled : 1; 162 165 }; 163 166 164 167 static DEFINE_PER_CPU(u64, current_tsc_ratio); ··· 1089 1086 return target_tsc - tsc; 1090 1087 } 1091 1088 1092 - static void init_vmcb(struct vcpu_svm *svm, bool init_event) 1089 + static void init_vmcb(struct vcpu_svm *svm) 1093 1090 { 1094 1091 struct vmcb_control_area *control = &svm->vmcb->control; 1095 1092 struct vmcb_save_area *save = &svm->vmcb->save; ··· 1160 1157 init_sys_seg(&save->ldtr, SEG_TYPE_LDT); 1161 1158 init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16); 1162 1159 1163 - if (!init_event) 1164 - svm_set_efer(&svm->vcpu, 0); 1160 + svm_set_efer(&svm->vcpu, 0); 1165 1161 save->dr6 = 0xffff0ff0; 1166 1162 kvm_set_rflags(&svm->vcpu, 2); 1167 1163 save->rip = 0x0000fff0; ··· 1214 1212 if (kvm_vcpu_is_reset_bsp(&svm->vcpu)) 1215 1213 svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; 1216 1214 } 1217 - init_vmcb(svm, init_event); 1215 + init_vmcb(svm); 1218 1216 1219 1217 kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy); 1220 1218 kvm_register_write(vcpu, VCPU_REGS_RDX, eax); ··· 1270 1268 clear_page(svm->vmcb); 1271 1269 svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT; 1272 1270 svm->asid_generation = 0; 1273 - init_vmcb(svm, false); 1271 + init_vmcb(svm); 1274 1272 1275 1273 svm_init_osvw(&svm->vcpu); 1276 1274 ··· 1892 1890 * so reinitialize it. 1893 1891 */ 1894 1892 clear_page(svm->vmcb); 1895 - init_vmcb(svm, false); 1893 + init_vmcb(svm); 1896 1894 1897 1895 kvm_run->exit_reason = KVM_EXIT_SHUTDOWN; 1898 1896 return 0; ··· 2367 2365 nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2; 2368 2366 nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info; 2369 2367 nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err; 2370 - nested_vmcb->control.next_rip = vmcb->control.next_rip; 2368 + 2369 + if (svm->nrips_enabled) 2370 + nested_vmcb->control.next_rip = vmcb->control.next_rip; 2371 2371 2372 2372 /* 2373 2373 * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have ··· 3064 3060 u8 cr8_prev = kvm_get_cr8(&svm->vcpu); 3065 3061 /* instruction emulation calls kvm_set_cr8() */ 3066 3062 r = cr_interception(svm); 3067 - if (irqchip_in_kernel(svm->vcpu.kvm)) 3063 + if (lapic_in_kernel(&svm->vcpu)) 3068 3064 return r; 3069 3065 if (cr8_prev <= kvm_get_cr8(&svm->vcpu)) 3070 3066 return r; ··· 3298 3294 3299 3295 static int interrupt_window_interception(struct vcpu_svm *svm) 3300 3296 { 3301 - struct kvm_run *kvm_run = svm->vcpu.run; 3302 - 3303 3297 kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); 3304 3298 svm_clear_vintr(svm); 3305 3299 svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; 3306 3300 mark_dirty(svm->vmcb, VMCB_INTR); 3307 3301 ++svm->vcpu.stat.irq_window_exits; 3308 - /* 3309 - * If the user space waits to inject interrupts, exit as soon as 3310 - * possible 3311 - */ 3312 - if (!irqchip_in_kernel(svm->vcpu.kvm) && 3313 - kvm_run->request_interrupt_window && 3314 - !kvm_cpu_has_interrupt(&svm->vcpu)) { 3315 - kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN; 3316 - return 0; 3317 - } 3318 - 3319 3302 return 1; 3320 3303 } 3321 3304 ··· 3650 3659 return; 3651 3660 } 3652 3661 3653 - static int svm_vm_has_apicv(struct kvm *kvm) 3662 + static int svm_cpu_uses_apicv(struct kvm_vcpu *vcpu) 3654 3663 { 3655 3664 return 0; 3656 3665 } 3657 3666 3658 - static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) 3667 + static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu) 3659 3668 { 3660 3669 return; 3661 3670 } ··· 4089 4098 4090 4099 static void svm_cpuid_update(struct kvm_vcpu *vcpu) 4091 4100 { 4101 + struct vcpu_svm *svm = to_svm(vcpu); 4102 + 4103 + /* Update nrips enabled cache */ 4104 + svm->nrips_enabled = !!guest_cpuid_has_nrips(&svm->vcpu); 4092 4105 } 4093 4106 4094 4107 static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) ··· 4420 4425 .enable_irq_window = enable_irq_window, 4421 4426 .update_cr8_intercept = update_cr8_intercept, 4422 4427 .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, 4423 - .vm_has_apicv = svm_vm_has_apicv, 4428 + .cpu_uses_apicv = svm_cpu_uses_apicv, 4424 4429 .load_eoi_exitmap = svm_load_eoi_exitmap, 4425 4430 .sync_pir_to_irr = svm_sync_pir_to_irr, 4426 4431

+51

arch/x86/kvm/trace.h

··· 129 129 ); 130 130 131 131 /* 132 + * Tracepoint for fast mmio. 133 + */ 134 + TRACE_EVENT(kvm_fast_mmio, 135 + TP_PROTO(u64 gpa), 136 + TP_ARGS(gpa), 137 + 138 + TP_STRUCT__entry( 139 + __field(u64, gpa) 140 + ), 141 + 142 + TP_fast_assign( 143 + __entry->gpa = gpa; 144 + ), 145 + 146 + TP_printk("fast mmio at gpa 0x%llx", __entry->gpa) 147 + ); 148 + 149 + /* 132 150 * Tracepoint for cpuid. 133 151 */ 134 152 TRACE_EVENT(kvm_cpuid, ··· 990 972 __entry->vcpu_id, 991 973 __entry->entering ? "entering" : "leaving", 992 974 __entry->smbase) 975 + ); 976 + 977 + /* 978 + * Tracepoint for VT-d posted-interrupts. 979 + */ 980 + TRACE_EVENT(kvm_pi_irte_update, 981 + TP_PROTO(unsigned int vcpu_id, unsigned int gsi, 982 + unsigned int gvec, u64 pi_desc_addr, bool set), 983 + TP_ARGS(vcpu_id, gsi, gvec, pi_desc_addr, set), 984 + 985 + TP_STRUCT__entry( 986 + __field( unsigned int, vcpu_id ) 987 + __field( unsigned int, gsi ) 988 + __field( unsigned int, gvec ) 989 + __field( u64, pi_desc_addr ) 990 + __field( bool, set ) 991 + ), 992 + 993 + TP_fast_assign( 994 + __entry->vcpu_id = vcpu_id; 995 + __entry->gsi = gsi; 996 + __entry->gvec = gvec; 997 + __entry->pi_desc_addr = pi_desc_addr; 998 + __entry->set = set; 999 + ), 1000 + 1001 + TP_printk("VT-d PI is %s for this irq, vcpu %u, gsi: 0x%x, " 1002 + "gvec: 0x%x, pi_desc_addr: 0x%llx", 1003 + __entry->set ? "enabled and being updated" : "disabled", 1004 + __entry->vcpu_id, 1005 + __entry->gsi, 1006 + __entry->gvec, 1007 + __entry->pi_desc_addr) 993 1008 ); 994 1009 995 1010 #endif /* _TRACE_KVM_H */

+605 -145

arch/x86/kvm/vmx.c

··· 35 35 #include "kvm_cache_regs.h" 36 36 #include "x86.h" 37 37 38 + #include <asm/cpu.h> 38 39 #include <asm/io.h> 39 40 #include <asm/desc.h> 40 41 #include <asm/vmx.h> ··· 46 45 #include <asm/debugreg.h> 47 46 #include <asm/kexec.h> 48 47 #include <asm/apic.h> 48 + #include <asm/irq_remapping.h> 49 49 50 50 #include "trace.h" 51 51 #include "pmu.h" ··· 426 424 /* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */ 427 425 u64 vmcs01_debugctl; 428 426 427 + u16 vpid02; 428 + u16 last_vpid; 429 + 429 430 u32 nested_vmx_procbased_ctls_low; 430 431 u32 nested_vmx_procbased_ctls_high; 431 432 u32 nested_vmx_true_procbased_ctls_low; ··· 445 440 u32 nested_vmx_misc_low; 446 441 u32 nested_vmx_misc_high; 447 442 u32 nested_vmx_ept_caps; 443 + u32 nested_vmx_vpid_caps; 448 444 }; 449 445 450 446 #define POSTED_INTR_ON 0 447 + #define POSTED_INTR_SN 1 448 + 451 449 /* Posted-Interrupt Descriptor */ 452 450 struct pi_desc { 453 451 u32 pir[8]; /* Posted interrupt requested */ 454 - u32 control; /* bit 0 of control is outstanding notification bit */ 455 - u32 rsvd[7]; 452 + union { 453 + struct { 454 + /* bit 256 - Outstanding Notification */ 455 + u16 on : 1, 456 + /* bit 257 - Suppress Notification */ 457 + sn : 1, 458 + /* bit 271:258 - Reserved */ 459 + rsvd_1 : 14; 460 + /* bit 279:272 - Notification Vector */ 461 + u8 nv; 462 + /* bit 287:280 - Reserved */ 463 + u8 rsvd_2; 464 + /* bit 319:288 - Notification Destination */ 465 + u32 ndst; 466 + }; 467 + u64 control; 468 + }; 469 + u32 rsvd[6]; 456 470 } __aligned(64); 457 471 458 472 static bool pi_test_and_set_on(struct pi_desc *pi_desc) ··· 489 465 static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc) 490 466 { 491 467 return test_and_set_bit(vector, (unsigned long *)pi_desc->pir); 468 + } 469 + 470 + static inline void pi_clear_sn(struct pi_desc *pi_desc) 471 + { 472 + return clear_bit(POSTED_INTR_SN, 473 + (unsigned long *)&pi_desc->control); 474 + } 475 + 476 + static inline void pi_set_sn(struct pi_desc *pi_desc) 477 + { 478 + return set_bit(POSTED_INTR_SN, 479 + (unsigned long *)&pi_desc->control); 480 + } 481 + 482 + static inline int pi_test_on(struct pi_desc *pi_desc) 483 + { 484 + return test_bit(POSTED_INTR_ON, 485 + (unsigned long *)&pi_desc->control); 486 + } 487 + 488 + static inline int pi_test_sn(struct pi_desc *pi_desc) 489 + { 490 + return test_bit(POSTED_INTR_SN, 491 + (unsigned long *)&pi_desc->control); 492 492 } 493 493 494 494 struct vcpu_vmx { ··· 580 532 s64 vnmi_blocked_time; 581 533 u32 exit_reason; 582 534 583 - bool rdtscp_enabled; 584 - 585 535 /* Posted interrupt descriptor */ 586 536 struct pi_desc pi_desc; 587 537 ··· 607 561 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) 608 562 { 609 563 return container_of(vcpu, struct vcpu_vmx, vcpu); 564 + } 565 + 566 + static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) 567 + { 568 + return &(to_vmx(vcpu)->pi_desc); 610 569 } 611 570 612 571 #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x) ··· 860 809 static void kvm_cpu_vmxoff(void); 861 810 static bool vmx_mpx_supported(void); 862 811 static bool vmx_xsaves_supported(void); 863 - static int vmx_vm_has_apicv(struct kvm *kvm); 812 + static int vmx_cpu_uses_apicv(struct kvm_vcpu *vcpu); 864 813 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); 865 814 static void vmx_set_segment(struct kvm_vcpu *vcpu, 866 815 struct kvm_segment *var, int seg); ··· 881 830 */ 882 831 static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu); 883 832 static DEFINE_PER_CPU(struct desc_ptr, host_gdt); 833 + 834 + /* 835 + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we 836 + * can find which vCPU should be waken up. 837 + */ 838 + static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu); 839 + static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); 884 840 885 841 static unsigned long *vmx_io_bitmap_a; 886 842 static unsigned long *vmx_io_bitmap_b; ··· 1004 946 return vmcs_config.cpu_based_exec_ctrl & CPU_BASED_TPR_SHADOW; 1005 947 } 1006 948 1007 - static inline bool vm_need_tpr_shadow(struct kvm *kvm) 949 + static inline bool cpu_need_tpr_shadow(struct kvm_vcpu *vcpu) 1008 950 { 1009 - return (cpu_has_vmx_tpr_shadow()) && (irqchip_in_kernel(kvm)); 951 + return cpu_has_vmx_tpr_shadow() && lapic_in_kernel(vcpu); 1010 952 } 1011 953 1012 954 static inline bool cpu_has_secondary_exec_ctrls(void) ··· 1041 983 1042 984 static inline bool cpu_has_vmx_posted_intr(void) 1043 985 { 1044 - return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR; 986 + return IS_ENABLED(CONFIG_X86_LOCAL_APIC) && 987 + vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR; 1045 988 } 1046 989 1047 990 static inline bool cpu_has_vmx_apicv(void) ··· 1121 1062 SECONDARY_EXEC_PAUSE_LOOP_EXITING; 1122 1063 } 1123 1064 1124 - static inline bool vm_need_virtualize_apic_accesses(struct kvm *kvm) 1065 + static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu) 1125 1066 { 1126 - return flexpriority_enabled && irqchip_in_kernel(kvm); 1067 + return flexpriority_enabled && lapic_in_kernel(vcpu); 1127 1068 } 1128 1069 1129 1070 static inline bool cpu_has_vmx_vpid(void) ··· 1214 1155 static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12) 1215 1156 { 1216 1157 return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); 1158 + } 1159 + 1160 + static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12) 1161 + { 1162 + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID); 1217 1163 } 1218 1164 1219 1165 static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12) ··· 1401 1337 __loaded_vmcs_clear, loaded_vmcs, 1); 1402 1338 } 1403 1339 1404 - static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx) 1340 + static inline void vpid_sync_vcpu_single(int vpid) 1405 1341 { 1406 - if (vmx->vpid == 0) 1342 + if (vpid == 0) 1407 1343 return; 1408 1344 1409 1345 if (cpu_has_vmx_invvpid_single()) 1410 - __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vmx->vpid, 0); 1346 + __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vpid, 0); 1411 1347 } 1412 1348 1413 1349 static inline void vpid_sync_vcpu_global(void) ··· 1416 1352 __invvpid(VMX_VPID_EXTENT_ALL_CONTEXT, 0, 0); 1417 1353 } 1418 1354 1419 - static inline void vpid_sync_context(struct vcpu_vmx *vmx) 1355 + static inline void vpid_sync_context(int vpid) 1420 1356 { 1421 1357 if (cpu_has_vmx_invvpid_single()) 1422 - vpid_sync_vcpu_single(vmx); 1358 + vpid_sync_vcpu_single(vpid); 1423 1359 else 1424 1360 vpid_sync_vcpu_global(); 1425 1361 } ··· 1959 1895 preempt_enable(); 1960 1896 } 1961 1897 1898 + static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) 1899 + { 1900 + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 1901 + struct pi_desc old, new; 1902 + unsigned int dest; 1903 + 1904 + if (!kvm_arch_has_assigned_device(vcpu->kvm) || 1905 + !irq_remapping_cap(IRQ_POSTING_CAP)) 1906 + return; 1907 + 1908 + do { 1909 + old.control = new.control = pi_desc->control; 1910 + 1911 + /* 1912 + * If 'nv' field is POSTED_INTR_WAKEUP_VECTOR, there 1913 + * are two possible cases: 1914 + * 1. After running 'pre_block', context switch 1915 + * happened. For this case, 'sn' was set in 1916 + * vmx_vcpu_put(), so we need to clear it here. 1917 + * 2. After running 'pre_block', we were blocked, 1918 + * and woken up by some other guy. For this case, 1919 + * we don't need to do anything, 'pi_post_block' 1920 + * will do everything for us. However, we cannot 1921 + * check whether it is case #1 or case #2 here 1922 + * (maybe, not needed), so we also clear sn here, 1923 + * I think it is not a big deal. 1924 + */ 1925 + if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR) { 1926 + if (vcpu->cpu != cpu) { 1927 + dest = cpu_physical_id(cpu); 1928 + 1929 + if (x2apic_enabled()) 1930 + new.ndst = dest; 1931 + else 1932 + new.ndst = (dest << 8) & 0xFF00; 1933 + } 1934 + 1935 + /* set 'NV' to 'notification vector' */ 1936 + new.nv = POSTED_INTR_VECTOR; 1937 + } 1938 + 1939 + /* Allow posting non-urgent interrupts */ 1940 + new.sn = 0; 1941 + } while (cmpxchg(&pi_desc->control, old.control, 1942 + new.control) != old.control); 1943 + } 1962 1944 /* 1963 1945 * Switches to specified vcpu, until a matching vcpu_put(), but assumes 1964 1946 * vcpu mutex is already taken. ··· 2055 1945 vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */ 2056 1946 vmx->loaded_vmcs->cpu = cpu; 2057 1947 } 1948 + 1949 + vmx_vcpu_pi_load(vcpu, cpu); 1950 + } 1951 + 1952 + static void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu) 1953 + { 1954 + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 1955 + 1956 + if (!kvm_arch_has_assigned_device(vcpu->kvm) || 1957 + !irq_remapping_cap(IRQ_POSTING_CAP)) 1958 + return; 1959 + 1960 + /* Set SN when the vCPU is preempted */ 1961 + if (vcpu->preempted) 1962 + pi_set_sn(pi_desc); 2058 1963 } 2059 1964 2060 1965 static void vmx_vcpu_put(struct kvm_vcpu *vcpu) 2061 1966 { 1967 + vmx_vcpu_pi_put(vcpu); 1968 + 2062 1969 __vmx_load_host_state(to_vmx(vcpu)); 2063 1970 if (!vmm_exclusive) { 2064 1971 __loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs); ··· 2334 2207 if (index >= 0) 2335 2208 move_msr_up(vmx, index, save_nmsrs++); 2336 2209 index = __find_msr_index(vmx, MSR_TSC_AUX); 2337 - if (index >= 0 && vmx->rdtscp_enabled) 2210 + if (index >= 0 && guest_cpuid_has_rdtscp(&vmx->vcpu)) 2338 2211 move_msr_up(vmx, index, save_nmsrs++); 2339 2212 /* 2340 2213 * MSR_STAR is only needed on long mode guests, and only ··· 2504 2377 vmx->nested.nested_vmx_pinbased_ctls_high |= 2505 2378 PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | 2506 2379 PIN_BASED_VMX_PREEMPTION_TIMER; 2507 - if (vmx_vm_has_apicv(vmx->vcpu.kvm)) 2380 + if (vmx_cpu_uses_apicv(&vmx->vcpu)) 2508 2381 vmx->nested.nested_vmx_pinbased_ctls_high |= 2509 2382 PIN_BASED_POSTED_INTR; 2510 2383 ··· 2598 2471 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 2599 2472 SECONDARY_EXEC_RDTSCP | 2600 2473 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | 2474 + SECONDARY_EXEC_ENABLE_VPID | 2601 2475 SECONDARY_EXEC_APIC_REGISTER_VIRT | 2602 2476 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 2603 2477 SECONDARY_EXEC_WBINVD_EXITING | 2604 - SECONDARY_EXEC_XSAVES; 2478 + SECONDARY_EXEC_XSAVES | 2479 + SECONDARY_EXEC_PCOMMIT; 2605 2480 2606 2481 if (enable_ept) { 2607 2482 /* nested EPT: emulate EPT also to L1 */ ··· 2621 2492 vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT; 2622 2493 } else 2623 2494 vmx->nested.nested_vmx_ept_caps = 0; 2495 + 2496 + if (enable_vpid) 2497 + vmx->nested.nested_vmx_vpid_caps = VMX_VPID_INVVPID_BIT | 2498 + VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT; 2499 + else 2500 + vmx->nested.nested_vmx_vpid_caps = 0; 2624 2501 2625 2502 if (enable_unrestricted_guest) 2626 2503 vmx->nested.nested_vmx_secondary_ctls_high |= ··· 2743 2608 break; 2744 2609 case MSR_IA32_VMX_EPT_VPID_CAP: 2745 2610 /* Currently, no nested vpid support */ 2746 - *pdata = vmx->nested.nested_vmx_ept_caps; 2611 + *pdata = vmx->nested.nested_vmx_ept_caps | 2612 + ((u64)vmx->nested.nested_vmx_vpid_caps << 32); 2747 2613 break; 2748 2614 default: 2749 2615 return 1; ··· 2809 2673 msr_info->data = vcpu->arch.ia32_xss; 2810 2674 break; 2811 2675 case MSR_TSC_AUX: 2812 - if (!to_vmx(vcpu)->rdtscp_enabled) 2676 + if (!guest_cpuid_has_rdtscp(vcpu)) 2813 2677 return 1; 2814 2678 /* Otherwise falls through */ 2815 2679 default: ··· 2915 2779 clear_atomic_switch_msr(vmx, MSR_IA32_XSS); 2916 2780 break; 2917 2781 case MSR_TSC_AUX: 2918 - if (!vmx->rdtscp_enabled) 2782 + if (!guest_cpuid_has_rdtscp(vcpu)) 2919 2783 return 1; 2920 2784 /* Check reserved bit, higher 32 bits should be zero */ 2921 2785 if ((data >> 32) != 0) ··· 3010 2874 return -EBUSY; 3011 2875 3012 2876 INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu)); 2877 + INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu)); 2878 + spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); 3013 2879 3014 2880 /* 3015 2881 * Now we can enable the vmclear operation in kdump ··· 3153 3015 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 3154 3016 SECONDARY_EXEC_SHADOW_VMCS | 3155 3017 SECONDARY_EXEC_XSAVES | 3156 - SECONDARY_EXEC_ENABLE_PML; 3018 + SECONDARY_EXEC_ENABLE_PML | 3019 + SECONDARY_EXEC_PCOMMIT; 3157 3020 if (adjust_vmx_controls(min2, opt2, 3158 3021 MSR_IA32_VMX_PROCBASED_CTLS2, 3159 3022 &_cpu_based_2nd_exec_control) < 0) ··· 3580 3441 3581 3442 #endif 3582 3443 3583 - static void vmx_flush_tlb(struct kvm_vcpu *vcpu) 3444 + static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid) 3584 3445 { 3585 - vpid_sync_context(to_vmx(vcpu)); 3446 + vpid_sync_context(vpid); 3586 3447 if (enable_ept) { 3587 3448 if (!VALID_PAGE(vcpu->arch.mmu.root_hpa)) 3588 3449 return; 3589 3450 ept_sync_context(construct_eptp(vcpu->arch.mmu.root_hpa)); 3590 3451 } 3452 + } 3453 + 3454 + static void vmx_flush_tlb(struct kvm_vcpu *vcpu) 3455 + { 3456 + __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid); 3591 3457 } 3592 3458 3593 3459 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu) ··· 3788 3644 if (!is_paging(vcpu)) { 3789 3645 hw_cr4 &= ~X86_CR4_PAE; 3790 3646 hw_cr4 |= X86_CR4_PSE; 3791 - /* 3792 - * SMEP/SMAP is disabled if CPU is in non-paging mode 3793 - * in hardware. However KVM always uses paging mode to 3794 - * emulate guest non-paging mode with TDP. 3795 - * To emulate this behavior, SMEP/SMAP needs to be 3796 - * manually disabled when guest switches to non-paging 3797 - * mode. 3798 - */ 3799 - hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP); 3800 3647 } else if (!(cr4 & X86_CR4_PAE)) { 3801 3648 hw_cr4 &= ~X86_CR4_PAE; 3802 3649 } 3803 3650 } 3651 + 3652 + if (!enable_unrestricted_guest && !is_paging(vcpu)) 3653 + /* 3654 + * SMEP/SMAP is disabled if CPU is in non-paging mode in 3655 + * hardware. However KVM always uses paging mode without 3656 + * unrestricted guest. 3657 + * To emulate this behavior, SMEP/SMAP needs to be manually 3658 + * disabled when guest switches to non-paging mode. 3659 + */ 3660 + hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP); 3804 3661 3805 3662 vmcs_writel(CR4_READ_SHADOW, cr4); 3806 3663 vmcs_writel(GUEST_CR4, hw_cr4); ··· 4291 4146 return r; 4292 4147 } 4293 4148 4294 - static void allocate_vpid(struct vcpu_vmx *vmx) 4149 + static int allocate_vpid(void) 4295 4150 { 4296 4151 int vpid; 4297 4152 4298 - vmx->vpid = 0; 4299 4153 if (!enable_vpid) 4300 - return; 4154 + return 0; 4301 4155 spin_lock(&vmx_vpid_lock); 4302 4156 vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS); 4303 - if (vpid < VMX_NR_VPIDS) { 4304 - vmx->vpid = vpid; 4157 + if (vpid < VMX_NR_VPIDS) 4305 4158 __set_bit(vpid, vmx_vpid_bitmap); 4306 - } 4159 + else 4160 + vpid = 0; 4307 4161 spin_unlock(&vmx_vpid_lock); 4162 + return vpid; 4308 4163 } 4309 4164 4310 - static void free_vpid(struct vcpu_vmx *vmx) 4165 + static void free_vpid(int vpid) 4311 4166 { 4312 - if (!enable_vpid) 4167 + if (!enable_vpid || vpid == 0) 4313 4168 return; 4314 4169 spin_lock(&vmx_vpid_lock); 4315 - if (vmx->vpid != 0) 4316 - __clear_bit(vmx->vpid, vmx_vpid_bitmap); 4170 + __clear_bit(vpid, vmx_vpid_bitmap); 4317 4171 spin_unlock(&vmx_vpid_lock); 4318 4172 } 4319 4173 ··· 4467 4323 msr, MSR_TYPE_W); 4468 4324 } 4469 4325 4470 - static int vmx_vm_has_apicv(struct kvm *kvm) 4326 + static int vmx_cpu_uses_apicv(struct kvm_vcpu *vcpu) 4471 4327 { 4472 - return enable_apicv && irqchip_in_kernel(kvm); 4328 + return enable_apicv && lapic_in_kernel(vcpu); 4473 4329 } 4474 4330 4475 4331 static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) ··· 4513 4369 { 4514 4370 #ifdef CONFIG_SMP 4515 4371 if (vcpu->mode == IN_GUEST_MODE) { 4372 + struct vcpu_vmx *vmx = to_vmx(vcpu); 4373 + 4374 + /* 4375 + * Currently, we don't support urgent interrupt, 4376 + * all interrupts are recognized as non-urgent 4377 + * interrupt, so we cannot post interrupts when 4378 + * 'SN' is set. 4379 + * 4380 + * If the vcpu is in guest mode, it means it is 4381 + * running instead of being scheduled out and 4382 + * waiting in the run queue, and that's the only 4383 + * case when 'SN' is set currently, warning if 4384 + * 'SN' is set. 4385 + */ 4386 + WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc)); 4387 + 4516 4388 apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), 4517 4389 POSTED_INTR_VECTOR); 4518 4390 return true; ··· 4665 4505 { 4666 4506 u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl; 4667 4507 4668 - if (!vmx_vm_has_apicv(vmx->vcpu.kvm)) 4508 + if (!vmx_cpu_uses_apicv(&vmx->vcpu)) 4669 4509 pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; 4670 4510 return pin_based_exec_ctrl; 4671 4511 } ··· 4677 4517 if (vmx->vcpu.arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT) 4678 4518 exec_control &= ~CPU_BASED_MOV_DR_EXITING; 4679 4519 4680 - if (!vm_need_tpr_shadow(vmx->vcpu.kvm)) { 4520 + if (!cpu_need_tpr_shadow(&vmx->vcpu)) { 4681 4521 exec_control &= ~CPU_BASED_TPR_SHADOW; 4682 4522 #ifdef CONFIG_X86_64 4683 4523 exec_control |= CPU_BASED_CR8_STORE_EXITING | ··· 4694 4534 static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx) 4695 4535 { 4696 4536 u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl; 4697 - if (!vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) 4537 + if (!cpu_need_virtualize_apic_accesses(&vmx->vcpu)) 4698 4538 exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 4699 4539 if (vmx->vpid == 0) 4700 4540 exec_control &= ~SECONDARY_EXEC_ENABLE_VPID; ··· 4708 4548 exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST; 4709 4549 if (!ple_gap) 4710 4550 exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; 4711 - if (!vmx_vm_has_apicv(vmx->vcpu.kvm)) 4551 + if (!vmx_cpu_uses_apicv(&vmx->vcpu)) 4712 4552 exec_control &= ~(SECONDARY_EXEC_APIC_REGISTER_VIRT | 4713 4553 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); 4714 4554 exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; ··· 4718 4558 a current VMCS12 4719 4559 */ 4720 4560 exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS; 4721 - /* PML is enabled/disabled in creating/destorying vcpu */ 4722 - exec_control &= ~SECONDARY_EXEC_ENABLE_PML; 4561 + 4562 + if (!enable_pml) 4563 + exec_control &= ~SECONDARY_EXEC_ENABLE_PML; 4564 + 4565 + /* Currently, we allow L1 guest to directly run pcommit instruction. */ 4566 + exec_control &= ~SECONDARY_EXEC_PCOMMIT; 4723 4567 4724 4568 return exec_control; 4725 4569 } ··· 4768 4604 4769 4605 vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx)); 4770 4606 4771 - if (cpu_has_secondary_exec_ctrls()) { 4607 + if (cpu_has_secondary_exec_ctrls()) 4772 4608 vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 4773 4609 vmx_secondary_exec_control(vmx)); 4774 - } 4775 4610 4776 - if (vmx_vm_has_apicv(vmx->vcpu.kvm)) { 4611 + if (vmx_cpu_uses_apicv(&vmx->vcpu)) { 4777 4612 vmcs_write64(EOI_EXIT_BITMAP0, 0); 4778 4613 vmcs_write64(EOI_EXIT_BITMAP1, 0); 4779 4614 vmcs_write64(EOI_EXIT_BITMAP2, 0); ··· 4916 4753 4917 4754 if (cpu_has_vmx_tpr_shadow() && !init_event) { 4918 4755 vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0); 4919 - if (vm_need_tpr_shadow(vcpu->kvm)) 4756 + if (cpu_need_tpr_shadow(vcpu)) 4920 4757 vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 4921 4758 __pa(vcpu->arch.apic->regs)); 4922 4759 vmcs_write32(TPR_THRESHOLD, 0); ··· 4924 4761 4925 4762 kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); 4926 4763 4927 - if (vmx_vm_has_apicv(vcpu->kvm)) 4764 + if (vmx_cpu_uses_apicv(vcpu)) 4928 4765 memset(&vmx->pi_desc, 0, sizeof(struct pi_desc)); 4929 4766 4930 4767 if (vmx->vpid != 0) ··· 4934 4771 vmx_set_cr0(vcpu, cr0); /* enter rmode */ 4935 4772 vmx->vcpu.arch.cr0 = cr0; 4936 4773 vmx_set_cr4(vcpu, 0); 4937 - if (!init_event) 4938 - vmx_set_efer(vcpu, 0); 4774 + vmx_set_efer(vcpu, 0); 4939 4775 vmx_fpu_activate(vcpu); 4940 4776 update_exception_bitmap(vcpu); 4941 4777 4942 - vpid_sync_context(vmx); 4778 + vpid_sync_context(vmx->vpid); 4943 4779 } 4944 4780 4945 4781 /* ··· 5458 5296 u8 cr8 = (u8)val; 5459 5297 err = kvm_set_cr8(vcpu, cr8); 5460 5298 kvm_complete_insn_gp(vcpu, err); 5461 - if (irqchip_in_kernel(vcpu->kvm)) 5299 + if (lapic_in_kernel(vcpu)) 5462 5300 return 1; 5463 5301 if (cr8_prev <= cr8) 5464 5302 return 1; ··· 5672 5510 kvm_make_request(KVM_REQ_EVENT, vcpu); 5673 5511 5674 5512 ++vcpu->stat.irq_window_exits; 5675 - 5676 - /* 5677 - * If the user space waits to inject interrupts, exit as soon as 5678 - * possible 5679 - */ 5680 - if (!irqchip_in_kernel(vcpu->kvm) && 5681 - vcpu->run->request_interrupt_window && 5682 - !kvm_cpu_has_interrupt(vcpu)) { 5683 - vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN; 5684 - return 0; 5685 - } 5686 5513 return 1; 5687 5514 } 5688 5515 ··· 5904 5753 gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); 5905 5754 if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { 5906 5755 skip_emulated_instruction(vcpu); 5756 + trace_kvm_fast_mmio(gpa); 5907 5757 return 1; 5908 5758 } 5909 5759 ··· 6060 5908 ple_window_actual_max = 6061 5909 __shrink_ple_window(max(ple_window_max, ple_window), 6062 5910 ple_window_grow, INT_MIN); 5911 + } 5912 + 5913 + /* 5914 + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR. 5915 + */ 5916 + static void wakeup_handler(void) 5917 + { 5918 + struct kvm_vcpu *vcpu; 5919 + int cpu = smp_processor_id(); 5920 + 5921 + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); 5922 + list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu), 5923 + blocked_vcpu_list) { 5924 + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 5925 + 5926 + if (pi_test_on(pi_desc) == 1) 5927 + kvm_vcpu_kick(vcpu); 5928 + } 5929 + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); 6063 5930 } 6064 5931 6065 5932 static __init int hardware_setup(void) ··· 6266 6095 kvm_x86_ops->flush_log_dirty = NULL; 6267 6096 kvm_x86_ops->enable_log_dirty_pt_masked = NULL; 6268 6097 } 6098 + 6099 + kvm_set_posted_intr_wakeup_handler(wakeup_handler); 6269 6100 6270 6101 return alloc_kvm_area(); 6271 6102 ··· 6800 6627 6801 6628 static inline void nested_release_vmcs12(struct vcpu_vmx *vmx) 6802 6629 { 6803 - u32 exec_control; 6804 6630 if (vmx->nested.current_vmptr == -1ull) 6805 6631 return; 6806 6632 ··· 6812 6640 they were modified */ 6813 6641 copy_shadow_to_vmcs12(vmx); 6814 6642 vmx->nested.sync_shadow_vmcs = false; 6815 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 6816 - exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS; 6817 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 6643 + vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, 6644 + SECONDARY_EXEC_SHADOW_VMCS); 6818 6645 vmcs_write64(VMCS_LINK_POINTER, -1ull); 6819 6646 } 6820 6647 vmx->nested.posted_intr_nv = -1; ··· 6833 6662 return; 6834 6663 6835 6664 vmx->nested.vmxon = false; 6665 + free_vpid(vmx->nested.vpid02); 6836 6666 nested_release_vmcs12(vmx); 6837 6667 if (enable_shadow_vmcs) 6838 6668 free_vmcs(vmx->nested.current_shadow_vmcs); ··· 7210 7038 { 7211 7039 struct vcpu_vmx *vmx = to_vmx(vcpu); 7212 7040 gpa_t vmptr; 7213 - u32 exec_control; 7214 7041 7215 7042 if (!nested_vmx_check_permission(vcpu)) 7216 7043 return 1; ··· 7241 7070 vmx->nested.current_vmcs12 = new_vmcs12; 7242 7071 vmx->nested.current_vmcs12_page = page; 7243 7072 if (enable_shadow_vmcs) { 7244 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 7245 - exec_control |= SECONDARY_EXEC_SHADOW_VMCS; 7246 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 7073 + vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, 7074 + SECONDARY_EXEC_SHADOW_VMCS); 7247 7075 vmcs_write64(VMCS_LINK_POINTER, 7248 7076 __pa(vmx->nested.current_shadow_vmcs)); 7249 7077 vmx->nested.sync_shadow_vmcs = true; ··· 7348 7178 7349 7179 static int handle_invvpid(struct kvm_vcpu *vcpu) 7350 7180 { 7351 - kvm_queue_exception(vcpu, UD_VECTOR); 7181 + struct vcpu_vmx *vmx = to_vmx(vcpu); 7182 + u32 vmx_instruction_info; 7183 + unsigned long type, types; 7184 + gva_t gva; 7185 + struct x86_exception e; 7186 + int vpid; 7187 + 7188 + if (!(vmx->nested.nested_vmx_secondary_ctls_high & 7189 + SECONDARY_EXEC_ENABLE_VPID) || 7190 + !(vmx->nested.nested_vmx_vpid_caps & VMX_VPID_INVVPID_BIT)) { 7191 + kvm_queue_exception(vcpu, UD_VECTOR); 7192 + return 1; 7193 + } 7194 + 7195 + if (!nested_vmx_check_permission(vcpu)) 7196 + return 1; 7197 + 7198 + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); 7199 + type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf); 7200 + 7201 + types = (vmx->nested.nested_vmx_vpid_caps >> 8) & 0x7; 7202 + 7203 + if (!(types & (1UL << type))) { 7204 + nested_vmx_failValid(vcpu, 7205 + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); 7206 + return 1; 7207 + } 7208 + 7209 + /* according to the intel vmx instruction reference, the memory 7210 + * operand is read even if it isn't needed (e.g., for type==global) 7211 + */ 7212 + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), 7213 + vmx_instruction_info, false, &gva)) 7214 + return 1; 7215 + if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vpid, 7216 + sizeof(u32), &e)) { 7217 + kvm_inject_page_fault(vcpu, &e); 7218 + return 1; 7219 + } 7220 + 7221 + switch (type) { 7222 + case VMX_VPID_EXTENT_ALL_CONTEXT: 7223 + if (get_vmcs12(vcpu)->virtual_processor_id == 0) { 7224 + nested_vmx_failValid(vcpu, 7225 + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); 7226 + return 1; 7227 + } 7228 + __vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02); 7229 + nested_vmx_succeed(vcpu); 7230 + break; 7231 + default: 7232 + /* Trap single context invalidation invvpid calls */ 7233 + BUG_ON(1); 7234 + break; 7235 + } 7236 + 7237 + skip_emulated_instruction(vcpu); 7352 7238 return 1; 7353 7239 } 7354 7240 ··· 7430 7204 * PML buffer already flushed at beginning of VMEXIT. Nothing to do 7431 7205 * here.., and there's no userspace involvement needed for PML. 7432 7206 */ 7207 + return 1; 7208 + } 7209 + 7210 + static int handle_pcommit(struct kvm_vcpu *vcpu) 7211 + { 7212 + /* we never catch pcommit instruct for L1 guest. */ 7213 + WARN_ON(1); 7433 7214 return 1; 7434 7215 } 7435 7216 ··· 7490 7257 [EXIT_REASON_XSAVES] = handle_xsaves, 7491 7258 [EXIT_REASON_XRSTORS] = handle_xrstors, 7492 7259 [EXIT_REASON_PML_FULL] = handle_pml_full, 7260 + [EXIT_REASON_PCOMMIT] = handle_pcommit, 7493 7261 }; 7494 7262 7495 7263 static const int kvm_vmx_max_exit_handlers = ··· 7792 7558 * the XSS exit bitmap in vmcs12. 7793 7559 */ 7794 7560 return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES); 7561 + case EXIT_REASON_PCOMMIT: 7562 + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_PCOMMIT); 7795 7563 default: 7796 7564 return true; 7797 7565 } ··· 7805 7569 *info2 = vmcs_read32(VM_EXIT_INTR_INFO); 7806 7570 } 7807 7571 7808 - static int vmx_enable_pml(struct vcpu_vmx *vmx) 7572 + static int vmx_create_pml_buffer(struct vcpu_vmx *vmx) 7809 7573 { 7810 7574 struct page *pml_pg; 7811 - u32 exec_control; 7812 7575 7813 7576 pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO); 7814 7577 if (!pml_pg) ··· 7818 7583 vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg)); 7819 7584 vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1); 7820 7585 7821 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 7822 - exec_control |= SECONDARY_EXEC_ENABLE_PML; 7823 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 7824 - 7825 7586 return 0; 7826 7587 } 7827 7588 7828 - static void vmx_disable_pml(struct vcpu_vmx *vmx) 7589 + static void vmx_destroy_pml_buffer(struct vcpu_vmx *vmx) 7829 7590 { 7830 - u32 exec_control; 7831 - 7832 - ASSERT(vmx->pml_pg); 7833 - __free_page(vmx->pml_pg); 7834 - vmx->pml_pg = NULL; 7835 - 7836 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 7837 - exec_control &= ~SECONDARY_EXEC_ENABLE_PML; 7838 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 7591 + if (vmx->pml_pg) { 7592 + __free_page(vmx->pml_pg); 7593 + vmx->pml_pg = NULL; 7594 + } 7839 7595 } 7840 7596 7841 7597 static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu) ··· 8150 7924 * apicv 8151 7925 */ 8152 7926 if (!cpu_has_vmx_virtualize_x2apic_mode() || 8153 - !vmx_vm_has_apicv(vcpu->kvm)) 7927 + !vmx_cpu_uses_apicv(vcpu)) 8154 7928 return; 8155 7929 8156 - if (!vm_need_tpr_shadow(vcpu->kvm)) 7930 + if (!cpu_need_tpr_shadow(vcpu)) 8157 7931 return; 8158 7932 8159 7933 sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); ··· 8255 8029 } 8256 8030 } 8257 8031 8258 - static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) 8032 + static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu) 8259 8033 { 8260 - if (!vmx_vm_has_apicv(vcpu->kvm)) 8034 + u64 *eoi_exit_bitmap = vcpu->arch.eoi_exit_bitmap; 8035 + if (!vmx_cpu_uses_apicv(vcpu)) 8261 8036 return; 8262 8037 8263 8038 vmcs_write64(EOI_EXIT_BITMAP0, eoi_exit_bitmap[0]); ··· 8704 8477 struct vcpu_vmx *vmx = to_vmx(vcpu); 8705 8478 8706 8479 if (enable_pml) 8707 - vmx_disable_pml(vmx); 8708 - free_vpid(vmx); 8480 + vmx_destroy_pml_buffer(vmx); 8481 + free_vpid(vmx->vpid); 8709 8482 leave_guest_mode(vcpu); 8710 8483 vmx_load_vmcs01(vcpu); 8711 8484 free_nested(vmx); ··· 8724 8497 if (!vmx) 8725 8498 return ERR_PTR(-ENOMEM); 8726 8499 8727 - allocate_vpid(vmx); 8500 + vmx->vpid = allocate_vpid(); 8728 8501 8729 8502 err = kvm_vcpu_init(&vmx->vcpu, kvm, id); 8730 8503 if (err) ··· 8757 8530 put_cpu(); 8758 8531 if (err) 8759 8532 goto free_vmcs; 8760 - if (vm_need_virtualize_apic_accesses(kvm)) { 8533 + if (cpu_need_virtualize_apic_accesses(&vmx->vcpu)) { 8761 8534 err = alloc_apic_access_page(kvm); 8762 8535 if (err) 8763 8536 goto free_vmcs; ··· 8772 8545 goto free_vmcs; 8773 8546 } 8774 8547 8775 - if (nested) 8548 + if (nested) { 8776 8549 nested_vmx_setup_ctls_msrs(vmx); 8550 + vmx->nested.vpid02 = allocate_vpid(); 8551 + } 8777 8552 8778 8553 vmx->nested.posted_intr_nv = -1; 8779 8554 vmx->nested.current_vmptr = -1ull; ··· 8788 8559 * for the guest, etc. 8789 8560 */ 8790 8561 if (enable_pml) { 8791 - err = vmx_enable_pml(vmx); 8562 + err = vmx_create_pml_buffer(vmx); 8792 8563 if (err) 8793 8564 goto free_vmcs; 8794 8565 } ··· 8796 8567 return &vmx->vcpu; 8797 8568 8798 8569 free_vmcs: 8570 + free_vpid(vmx->nested.vpid02); 8799 8571 free_loaded_vmcs(vmx->loaded_vmcs); 8800 8572 free_msrs: 8801 8573 kfree(vmx->guest_msrs); 8802 8574 uninit_vcpu: 8803 8575 kvm_vcpu_uninit(&vmx->vcpu); 8804 8576 free_vcpu: 8805 - free_vpid(vmx); 8577 + free_vpid(vmx->vpid); 8806 8578 kmem_cache_free(kvm_vcpu_cache, vmx); 8807 8579 return ERR_PTR(err); 8808 8580 } ··· 8878 8648 return PT_PDPE_LEVEL; 8879 8649 } 8880 8650 8651 + static void vmcs_set_secondary_exec_control(u32 new_ctl) 8652 + { 8653 + /* 8654 + * These bits in the secondary execution controls field 8655 + * are dynamic, the others are mostly based on the hypervisor 8656 + * architecture and the guest's CPUID. Do not touch the 8657 + * dynamic bits. 8658 + */ 8659 + u32 mask = 8660 + SECONDARY_EXEC_SHADOW_VMCS | 8661 + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | 8662 + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 8663 + 8664 + u32 cur_ctl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 8665 + 8666 + vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 8667 + (new_ctl & ~mask) | (cur_ctl & mask)); 8668 + } 8669 + 8881 8670 static void vmx_cpuid_update(struct kvm_vcpu *vcpu) 8882 8671 { 8883 8672 struct kvm_cpuid_entry2 *best; 8884 8673 struct vcpu_vmx *vmx = to_vmx(vcpu); 8885 - u32 exec_control; 8674 + u32 secondary_exec_ctl = vmx_secondary_exec_control(vmx); 8886 8675 8887 - vmx->rdtscp_enabled = false; 8888 8676 if (vmx_rdtscp_supported()) { 8889 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 8890 - if (exec_control & SECONDARY_EXEC_RDTSCP) { 8891 - best = kvm_find_cpuid_entry(vcpu, 0x80000001, 0); 8892 - if (best && (best->edx & bit(X86_FEATURE_RDTSCP))) 8893 - vmx->rdtscp_enabled = true; 8894 - else { 8895 - exec_control &= ~SECONDARY_EXEC_RDTSCP; 8896 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 8897 - exec_control); 8898 - } 8677 + bool rdtscp_enabled = guest_cpuid_has_rdtscp(vcpu); 8678 + if (!rdtscp_enabled) 8679 + secondary_exec_ctl &= ~SECONDARY_EXEC_RDTSCP; 8680 + 8681 + if (nested) { 8682 + if (rdtscp_enabled) 8683 + vmx->nested.nested_vmx_secondary_ctls_high |= 8684 + SECONDARY_EXEC_RDTSCP; 8685 + else 8686 + vmx->nested.nested_vmx_secondary_ctls_high &= 8687 + ~SECONDARY_EXEC_RDTSCP; 8899 8688 } 8900 - if (nested && !vmx->rdtscp_enabled) 8901 - vmx->nested.nested_vmx_secondary_ctls_high &= 8902 - ~SECONDARY_EXEC_RDTSCP; 8903 8689 } 8904 8690 8905 8691 /* Exposing INVPCID only when PCID is exposed */ 8906 8692 best = kvm_find_cpuid_entry(vcpu, 0x7, 0); 8907 8693 if (vmx_invpcid_supported() && 8908 - best && (best->ebx & bit(X86_FEATURE_INVPCID)) && 8909 - guest_cpuid_has_pcid(vcpu)) { 8910 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 8911 - exec_control |= SECONDARY_EXEC_ENABLE_INVPCID; 8912 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 8913 - exec_control); 8914 - } else { 8915 - if (cpu_has_secondary_exec_ctrls()) { 8916 - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 8917 - exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID; 8918 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 8919 - exec_control); 8920 - } 8694 + (!best || !(best->ebx & bit(X86_FEATURE_INVPCID)) || 8695 + !guest_cpuid_has_pcid(vcpu))) { 8696 + secondary_exec_ctl &= ~SECONDARY_EXEC_ENABLE_INVPCID; 8697 + 8921 8698 if (best) 8922 8699 best->ebx &= ~bit(X86_FEATURE_INVPCID); 8700 + } 8701 + 8702 + vmcs_set_secondary_exec_control(secondary_exec_ctl); 8703 + 8704 + if (static_cpu_has(X86_FEATURE_PCOMMIT) && nested) { 8705 + if (guest_cpuid_has_pcommit(vcpu)) 8706 + vmx->nested.nested_vmx_secondary_ctls_high |= 8707 + SECONDARY_EXEC_PCOMMIT; 8708 + else 8709 + vmx->nested.nested_vmx_secondary_ctls_high &= 8710 + ~SECONDARY_EXEC_PCOMMIT; 8923 8711 } 8924 8712 } 8925 8713 ··· 9546 9298 9547 9299 if (cpu_has_secondary_exec_ctrls()) { 9548 9300 exec_control = vmx_secondary_exec_control(vmx); 9549 - if (!vmx->rdtscp_enabled) 9550 - exec_control &= ~SECONDARY_EXEC_RDTSCP; 9301 + 9551 9302 /* Take the following fields only from vmcs12 */ 9552 9303 exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 9553 9304 SECONDARY_EXEC_RDTSCP | 9554 9305 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 9555 - SECONDARY_EXEC_APIC_REGISTER_VIRT); 9306 + SECONDARY_EXEC_APIC_REGISTER_VIRT | 9307 + SECONDARY_EXEC_PCOMMIT); 9556 9308 if (nested_cpu_has(vmcs12, 9557 9309 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) 9558 9310 exec_control |= vmcs12->secondary_vm_exec_control; ··· 9571 9323 vmcs_write64(APIC_ACCESS_ADDR, 9572 9324 page_to_phys(vmx->nested.apic_access_page)); 9573 9325 } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) && 9574 - (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))) { 9326 + cpu_need_virtualize_apic_accesses(&vmx->vcpu)) { 9575 9327 exec_control |= 9576 9328 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 9577 9329 kvm_vcpu_reload_apic_access_page(vcpu); ··· 9681 9433 9682 9434 if (enable_vpid) { 9683 9435 /* 9684 - * Trivially support vpid by letting L2s share their parent 9685 - * L1's vpid. TODO: move to a more elaborate solution, giving 9686 - * each L2 its own vpid and exposing the vpid feature to L1. 9436 + * There is no direct mapping between vpid02 and vpid12, the 9437 + * vpid02 is per-vCPU for L0 and reused while the value of 9438 + * vpid12 is changed w/ one invvpid during nested vmentry. 9439 + * The vpid12 is allocated by L1 for L2, so it will not 9440 + * influence global bitmap(for vpid01 and vpid02 allocation) 9441 + * even if spawn a lot of nested vCPUs. 9687 9442 */ 9688 - vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid); 9689 - vmx_flush_tlb(vcpu); 9443 + if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02) { 9444 + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02); 9445 + if (vmcs12->virtual_processor_id != vmx->nested.last_vpid) { 9446 + vmx->nested.last_vpid = vmcs12->virtual_processor_id; 9447 + __vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02); 9448 + } 9449 + } else { 9450 + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid); 9451 + vmx_flush_tlb(vcpu); 9452 + } 9453 + 9690 9454 } 9691 9455 9692 9456 if (nested_cpu_has_ept(vmcs12)) { ··· 10538 10278 kvm_mmu_clear_dirty_pt_masked(kvm, memslot, offset, mask); 10539 10279 } 10540 10280 10281 + /* 10282 + * This routine does the following things for vCPU which is going 10283 + * to be blocked if VT-d PI is enabled. 10284 + * - Store the vCPU to the wakeup list, so when interrupts happen 10285 + * we can find the right vCPU to wake up. 10286 + * - Change the Posted-interrupt descriptor as below: 10287 + * 'NDST' <-- vcpu->pre_pcpu 10288 + * 'NV' <-- POSTED_INTR_WAKEUP_VECTOR 10289 + * - If 'ON' is set during this process, which means at least one 10290 + * interrupt is posted for this vCPU, we cannot block it, in 10291 + * this case, return 1, otherwise, return 0. 10292 + * 10293 + */ 10294 + static int vmx_pre_block(struct kvm_vcpu *vcpu) 10295 + { 10296 + unsigned long flags; 10297 + unsigned int dest; 10298 + struct pi_desc old, new; 10299 + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 10300 + 10301 + if (!kvm_arch_has_assigned_device(vcpu->kvm) || 10302 + !irq_remapping_cap(IRQ_POSTING_CAP)) 10303 + return 0; 10304 + 10305 + vcpu->pre_pcpu = vcpu->cpu; 10306 + spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock, 10307 + vcpu->pre_pcpu), flags); 10308 + list_add_tail(&vcpu->blocked_vcpu_list, 10309 + &per_cpu(blocked_vcpu_on_cpu, 10310 + vcpu->pre_pcpu)); 10311 + spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock, 10312 + vcpu->pre_pcpu), flags); 10313 + 10314 + do { 10315 + old.control = new.control = pi_desc->control; 10316 + 10317 + /* 10318 + * We should not block the vCPU if 10319 + * an interrupt is posted for it. 10320 + */ 10321 + if (pi_test_on(pi_desc) == 1) { 10322 + spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock, 10323 + vcpu->pre_pcpu), flags); 10324 + list_del(&vcpu->blocked_vcpu_list); 10325 + spin_unlock_irqrestore( 10326 + &per_cpu(blocked_vcpu_on_cpu_lock, 10327 + vcpu->pre_pcpu), flags); 10328 + vcpu->pre_pcpu = -1; 10329 + 10330 + return 1; 10331 + } 10332 + 10333 + WARN((pi_desc->sn == 1), 10334 + "Warning: SN field of posted-interrupts " 10335 + "is set before blocking\n"); 10336 + 10337 + /* 10338 + * Since vCPU can be preempted during this process, 10339 + * vcpu->cpu could be different with pre_pcpu, we 10340 + * need to set pre_pcpu as the destination of wakeup 10341 + * notification event, then we can find the right vCPU 10342 + * to wakeup in wakeup handler if interrupts happen 10343 + * when the vCPU is in blocked state. 10344 + */ 10345 + dest = cpu_physical_id(vcpu->pre_pcpu); 10346 + 10347 + if (x2apic_enabled()) 10348 + new.ndst = dest; 10349 + else 10350 + new.ndst = (dest << 8) & 0xFF00; 10351 + 10352 + /* set 'NV' to 'wakeup vector' */ 10353 + new.nv = POSTED_INTR_WAKEUP_VECTOR; 10354 + } while (cmpxchg(&pi_desc->control, old.control, 10355 + new.control) != old.control); 10356 + 10357 + return 0; 10358 + } 10359 + 10360 + static void vmx_post_block(struct kvm_vcpu *vcpu) 10361 + { 10362 + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 10363 + struct pi_desc old, new; 10364 + unsigned int dest; 10365 + unsigned long flags; 10366 + 10367 + if (!kvm_arch_has_assigned_device(vcpu->kvm) || 10368 + !irq_remapping_cap(IRQ_POSTING_CAP)) 10369 + return; 10370 + 10371 + do { 10372 + old.control = new.control = pi_desc->control; 10373 + 10374 + dest = cpu_physical_id(vcpu->cpu); 10375 + 10376 + if (x2apic_enabled()) 10377 + new.ndst = dest; 10378 + else 10379 + new.ndst = (dest << 8) & 0xFF00; 10380 + 10381 + /* Allow posting non-urgent interrupts */ 10382 + new.sn = 0; 10383 + 10384 + /* set 'NV' to 'notification vector' */ 10385 + new.nv = POSTED_INTR_VECTOR; 10386 + } while (cmpxchg(&pi_desc->control, old.control, 10387 + new.control) != old.control); 10388 + 10389 + if(vcpu->pre_pcpu != -1) { 10390 + spin_lock_irqsave( 10391 + &per_cpu(blocked_vcpu_on_cpu_lock, 10392 + vcpu->pre_pcpu), flags); 10393 + list_del(&vcpu->blocked_vcpu_list); 10394 + spin_unlock_irqrestore( 10395 + &per_cpu(blocked_vcpu_on_cpu_lock, 10396 + vcpu->pre_pcpu), flags); 10397 + vcpu->pre_pcpu = -1; 10398 + } 10399 + } 10400 + 10401 + /* 10402 + * vmx_update_pi_irte - set IRTE for Posted-Interrupts 10403 + * 10404 + * @kvm: kvm 10405 + * @host_irq: host irq of the interrupt 10406 + * @guest_irq: gsi of the interrupt 10407 + * @set: set or unset PI 10408 + * returns 0 on success, < 0 on failure 10409 + */ 10410 + static int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq, 10411 + uint32_t guest_irq, bool set) 10412 + { 10413 + struct kvm_kernel_irq_routing_entry *e; 10414 + struct kvm_irq_routing_table *irq_rt; 10415 + struct kvm_lapic_irq irq; 10416 + struct kvm_vcpu *vcpu; 10417 + struct vcpu_data vcpu_info; 10418 + int idx, ret = -EINVAL; 10419 + 10420 + if (!kvm_arch_has_assigned_device(kvm) || 10421 + !irq_remapping_cap(IRQ_POSTING_CAP)) 10422 + return 0; 10423 + 10424 + idx = srcu_read_lock(&kvm->irq_srcu); 10425 + irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); 10426 + BUG_ON(guest_irq >= irq_rt->nr_rt_entries); 10427 + 10428 + hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) { 10429 + if (e->type != KVM_IRQ_ROUTING_MSI) 10430 + continue; 10431 + /* 10432 + * VT-d PI cannot support posting multicast/broadcast 10433 + * interrupts to a vCPU, we still use interrupt remapping 10434 + * for these kind of interrupts. 10435 + * 10436 + * For lowest-priority interrupts, we only support 10437 + * those with single CPU as the destination, e.g. user 10438 + * configures the interrupts via /proc/irq or uses 10439 + * irqbalance to make the interrupts single-CPU. 10440 + * 10441 + * We will support full lowest-priority interrupt later. 10442 + */ 10443 + 10444 + kvm_set_msi_irq(e, &irq); 10445 + if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu)) 10446 + continue; 10447 + 10448 + vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)); 10449 + vcpu_info.vector = irq.vector; 10450 + 10451 + trace_kvm_pi_irte_update(vcpu->vcpu_id, e->gsi, 10452 + vcpu_info.vector, vcpu_info.pi_desc_addr, set); 10453 + 10454 + if (set) 10455 + ret = irq_set_vcpu_affinity(host_irq, &vcpu_info); 10456 + else { 10457 + /* suppress notification event before unposting */ 10458 + pi_set_sn(vcpu_to_pi_desc(vcpu)); 10459 + ret = irq_set_vcpu_affinity(host_irq, NULL); 10460 + pi_clear_sn(vcpu_to_pi_desc(vcpu)); 10461 + } 10462 + 10463 + if (ret < 0) { 10464 + printk(KERN_INFO "%s: failed to update PI IRTE\n", 10465 + __func__); 10466 + goto out; 10467 + } 10468 + } 10469 + 10470 + ret = 0; 10471 + out: 10472 + srcu_read_unlock(&kvm->irq_srcu, idx); 10473 + return ret; 10474 + } 10475 + 10541 10476 static struct kvm_x86_ops vmx_x86_ops = { 10542 10477 .cpu_has_kvm_support = cpu_has_kvm_support, 10543 10478 .disabled_by_bios = vmx_disabled_by_bios, ··· 10802 10347 .update_cr8_intercept = update_cr8_intercept, 10803 10348 .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode, 10804 10349 .set_apic_access_page_addr = vmx_set_apic_access_page_addr, 10805 - .vm_has_apicv = vmx_vm_has_apicv, 10350 + .cpu_uses_apicv = vmx_cpu_uses_apicv, 10806 10351 .load_eoi_exitmap = vmx_load_eoi_exitmap, 10807 10352 .hwapic_irr_update = vmx_hwapic_irr_update, 10808 10353 .hwapic_isr_update = vmx_hwapic_isr_update, ··· 10849 10394 .flush_log_dirty = vmx_flush_log_dirty, 10850 10395 .enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked, 10851 10396 10397 + .pre_block = vmx_pre_block, 10398 + .post_block = vmx_post_block, 10399 + 10852 10400 .pmu_ops = &intel_pmu_ops, 10401 + 10402 + .update_pi_irte = vmx_update_pi_irte, 10853 10403 }; 10854 10404 10855 10405 static int __init vmx_init(void)

+197 -59

arch/x86/kvm/x86.c

··· 51 51 #include <linux/pci.h> 52 52 #include <linux/timekeeper_internal.h> 53 53 #include <linux/pvclock_gtod.h> 54 + #include <linux/kvm_irqfd.h> 55 + #include <linux/irqbypass.h> 54 56 #include <trace/events/kvm.h> 55 57 56 58 #define CREATE_TRACE_POINTS ··· 66 64 #include <asm/fpu/internal.h> /* Ugh! */ 67 65 #include <asm/pvclock.h> 68 66 #include <asm/div64.h> 67 + #include <asm/irq_remapping.h> 69 68 70 69 #define MAX_IO_MSRS 256 71 70 #define KVM_MAX_MCE_BANKS 32 ··· 625 622 if ((cr0 ^ old_cr0) & update_bits) 626 623 kvm_mmu_reset_context(vcpu); 627 624 628 - if ((cr0 ^ old_cr0) & X86_CR0_CD) 625 + if (((cr0 ^ old_cr0) & X86_CR0_CD) && 626 + kvm_arch_has_noncoherent_dma(vcpu->kvm) && 627 + !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) 629 628 kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL); 630 629 631 630 return 0; ··· 794 789 { 795 790 if (cr8 & CR8_RESERVED_BITS) 796 791 return 1; 797 - if (irqchip_in_kernel(vcpu->kvm)) 792 + if (lapic_in_kernel(vcpu)) 798 793 kvm_lapic_set_tpr(vcpu, cr8); 799 794 else 800 795 vcpu->arch.cr8 = cr8; ··· 804 799 805 800 unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu) 806 801 { 807 - if (irqchip_in_kernel(vcpu->kvm)) 802 + if (lapic_in_kernel(vcpu)) 808 803 return kvm_lapic_get_cr8(vcpu); 809 804 else 810 805 return vcpu->arch.cr8; ··· 958 953 HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC, 959 954 HV_X64_MSR_CRASH_P0, HV_X64_MSR_CRASH_P1, HV_X64_MSR_CRASH_P2, 960 955 HV_X64_MSR_CRASH_P3, HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL, 956 + HV_X64_MSR_RESET, 957 + HV_X64_MSR_VP_INDEX, 958 + HV_X64_MSR_VP_RUNTIME, 961 959 HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, 962 960 MSR_KVM_PV_EOI_EN, 963 961 ··· 1906 1898 1907 1899 static void record_steal_time(struct kvm_vcpu *vcpu) 1908 1900 { 1901 + accumulate_steal_time(vcpu); 1902 + 1909 1903 if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) 1910 1904 return; 1911 1905 ··· 2057 2047 2058 2048 if (!(data & KVM_MSR_ENABLED)) 2059 2049 break; 2060 - 2061 - vcpu->arch.st.last_steal = current->sched_info.run_delay; 2062 - 2063 - preempt_disable(); 2064 - accumulate_steal_time(vcpu); 2065 - preempt_enable(); 2066 2050 2067 2051 kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); 2068 2052 ··· 2453 2449 case KVM_CAP_ENABLE_CAP_VM: 2454 2450 case KVM_CAP_DISABLE_QUIRKS: 2455 2451 case KVM_CAP_SET_BOOT_CPU_ID: 2452 + case KVM_CAP_SPLIT_IRQCHIP: 2456 2453 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT 2457 2454 case KVM_CAP_ASSIGN_DEV_IRQ: 2458 2455 case KVM_CAP_PCI_2_3: ··· 2633 2628 vcpu->cpu = cpu; 2634 2629 } 2635 2630 2636 - accumulate_steal_time(vcpu); 2637 2631 kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); 2638 2632 } 2639 2633 ··· 2666 2662 { 2667 2663 if (irq->irq >= KVM_NR_INTERRUPTS) 2668 2664 return -EINVAL; 2669 - if (irqchip_in_kernel(vcpu->kvm)) 2665 + 2666 + if (!irqchip_in_kernel(vcpu->kvm)) { 2667 + kvm_queue_interrupt(vcpu, irq->irq, false); 2668 + kvm_make_request(KVM_REQ_EVENT, vcpu); 2669 + return 0; 2670 + } 2671 + 2672 + /* 2673 + * With in-kernel LAPIC, we only use this to inject EXTINT, so 2674 + * fail for in-kernel 8259. 2675 + */ 2676 + if (pic_in_kernel(vcpu->kvm)) 2670 2677 return -ENXIO; 2671 2678 2672 - kvm_queue_interrupt(vcpu, irq->irq, false); 2673 - kvm_make_request(KVM_REQ_EVENT, vcpu); 2679 + if (vcpu->arch.pending_external_vector != -1) 2680 + return -EEXIST; 2674 2681 2682 + vcpu->arch.pending_external_vector = irq->irq; 2675 2683 return 0; 2676 2684 } 2677 2685 ··· 3192 3176 struct kvm_vapic_addr va; 3193 3177 3194 3178 r = -EINVAL; 3195 - if (!irqchip_in_kernel(vcpu->kvm)) 3179 + if (!lapic_in_kernel(vcpu)) 3196 3180 goto out; 3197 3181 r = -EFAULT; 3198 3182 if (copy_from_user(&va, argp, sizeof va)) ··· 3441 3425 3442 3426 static int kvm_vm_ioctl_get_pit(struct kvm *kvm, struct kvm_pit_state *ps) 3443 3427 { 3444 - int r = 0; 3445 - 3446 3428 mutex_lock(&kvm->arch.vpit->pit_state.lock); 3447 3429 memcpy(ps, &kvm->arch.vpit->pit_state, sizeof(struct kvm_pit_state)); 3448 3430 mutex_unlock(&kvm->arch.vpit->pit_state.lock); 3449 - return r; 3431 + return 0; 3450 3432 } 3451 3433 3452 3434 static int kvm_vm_ioctl_set_pit(struct kvm *kvm, struct kvm_pit_state *ps) 3453 3435 { 3454 - int r = 0; 3455 - 3456 3436 mutex_lock(&kvm->arch.vpit->pit_state.lock); 3457 3437 memcpy(&kvm->arch.vpit->pit_state, ps, sizeof(struct kvm_pit_state)); 3458 3438 kvm_pit_load_count(kvm, 0, ps->channels[0].count, 0); 3459 3439 mutex_unlock(&kvm->arch.vpit->pit_state.lock); 3460 - return r; 3440 + return 0; 3461 3441 } 3462 3442 3463 3443 static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps) 3464 3444 { 3465 - int r = 0; 3466 - 3467 3445 mutex_lock(&kvm->arch.vpit->pit_state.lock); 3468 3446 memcpy(ps->channels, &kvm->arch.vpit->pit_state.channels, 3469 3447 sizeof(ps->channels)); 3470 3448 ps->flags = kvm->arch.vpit->pit_state.flags; 3471 3449 mutex_unlock(&kvm->arch.vpit->pit_state.lock); 3472 3450 memset(&ps->reserved, 0, sizeof(ps->reserved)); 3473 - return r; 3451 + return 0; 3474 3452 } 3475 3453 3476 3454 static int kvm_vm_ioctl_set_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps) 3477 3455 { 3478 - int r = 0, start = 0; 3456 + int start = 0; 3479 3457 u32 prev_legacy, cur_legacy; 3480 3458 mutex_lock(&kvm->arch.vpit->pit_state.lock); 3481 3459 prev_legacy = kvm->arch.vpit->pit_state.flags & KVM_PIT_FLAGS_HPET_LEGACY; ··· 3481 3471 kvm->arch.vpit->pit_state.flags = ps->flags; 3482 3472 kvm_pit_load_count(kvm, 0, kvm->arch.vpit->pit_state.channels[0].count, start); 3483 3473 mutex_unlock(&kvm->arch.vpit->pit_state.lock); 3484 - return r; 3474 + return 0; 3485 3475 } 3486 3476 3487 3477 static int kvm_vm_ioctl_reinject(struct kvm *kvm, ··· 3566 3556 kvm->arch.disabled_quirks = cap->args[0]; 3567 3557 r = 0; 3568 3558 break; 3559 + case KVM_CAP_SPLIT_IRQCHIP: { 3560 + mutex_lock(&kvm->lock); 3561 + r = -EINVAL; 3562 + if (cap->args[0] > MAX_NR_RESERVED_IOAPIC_PINS) 3563 + goto split_irqchip_unlock; 3564 + r = -EEXIST; 3565 + if (irqchip_in_kernel(kvm)) 3566 + goto split_irqchip_unlock; 3567 + if (atomic_read(&kvm->online_vcpus)) 3568 + goto split_irqchip_unlock; 3569 + r = kvm_setup_empty_irq_routing(kvm); 3570 + if (r) 3571 + goto split_irqchip_unlock; 3572 + /* Pairs with irqchip_in_kernel. */ 3573 + smp_wmb(); 3574 + kvm->arch.irqchip_split = true; 3575 + kvm->arch.nr_reserved_ioapic_pins = cap->args[0]; 3576 + r = 0; 3577 + split_irqchip_unlock: 3578 + mutex_unlock(&kvm->lock); 3579 + break; 3580 + } 3569 3581 default: 3570 3582 r = -EINVAL; 3571 3583 break; ··· 3701 3669 } 3702 3670 3703 3671 r = -ENXIO; 3704 - if (!irqchip_in_kernel(kvm)) 3672 + if (!irqchip_in_kernel(kvm) || irqchip_split(kvm)) 3705 3673 goto get_irqchip_out; 3706 3674 r = kvm_vm_ioctl_get_irqchip(kvm, chip); 3707 3675 if (r) ··· 3725 3693 } 3726 3694 3727 3695 r = -ENXIO; 3728 - if (!irqchip_in_kernel(kvm)) 3696 + if (!irqchip_in_kernel(kvm) || irqchip_split(kvm)) 3729 3697 goto set_irqchip_out; 3730 3698 r = kvm_vm_ioctl_set_irqchip(kvm, chip); 3731 3699 if (r) ··· 4090 4058 { 4091 4059 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 4092 4060 return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception); 4061 + } 4062 + 4063 + static int kvm_read_guest_phys_system(struct x86_emulate_ctxt *ctxt, 4064 + unsigned long addr, void *val, unsigned int bytes) 4065 + { 4066 + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 4067 + int r = kvm_vcpu_read_guest(vcpu, addr, val, bytes); 4068 + 4069 + return r < 0 ? X86EMUL_IO_NEEDED : X86EMUL_CONTINUE; 4093 4070 } 4094 4071 4095 4072 int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt, ··· 4836 4795 .write_gpr = emulator_write_gpr, 4837 4796 .read_std = kvm_read_guest_virt_system, 4838 4797 .write_std = kvm_write_guest_virt_system, 4798 + .read_phys = kvm_read_guest_phys_system, 4839 4799 .fetch = kvm_fetch_guest_virt, 4840 4800 .read_emulated = emulator_read_emulated, 4841 4801 .write_emulated = emulator_write_emulated, ··· 5709 5667 int kvm_vcpu_halt(struct kvm_vcpu *vcpu) 5710 5668 { 5711 5669 ++vcpu->stat.halt_exits; 5712 - if (irqchip_in_kernel(vcpu->kvm)) { 5670 + if (lapic_in_kernel(vcpu)) { 5713 5671 vcpu->arch.mp_state = KVM_MP_STATE_HALTED; 5714 5672 return 1; 5715 5673 } else { ··· 5816 5774 */ 5817 5775 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu) 5818 5776 { 5819 - return (!irqchip_in_kernel(vcpu->kvm) && !kvm_cpu_has_interrupt(vcpu) && 5820 - vcpu->run->request_interrupt_window && 5821 - kvm_arch_interrupt_allowed(vcpu)); 5777 + if (!vcpu->run->request_interrupt_window || pic_in_kernel(vcpu->kvm)) 5778 + return false; 5779 + 5780 + if (kvm_cpu_has_interrupt(vcpu)) 5781 + return false; 5782 + 5783 + return (irqchip_split(vcpu->kvm) 5784 + ? kvm_apic_accept_pic_intr(vcpu) 5785 + : kvm_arch_interrupt_allowed(vcpu)); 5822 5786 } 5823 5787 5824 5788 static void post_kvm_run_save(struct kvm_vcpu *vcpu) ··· 5835 5787 kvm_run->flags = is_smm(vcpu) ? KVM_RUN_X86_SMM : 0; 5836 5788 kvm_run->cr8 = kvm_get_cr8(vcpu); 5837 5789 kvm_run->apic_base = kvm_get_apic_base(vcpu); 5838 - if (irqchip_in_kernel(vcpu->kvm)) 5839 - kvm_run->ready_for_interrupt_injection = 1; 5840 - else 5790 + if (!irqchip_in_kernel(vcpu->kvm)) 5841 5791 kvm_run->ready_for_interrupt_injection = 5842 5792 kvm_arch_interrupt_allowed(vcpu) && 5843 5793 !kvm_cpu_has_interrupt(vcpu) && 5844 5794 !kvm_event_needs_reinjection(vcpu); 5795 + else if (!pic_in_kernel(vcpu->kvm)) 5796 + kvm_run->ready_for_interrupt_injection = 5797 + kvm_apic_accept_pic_intr(vcpu) && 5798 + !kvm_cpu_has_interrupt(vcpu); 5799 + else 5800 + kvm_run->ready_for_interrupt_injection = 1; 5845 5801 } 5846 5802 5847 5803 static void update_cr8_intercept(struct kvm_vcpu *vcpu) ··· 6196 6144 6197 6145 static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) 6198 6146 { 6199 - u64 eoi_exit_bitmap[4]; 6200 - u32 tmr[8]; 6201 - 6202 6147 if (!kvm_apic_hw_enabled(vcpu->arch.apic)) 6203 6148 return; 6204 6149 6205 - memset(eoi_exit_bitmap, 0, 32); 6206 - memset(tmr, 0, 32); 6150 + memset(vcpu->arch.eoi_exit_bitmap, 0, 256 / 8); 6207 6151 6208 - kvm_ioapic_scan_entry(vcpu, eoi_exit_bitmap, tmr); 6209 - kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap); 6210 - kvm_apic_update_tmr(vcpu, tmr); 6152 + if (irqchip_split(vcpu->kvm)) 6153 + kvm_scan_ioapic_routes(vcpu, vcpu->arch.eoi_exit_bitmap); 6154 + else { 6155 + kvm_x86_ops->sync_pir_to_irr(vcpu); 6156 + kvm_ioapic_scan_entry(vcpu, vcpu->arch.eoi_exit_bitmap); 6157 + } 6158 + kvm_x86_ops->load_eoi_exitmap(vcpu); 6211 6159 } 6212 6160 6213 6161 static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu) ··· 6220 6168 { 6221 6169 struct page *page = NULL; 6222 6170 6223 - if (!irqchip_in_kernel(vcpu->kvm)) 6171 + if (!lapic_in_kernel(vcpu)) 6224 6172 return; 6225 6173 6226 6174 if (!kvm_x86_ops->set_apic_access_page_addr) ··· 6258 6206 static int vcpu_enter_guest(struct kvm_vcpu *vcpu) 6259 6207 { 6260 6208 int r; 6261 - bool req_int_win = !irqchip_in_kernel(vcpu->kvm) && 6209 + bool req_int_win = !lapic_in_kernel(vcpu) && 6262 6210 vcpu->run->request_interrupt_window; 6263 6211 bool req_immediate_exit = false; 6264 6212 ··· 6310 6258 kvm_pmu_handle_event(vcpu); 6311 6259 if (kvm_check_request(KVM_REQ_PMI, vcpu)) 6312 6260 kvm_pmu_deliver_pmi(vcpu); 6261 + if (kvm_check_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu)) { 6262 + BUG_ON(vcpu->arch.pending_ioapic_eoi > 255); 6263 + if (test_bit(vcpu->arch.pending_ioapic_eoi, 6264 + (void *) vcpu->arch.eoi_exit_bitmap)) { 6265 + vcpu->run->exit_reason = KVM_EXIT_IOAPIC_EOI; 6266 + vcpu->run->eoi.vector = 6267 + vcpu->arch.pending_ioapic_eoi; 6268 + r = 0; 6269 + goto out; 6270 + } 6271 + } 6313 6272 if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu)) 6314 6273 vcpu_scan_ioapic(vcpu); 6315 6274 if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu)) ··· 6331 6268 r = 0; 6332 6269 goto out; 6333 6270 } 6271 + if (kvm_check_request(KVM_REQ_HV_RESET, vcpu)) { 6272 + vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT; 6273 + vcpu->run->system_event.type = KVM_SYSTEM_EVENT_RESET; 6274 + r = 0; 6275 + goto out; 6276 + } 6277 + } 6278 + 6279 + /* 6280 + * KVM_REQ_EVENT is not set when posted interrupts are set by 6281 + * VT-d hardware, so we have to update RVI unconditionally. 6282 + */ 6283 + if (kvm_lapic_enabled(vcpu)) { 6284 + /* 6285 + * Update architecture specific hints for APIC 6286 + * virtual interrupt delivery. 6287 + */ 6288 + if (kvm_x86_ops->hwapic_irr_update) 6289 + kvm_x86_ops->hwapic_irr_update(vcpu, 6290 + kvm_lapic_find_highest_irr(vcpu)); 6334 6291 } 6335 6292 6336 6293 if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { ··· 6369 6286 kvm_x86_ops->enable_irq_window(vcpu); 6370 6287 6371 6288 if (kvm_lapic_enabled(vcpu)) { 6372 - /* 6373 - * Update architecture specific hints for APIC 6374 - * virtual interrupt delivery. 6375 - */ 6376 - if (kvm_x86_ops->hwapic_irr_update) 6377 - kvm_x86_ops->hwapic_irr_update(vcpu, 6378 - kvm_lapic_find_highest_irr(vcpu)); 6379 6289 update_cr8_intercept(vcpu); 6380 6290 kvm_lapic_sync_to_vapic(vcpu); 6381 6291 } ··· 6504 6428 6505 6429 static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu) 6506 6430 { 6507 - if (!kvm_arch_vcpu_runnable(vcpu)) { 6431 + if (!kvm_arch_vcpu_runnable(vcpu) && 6432 + (!kvm_x86_ops->pre_block || kvm_x86_ops->pre_block(vcpu) == 0)) { 6508 6433 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); 6509 6434 kvm_vcpu_block(vcpu); 6510 6435 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); 6436 + 6437 + if (kvm_x86_ops->post_block) 6438 + kvm_x86_ops->post_block(vcpu); 6439 + 6511 6440 if (!kvm_check_request(KVM_REQ_UNHALT, vcpu)) 6512 6441 return 1; 6513 6442 } ··· 6549 6468 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); 6550 6469 6551 6470 for (;;) { 6552 - if (kvm_vcpu_running(vcpu)) 6471 + if (kvm_vcpu_running(vcpu)) { 6553 6472 r = vcpu_enter_guest(vcpu); 6554 - else 6473 + } else { 6555 6474 r = vcpu_block(kvm, vcpu); 6475 + } 6476 + 6556 6477 if (r <= 0) 6557 6478 break; 6558 6479 ··· 6563 6480 kvm_inject_pending_timer_irqs(vcpu); 6564 6481 6565 6482 if (dm_request_for_irq_injection(vcpu)) { 6566 - r = -EINTR; 6567 - vcpu->run->exit_reason = KVM_EXIT_INTR; 6483 + r = 0; 6484 + vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN; 6568 6485 ++vcpu->stat.request_irq_exits; 6569 6486 break; 6570 6487 } ··· 6691 6608 } 6692 6609 6693 6610 /* re-sync apic's tpr */ 6694 - if (!irqchip_in_kernel(vcpu->kvm)) { 6611 + if (!lapic_in_kernel(vcpu)) { 6695 6612 if (kvm_set_cr8(vcpu, kvm_run->cr8) != 0) { 6696 6613 r = -EINVAL; 6697 6614 goto out; ··· 7391 7308 7392 7309 bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) 7393 7310 { 7394 - return irqchip_in_kernel(vcpu->kvm) == (vcpu->arch.apic != NULL); 7311 + return irqchip_in_kernel(vcpu->kvm) == lapic_in_kernel(vcpu); 7395 7312 } 7396 7313 7397 7314 struct static_key kvm_no_apic_vcpu __read_mostly; ··· 7460 7377 kvm_async_pf_hash_reset(vcpu); 7461 7378 kvm_pmu_init(vcpu); 7462 7379 7380 + vcpu->arch.pending_external_vector = -1; 7381 + 7463 7382 return 0; 7464 7383 7465 7384 fail_free_mce_banks: ··· 7487 7402 kvm_mmu_destroy(vcpu); 7488 7403 srcu_read_unlock(&vcpu->kvm->srcu, idx); 7489 7404 free_page((unsigned long)vcpu->arch.pio_data); 7490 - if (!irqchip_in_kernel(vcpu->kvm)) 7405 + if (!lapic_in_kernel(vcpu)) 7491 7406 static_key_slow_dec(&kvm_no_apic_vcpu); 7492 7407 } 7493 7408 ··· 8114 8029 } 8115 8030 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma); 8116 8031 8032 + int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons, 8033 + struct irq_bypass_producer *prod) 8034 + { 8035 + struct kvm_kernel_irqfd *irqfd = 8036 + container_of(cons, struct kvm_kernel_irqfd, consumer); 8037 + 8038 + if (kvm_x86_ops->update_pi_irte) { 8039 + irqfd->producer = prod; 8040 + return kvm_x86_ops->update_pi_irte(irqfd->kvm, 8041 + prod->irq, irqfd->gsi, 1); 8042 + } 8043 + 8044 + return -EINVAL; 8045 + } 8046 + 8047 + void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons, 8048 + struct irq_bypass_producer *prod) 8049 + { 8050 + int ret; 8051 + struct kvm_kernel_irqfd *irqfd = 8052 + container_of(cons, struct kvm_kernel_irqfd, consumer); 8053 + 8054 + if (!kvm_x86_ops->update_pi_irte) { 8055 + WARN_ON(irqfd->producer != NULL); 8056 + return; 8057 + } 8058 + 8059 + WARN_ON(irqfd->producer != prod); 8060 + irqfd->producer = NULL; 8061 + 8062 + /* 8063 + * When producer of consumer is unregistered, we change back to 8064 + * remapped mode, so we can re-use the current implementation 8065 + * when the irq is masked/disabed or the consumer side (KVM 8066 + * int this case doesn't want to receive the interrupts. 8067 + */ 8068 + ret = kvm_x86_ops->update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 0); 8069 + if (ret) 8070 + printk(KERN_INFO "irq bypass consumer (token %p) unregistration" 8071 + " fails: %d\n", irqfd->consumer.token, ret); 8072 + } 8073 + 8074 + int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq, 8075 + uint32_t guest_irq, bool set) 8076 + { 8077 + if (!kvm_x86_ops->update_pi_irte) 8078 + return -EINVAL; 8079 + 8080 + return kvm_x86_ops->update_pi_irte(kvm, host_irq, guest_irq, set); 8081 + } 8082 + 8117 8083 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); 8084 + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); 8118 8085 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); 8119 8086 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault); 8120 8087 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr); ··· 8181 8044 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset); 8182 8045 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window); 8183 8046 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full); 8047 + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);

-5

drivers/hv/hyperv_vmbus.h

··· 63 63 /* Define version of the synthetic interrupt controller. */ 64 64 #define HV_SYNIC_VERSION (1) 65 65 66 - /* Define the expected SynIC version. */ 67 - #define HV_SYNIC_VERSION_1 (0x1) 68 - 69 66 /* Define synthetic interrupt controller message constants. */ 70 67 #define HV_MESSAGE_SIZE (256) 71 68 #define HV_MESSAGE_PAYLOAD_BYTE_COUNT (240) ··· 102 105 HVMSG_X64_LEGACY_FP_ERROR = 0x80010005 103 106 }; 104 107 105 - /* Define the number of synthetic interrupt sources. */ 106 - #define HV_SYNIC_SINT_COUNT (16) 107 108 #define HV_SYNIC_STIMER_COUNT (4) 108 109 109 110 /* Define invalid partition identifier. */

+8 -4

drivers/iommu/irq_remapping.c

··· 22 22 int disable_sourceid_checking; 23 23 int no_x2apic_optout; 24 24 25 - int disable_irq_post = 1; 25 + int disable_irq_post = 0; 26 26 27 27 static int disable_irq_remap; 28 28 static struct irq_remap_ops *remap_ops; ··· 58 58 return -EINVAL; 59 59 60 60 while (*str) { 61 - if (!strncmp(str, "on", 2)) 61 + if (!strncmp(str, "on", 2)) { 62 62 disable_irq_remap = 0; 63 - else if (!strncmp(str, "off", 3)) 63 + disable_irq_post = 0; 64 + } else if (!strncmp(str, "off", 3)) { 64 65 disable_irq_remap = 1; 65 - else if (!strncmp(str, "nosid", 5)) 66 + disable_irq_post = 1; 67 + } else if (!strncmp(str, "nosid", 5)) 66 68 disable_sourceid_checking = 1; 67 69 else if (!strncmp(str, "no_x2apic_optout", 16)) 68 70 no_x2apic_optout = 1; 71 + else if (!strncmp(str, "nopost", 6)) 72 + disable_irq_post = 1; 69 73 70 74 str += strcspn(str, ","); 71 75 while (*str == ',')

+1

drivers/vfio/Kconfig

··· 33 33 34 34 source "drivers/vfio/pci/Kconfig" 35 35 source "drivers/vfio/platform/Kconfig" 36 + source "virt/lib/Kconfig"

+1

drivers/vfio/pci/Kconfig

··· 2 2 tristate "VFIO support for PCI devices" 3 3 depends on VFIO && PCI && EVENTFD 4 4 select VFIO_VIRQFD 5 + select IRQ_BYPASS_MANAGER 5 6 help 6 7 Support for the PCI VFIO bus driver. This is required to make 7 8 use of PCI drivers using the VFIO framework.

+9

drivers/vfio/pci/vfio_pci_intrs.c

··· 319 319 320 320 if (vdev->ctx[vector].trigger) { 321 321 free_irq(irq, vdev->ctx[vector].trigger); 322 + irq_bypass_unregister_producer(&vdev->ctx[vector].producer); 322 323 kfree(vdev->ctx[vector].name); 323 324 eventfd_ctx_put(vdev->ctx[vector].trigger); 324 325 vdev->ctx[vector].trigger = NULL; ··· 360 359 eventfd_ctx_put(trigger); 361 360 return ret; 362 361 } 362 + 363 + vdev->ctx[vector].producer.token = trigger; 364 + vdev->ctx[vector].producer.irq = irq; 365 + ret = irq_bypass_register_producer(&vdev->ctx[vector].producer); 366 + if (unlikely(ret)) 367 + dev_info(&pdev->dev, 368 + "irq bypass producer (token %p) registration fails: %d\n", 369 + vdev->ctx[vector].producer.token, ret); 363 370 364 371 vdev->ctx[vector].trigger = trigger; 365 372

+2

drivers/vfio/pci/vfio_pci_private.h

··· 13 13 14 14 #include <linux/mutex.h> 15 15 #include <linux/pci.h> 16 + #include <linux/irqbypass.h> 16 17 17 18 #ifndef VFIO_PCI_PRIVATE_H 18 19 #define VFIO_PCI_PRIVATE_H ··· 30 29 struct virqfd *mask; 31 30 char *name; 32 31 bool masked; 32 + struct irq_bypass_producer producer; 33 33 }; 34 34 35 35 struct vfio_pci_device {

+3 -1

include/kvm/arm_arch_timer.h

··· 51 51 bool armed; 52 52 53 53 /* Timer IRQ */ 54 - const struct kvm_irq_level *irq; 54 + struct kvm_irq_level irq; 55 55 56 56 /* VGIC mapping */ 57 57 struct irq_phys_map *map; ··· 71 71 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value); 72 72 73 73 bool kvm_timer_should_fire(struct kvm_vcpu *vcpu); 74 + void kvm_timer_schedule(struct kvm_vcpu *vcpu); 75 + void kvm_timer_unschedule(struct kvm_vcpu *vcpu); 74 76 75 77 #endif

+3 -13

include/kvm/arm_vgic.h

··· 112 112 struct vgic_ops { 113 113 struct vgic_lr (*get_lr)(const struct kvm_vcpu *, int); 114 114 void (*set_lr)(struct kvm_vcpu *, int, struct vgic_lr); 115 - void (*sync_lr_elrsr)(struct kvm_vcpu *, int, struct vgic_lr); 116 115 u64 (*get_elrsr)(const struct kvm_vcpu *vcpu); 117 116 u64 (*get_eisr)(const struct kvm_vcpu *vcpu); 118 117 void (*clear_eisr)(struct kvm_vcpu *vcpu); ··· 158 159 u32 virt_irq; 159 160 u32 phys_irq; 160 161 u32 irq; 161 - bool active; 162 162 }; 163 163 164 164 struct irq_phys_map_entry { ··· 294 296 }; 295 297 296 298 struct vgic_cpu { 297 - /* per IRQ to LR mapping */ 298 - u8 *vgic_irq_lr_map; 299 - 300 299 /* Pending/active/both interrupts on this VCPU */ 301 - DECLARE_BITMAP( pending_percpu, VGIC_NR_PRIVATE_IRQS); 302 - DECLARE_BITMAP( active_percpu, VGIC_NR_PRIVATE_IRQS); 303 - DECLARE_BITMAP( pend_act_percpu, VGIC_NR_PRIVATE_IRQS); 300 + DECLARE_BITMAP(pending_percpu, VGIC_NR_PRIVATE_IRQS); 301 + DECLARE_BITMAP(active_percpu, VGIC_NR_PRIVATE_IRQS); 302 + DECLARE_BITMAP(pend_act_percpu, VGIC_NR_PRIVATE_IRQS); 304 303 305 304 /* Pending/active/both shared interrupts, dynamically sized */ 306 305 unsigned long *pending_shared; 307 306 unsigned long *active_shared; 308 307 unsigned long *pend_act_shared; 309 - 310 - /* Bitmap of used/free list registers */ 311 - DECLARE_BITMAP( lr_used, VGIC_V2_MAX_LRS); 312 308 313 309 /* Number of list registers on this CPU */ 314 310 int nr_lr; ··· 346 354 struct irq_phys_map *kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, 347 355 int virt_irq, int irq); 348 356 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map); 349 - bool kvm_vgic_get_phys_irq_active(struct irq_phys_map *map); 350 - void kvm_vgic_set_phys_irq_active(struct irq_phys_map *map, bool active); 351 357 352 358 #define irqchip_in_kernel(k) (!!((k)->arch.vgic.in_kernel)) 353 359 #define vgic_initialized(k) (!!((k)->arch.vgic.nr_cpus))

+1

include/linux/hyperv.h

··· 26 26 #define _HYPERV_H 27 27 28 28 #include <uapi/linux/hyperv.h> 29 + #include <uapi/asm/hyperv.h> 29 30 30 31 #include <linux/types.h> 31 32 #include <linux/scatterlist.h>

+90

include/linux/irqbypass.h

··· 1 + /* 2 + * IRQ offload/bypass manager 3 + * 4 + * Copyright (C) 2015 Red Hat, Inc. 5 + * Copyright (c) 2015 Linaro Ltd. 6 + * 7 + * This program is free software; you can redistribute it and/or modify 8 + * it under the terms of the GNU General Public License version 2 as 9 + * published by the Free Software Foundation. 10 + */ 11 + #ifndef IRQBYPASS_H 12 + #define IRQBYPASS_H 13 + 14 + #include <linux/list.h> 15 + 16 + struct irq_bypass_consumer; 17 + 18 + /* 19 + * Theory of operation 20 + * 21 + * The IRQ bypass manager is a simple set of lists and callbacks that allows 22 + * IRQ producers (ex. physical interrupt sources) to be matched to IRQ 23 + * consumers (ex. virtualization hardware that allows IRQ bypass or offload) 24 + * via a shared token (ex. eventfd_ctx). Producers and consumers register 25 + * independently. When a token match is found, the optional @stop callback 26 + * will be called for each participant. The pair will then be connected via 27 + * the @add_* callbacks, and finally the optional @start callback will allow 28 + * any final coordination. When either participant is unregistered, the 29 + * process is repeated using the @del_* callbacks in place of the @add_* 30 + * callbacks. Match tokens must be unique per producer/consumer, 1:N pairings 31 + * are not supported. 32 + */ 33 + 34 + /** 35 + * struct irq_bypass_producer - IRQ bypass producer definition 36 + * @node: IRQ bypass manager private list management 37 + * @token: opaque token to match between producer and consumer 38 + * @irq: Linux IRQ number for the producer device 39 + * @add_consumer: Connect the IRQ producer to an IRQ consumer (optional) 40 + * @del_consumer: Disconnect the IRQ producer from an IRQ consumer (optional) 41 + * @stop: Perform any quiesce operations necessary prior to add/del (optional) 42 + * @start: Perform any startup operations necessary after add/del (optional) 43 + * 44 + * The IRQ bypass producer structure represents an interrupt source for 45 + * participation in possible host bypass, for instance an interrupt vector 46 + * for a physical device assigned to a VM. 47 + */ 48 + struct irq_bypass_producer { 49 + struct list_head node; 50 + void *token; 51 + int irq; 52 + int (*add_consumer)(struct irq_bypass_producer *, 53 + struct irq_bypass_consumer *); 54 + void (*del_consumer)(struct irq_bypass_producer *, 55 + struct irq_bypass_consumer *); 56 + void (*stop)(struct irq_bypass_producer *); 57 + void (*start)(struct irq_bypass_producer *); 58 + }; 59 + 60 + /** 61 + * struct irq_bypass_consumer - IRQ bypass consumer definition 62 + * @node: IRQ bypass manager private list management 63 + * @token: opaque token to match between producer and consumer 64 + * @add_producer: Connect the IRQ consumer to an IRQ producer 65 + * @del_producer: Disconnect the IRQ consumer from an IRQ producer 66 + * @stop: Perform any quiesce operations necessary prior to add/del (optional) 67 + * @start: Perform any startup operations necessary after add/del (optional) 68 + * 69 + * The IRQ bypass consumer structure represents an interrupt sink for 70 + * participation in possible host bypass, for instance a hypervisor may 71 + * support offloads to allow bypassing the host entirely or offload 72 + * portions of the interrupt handling to the VM. 73 + */ 74 + struct irq_bypass_consumer { 75 + struct list_head node; 76 + void *token; 77 + int (*add_producer)(struct irq_bypass_consumer *, 78 + struct irq_bypass_producer *); 79 + void (*del_producer)(struct irq_bypass_consumer *, 80 + struct irq_bypass_producer *); 81 + void (*stop)(struct irq_bypass_consumer *); 82 + void (*start)(struct irq_bypass_consumer *); 83 + }; 84 + 85 + int irq_bypass_register_producer(struct irq_bypass_producer *); 86 + void irq_bypass_unregister_producer(struct irq_bypass_producer *); 87 + int irq_bypass_register_consumer(struct irq_bypass_consumer *); 88 + void irq_bypass_unregister_consumer(struct irq_bypass_consumer *); 89 + 90 + #endif /* IRQBYPASS_H */

+40 -2

include/linux/kvm_host.h

··· 24 24 #include <linux/err.h> 25 25 #include <linux/irqflags.h> 26 26 #include <linux/context_tracking.h> 27 + #include <linux/irqbypass.h> 27 28 #include <asm/signal.h> 28 29 29 30 #include <linux/kvm.h> ··· 141 140 #define KVM_REQ_APIC_PAGE_RELOAD 25 142 141 #define KVM_REQ_SMI 26 143 142 #define KVM_REQ_HV_CRASH 27 143 + #define KVM_REQ_IOAPIC_EOI_EXIT 28 144 + #define KVM_REQ_HV_RESET 29 144 145 145 146 #define KVM_USERSPACE_IRQ_SOURCE_ID 0 146 147 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1 ··· 233 230 int mode; 234 231 unsigned long requests; 235 232 unsigned long guest_debug; 233 + 234 + int pre_pcpu; 235 + struct list_head blocked_vcpu_list; 236 236 237 237 struct mutex mutex; 238 238 struct kvm_run *run; ··· 334 328 }; 335 329 struct hlist_node link; 336 330 }; 331 + 332 + #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING 333 + struct kvm_irq_routing_table { 334 + int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS]; 335 + u32 nr_rt_entries; 336 + /* 337 + * Array indexed by gsi. Each entry contains list of irq chips 338 + * the gsi is connected to. 339 + */ 340 + struct hlist_head map[0]; 341 + }; 342 + #endif 337 343 338 344 #ifndef KVM_PRIVATE_MEM_SLOTS 339 345 #define KVM_PRIVATE_MEM_SLOTS 0 ··· 473 455 474 456 #ifdef __KVM_HAVE_IOAPIC 475 457 void kvm_vcpu_request_scan_ioapic(struct kvm *kvm); 458 + void kvm_arch_irq_routing_update(struct kvm *kvm); 476 459 #else 477 460 static inline void kvm_vcpu_request_scan_ioapic(struct kvm *kvm) 461 + { 462 + } 463 + static inline void kvm_arch_irq_routing_update(struct kvm *kvm) 478 464 { 479 465 } 480 466 #endif ··· 647 625 void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn); 648 626 649 627 void kvm_vcpu_block(struct kvm_vcpu *vcpu); 628 + void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu); 629 + void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu); 650 630 void kvm_vcpu_kick(struct kvm_vcpu *vcpu); 651 631 int kvm_vcpu_yield_to(struct kvm_vcpu *target); 652 632 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu); ··· 827 803 828 804 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, 829 805 bool line_status); 830 - int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level); 831 806 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm, 832 807 int irq_source_id, int level, bool line_status); 808 + int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, 809 + struct kvm *kvm, int irq_source_id, 810 + int level, bool line_status); 833 811 bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin); 812 + void kvm_notify_acked_gsi(struct kvm *kvm, int gsi); 834 813 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin); 835 814 void kvm_register_irq_ack_notifier(struct kvm *kvm, 836 815 struct kvm_irq_ack_notifier *kian); ··· 1029 1002 #endif 1030 1003 1031 1004 int kvm_setup_default_irq_routing(struct kvm *kvm); 1005 + int kvm_setup_empty_irq_routing(struct kvm *kvm); 1032 1006 int kvm_set_irq_routing(struct kvm *kvm, 1033 1007 const struct kvm_irq_routing_entry *entries, 1034 1008 unsigned nr, ··· 1172 1144 { 1173 1145 } 1174 1146 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */ 1175 - #endif 1176 1147 1148 + #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 1149 + int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *, 1150 + struct irq_bypass_producer *); 1151 + void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *, 1152 + struct irq_bypass_producer *); 1153 + void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *); 1154 + void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *); 1155 + int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq, 1156 + uint32_t guest_irq, bool set); 1157 + #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */ 1158 + #endif

+71

include/linux/kvm_irqfd.h

··· 1 + /* 2 + * This program is free software; you can redistribute it and/or modify 3 + * it under the terms of the GNU General Public License as published by 4 + * the Free Software Foundation; either version 2 of the License. 5 + * 6 + * This program is distributed in the hope that it will be useful, 7 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 8 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 9 + * GNU General Public License for more details. 10 + * 11 + * irqfd: Allows an fd to be used to inject an interrupt to the guest 12 + * Credit goes to Avi Kivity for the original idea. 13 + */ 14 + 15 + #ifndef __LINUX_KVM_IRQFD_H 16 + #define __LINUX_KVM_IRQFD_H 17 + 18 + #include <linux/kvm_host.h> 19 + #include <linux/poll.h> 20 + 21 + /* 22 + * Resampling irqfds are a special variety of irqfds used to emulate 23 + * level triggered interrupts. The interrupt is asserted on eventfd 24 + * trigger. On acknowledgment through the irq ack notifier, the 25 + * interrupt is de-asserted and userspace is notified through the 26 + * resamplefd. All resamplers on the same gsi are de-asserted 27 + * together, so we don't need to track the state of each individual 28 + * user. We can also therefore share the same irq source ID. 29 + */ 30 + struct kvm_kernel_irqfd_resampler { 31 + struct kvm *kvm; 32 + /* 33 + * List of resampling struct _irqfd objects sharing this gsi. 34 + * RCU list modified under kvm->irqfds.resampler_lock 35 + */ 36 + struct list_head list; 37 + struct kvm_irq_ack_notifier notifier; 38 + /* 39 + * Entry in list of kvm->irqfd.resampler_list. Use for sharing 40 + * resamplers among irqfds on the same gsi. 41 + * Accessed and modified under kvm->irqfds.resampler_lock 42 + */ 43 + struct list_head link; 44 + }; 45 + 46 + struct kvm_kernel_irqfd { 47 + /* Used for MSI fast-path */ 48 + struct kvm *kvm; 49 + wait_queue_t wait; 50 + /* Update side is protected by irqfds.lock */ 51 + struct kvm_kernel_irq_routing_entry irq_entry; 52 + seqcount_t irq_entry_sc; 53 + /* Used for level IRQ fast-path */ 54 + int gsi; 55 + struct work_struct inject; 56 + /* The resampler used by this irqfd (resampler-only) */ 57 + struct kvm_kernel_irqfd_resampler *resampler; 58 + /* Eventfd notified on resample (resampler-only) */ 59 + struct eventfd_ctx *resamplefd; 60 + /* Entry in list of irqfds for a resampler (resampler-only) */ 61 + struct list_head resampler_link; 62 + /* Used for setup/shutdown */ 63 + struct eventfd_ctx *eventfd; 64 + struct list_head list; 65 + poll_table pt; 66 + struct work_struct shutdown; 67 + struct irq_bypass_consumer consumer; 68 + struct irq_bypass_producer *producer; 69 + }; 70 + 71 + #endif /* __LINUX_KVM_IRQFD_H */

+7

include/uapi/linux/kvm.h

··· 183 183 #define KVM_EXIT_EPR 23 184 184 #define KVM_EXIT_SYSTEM_EVENT 24 185 185 #define KVM_EXIT_S390_STSI 25 186 + #define KVM_EXIT_IOAPIC_EOI 26 186 187 187 188 /* For KVM_EXIT_INTERNAL_ERROR */ 188 189 /* Emulate instruction failed. */ ··· 334 333 __u8 sel1; 335 334 __u16 sel2; 336 335 } s390_stsi; 336 + /* KVM_EXIT_IOAPIC_EOI */ 337 + struct { 338 + __u8 vector; 339 + } eoi; 337 340 /* Fix the size of the union. */ 338 341 char padding[256]; 339 342 }; ··· 829 824 #define KVM_CAP_MULTI_ADDRESS_SPACE 118 830 825 #define KVM_CAP_GUEST_DEBUG_HW_BPS 119 831 826 #define KVM_CAP_GUEST_DEBUG_HW_WPS 120 827 + #define KVM_CAP_SPLIT_IRQCHIP 121 828 + #define KVM_CAP_IOEVENTFD_ANY_LENGTH 122 832 829 833 830 #ifdef KVM_CAP_IRQ_ROUTING 834 831

+2

kernel/sched/cputime.c

··· 444 444 *ut = p->utime; 445 445 *st = p->stime; 446 446 } 447 + EXPORT_SYMBOL_GPL(task_cputime_adjusted); 447 448 448 449 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) 449 450 { ··· 653 652 task_cputime(p, &cputime.utime, &cputime.stime); 654 653 cputime_adjust(&cputime, &p->prev_cputime, ut, st); 655 654 } 655 + EXPORT_SYMBOL_GPL(task_cputime_adjusted); 656 656 657 657 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) 658 658 {

+1

virt/Makefile

··· 1 + obj-y += lib/

+4 -1

virt/kvm/Kconfig

··· 46 46 47 47 config KVM_COMPAT 48 48 def_bool y 49 - depends on COMPAT && !S390 49 + depends on KVM && COMPAT && !S390 50 + 51 + config HAVE_KVM_IRQ_BYPASS 52 + bool

+117 -56

virt/kvm/arm/arch_timer.c

··· 28 28 #include <kvm/arm_vgic.h> 29 29 #include <kvm/arm_arch_timer.h> 30 30 31 + #include "trace.h" 32 + 31 33 static struct timecounter *timecounter; 32 34 static struct workqueue_struct *wqueue; 33 35 static unsigned int host_vtimer_irq; ··· 59 57 cancel_work_sync(&timer->expired); 60 58 timer->armed = false; 61 59 } 62 - } 63 - 64 - static void kvm_timer_inject_irq(struct kvm_vcpu *vcpu) 65 - { 66 - int ret; 67 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 68 - 69 - kvm_vgic_set_phys_irq_active(timer->map, true); 70 - ret = kvm_vgic_inject_mapped_irq(vcpu->kvm, vcpu->vcpu_id, 71 - timer->map, 72 - timer->irq->level); 73 - WARN_ON(ret); 74 60 } 75 61 76 62 static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id) ··· 101 111 return HRTIMER_NORESTART; 102 112 } 103 113 114 + static bool kvm_timer_irq_can_fire(struct kvm_vcpu *vcpu) 115 + { 116 + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 117 + 118 + return !(timer->cntv_ctl & ARCH_TIMER_CTRL_IT_MASK) && 119 + (timer->cntv_ctl & ARCH_TIMER_CTRL_ENABLE); 120 + } 121 + 104 122 bool kvm_timer_should_fire(struct kvm_vcpu *vcpu) 105 123 { 106 124 struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 107 125 cycle_t cval, now; 108 126 109 - if ((timer->cntv_ctl & ARCH_TIMER_CTRL_IT_MASK) || 110 - !(timer->cntv_ctl & ARCH_TIMER_CTRL_ENABLE) || 111 - kvm_vgic_get_phys_irq_active(timer->map)) 127 + if (!kvm_timer_irq_can_fire(vcpu)) 112 128 return false; 113 129 114 130 cval = timer->cntv_cval; ··· 123 127 return cval <= now; 124 128 } 125 129 130 + static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level) 131 + { 132 + int ret; 133 + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 134 + 135 + BUG_ON(!vgic_initialized(vcpu->kvm)); 136 + 137 + timer->irq.level = new_level; 138 + trace_kvm_timer_update_irq(vcpu->vcpu_id, timer->map->virt_irq, 139 + timer->irq.level); 140 + ret = kvm_vgic_inject_mapped_irq(vcpu->kvm, vcpu->vcpu_id, 141 + timer->map, 142 + timer->irq.level); 143 + WARN_ON(ret); 144 + } 145 + 146 + /* 147 + * Check if there was a change in the timer state (should we raise or lower 148 + * the line level to the GIC). 149 + */ 150 + static void kvm_timer_update_state(struct kvm_vcpu *vcpu) 151 + { 152 + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 153 + 154 + /* 155 + * If userspace modified the timer registers via SET_ONE_REG before 156 + * the vgic was initialized, we mustn't set the timer->irq.level value 157 + * because the guest would never see the interrupt. Instead wait 158 + * until we call this function from kvm_timer_flush_hwstate. 159 + */ 160 + if (!vgic_initialized(vcpu->kvm)) 161 + return; 162 + 163 + if (kvm_timer_should_fire(vcpu) != timer->irq.level) 164 + kvm_timer_update_irq(vcpu, !timer->irq.level); 165 + } 166 + 167 + /* 168 + * Schedule the background timer before calling kvm_vcpu_block, so that this 169 + * thread is removed from its waitqueue and made runnable when there's a timer 170 + * interrupt to handle. 171 + */ 172 + void kvm_timer_schedule(struct kvm_vcpu *vcpu) 173 + { 174 + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 175 + u64 ns; 176 + cycle_t cval, now; 177 + 178 + BUG_ON(timer_is_armed(timer)); 179 + 180 + /* 181 + * No need to schedule a background timer if the guest timer has 182 + * already expired, because kvm_vcpu_block will return before putting 183 + * the thread to sleep. 184 + */ 185 + if (kvm_timer_should_fire(vcpu)) 186 + return; 187 + 188 + /* 189 + * If the timer is not capable of raising interrupts (disabled or 190 + * masked), then there's no more work for us to do. 191 + */ 192 + if (!kvm_timer_irq_can_fire(vcpu)) 193 + return; 194 + 195 + /* The timer has not yet expired, schedule a background timer */ 196 + cval = timer->cntv_cval; 197 + now = kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff; 198 + 199 + ns = cyclecounter_cyc2ns(timecounter->cc, 200 + cval - now, 201 + timecounter->mask, 202 + &timecounter->frac); 203 + timer_arm(timer, ns); 204 + } 205 + 206 + void kvm_timer_unschedule(struct kvm_vcpu *vcpu) 207 + { 208 + struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 209 + timer_disarm(timer); 210 + } 211 + 126 212 /** 127 213 * kvm_timer_flush_hwstate - prepare to move the virt timer to the cpu 128 214 * @vcpu: The vcpu pointer 129 215 * 130 - * Disarm any pending soft timers, since the world-switch code will write the 131 - * virtual timer state back to the physical CPU. 216 + * Check if the virtual timer has expired while we were running in the host, 217 + * and inject an interrupt if that was the case. 132 218 */ 133 219 void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu) 134 220 { ··· 218 140 bool phys_active; 219 141 int ret; 220 142 221 - /* 222 - * We're about to run this vcpu again, so there is no need to 223 - * keep the background timer running, as we're about to 224 - * populate the CPU timer again. 225 - */ 226 - timer_disarm(timer); 143 + kvm_timer_update_state(vcpu); 227 144 228 145 /* 229 - * If the timer expired while we were not scheduled, now is the time 230 - * to inject it. 146 + * If we enter the guest with the virtual input level to the VGIC 147 + * asserted, then we have already told the VGIC what we need to, and 148 + * we don't need to exit from the guest until the guest deactivates 149 + * the already injected interrupt, so therefore we should set the 150 + * hardware active state to prevent unnecessary exits from the guest. 151 + * 152 + * Conversely, if the virtual input level is deasserted, then always 153 + * clear the hardware active state to ensure that hardware interrupts 154 + * from the timer triggers a guest exit. 231 155 */ 232 - if (kvm_timer_should_fire(vcpu)) 233 - kvm_timer_inject_irq(vcpu); 234 - 235 - /* 236 - * We keep track of whether the edge-triggered interrupt has been 237 - * signalled to the vgic/guest, and if so, we mask the interrupt and 238 - * the physical distributor to prevent the timer from raising a 239 - * physical interrupt whenever we run a guest, preventing forward 240 - * VCPU progress. 241 - */ 242 - if (kvm_vgic_get_phys_irq_active(timer->map)) 156 + if (timer->irq.level) 243 157 phys_active = true; 244 158 else 245 159 phys_active = false; ··· 246 176 * kvm_timer_sync_hwstate - sync timer state from cpu 247 177 * @vcpu: The vcpu pointer 248 178 * 249 - * Check if the virtual timer was armed and either schedule a corresponding 250 - * soft timer or inject directly if already expired. 179 + * Check if the virtual timer has expired while we were running in the guest, 180 + * and inject an interrupt if that was the case. 251 181 */ 252 182 void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu) 253 183 { 254 184 struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 255 - cycle_t cval, now; 256 - u64 ns; 257 185 258 186 BUG_ON(timer_is_armed(timer)); 259 187 260 - if (kvm_timer_should_fire(vcpu)) { 261 - /* 262 - * Timer has already expired while we were not 263 - * looking. Inject the interrupt and carry on. 264 - */ 265 - kvm_timer_inject_irq(vcpu); 266 - return; 267 - } 268 - 269 - cval = timer->cntv_cval; 270 - now = kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff; 271 - 272 - ns = cyclecounter_cyc2ns(timecounter->cc, cval - now, timecounter->mask, 273 - &timecounter->frac); 274 - timer_arm(timer, ns); 188 + /* 189 + * The guest could have modified the timer registers or the timer 190 + * could have expired, update the timer state. 191 + */ 192 + kvm_timer_update_state(vcpu); 275 193 } 276 194 277 195 int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu, ··· 274 216 * kvm_vcpu_set_target(). To handle this, we determine 275 217 * vcpu timer irq number when the vcpu is reset. 276 218 */ 277 - timer->irq = irq; 219 + timer->irq.irq = irq->irq; 278 220 279 221 /* 280 222 * The bits in CNTV_CTL are architecturally reset to UNKNOWN for ARMv8 ··· 283 225 * the ARMv7 architecture. 284 226 */ 285 227 timer->cntv_ctl = 0; 228 + kvm_timer_update_state(vcpu); 286 229 287 230 /* 288 231 * Tell the VGIC that the virtual interrupt is tied to a ··· 328 269 default: 329 270 return -1; 330 271 } 272 + 273 + kvm_timer_update_state(vcpu); 331 274 return 0; 332 275 } 333 276

+63

virt/kvm/arm/trace.h

··· 1 + #if !defined(_TRACE_KVM_H) || defined(TRACE_HEADER_MULTI_READ) 2 + #define _TRACE_KVM_H 3 + 4 + #include <linux/tracepoint.h> 5 + 6 + #undef TRACE_SYSTEM 7 + #define TRACE_SYSTEM kvm 8 + 9 + /* 10 + * Tracepoints for vgic 11 + */ 12 + TRACE_EVENT(vgic_update_irq_pending, 13 + TP_PROTO(unsigned long vcpu_id, __u32 irq, bool level), 14 + TP_ARGS(vcpu_id, irq, level), 15 + 16 + TP_STRUCT__entry( 17 + __field( unsigned long, vcpu_id ) 18 + __field( __u32, irq ) 19 + __field( bool, level ) 20 + ), 21 + 22 + TP_fast_assign( 23 + __entry->vcpu_id = vcpu_id; 24 + __entry->irq = irq; 25 + __entry->level = level; 26 + ), 27 + 28 + TP_printk("VCPU: %ld, IRQ %d, level: %d", 29 + __entry->vcpu_id, __entry->irq, __entry->level) 30 + ); 31 + 32 + /* 33 + * Tracepoints for arch_timer 34 + */ 35 + TRACE_EVENT(kvm_timer_update_irq, 36 + TP_PROTO(unsigned long vcpu_id, __u32 irq, int level), 37 + TP_ARGS(vcpu_id, irq, level), 38 + 39 + TP_STRUCT__entry( 40 + __field( unsigned long, vcpu_id ) 41 + __field( __u32, irq ) 42 + __field( int, level ) 43 + ), 44 + 45 + TP_fast_assign( 46 + __entry->vcpu_id = vcpu_id; 47 + __entry->irq = irq; 48 + __entry->level = level; 49 + ), 50 + 51 + TP_printk("VCPU: %ld, IRQ %d, level %d", 52 + __entry->vcpu_id, __entry->irq, __entry->level) 53 + ); 54 + 55 + #endif /* _TRACE_KVM_H */ 56 + 57 + #undef TRACE_INCLUDE_PATH 58 + #define TRACE_INCLUDE_PATH ../../../virt/kvm/arm 59 + #undef TRACE_INCLUDE_FILE 60 + #define TRACE_INCLUDE_FILE trace 61 + 62 + /* This part must be outside protection */ 63 + #include <trace/define_trace.h>

+1 -5

virt/kvm/arm/vgic-v2.c

··· 79 79 lr_val |= (lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT); 80 80 81 81 vcpu->arch.vgic_cpu.vgic_v2.vgic_lr[lr] = lr_val; 82 - } 83 82 84 - static void vgic_v2_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr, 85 - struct vgic_lr lr_desc) 86 - { 87 83 if (!(lr_desc.state & LR_STATE_MASK)) 88 84 vcpu->arch.vgic_cpu.vgic_v2.vgic_elrsr |= (1ULL << lr); 89 85 else ··· 154 158 * anyway. 155 159 */ 156 160 vcpu->arch.vgic_cpu.vgic_v2.vgic_vmcr = 0; 161 + vcpu->arch.vgic_cpu.vgic_v2.vgic_elrsr = ~0; 157 162 158 163 /* Get the show on the road... */ 159 164 vcpu->arch.vgic_cpu.vgic_v2.vgic_hcr = GICH_HCR_EN; ··· 163 166 static const struct vgic_ops vgic_v2_ops = { 164 167 .get_lr = vgic_v2_get_lr, 165 168 .set_lr = vgic_v2_set_lr, 166 - .sync_lr_elrsr = vgic_v2_sync_lr_elrsr, 167 169 .get_elrsr = vgic_v2_get_elrsr, 168 170 .get_eisr = vgic_v2_get_eisr, 169 171 .clear_eisr = vgic_v2_clear_eisr,

+1 -5

virt/kvm/arm/vgic-v3.c

··· 112 112 } 113 113 114 114 vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)] = lr_val; 115 - } 116 115 117 - static void vgic_v3_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr, 118 - struct vgic_lr lr_desc) 119 - { 120 116 if (!(lr_desc.state & LR_STATE_MASK)) 121 117 vcpu->arch.vgic_cpu.vgic_v3.vgic_elrsr |= (1U << lr); 122 118 else ··· 189 193 * anyway. 190 194 */ 191 195 vgic_v3->vgic_vmcr = 0; 196 + vgic_v3->vgic_elrsr = ~0; 192 197 193 198 /* 194 199 * If we are emulating a GICv3, we do it in an non-GICv2-compatible ··· 208 211 static const struct vgic_ops vgic_v3_ops = { 209 212 .get_lr = vgic_v3_get_lr, 210 213 .set_lr = vgic_v3_set_lr, 211 - .sync_lr_elrsr = vgic_v3_sync_lr_elrsr, 212 214 .get_elrsr = vgic_v3_get_elrsr, 213 215 .get_eisr = vgic_v3_get_eisr, 214 216 .clear_eisr = vgic_v3_clear_eisr,

+119 -189

virt/kvm/arm/vgic.c

··· 34 34 #include <asm/kvm.h> 35 35 #include <kvm/iodev.h> 36 36 37 + #define CREATE_TRACE_POINTS 38 + #include "trace.h" 39 + 37 40 /* 38 41 * How the whole thing works (courtesy of Christoffer Dall): 39 42 * ··· 105 102 #include "vgic.h" 106 103 107 104 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu); 108 - static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu); 105 + static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu); 109 106 static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr); 110 107 static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr lr_desc); 108 + static u64 vgic_get_elrsr(struct kvm_vcpu *vcpu); 111 109 static struct irq_phys_map *vgic_irq_map_search(struct kvm_vcpu *vcpu, 112 110 int virt_irq); 111 + static int compute_pending_for_cpu(struct kvm_vcpu *vcpu); 113 112 114 113 static const struct vgic_ops *vgic_ops; 115 114 static const struct vgic_params *vgic; ··· 362 357 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 363 358 364 359 vgic_bitmap_set_irq_val(&dist->irq_soft_pend, vcpu->vcpu_id, irq, 0); 360 + if (!vgic_dist_irq_get_level(vcpu, irq)) { 361 + vgic_dist_irq_clear_pending(vcpu, irq); 362 + if (!compute_pending_for_cpu(vcpu)) 363 + clear_bit(vcpu->vcpu_id, dist->irq_pending_on_cpu); 364 + } 365 365 } 366 366 367 367 static int vgic_dist_irq_is_pending(struct kvm_vcpu *vcpu, int irq) ··· 541 531 return false; 542 532 } 543 533 544 - /* 545 - * If a mapped interrupt's state has been modified by the guest such that it 546 - * is no longer active or pending, without it have gone through the sync path, 547 - * then the map->active field must be cleared so the interrupt can be taken 548 - * again. 549 - */ 550 - static void vgic_handle_clear_mapped_irq(struct kvm_vcpu *vcpu) 551 - { 552 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 553 - struct list_head *root; 554 - struct irq_phys_map_entry *entry; 555 - struct irq_phys_map *map; 556 - 557 - rcu_read_lock(); 558 - 559 - /* Check for PPIs */ 560 - root = &vgic_cpu->irq_phys_map_list; 561 - list_for_each_entry_rcu(entry, root, entry) { 562 - map = &entry->map; 563 - 564 - if (!vgic_dist_irq_is_pending(vcpu, map->virt_irq) && 565 - !vgic_irq_is_active(vcpu, map->virt_irq)) 566 - map->active = false; 567 - } 568 - 569 - rcu_read_unlock(); 570 - } 571 - 572 534 bool vgic_handle_clear_pending_reg(struct kvm *kvm, 573 535 struct kvm_exit_mmio *mmio, 574 536 phys_addr_t offset, int vcpu_id) ··· 571 589 vcpu_id, offset); 572 590 vgic_reg_access(mmio, reg, offset, mode); 573 591 574 - vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id)); 575 592 vgic_update_state(kvm); 576 593 return true; 577 594 } ··· 608 627 ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT); 609 628 610 629 if (mmio->is_write) { 611 - vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id)); 612 630 vgic_update_state(kvm); 613 631 return true; 614 632 } ··· 664 684 vgic_reg_access(mmio, &val, offset, 665 685 ACCESS_READ_VALUE | ACCESS_WRITE_VALUE); 666 686 if (mmio->is_write) { 667 - if (offset < 8) { 668 - *reg = ~0U; /* Force PPIs/SGIs to 1 */ 687 + /* Ignore writes to read-only SGI and PPI bits */ 688 + if (offset < 8) 669 689 return false; 670 - } 671 690 672 691 val = vgic_cfg_compress(val); 673 692 if (offset & 4) { ··· 692 713 void vgic_unqueue_irqs(struct kvm_vcpu *vcpu) 693 714 { 694 715 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 716 + u64 elrsr = vgic_get_elrsr(vcpu); 717 + unsigned long *elrsr_ptr = u64_to_bitmask(&elrsr); 695 718 int i; 696 719 697 - for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) { 720 + for_each_clear_bit(i, elrsr_ptr, vgic_cpu->nr_lr) { 698 721 struct vgic_lr lr = vgic_get_lr(vcpu, i); 699 722 700 723 /* ··· 717 736 * interrupt then move the active state to the 718 737 * distributor tracking bit. 719 738 */ 720 - if (lr.state & LR_STATE_ACTIVE) { 739 + if (lr.state & LR_STATE_ACTIVE) 721 740 vgic_irq_set_active(vcpu, lr.irq); 722 - lr.state &= ~LR_STATE_ACTIVE; 723 - } 724 741 725 742 /* 726 743 * Reestablish the pending state on the distributor and the 727 - * CPU interface. It may have already been pending, but that 728 - * is fine, then we are only setting a few bits that were 729 - * already set. 744 + * CPU interface and mark the LR as free for other use. 730 745 */ 731 - if (lr.state & LR_STATE_PENDING) { 732 - vgic_dist_irq_set_pending(vcpu, lr.irq); 733 - lr.state &= ~LR_STATE_PENDING; 734 - } 735 - 736 - vgic_set_lr(vcpu, i, lr); 737 - 738 - /* 739 - * Mark the LR as free for other use. 740 - */ 741 - BUG_ON(lr.state & LR_STATE_MASK); 742 - vgic_retire_lr(i, lr.irq, vcpu); 743 - vgic_irq_clear_queued(vcpu, lr.irq); 746 + vgic_retire_lr(i, vcpu); 744 747 745 748 /* Finally update the VGIC state. */ 746 749 vgic_update_state(vcpu->kvm); ··· 1032 1067 vgic_ops->set_lr(vcpu, lr, vlr); 1033 1068 } 1034 1069 1035 - static void vgic_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr, 1036 - struct vgic_lr vlr) 1037 - { 1038 - vgic_ops->sync_lr_elrsr(vcpu, lr, vlr); 1039 - } 1040 - 1041 1070 static inline u64 vgic_get_elrsr(struct kvm_vcpu *vcpu) 1042 1071 { 1043 1072 return vgic_ops->get_elrsr(vcpu); ··· 1077 1118 vgic_ops->enable(vcpu); 1078 1119 } 1079 1120 1080 - static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu) 1121 + static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu) 1081 1122 { 1082 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 1083 1123 struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr); 1124 + 1125 + vgic_irq_clear_queued(vcpu, vlr.irq); 1084 1126 1085 1127 /* 1086 1128 * We must transfer the pending state back to the distributor before 1087 1129 * retiring the LR, otherwise we may loose edge-triggered interrupts. 1088 1130 */ 1089 1131 if (vlr.state & LR_STATE_PENDING) { 1090 - vgic_dist_irq_set_pending(vcpu, irq); 1132 + vgic_dist_irq_set_pending(vcpu, vlr.irq); 1091 1133 vlr.hwirq = 0; 1092 1134 } 1093 1135 1094 1136 vlr.state = 0; 1095 1137 vgic_set_lr(vcpu, lr_nr, vlr); 1096 - clear_bit(lr_nr, vgic_cpu->lr_used); 1097 - vgic_cpu->vgic_irq_lr_map[irq] = LR_EMPTY; 1098 - vgic_sync_lr_elrsr(vcpu, lr_nr, vlr); 1099 1138 } 1100 1139 1101 1140 /* ··· 1107 1150 */ 1108 1151 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu) 1109 1152 { 1110 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 1153 + u64 elrsr = vgic_get_elrsr(vcpu); 1154 + unsigned long *elrsr_ptr = u64_to_bitmask(&elrsr); 1111 1155 int lr; 1112 1156 1113 - for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) { 1157 + for_each_clear_bit(lr, elrsr_ptr, vgic->nr_lr) { 1114 1158 struct vgic_lr vlr = vgic_get_lr(vcpu, lr); 1115 1159 1116 - if (!vgic_irq_is_enabled(vcpu, vlr.irq)) { 1117 - vgic_retire_lr(lr, vlr.irq, vcpu); 1118 - if (vgic_irq_is_queued(vcpu, vlr.irq)) 1119 - vgic_irq_clear_queued(vcpu, vlr.irq); 1120 - } 1160 + if (!vgic_irq_is_enabled(vcpu, vlr.irq)) 1161 + vgic_retire_lr(lr, vcpu); 1121 1162 } 1122 1163 } 1123 1164 ··· 1155 1200 } 1156 1201 1157 1202 vgic_set_lr(vcpu, lr_nr, vlr); 1158 - vgic_sync_lr_elrsr(vcpu, lr_nr, vlr); 1159 1203 } 1160 1204 1161 1205 /* ··· 1164 1210 */ 1165 1211 bool vgic_queue_irq(struct kvm_vcpu *vcpu, u8 sgi_source_id, int irq) 1166 1212 { 1167 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 1168 1213 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1214 + u64 elrsr = vgic_get_elrsr(vcpu); 1215 + unsigned long *elrsr_ptr = u64_to_bitmask(&elrsr); 1169 1216 struct vgic_lr vlr; 1170 1217 int lr; 1171 1218 ··· 1177 1222 1178 1223 kvm_debug("Queue IRQ%d\n", irq); 1179 1224 1180 - lr = vgic_cpu->vgic_irq_lr_map[irq]; 1181 - 1182 1225 /* Do we have an active interrupt for the same CPUID? */ 1183 - if (lr != LR_EMPTY) { 1226 + for_each_clear_bit(lr, elrsr_ptr, vgic->nr_lr) { 1184 1227 vlr = vgic_get_lr(vcpu, lr); 1185 - if (vlr.source == sgi_source_id) { 1228 + if (vlr.irq == irq && vlr.source == sgi_source_id) { 1186 1229 kvm_debug("LR%d piggyback for IRQ%d\n", lr, vlr.irq); 1187 - BUG_ON(!test_bit(lr, vgic_cpu->lr_used)); 1188 1230 vgic_queue_irq_to_lr(vcpu, irq, lr, vlr); 1189 1231 return true; 1190 1232 } 1191 1233 } 1192 1234 1193 1235 /* Try to use another LR for this interrupt */ 1194 - lr = find_first_zero_bit((unsigned long *)vgic_cpu->lr_used, 1195 - vgic->nr_lr); 1236 + lr = find_first_bit(elrsr_ptr, vgic->nr_lr); 1196 1237 if (lr >= vgic->nr_lr) 1197 1238 return false; 1198 1239 1199 1240 kvm_debug("LR%d allocated for IRQ%d %x\n", lr, irq, sgi_source_id); 1200 - vgic_cpu->vgic_irq_lr_map[irq] = lr; 1201 - set_bit(lr, vgic_cpu->lr_used); 1202 1241 1203 1242 vlr.irq = irq; 1204 1243 vlr.source = sgi_source_id; ··· 1287 1338 } 1288 1339 } 1289 1340 1341 + static int process_queued_irq(struct kvm_vcpu *vcpu, 1342 + int lr, struct vgic_lr vlr) 1343 + { 1344 + int pending = 0; 1345 + 1346 + /* 1347 + * If the IRQ was EOIed (called from vgic_process_maintenance) or it 1348 + * went from active to non-active (called from vgic_sync_hwirq) it was 1349 + * also ACKed and we we therefore assume we can clear the soft pending 1350 + * state (should it had been set) for this interrupt. 1351 + * 1352 + * Note: if the IRQ soft pending state was set after the IRQ was 1353 + * acked, it actually shouldn't be cleared, but we have no way of 1354 + * knowing that unless we start trapping ACKs when the soft-pending 1355 + * state is set. 1356 + */ 1357 + vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq); 1358 + 1359 + /* 1360 + * Tell the gic to start sampling this interrupt again. 1361 + */ 1362 + vgic_irq_clear_queued(vcpu, vlr.irq); 1363 + 1364 + /* Any additional pending interrupt? */ 1365 + if (vgic_irq_is_edge(vcpu, vlr.irq)) { 1366 + BUG_ON(!(vlr.state & LR_HW)); 1367 + pending = vgic_dist_irq_is_pending(vcpu, vlr.irq); 1368 + } else { 1369 + if (vgic_dist_irq_get_level(vcpu, vlr.irq)) { 1370 + vgic_cpu_irq_set(vcpu, vlr.irq); 1371 + pending = 1; 1372 + } else { 1373 + vgic_dist_irq_clear_pending(vcpu, vlr.irq); 1374 + vgic_cpu_irq_clear(vcpu, vlr.irq); 1375 + } 1376 + } 1377 + 1378 + /* 1379 + * Despite being EOIed, the LR may not have 1380 + * been marked as empty. 1381 + */ 1382 + vlr.state = 0; 1383 + vlr.hwirq = 0; 1384 + vgic_set_lr(vcpu, lr, vlr); 1385 + 1386 + return pending; 1387 + } 1388 + 1290 1389 static bool vgic_process_maintenance(struct kvm_vcpu *vcpu) 1291 1390 { 1292 1391 u32 status = vgic_get_interrupt_status(vcpu); 1293 1392 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1294 - bool level_pending = false; 1295 1393 struct kvm *kvm = vcpu->kvm; 1394 + int level_pending = 0; 1296 1395 1297 1396 kvm_debug("STATUS = %08x\n", status); 1298 1397 ··· 1355 1358 1356 1359 for_each_set_bit(lr, eisr_ptr, vgic->nr_lr) { 1357 1360 struct vgic_lr vlr = vgic_get_lr(vcpu, lr); 1361 + 1358 1362 WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq)); 1359 - 1360 - spin_lock(&dist->lock); 1361 - vgic_irq_clear_queued(vcpu, vlr.irq); 1362 1363 WARN_ON(vlr.state & LR_STATE_MASK); 1363 - vlr.state = 0; 1364 - vgic_set_lr(vcpu, lr, vlr); 1365 1364 1366 - /* 1367 - * If the IRQ was EOIed it was also ACKed and we we 1368 - * therefore assume we can clear the soft pending 1369 - * state (should it had been set) for this interrupt. 1370 - * 1371 - * Note: if the IRQ soft pending state was set after 1372 - * the IRQ was acked, it actually shouldn't be 1373 - * cleared, but we have no way of knowing that unless 1374 - * we start trapping ACKs when the soft-pending state 1375 - * is set. 1376 - */ 1377 - vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq); 1378 1365 1379 1366 /* 1380 1367 * kvm_notify_acked_irq calls kvm_set_irq() 1381 - * to reset the IRQ level. Need to release the 1382 - * lock for kvm_set_irq to grab it. 1368 + * to reset the IRQ level, which grabs the dist->lock 1369 + * so we call this before taking the dist->lock. 1383 1370 */ 1384 - spin_unlock(&dist->lock); 1385 - 1386 1371 kvm_notify_acked_irq(kvm, 0, 1387 1372 vlr.irq - VGIC_NR_PRIVATE_IRQS); 1373 + 1388 1374 spin_lock(&dist->lock); 1389 - 1390 - /* Any additional pending interrupt? */ 1391 - if (vgic_dist_irq_get_level(vcpu, vlr.irq)) { 1392 - vgic_cpu_irq_set(vcpu, vlr.irq); 1393 - level_pending = true; 1394 - } else { 1395 - vgic_dist_irq_clear_pending(vcpu, vlr.irq); 1396 - vgic_cpu_irq_clear(vcpu, vlr.irq); 1397 - } 1398 - 1375 + level_pending |= process_queued_irq(vcpu, lr, vlr); 1399 1376 spin_unlock(&dist->lock); 1400 - 1401 - /* 1402 - * Despite being EOIed, the LR may not have 1403 - * been marked as empty. 1404 - */ 1405 - vgic_sync_lr_elrsr(vcpu, lr, vlr); 1406 1377 } 1407 1378 } 1408 1379 ··· 1391 1426 /* 1392 1427 * Save the physical active state, and reset it to inactive. 1393 1428 * 1394 - * Return 1 if HW interrupt went from active to inactive, and 0 otherwise. 1429 + * Return true if there's a pending forwarded interrupt to queue. 1395 1430 */ 1396 - static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, struct vgic_lr vlr) 1431 + static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr vlr) 1397 1432 { 1433 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1398 1434 struct irq_phys_map *map; 1435 + bool phys_active; 1436 + bool level_pending; 1399 1437 int ret; 1400 1438 1401 1439 if (!(vlr.state & LR_HW)) 1402 - return 0; 1440 + return false; 1403 1441 1404 1442 map = vgic_irq_map_search(vcpu, vlr.irq); 1405 1443 BUG_ON(!map); 1406 1444 1407 1445 ret = irq_get_irqchip_state(map->irq, 1408 1446 IRQCHIP_STATE_ACTIVE, 1409 - &map->active); 1447 + &phys_active); 1410 1448 1411 1449 WARN_ON(ret); 1412 1450 1413 - if (map->active) 1451 + if (phys_active) 1414 1452 return 0; 1415 1453 1416 - return 1; 1454 + spin_lock(&dist->lock); 1455 + level_pending = process_queued_irq(vcpu, lr, vlr); 1456 + spin_unlock(&dist->lock); 1457 + return level_pending; 1417 1458 } 1418 1459 1419 1460 /* Sync back the VGIC state after a guest run */ 1420 1461 static void __kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) 1421 1462 { 1422 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 1423 1463 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1424 1464 u64 elrsr; 1425 1465 unsigned long *elrsr_ptr; ··· 1432 1462 bool level_pending; 1433 1463 1434 1464 level_pending = vgic_process_maintenance(vcpu); 1435 - elrsr = vgic_get_elrsr(vcpu); 1436 - elrsr_ptr = u64_to_bitmask(&elrsr); 1437 1465 1438 1466 /* Deal with HW interrupts, and clear mappings for empty LRs */ 1439 1467 for (lr = 0; lr < vgic->nr_lr; lr++) { 1440 - struct vgic_lr vlr; 1468 + struct vgic_lr vlr = vgic_get_lr(vcpu, lr); 1441 1469 1442 - if (!test_bit(lr, vgic_cpu->lr_used)) 1443 - continue; 1444 - 1445 - vlr = vgic_get_lr(vcpu, lr); 1446 - if (vgic_sync_hwirq(vcpu, vlr)) { 1447 - /* 1448 - * So this is a HW interrupt that the guest 1449 - * EOI-ed. Clean the LR state and allow the 1450 - * interrupt to be sampled again. 1451 - */ 1452 - vlr.state = 0; 1453 - vlr.hwirq = 0; 1454 - vgic_set_lr(vcpu, lr, vlr); 1455 - vgic_irq_clear_queued(vcpu, vlr.irq); 1456 - set_bit(lr, elrsr_ptr); 1457 - } 1458 - 1459 - if (!test_bit(lr, elrsr_ptr)) 1460 - continue; 1461 - 1462 - clear_bit(lr, vgic_cpu->lr_used); 1463 - 1470 + level_pending |= vgic_sync_hwirq(vcpu, lr, vlr); 1464 1471 BUG_ON(vlr.irq >= dist->nr_irqs); 1465 - vgic_cpu->vgic_irq_lr_map[vlr.irq] = LR_EMPTY; 1466 1472 } 1467 1473 1468 1474 /* Check if we still have something up our sleeve... */ 1475 + elrsr = vgic_get_elrsr(vcpu); 1476 + elrsr_ptr = u64_to_bitmask(&elrsr); 1469 1477 pending = find_first_zero_bit(elrsr_ptr, vgic->nr_lr); 1470 1478 if (level_pending || pending < vgic->nr_lr) 1471 1479 set_bit(vcpu->vcpu_id, dist->irq_pending_on_cpu); ··· 1532 1584 int edge_triggered, level_triggered; 1533 1585 int enabled; 1534 1586 bool ret = true, can_inject = true; 1587 + 1588 + trace_vgic_update_irq_pending(cpuid, irq_num, level); 1535 1589 1536 1590 if (irq_num >= min(kvm->arch.vgic.nr_irqs, 1020)) 1537 1591 return -EINVAL; ··· 1814 1864 } 1815 1865 1816 1866 /** 1817 - * kvm_vgic_get_phys_irq_active - Return the active state of a mapped IRQ 1818 - * 1819 - * Return the logical active state of a mapped interrupt. This doesn't 1820 - * necessarily reflects the current HW state. 1821 - */ 1822 - bool kvm_vgic_get_phys_irq_active(struct irq_phys_map *map) 1823 - { 1824 - BUG_ON(!map); 1825 - return map->active; 1826 - } 1827 - 1828 - /** 1829 - * kvm_vgic_set_phys_irq_active - Set the active state of a mapped IRQ 1830 - * 1831 - * Set the logical active state of a mapped interrupt. This doesn't 1832 - * immediately affects the HW state. 1833 - */ 1834 - void kvm_vgic_set_phys_irq_active(struct irq_phys_map *map, bool active) 1835 - { 1836 - BUG_ON(!map); 1837 - map->active = active; 1838 - } 1839 - 1840 - /** 1841 1867 * kvm_vgic_unmap_phys_irq - Remove a virtual to physical IRQ mapping 1842 1868 * @vcpu: The VCPU pointer 1843 1869 * @map: The pointer to a mapping obtained through kvm_vgic_map_phys_irq ··· 1868 1942 kfree(vgic_cpu->pending_shared); 1869 1943 kfree(vgic_cpu->active_shared); 1870 1944 kfree(vgic_cpu->pend_act_shared); 1871 - kfree(vgic_cpu->vgic_irq_lr_map); 1872 1945 vgic_destroy_irq_phys_map(vcpu->kvm, &vgic_cpu->irq_phys_map_list); 1873 1946 vgic_cpu->pending_shared = NULL; 1874 1947 vgic_cpu->active_shared = NULL; 1875 1948 vgic_cpu->pend_act_shared = NULL; 1876 - vgic_cpu->vgic_irq_lr_map = NULL; 1877 1949 } 1878 1950 1879 1951 static int vgic_vcpu_init_maps(struct kvm_vcpu *vcpu, int nr_irqs) ··· 1882 1958 vgic_cpu->pending_shared = kzalloc(sz, GFP_KERNEL); 1883 1959 vgic_cpu->active_shared = kzalloc(sz, GFP_KERNEL); 1884 1960 vgic_cpu->pend_act_shared = kzalloc(sz, GFP_KERNEL); 1885 - vgic_cpu->vgic_irq_lr_map = kmalloc(nr_irqs, GFP_KERNEL); 1886 1961 1887 1962 if (!vgic_cpu->pending_shared 1888 1963 || !vgic_cpu->active_shared 1889 - || !vgic_cpu->pend_act_shared 1890 - || !vgic_cpu->vgic_irq_lr_map) { 1964 + || !vgic_cpu->pend_act_shared) { 1891 1965 kvm_vgic_vcpu_destroy(vcpu); 1892 1966 return -ENOMEM; 1893 1967 } 1894 - 1895 - memset(vgic_cpu->vgic_irq_lr_map, LR_EMPTY, nr_irqs); 1896 1968 1897 1969 /* 1898 1970 * Store the number of LRs per vcpu, so we don't have to go ··· 2031 2111 break; 2032 2112 } 2033 2113 2034 - for (i = 0; i < dist->nr_irqs; i++) { 2035 - if (i < VGIC_NR_PPIS) 2114 + /* 2115 + * Enable and configure all SGIs to be edge-triggere and 2116 + * configure all PPIs as level-triggered. 2117 + */ 2118 + for (i = 0; i < VGIC_NR_PRIVATE_IRQS; i++) { 2119 + if (i < VGIC_NR_SGIS) { 2120 + /* SGIs */ 2036 2121 vgic_bitmap_set_irq_val(&dist->irq_enabled, 2037 2122 vcpu->vcpu_id, i, 1); 2038 - if (i < VGIC_NR_PRIVATE_IRQS) 2039 2123 vgic_bitmap_set_irq_val(&dist->irq_cfg, 2040 2124 vcpu->vcpu_id, i, 2041 2125 VGIC_CFG_EDGE); 2126 + } else if (i < VGIC_NR_PRIVATE_IRQS) { 2127 + /* PPIs */ 2128 + vgic_bitmap_set_irq_val(&dist->irq_cfg, 2129 + vcpu->vcpu_id, i, 2130 + VGIC_CFG_LEVEL); 2131 + } 2042 2132 } 2043 2133 2044 2134 vgic_enable(vcpu);

+4

virt/kvm/async_pf.c

··· 94 94 95 95 trace_kvm_async_pf_completed(addr, gva); 96 96 97 + /* 98 + * This memory barrier pairs with prepare_to_wait's set_current_state() 99 + */ 100 + smp_mb(); 97 101 if (waitqueue_active(&vcpu->wq)) 98 102 wake_up_interruptible(&vcpu->wq); 99 103

+97 -93

virt/kvm/eventfd.c

··· 23 23 24 24 #include <linux/kvm_host.h> 25 25 #include <linux/kvm.h> 26 + #include <linux/kvm_irqfd.h> 26 27 #include <linux/workqueue.h> 27 28 #include <linux/syscalls.h> 28 29 #include <linux/wait.h> ··· 35 34 #include <linux/srcu.h> 36 35 #include <linux/slab.h> 37 36 #include <linux/seqlock.h> 37 + #include <linux/irqbypass.h> 38 38 #include <trace/events/kvm.h> 39 39 40 40 #include <kvm/iodev.h> 41 41 42 42 #ifdef CONFIG_HAVE_KVM_IRQFD 43 - /* 44 - * -------------------------------------------------------------------- 45 - * irqfd: Allows an fd to be used to inject an interrupt to the guest 46 - * 47 - * Credit goes to Avi Kivity for the original idea. 48 - * -------------------------------------------------------------------- 49 - */ 50 - 51 - /* 52 - * Resampling irqfds are a special variety of irqfds used to emulate 53 - * level triggered interrupts. The interrupt is asserted on eventfd 54 - * trigger. On acknowledgement through the irq ack notifier, the 55 - * interrupt is de-asserted and userspace is notified through the 56 - * resamplefd. All resamplers on the same gsi are de-asserted 57 - * together, so we don't need to track the state of each individual 58 - * user. We can also therefore share the same irq source ID. 59 - */ 60 - struct _irqfd_resampler { 61 - struct kvm *kvm; 62 - /* 63 - * List of resampling struct _irqfd objects sharing this gsi. 64 - * RCU list modified under kvm->irqfds.resampler_lock 65 - */ 66 - struct list_head list; 67 - struct kvm_irq_ack_notifier notifier; 68 - /* 69 - * Entry in list of kvm->irqfd.resampler_list. Use for sharing 70 - * resamplers among irqfds on the same gsi. 71 - * Accessed and modified under kvm->irqfds.resampler_lock 72 - */ 73 - struct list_head link; 74 - }; 75 - 76 - struct _irqfd { 77 - /* Used for MSI fast-path */ 78 - struct kvm *kvm; 79 - wait_queue_t wait; 80 - /* Update side is protected by irqfds.lock */ 81 - struct kvm_kernel_irq_routing_entry irq_entry; 82 - seqcount_t irq_entry_sc; 83 - /* Used for level IRQ fast-path */ 84 - int gsi; 85 - struct work_struct inject; 86 - /* The resampler used by this irqfd (resampler-only) */ 87 - struct _irqfd_resampler *resampler; 88 - /* Eventfd notified on resample (resampler-only) */ 89 - struct eventfd_ctx *resamplefd; 90 - /* Entry in list of irqfds for a resampler (resampler-only) */ 91 - struct list_head resampler_link; 92 - /* Used for setup/shutdown */ 93 - struct eventfd_ctx *eventfd; 94 - struct list_head list; 95 - poll_table pt; 96 - struct work_struct shutdown; 97 - }; 98 43 99 44 static struct workqueue_struct *irqfd_cleanup_wq; 100 45 101 46 static void 102 47 irqfd_inject(struct work_struct *work) 103 48 { 104 - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); 49 + struct kvm_kernel_irqfd *irqfd = 50 + container_of(work, struct kvm_kernel_irqfd, inject); 105 51 struct kvm *kvm = irqfd->kvm; 106 52 107 53 if (!irqfd->resampler) { ··· 69 121 static void 70 122 irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian) 71 123 { 72 - struct _irqfd_resampler *resampler; 124 + struct kvm_kernel_irqfd_resampler *resampler; 73 125 struct kvm *kvm; 74 - struct _irqfd *irqfd; 126 + struct kvm_kernel_irqfd *irqfd; 75 127 int idx; 76 128 77 - resampler = container_of(kian, struct _irqfd_resampler, notifier); 129 + resampler = container_of(kian, 130 + struct kvm_kernel_irqfd_resampler, notifier); 78 131 kvm = resampler->kvm; 79 132 80 133 kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID, ··· 90 141 } 91 142 92 143 static void 93 - irqfd_resampler_shutdown(struct _irqfd *irqfd) 144 + irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd) 94 145 { 95 - struct _irqfd_resampler *resampler = irqfd->resampler; 146 + struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler; 96 147 struct kvm *kvm = resampler->kvm; 97 148 98 149 mutex_lock(&kvm->irqfds.resampler_lock); ··· 117 168 static void 118 169 irqfd_shutdown(struct work_struct *work) 119 170 { 120 - struct _irqfd *irqfd = container_of(work, struct _irqfd, shutdown); 171 + struct kvm_kernel_irqfd *irqfd = 172 + container_of(work, struct kvm_kernel_irqfd, shutdown); 121 173 u64 cnt; 122 174 123 175 /* ··· 141 191 /* 142 192 * It is now safe to release the object's resources 143 193 */ 194 + #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 195 + irq_bypass_unregister_consumer(&irqfd->consumer); 196 + #endif 144 197 eventfd_ctx_put(irqfd->eventfd); 145 198 kfree(irqfd); 146 199 } ··· 151 198 152 199 /* assumes kvm->irqfds.lock is held */ 153 200 static bool 154 - irqfd_is_active(struct _irqfd *irqfd) 201 + irqfd_is_active(struct kvm_kernel_irqfd *irqfd) 155 202 { 156 203 return list_empty(&irqfd->list) ? false : true; 157 204 } ··· 162 209 * assumes kvm->irqfds.lock is held 163 210 */ 164 211 static void 165 - irqfd_deactivate(struct _irqfd *irqfd) 212 + irqfd_deactivate(struct kvm_kernel_irqfd *irqfd) 166 213 { 167 214 BUG_ON(!irqfd_is_active(irqfd)); 168 215 ··· 171 218 queue_work(irqfd_cleanup_wq, &irqfd->shutdown); 172 219 } 173 220 221 + int __attribute__((weak)) kvm_arch_set_irq_inatomic( 222 + struct kvm_kernel_irq_routing_entry *irq, 223 + struct kvm *kvm, int irq_source_id, 224 + int level, 225 + bool line_status) 226 + { 227 + return -EWOULDBLOCK; 228 + } 229 + 174 230 /* 175 231 * Called with wqh->lock held and interrupts disabled 176 232 */ 177 233 static int 178 234 irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) 179 235 { 180 - struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait); 236 + struct kvm_kernel_irqfd *irqfd = 237 + container_of(wait, struct kvm_kernel_irqfd, wait); 181 238 unsigned long flags = (unsigned long)key; 182 239 struct kvm_kernel_irq_routing_entry irq; 183 240 struct kvm *kvm = irqfd->kvm; ··· 201 238 irq = irqfd->irq_entry; 202 239 } while (read_seqcount_retry(&irqfd->irq_entry_sc, seq)); 203 240 /* An event has been signaled, inject an interrupt */ 204 - if (irq.type == KVM_IRQ_ROUTING_MSI) 205 - kvm_set_msi(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, 206 - false); 207 - else 241 + if (kvm_arch_set_irq_inatomic(&irq, kvm, 242 + KVM_USERSPACE_IRQ_SOURCE_ID, 1, 243 + false) == -EWOULDBLOCK) 208 244 schedule_work(&irqfd->inject); 209 245 srcu_read_unlock(&kvm->irq_srcu, idx); 210 246 } ··· 236 274 irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, 237 275 poll_table *pt) 238 276 { 239 - struct _irqfd *irqfd = container_of(pt, struct _irqfd, pt); 277 + struct kvm_kernel_irqfd *irqfd = 278 + container_of(pt, struct kvm_kernel_irqfd, pt); 240 279 add_wait_queue(wqh, &irqfd->wait); 241 280 } 242 281 243 282 /* Must be called under irqfds.lock */ 244 - static void irqfd_update(struct kvm *kvm, struct _irqfd *irqfd) 283 + static void irqfd_update(struct kvm *kvm, struct kvm_kernel_irqfd *irqfd) 245 284 { 246 285 struct kvm_kernel_irq_routing_entry *e; 247 286 struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS]; 248 - int i, n_entries; 287 + int n_entries; 249 288 250 289 n_entries = kvm_irq_map_gsi(kvm, entries, irqfd->gsi); 251 290 252 291 write_seqcount_begin(&irqfd->irq_entry_sc); 253 292 254 - irqfd->irq_entry.type = 0; 255 - 256 293 e = entries; 257 - for (i = 0; i < n_entries; ++i, ++e) { 258 - /* Only fast-path MSI. */ 259 - if (e->type == KVM_IRQ_ROUTING_MSI) 260 - irqfd->irq_entry = *e; 261 - } 294 + if (n_entries == 1) 295 + irqfd->irq_entry = *e; 296 + else 297 + irqfd->irq_entry.type = 0; 262 298 263 299 write_seqcount_end(&irqfd->irq_entry_sc); 264 300 } 265 301 302 + #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 303 + void __attribute__((weak)) kvm_arch_irq_bypass_stop( 304 + struct irq_bypass_consumer *cons) 305 + { 306 + } 307 + 308 + void __attribute__((weak)) kvm_arch_irq_bypass_start( 309 + struct irq_bypass_consumer *cons) 310 + { 311 + } 312 + 313 + int __attribute__((weak)) kvm_arch_update_irqfd_routing( 314 + struct kvm *kvm, unsigned int host_irq, 315 + uint32_t guest_irq, bool set) 316 + { 317 + return 0; 318 + } 319 + #endif 320 + 266 321 static int 267 322 kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) 268 323 { 269 - struct _irqfd *irqfd, *tmp; 324 + struct kvm_kernel_irqfd *irqfd, *tmp; 270 325 struct fd f; 271 326 struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL; 272 327 int ret; ··· 319 340 irqfd->eventfd = eventfd; 320 341 321 342 if (args->flags & KVM_IRQFD_FLAG_RESAMPLE) { 322 - struct _irqfd_resampler *resampler; 343 + struct kvm_kernel_irqfd_resampler *resampler; 323 344 324 345 resamplefd = eventfd_ctx_fdget(args->resamplefd); 325 346 if (IS_ERR(resamplefd)) { ··· 407 428 * we might race against the POLLHUP 408 429 */ 409 430 fdput(f); 431 + #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 432 + irqfd->consumer.token = (void *)irqfd->eventfd; 433 + irqfd->consumer.add_producer = kvm_arch_irq_bypass_add_producer; 434 + irqfd->consumer.del_producer = kvm_arch_irq_bypass_del_producer; 435 + irqfd->consumer.stop = kvm_arch_irq_bypass_stop; 436 + irqfd->consumer.start = kvm_arch_irq_bypass_start; 437 + ret = irq_bypass_register_consumer(&irqfd->consumer); 438 + if (ret) 439 + pr_info("irq bypass consumer (token %p) registration fails: %d\n", 440 + irqfd->consumer.token, ret); 441 + #endif 410 442 411 443 return 0; 412 444 ··· 459 469 } 460 470 EXPORT_SYMBOL_GPL(kvm_irq_has_notifier); 461 471 462 - void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) 472 + void kvm_notify_acked_gsi(struct kvm *kvm, int gsi) 463 473 { 464 474 struct kvm_irq_ack_notifier *kian; 475 + 476 + hlist_for_each_entry_rcu(kian, &kvm->irq_ack_notifier_list, 477 + link) 478 + if (kian->gsi == gsi) 479 + kian->irq_acked(kian); 480 + } 481 + 482 + void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) 483 + { 465 484 int gsi, idx; 466 485 467 486 trace_kvm_ack_irq(irqchip, pin); ··· 478 479 idx = srcu_read_lock(&kvm->irq_srcu); 479 480 gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin); 480 481 if (gsi != -1) 481 - hlist_for_each_entry_rcu(kian, &kvm->irq_ack_notifier_list, 482 - link) 483 - if (kian->gsi == gsi) 484 - kian->irq_acked(kian); 482 + kvm_notify_acked_gsi(kvm, gsi); 485 483 srcu_read_unlock(&kvm->irq_srcu, idx); 486 484 } 487 485 ··· 521 525 static int 522 526 kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args) 523 527 { 524 - struct _irqfd *irqfd, *tmp; 528 + struct kvm_kernel_irqfd *irqfd, *tmp; 525 529 struct eventfd_ctx *eventfd; 526 530 527 531 eventfd = eventfd_ctx_fdget(args->fd); ··· 577 581 void 578 582 kvm_irqfd_release(struct kvm *kvm) 579 583 { 580 - struct _irqfd *irqfd, *tmp; 584 + struct kvm_kernel_irqfd *irqfd, *tmp; 581 585 582 586 spin_lock_irq(&kvm->irqfds.lock); 583 587 ··· 600 604 */ 601 605 void kvm_irq_routing_update(struct kvm *kvm) 602 606 { 603 - struct _irqfd *irqfd; 607 + struct kvm_kernel_irqfd *irqfd; 604 608 605 609 spin_lock_irq(&kvm->irqfds.lock); 606 610 607 - list_for_each_entry(irqfd, &kvm->irqfds.items, list) 611 + list_for_each_entry(irqfd, &kvm->irqfds.items, list) { 608 612 irqfd_update(kvm, irqfd); 613 + 614 + #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS 615 + if (irqfd->producer) { 616 + int ret = kvm_arch_update_irqfd_routing( 617 + irqfd->kvm, irqfd->producer->irq, 618 + irqfd->gsi, 1); 619 + WARN_ON(ret); 620 + } 621 + #endif 622 + } 609 623 610 624 spin_unlock_irq(&kvm->irqfds.lock); 611 625 } ··· 920 914 return -EINVAL; 921 915 922 916 /* ioeventfd with no length can't be combined with DATAMATCH */ 923 - if (!args->len && 924 - args->flags & (KVM_IOEVENTFD_FLAG_PIO | 925 - KVM_IOEVENTFD_FLAG_DATAMATCH)) 917 + if (!args->len && (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH)) 926 918 return -EINVAL; 927 919 928 920 ret = kvm_assign_ioeventfd_idx(kvm, bus_idx, args);

+5 -13

virt/kvm/irqchip.c

··· 31 31 #include <trace/events/kvm.h> 32 32 #include "irq.h" 33 33 34 - struct kvm_irq_routing_table { 35 - int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS]; 36 - u32 nr_rt_entries; 37 - /* 38 - * Array indexed by gsi. Each entry contains list of irq chips 39 - * the gsi is connected to. 40 - */ 41 - struct hlist_head map[0]; 42 - }; 43 - 44 34 int kvm_irq_map_gsi(struct kvm *kvm, 45 35 struct kvm_kernel_irq_routing_entry *entries, int gsi) 46 36 { ··· 144 154 145 155 /* 146 156 * Do not allow GSI to be mapped to the same irqchip more than once. 147 - * Allow only one to one mapping between GSI and MSI. 157 + * Allow only one to one mapping between GSI and non-irqchip routing. 148 158 */ 149 159 hlist_for_each_entry(ei, &rt->map[ue->gsi], link) 150 - if (ei->type == KVM_IRQ_ROUTING_MSI || 151 - ue->type == KVM_IRQ_ROUTING_MSI || 160 + if (ei->type != KVM_IRQ_ROUTING_IRQCHIP || 161 + ue->type != KVM_IRQ_ROUTING_IRQCHIP || 152 162 ue->u.irqchip.irqchip == ei->irqchip.irqchip) 153 163 return r; 154 164 ··· 220 230 rcu_assign_pointer(kvm->irq_routing, new); 221 231 kvm_irq_routing_update(kvm); 222 232 mutex_unlock(&kvm->irq_lock); 233 + 234 + kvm_arch_irq_routing_update(kvm); 223 235 224 236 synchronize_srcu_expedited(&kvm->irq_srcu); 225 237

+9 -2

virt/kvm/kvm_main.c

··· 230 230 init_waitqueue_head(&vcpu->wq); 231 231 kvm_async_pf_vcpu_init(vcpu); 232 232 233 + vcpu->pre_pcpu = -1; 234 + INIT_LIST_HEAD(&vcpu->blocked_vcpu_list); 235 + 233 236 page = alloc_page(GFP_KERNEL | __GFP_ZERO); 234 237 if (!page) { 235 238 r = -ENOMEM; ··· 2021 2018 } while (single_task_running() && ktime_before(cur, stop)); 2022 2019 } 2023 2020 2021 + kvm_arch_vcpu_blocking(vcpu); 2022 + 2024 2023 for (;;) { 2025 2024 prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); 2026 2025 ··· 2036 2031 finish_wait(&vcpu->wq, &wait); 2037 2032 cur = ktime_get(); 2038 2033 2034 + kvm_arch_vcpu_unblocking(vcpu); 2039 2035 out: 2040 2036 block_ns = ktime_to_ns(cur) - ktime_to_ns(start); 2041 2037 ··· 2724 2718 case KVM_CAP_IRQFD: 2725 2719 case KVM_CAP_IRQFD_RESAMPLE: 2726 2720 #endif 2721 + case KVM_CAP_IOEVENTFD_ANY_LENGTH: 2727 2722 case KVM_CAP_CHECK_EXTENSION_VM: 2728 2723 return 1; 2729 2724 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING ··· 3348 3341 if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1) 3349 3342 return -ENOSPC; 3350 3343 3351 - new_bus = kzalloc(sizeof(*bus) + ((bus->dev_count + 1) * 3344 + new_bus = kmalloc(sizeof(*bus) + ((bus->dev_count + 1) * 3352 3345 sizeof(struct kvm_io_range)), GFP_KERNEL); 3353 3346 if (!new_bus) 3354 3347 return -ENOMEM; ··· 3380 3373 if (r) 3381 3374 return r; 3382 3375 3383 - new_bus = kzalloc(sizeof(*bus) + ((bus->dev_count - 1) * 3376 + new_bus = kmalloc(sizeof(*bus) + ((bus->dev_count - 1) * 3384 3377 sizeof(struct kvm_io_range)), GFP_KERNEL); 3385 3378 if (!new_bus) 3386 3379 return -ENOMEM;

+2

virt/lib/Kconfig

··· 1 + config IRQ_BYPASS_MANAGER 2 + tristate

+1

virt/lib/Makefile

··· 1 + obj-$(CONFIG_IRQ_BYPASS_MANAGER) += irqbypass.o

+257

virt/lib/irqbypass.c

··· 1 + /* 2 + * IRQ offload/bypass manager 3 + * 4 + * Copyright (C) 2015 Red Hat, Inc. 5 + * Copyright (c) 2015 Linaro Ltd. 6 + * 7 + * This program is free software; you can redistribute it and/or modify 8 + * it under the terms of the GNU General Public License version 2 as 9 + * published by the Free Software Foundation. 10 + * 11 + * Various virtualization hardware acceleration techniques allow bypassing or 12 + * offloading interrupts received from devices around the host kernel. Posted 13 + * Interrupts on Intel VT-d systems can allow interrupts to be received 14 + * directly by a virtual machine. ARM IRQ Forwarding allows forwarded physical 15 + * interrupts to be directly deactivated by the guest. This manager allows 16 + * interrupt producers and consumers to find each other to enable this sort of 17 + * bypass. 18 + */ 19 + 20 + #include <linux/irqbypass.h> 21 + #include <linux/list.h> 22 + #include <linux/module.h> 23 + #include <linux/mutex.h> 24 + 25 + MODULE_LICENSE("GPL v2"); 26 + MODULE_DESCRIPTION("IRQ bypass manager utility module"); 27 + 28 + static LIST_HEAD(producers); 29 + static LIST_HEAD(consumers); 30 + static DEFINE_MUTEX(lock); 31 + 32 + /* @lock must be held when calling connect */ 33 + static int __connect(struct irq_bypass_producer *prod, 34 + struct irq_bypass_consumer *cons) 35 + { 36 + int ret = 0; 37 + 38 + if (prod->stop) 39 + prod->stop(prod); 40 + if (cons->stop) 41 + cons->stop(cons); 42 + 43 + if (prod->add_consumer) 44 + ret = prod->add_consumer(prod, cons); 45 + 46 + if (!ret) { 47 + ret = cons->add_producer(cons, prod); 48 + if (ret && prod->del_consumer) 49 + prod->del_consumer(prod, cons); 50 + } 51 + 52 + if (cons->start) 53 + cons->start(cons); 54 + if (prod->start) 55 + prod->start(prod); 56 + 57 + return ret; 58 + } 59 + 60 + /* @lock must be held when calling disconnect */ 61 + static void __disconnect(struct irq_bypass_producer *prod, 62 + struct irq_bypass_consumer *cons) 63 + { 64 + if (prod->stop) 65 + prod->stop(prod); 66 + if (cons->stop) 67 + cons->stop(cons); 68 + 69 + cons->del_producer(cons, prod); 70 + 71 + if (prod->del_consumer) 72 + prod->del_consumer(prod, cons); 73 + 74 + if (cons->start) 75 + cons->start(cons); 76 + if (prod->start) 77 + prod->start(prod); 78 + } 79 + 80 + /** 81 + * irq_bypass_register_producer - register IRQ bypass producer 82 + * @producer: pointer to producer structure 83 + * 84 + * Add the provided IRQ producer to the list of producers and connect 85 + * with any matching token found on the IRQ consumers list. 86 + */ 87 + int irq_bypass_register_producer(struct irq_bypass_producer *producer) 88 + { 89 + struct irq_bypass_producer *tmp; 90 + struct irq_bypass_consumer *consumer; 91 + 92 + might_sleep(); 93 + 94 + if (!try_module_get(THIS_MODULE)) 95 + return -ENODEV; 96 + 97 + mutex_lock(&lock); 98 + 99 + list_for_each_entry(tmp, &producers, node) { 100 + if (tmp->token == producer->token) { 101 + mutex_unlock(&lock); 102 + module_put(THIS_MODULE); 103 + return -EBUSY; 104 + } 105 + } 106 + 107 + list_for_each_entry(consumer, &consumers, node) { 108 + if (consumer->token == producer->token) { 109 + int ret = __connect(producer, consumer); 110 + if (ret) { 111 + mutex_unlock(&lock); 112 + module_put(THIS_MODULE); 113 + return ret; 114 + } 115 + break; 116 + } 117 + } 118 + 119 + list_add(&producer->node, &producers); 120 + 121 + mutex_unlock(&lock); 122 + 123 + return 0; 124 + } 125 + EXPORT_SYMBOL_GPL(irq_bypass_register_producer); 126 + 127 + /** 128 + * irq_bypass_unregister_producer - unregister IRQ bypass producer 129 + * @producer: pointer to producer structure 130 + * 131 + * Remove a previously registered IRQ producer from the list of producers 132 + * and disconnect it from any connected IRQ consumer. 133 + */ 134 + void irq_bypass_unregister_producer(struct irq_bypass_producer *producer) 135 + { 136 + struct irq_bypass_producer *tmp; 137 + struct irq_bypass_consumer *consumer; 138 + 139 + might_sleep(); 140 + 141 + if (!try_module_get(THIS_MODULE)) 142 + return; /* nothing in the list anyway */ 143 + 144 + mutex_lock(&lock); 145 + 146 + list_for_each_entry(tmp, &producers, node) { 147 + if (tmp->token != producer->token) 148 + continue; 149 + 150 + list_for_each_entry(consumer, &consumers, node) { 151 + if (consumer->token == producer->token) { 152 + __disconnect(producer, consumer); 153 + break; 154 + } 155 + } 156 + 157 + list_del(&producer->node); 158 + module_put(THIS_MODULE); 159 + break; 160 + } 161 + 162 + mutex_unlock(&lock); 163 + 164 + module_put(THIS_MODULE); 165 + } 166 + EXPORT_SYMBOL_GPL(irq_bypass_unregister_producer); 167 + 168 + /** 169 + * irq_bypass_register_consumer - register IRQ bypass consumer 170 + * @consumer: pointer to consumer structure 171 + * 172 + * Add the provided IRQ consumer to the list of consumers and connect 173 + * with any matching token found on the IRQ producer list. 174 + */ 175 + int irq_bypass_register_consumer(struct irq_bypass_consumer *consumer) 176 + { 177 + struct irq_bypass_consumer *tmp; 178 + struct irq_bypass_producer *producer; 179 + 180 + if (!consumer->add_producer || !consumer->del_producer) 181 + return -EINVAL; 182 + 183 + might_sleep(); 184 + 185 + if (!try_module_get(THIS_MODULE)) 186 + return -ENODEV; 187 + 188 + mutex_lock(&lock); 189 + 190 + list_for_each_entry(tmp, &consumers, node) { 191 + if (tmp->token == consumer->token) { 192 + mutex_unlock(&lock); 193 + module_put(THIS_MODULE); 194 + return -EBUSY; 195 + } 196 + } 197 + 198 + list_for_each_entry(producer, &producers, node) { 199 + if (producer->token == consumer->token) { 200 + int ret = __connect(producer, consumer); 201 + if (ret) { 202 + mutex_unlock(&lock); 203 + module_put(THIS_MODULE); 204 + return ret; 205 + } 206 + break; 207 + } 208 + } 209 + 210 + list_add(&consumer->node, &consumers); 211 + 212 + mutex_unlock(&lock); 213 + 214 + return 0; 215 + } 216 + EXPORT_SYMBOL_GPL(irq_bypass_register_consumer); 217 + 218 + /** 219 + * irq_bypass_unregister_consumer - unregister IRQ bypass consumer 220 + * @consumer: pointer to consumer structure 221 + * 222 + * Remove a previously registered IRQ consumer from the list of consumers 223 + * and disconnect it from any connected IRQ producer. 224 + */ 225 + void irq_bypass_unregister_consumer(struct irq_bypass_consumer *consumer) 226 + { 227 + struct irq_bypass_consumer *tmp; 228 + struct irq_bypass_producer *producer; 229 + 230 + might_sleep(); 231 + 232 + if (!try_module_get(THIS_MODULE)) 233 + return; /* nothing in the list anyway */ 234 + 235 + mutex_lock(&lock); 236 + 237 + list_for_each_entry(tmp, &consumers, node) { 238 + if (tmp->token != consumer->token) 239 + continue; 240 + 241 + list_for_each_entry(producer, &producers, node) { 242 + if (producer->token == consumer->token) { 243 + __disconnect(producer, consumer); 244 + break; 245 + } 246 + } 247 + 248 + list_del(&consumer->node); 249 + module_put(THIS_MODULE); 250 + break; 251 + } 252 + 253 + mutex_unlock(&lock); 254 + 255 + module_put(THIS_MODULE); 256 + } 257 + EXPORT_SYMBOL_GPL(irq_bypass_unregister_consumer);