Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: s390: Rework guest entry logic

In __vcpu_run() and do_vsie_run(), we enter an RCU extended quiescent
state (EQS) by calling guest_enter_irqoff(), which lasts until
__vcpu_run() calls guest_exit_irqoff(). However, between the two we
enable interrupts and may handle interrupts during the EQS. As the IRQ
entry code will not wake RCU in this case, we may run the core IRQ code
and IRQ handler without RCU watching, leading to various potential
problems.

It is necessary to unmask (host) interrupts around entering the guest,
as entering the guest via SIE will not automatically unmask these. When
a host interrupt is taken from a guest, it is taken via its regular
host IRQ handler rather than being treated as a direct exit from SIE.
Due to this, we cannot simply mask interrupts around guest entry, and
must handle interrupts during this window, waking RCU as required.

Additionally, between guest_enter_irqoff() and guest_exit_irqoff(), we
use local_irq_enable() and local_irq_disable() to unmask interrupts,
violating the ordering requirements for RCU/lockdep/tracing around
entry/exit sequences. Further, since this occurs in an instrumentable
function, it's possible that instrumented code runs during this window,
with potential usage of RCU, etc.

To fix the RCU wakeup problem, an s390 implementation of
arch_in_rcu_eqs() is added which checks for PF_VCPU in current->flags.
PF_VCPU is set/cleared by guest_timing_{enter,exit}_irqoff(), which
surround the actual guest entry.

To fix the remaining issues, the lower-level guest entry logic is moved
into a shared noinstr helper function using the
guest_state_{enter,exit}_irqoff() helpers. These perform all the
lockdep/RCU/tracing manipulation necessary, but as sie64a() does not
enable/disable interrupts, we must do this explicitly with the
non-instrumented arch_local_irq_{enable,disable}() helpers:

guest_state_enter_irqoff()

arch_local_irq_enable();
sie64a(...);
arch_local_irq_disable();

guest_state_exit_irqoff();

[ajd@linux.ibm.com: rebase, fix commit message]

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250708092742.104309-3-ajd@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20250708092742.104309-3-ajd@linux.ibm.com>

authored by

Mark Rutland and committed by
Janosch Frank
57d88f02 ee4a2e08

+59 -22
+10
arch/s390/include/asm/entry-common.h
··· 59 59 60 60 #define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare 61 61 62 + static __always_inline bool arch_in_rcu_eqs(void) 63 + { 64 + if (IS_ENABLED(CONFIG_KVM)) 65 + return current->flags & PF_VCPU; 66 + 67 + return false; 68 + } 69 + 70 + #define arch_in_rcu_eqs arch_in_rcu_eqs 71 + 62 72 #endif
+3
arch/s390/include/asm/kvm_host.h
··· 716 716 bool kvm_s390_pv_is_protected(struct kvm *kvm); 717 717 bool kvm_s390_pv_cpu_is_protected(struct kvm_vcpu *vcpu); 718 718 719 + extern int kvm_s390_enter_exit_sie(struct kvm_s390_sie_block *scb, 720 + u64 *gprs, unsigned long gasce); 721 + 719 722 extern int kvm_s390_gisc_register(struct kvm *kvm, u32 gisc); 720 723 extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc); 721 724
+39 -12
arch/s390/kvm/kvm-s390.c
··· 5062 5062 return vcpu_post_run_handle_fault(vcpu); 5063 5063 } 5064 5064 5065 + int noinstr kvm_s390_enter_exit_sie(struct kvm_s390_sie_block *scb, 5066 + u64 *gprs, unsigned long gasce) 5067 + { 5068 + int ret; 5069 + 5070 + guest_state_enter_irqoff(); 5071 + 5072 + /* 5073 + * The guest_state_{enter,exit}_irqoff() functions inform lockdep and 5074 + * tracing that entry to the guest will enable host IRQs, and exit from 5075 + * the guest will disable host IRQs. 5076 + * 5077 + * We must not use lockdep/tracing/RCU in this critical section, so we 5078 + * use the low-level arch_local_irq_*() helpers to enable/disable IRQs. 5079 + */ 5080 + arch_local_irq_enable(); 5081 + ret = sie64a(scb, gprs, gasce); 5082 + arch_local_irq_disable(); 5083 + 5084 + guest_state_exit_irqoff(); 5085 + 5086 + return ret; 5087 + } 5088 + 5065 5089 #define PSW_INT_MASK (PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_MCHECK) 5066 5090 static int __vcpu_run(struct kvm_vcpu *vcpu) 5067 5091 { ··· 5106 5082 kvm_vcpu_srcu_read_unlock(vcpu); 5107 5083 /* 5108 5084 * As PF_VCPU will be used in fault handler, between 5109 - * guest_enter and guest_exit should be no uaccess. 5085 + * guest_timing_enter_irqoff and guest_timing_exit_irqoff 5086 + * should be no uaccess. 5110 5087 */ 5111 - local_irq_disable(); 5112 - guest_enter_irqoff(); 5113 - __disable_cpu_timer_accounting(vcpu); 5114 - local_irq_enable(); 5115 5088 if (kvm_s390_pv_cpu_is_protected(vcpu)) { 5116 5089 memcpy(sie_page->pv_grregs, 5117 5090 vcpu->run->s.regs.gprs, 5118 5091 sizeof(sie_page->pv_grregs)); 5119 5092 } 5120 - exit_reason = sie64a(vcpu->arch.sie_block, 5121 - vcpu->run->s.regs.gprs, 5122 - vcpu->arch.gmap->asce); 5093 + 5094 + local_irq_disable(); 5095 + guest_timing_enter_irqoff(); 5096 + __disable_cpu_timer_accounting(vcpu); 5097 + 5098 + exit_reason = kvm_s390_enter_exit_sie(vcpu->arch.sie_block, 5099 + vcpu->run->s.regs.gprs, 5100 + vcpu->arch.gmap->asce); 5101 + 5102 + __enable_cpu_timer_accounting(vcpu); 5103 + guest_timing_exit_irqoff(); 5104 + local_irq_enable(); 5105 + 5123 5106 if (kvm_s390_pv_cpu_is_protected(vcpu)) { 5124 5107 memcpy(vcpu->run->s.regs.gprs, 5125 5108 sie_page->pv_grregs, ··· 5142 5111 vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK; 5143 5112 } 5144 5113 } 5145 - local_irq_disable(); 5146 - __enable_cpu_timer_accounting(vcpu); 5147 - guest_exit_irqoff(); 5148 - local_irq_enable(); 5149 5114 kvm_vcpu_srcu_read_lock(vcpu); 5150 5115 5151 5116 rc = vcpu_post_run(vcpu, exit_reason);
+7 -10
arch/s390/kvm/vsie.c
··· 1170 1170 vcpu->arch.sie_block->fpf & FPF_BPBC) 1171 1171 set_thread_flag(TIF_ISOLATE_BP_GUEST); 1172 1172 1173 - local_irq_disable(); 1174 - guest_enter_irqoff(); 1175 - local_irq_enable(); 1176 - 1177 1173 /* 1178 1174 * Simulate a SIE entry of the VCPU (see sie64a), so VCPU blocking 1179 1175 * and VCPU requests also hinder the vSIE from running and lead ··· 1179 1183 vcpu->arch.sie_block->prog0c |= PROG_IN_SIE; 1180 1184 current->thread.gmap_int_code = 0; 1181 1185 barrier(); 1182 - if (!kvm_s390_vcpu_sie_inhibited(vcpu)) 1183 - rc = sie64a(scb_s, vcpu->run->s.regs.gprs, vsie_page->gmap->asce); 1186 + if (!kvm_s390_vcpu_sie_inhibited(vcpu)) { 1187 + local_irq_disable(); 1188 + guest_timing_enter_irqoff(); 1189 + rc = kvm_s390_enter_exit_sie(scb_s, vcpu->run->s.regs.gprs, vsie_page->gmap->asce); 1190 + guest_timing_exit_irqoff(); 1191 + local_irq_enable(); 1192 + } 1184 1193 barrier(); 1185 1194 vcpu->arch.sie_block->prog0c &= ~PROG_IN_SIE; 1186 - 1187 - local_irq_disable(); 1188 - guest_exit_irqoff(); 1189 - local_irq_enable(); 1190 1195 1191 1196 /* restore guest state for bp isolation override */ 1192 1197 if (!guest_bp_isolation)