Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2

FEAT_NV2 introduces an interesting problem for NV, as VNCR_EL2.BADDR
is a virtual address in the EL2&0 (or EL2, but we thankfully ignore
this) translation regime.

As we need to replicate such mapping in the real EL2, it means that
we need to remember that there is such a translation, and that any
TLBI affecting EL2 can possibly affect this translation.

It also means that any invalidation driven by an MMU notifier must
be able to shoot down any such mapping.

All in all, we need a data structure that represents this mapping,
and that is extremely close to a TLB. Given that we can only use
one of those per vcpu at any given time, we only allocate one.

No effort is made to keep that structure small. If we need to
start caching multiple of them, we may want to revisit that design
point. But for now, it is kept simple so that we can reason about it.

Oh, and add a braindump of how things are supposed to work, because
I will definitely page this out at some point. Yes, pun intended.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>

+85
+5
arch/arm64/include/asm/kvm_host.h
··· 731 731 bool reset; 732 732 }; 733 733 734 + struct vncr_tlb; 735 + 734 736 struct kvm_vcpu_arch { 735 737 struct kvm_cpu_context ctxt; 736 738 ··· 827 825 828 826 /* Per-vcpu CCSIDR override or NULL */ 829 827 u32 *ccsidr; 828 + 829 + /* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */ 830 + struct vncr_tlb *vncr_tlb; 830 831 }; 831 832 832 833 /*
+3
arch/arm64/include/asm/kvm_nested.h
··· 333 333 int __kvm_translate_va(struct kvm_vcpu *vcpu, struct s1_walk_info *wi, 334 334 struct s1_walk_result *wr, u64 va); 335 335 336 + /* VNCR management */ 337 + int kvm_vcpu_allocate_vncr_tlb(struct kvm_vcpu *vcpu); 338 + 336 339 #endif /* __ARM64_KVM_NESTED_H */
+4
arch/arm64/kvm/arm.c
··· 843 843 return ret; 844 844 845 845 if (vcpu_has_nv(vcpu)) { 846 + ret = kvm_vcpu_allocate_vncr_tlb(vcpu); 847 + if (ret) 848 + return ret; 849 + 846 850 ret = kvm_vgic_vcpu_nv_init(vcpu); 847 851 if (ret) 848 852 return ret;
+72
arch/arm64/kvm/nested.c
··· 16 16 17 17 #include "sys_regs.h" 18 18 19 + struct vncr_tlb { 20 + /* The guest's VNCR_EL2 */ 21 + u64 gva; 22 + struct s1_walk_info wi; 23 + struct s1_walk_result wr; 24 + 25 + u64 hpa; 26 + 27 + /* -1 when not mapped on a CPU */ 28 + int cpu; 29 + 30 + /* 31 + * true if the TLB is valid. Can only be changed with the 32 + * mmu_lock held. 33 + */ 34 + bool valid; 35 + }; 36 + 19 37 /* 20 38 * Ratio of live shadow S2 MMU per vcpu. This is a trade-off between 21 39 * memory usage and potential number of different sets of S2 PTs in ··· 827 809 kvm->arch.nested_mmus = NULL; 828 810 kvm->arch.nested_mmus_size = 0; 829 811 kvm_uninit_stage2_mmu(kvm); 812 + } 813 + 814 + /* 815 + * Dealing with VNCR_EL2 exposed by the *guest* is a complicated matter: 816 + * 817 + * - We introduce an internal representation of a vcpu-private TLB, 818 + * representing the mapping between the guest VA contained in VNCR_EL2, 819 + * the IPA the guest's EL2 PTs point to, and the actual PA this lives at. 820 + * 821 + * - On translation fault from a nested VNCR access, we create such a TLB. 822 + * If there is no mapping to describe, the guest inherits the fault. 823 + * Crucially, no actual mapping is done at this stage. 824 + * 825 + * - On vcpu_load() in a non-HYP context with HCR_EL2.NV==1, if the above 826 + * TLB exists, we map it in the fixmap for this CPU, and run with it. We 827 + * have to respect the permissions dictated by the guest, but not the 828 + * memory type (FWB is a must). 829 + * 830 + * - Note that we usually don't do a vcpu_load() on the back of a fault 831 + * (unless we are preempted), so the resolution of a translation fault 832 + * must go via a request that will map the VNCR page in the fixmap. 833 + * vcpu_load() might as well use the same mechanism. 834 + * 835 + * - On vcpu_put() in a non-HYP context with HCR_EL2.NV==1, if the TLB was 836 + * mapped, we unmap it. Yes it is that simple. The TLB still exists 837 + * though, and may be reused at a later load. 838 + * 839 + * - On permission fault, we simply forward the fault to the guest's EL2. 840 + * Get out of my way. 841 + * 842 + * - On any TLBI for the EL2&0 translation regime, we must find any TLB that 843 + * intersects with the TLBI request, invalidate it, and unmap the page 844 + * from the fixmap. Because we need to look at all the vcpu-private TLBs, 845 + * this requires some wide-ranging locking to ensure that nothing races 846 + * against it. This may require some refcounting to avoid the search when 847 + * no such TLB is present. 848 + * 849 + * - On MMU notifiers, we must invalidate our TLB in a similar way, but 850 + * looking at the IPA instead. The funny part is that there may not be a 851 + * stage-2 mapping for this page if L1 hasn't accessed it using LD/ST 852 + * instructions. 853 + */ 854 + 855 + int kvm_vcpu_allocate_vncr_tlb(struct kvm_vcpu *vcpu) 856 + { 857 + if (!kvm_has_feat(vcpu->kvm, ID_AA64MMFR4_EL1, NV_frac, NV2_ONLY)) 858 + return 0; 859 + 860 + vcpu->arch.vncr_tlb = kzalloc(sizeof(*vcpu->arch.vncr_tlb), 861 + GFP_KERNEL_ACCOUNT); 862 + if (!vcpu->arch.vncr_tlb) 863 + return -ENOMEM; 864 + 865 + return 0; 830 866 } 831 867 832 868 /*
+1
arch/arm64/kvm/reset.c
··· 159 159 kvm_unshare_hyp(sve_state, sve_state + vcpu_sve_state_size(vcpu)); 160 160 kfree(sve_state); 161 161 free_page((unsigned long)vcpu->arch.ctxt.vncr_array); 162 + kfree(vcpu->arch.vncr_tlb); 162 163 kfree(vcpu->arch.ccsidr); 163 164 } 164 165