Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest

When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
allocated with role.level = 5 and the guest pagetable's root gfn.

And root_sp->spt[0] is also allocated with the same gfn and the same
role except role.level = 4. Luckily that they are different shadow
pages, but only root_sp->spt[0] is the real translation of the guest
pagetable.

Here comes a problem:

If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root page for nested NPT before and after
switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp
for the guest even the guest switches gCR4_LA57. The guest will see
unexpected page mapped and L2 may exploit the bug and hurt L1. It is
lucky that the problem can't hurt L0.

And three special cases need to be handled:

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless. (For a normal high
level sp in shadow paging, sp->gfns is often unused and kept zero, but
it could be relevant and meaningful if sp->gfns is used because they
are backed by concrete gptes.)

For such root_sp in the case, root_sp is just a portal to contribute
root_sp->spt[0], and root_sp->gfns should not be used and
root_sp->spt[0] should not be dropped if gpte[0] of the guest root
pagetable is changed.

Such root_sp should not be accounted too.

So add role.passthrough to distinguish the shadow pages in the hash
when gCR4_LA57 is toggled and fix above special cases by using it in
kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

authored by

Lai Jiangshan and committed by
Paolo Bonzini
84e5ffd0 767d8d8d

+24 -2
+4
Documentation/virt/kvm/x86/mmu.rst
··· 202 202 Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D 203 203 bits before Haswell; shadow EPT page tables also cannot use A/D bits 204 204 if the L1 hypervisor does not enable them. 205 + role.passthrough: 206 + The page is not backed by a guest page table, but its first entry 207 + points to one. This is set if NPT uses 5-level page tables (host 208 + CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1). 205 209 gfn: 206 210 Either the guest page table containing the translations shadowed by this 207 211 page, or the base page frame for linear translations. See role.direct.
+3 -2
arch/x86/include/asm/kvm_host.h
··· 285 285 * minimize the size of kvm_memory_slot.arch.gfn_track, i.e. allows allocating 286 286 * 2 bytes per gfn instead of 4 bytes per gfn. 287 287 * 288 - * Indirect upper-level shadow pages are tracked for write-protection via 288 + * Upper-level shadow pages having gptes are tracked for write-protection via 289 289 * gfn_track. As above, gfn_track is a 16 bit counter, so KVM must not create 290 290 * more than 2^16-1 upper-level shadow pages at a single gfn, otherwise 291 291 * gfn_track will overflow and explosions will ensure. ··· 331 331 unsigned smap_andnot_wp:1; 332 332 unsigned ad_disabled:1; 333 333 unsigned guest_mode:1; 334 - unsigned :6; 334 + unsigned passthrough:1; 335 + unsigned :5; 335 336 336 337 /* 337 338 * This is left at the top of the word so that
+16
arch/x86/kvm/mmu/mmu.c
··· 734 734 735 735 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) 736 736 { 737 + if (sp->role.passthrough) 738 + return sp->gfn; 739 + 737 740 if (!sp->role.direct) 738 741 return sp->gfns[index]; 739 742 ··· 745 742 746 743 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) 747 744 { 745 + if (sp->role.passthrough) { 746 + WARN_ON_ONCE(gfn != sp->gfn); 747 + return; 748 + } 749 + 748 750 if (!sp->role.direct) { 749 751 sp->gfns[index] = gfn; 750 752 return; ··· 1866 1858 if (sp->role.direct) 1867 1859 return false; 1868 1860 1861 + if (sp->role.passthrough) 1862 + return false; 1863 + 1869 1864 return true; 1870 1865 } 1871 1866 ··· 2065 2054 quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; 2066 2055 role.quadrant = quadrant; 2067 2056 } 2057 + if (level <= vcpu->arch.mmu->cpu_role.base.level) 2058 + role.passthrough = 0; 2068 2059 2069 2060 sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; 2070 2061 for_each_valid_sp(vcpu->kvm, sp, sp_list) { ··· 4920 4907 4921 4908 root_role = cpu_role.base; 4922 4909 root_role.level = kvm_mmu_get_tdp_level(vcpu); 4910 + if (root_role.level == PT64_ROOT_5LEVEL && 4911 + cpu_role.base.level == PT64_ROOT_4LEVEL) 4912 + root_role.passthrough = 1; 4923 4913 4924 4914 shadow_mmu_init_context(vcpu, context, cpu_role, root_role); 4925 4915 kvm_mmu_new_pgd(vcpu, nested_cr3);
+1
arch/x86/kvm/mmu/paging_tmpl.h
··· 1007 1007 .level = 0xf, 1008 1008 .access = 0x7, 1009 1009 .quadrant = 0x3, 1010 + .passthrough = 0x1, 1010 1011 }; 1011 1012 1012 1013 /*