Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: nVMX: Documentation

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

[marcelo: move to Documentation/virtual/kvm]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

authored by

Nadav Har'El and committed by
Avi Kivity
823e3965 2844d849

+251
+251
Documentation/virtual/kvm/nested-vmx.txt
··· 1 + Nested VMX 2 + ========== 3 + 4 + Overview 5 + --------- 6 + 7 + On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) 8 + to easily and efficiently run guest operating systems. Normally, these guests 9 + *cannot* themselves be hypervisors running their own guests, because in VMX, 10 + guests cannot use VMX instructions. 11 + 12 + The "Nested VMX" feature adds this missing capability - of running guest 13 + hypervisors (which use VMX) with their own nested guests. It does so by 14 + allowing a guest to use VMX instructions, and correctly and efficiently 15 + emulating them using the single level of VMX available in the hardware. 16 + 17 + We describe in much greater detail the theory behind the nested VMX feature, 18 + its implementation and its performance characteristics, in the OSDI 2010 paper 19 + "The Turtles Project: Design and Implementation of Nested Virtualization", 20 + available at: 21 + 22 + http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf 23 + 24 + 25 + Terminology 26 + ----------- 27 + 28 + Single-level virtualization has two levels - the host (KVM) and the guests. 29 + In nested virtualization, we have three levels: The host (KVM), which we call 30 + L0, the guest hypervisor, which we call L1, and its nested guest, which we 31 + call L2. 32 + 33 + 34 + Known limitations 35 + ----------------- 36 + 37 + The current code supports running Linux guests under KVM guests. 38 + Only 64-bit guest hypervisors are supported. 39 + 40 + Additional patches for running Windows under guest KVM, and Linux under 41 + guest VMware server, and support for nested EPT, are currently running in 42 + the lab, and will be sent as follow-on patchsets. 43 + 44 + 45 + Running nested VMX 46 + ------------------ 47 + 48 + The nested VMX feature is disabled by default. It can be enabled by giving 49 + the "nested=1" option to the kvm-intel module. 50 + 51 + No modifications are required to user space (qemu). However, qemu's default 52 + emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be 53 + explicitly enabled, by giving qemu one of the following options: 54 + 55 + -cpu host (emulated CPU has all features of the real CPU) 56 + 57 + -cpu qemu64,+vmx (add just the vmx feature to a named CPU type) 58 + 59 + 60 + ABIs 61 + ---- 62 + 63 + Nested VMX aims to present a standard and (eventually) fully-functional VMX 64 + implementation for the a guest hypervisor to use. As such, the official 65 + specification of the ABI that it provides is Intel's VMX specification, 66 + namely volume 3B of their "Intel 64 and IA-32 Architectures Software 67 + Developer's Manual". Not all of VMX's features are currently fully supported, 68 + but the goal is to eventually support them all, starting with the VMX features 69 + which are used in practice by popular hypervisors (KVM and others). 70 + 71 + As a VMX implementation, nested VMX presents a VMCS structure to L1. 72 + As mandated by the spec, other than the two fields revision_id and abort, 73 + this structure is *opaque* to its user, who is not supposed to know or care 74 + about its internal structure. Rather, the structure is accessed through the 75 + VMREAD and VMWRITE instructions. 76 + Still, for debugging purposes, KVM developers might be interested to know the 77 + internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. 78 + 79 + The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we 80 + also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS 81 + which L0 builds to actually run L2 - how this is done is explained in the 82 + aforementioned paper. 83 + 84 + For convenience, we repeat the content of struct vmcs12 here. If the internals 85 + of this structure changes, this can break live migration across KVM versions. 86 + VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner 87 + struct shadow_vmcs is ever changed. 88 + 89 + typedef u64 natural_width; 90 + struct __packed vmcs12 { 91 + /* According to the Intel spec, a VMCS region must start with 92 + * these two user-visible fields */ 93 + u32 revision_id; 94 + u32 abort; 95 + 96 + u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ 97 + u32 padding[7]; /* room for future expansion */ 98 + 99 + u64 io_bitmap_a; 100 + u64 io_bitmap_b; 101 + u64 msr_bitmap; 102 + u64 vm_exit_msr_store_addr; 103 + u64 vm_exit_msr_load_addr; 104 + u64 vm_entry_msr_load_addr; 105 + u64 tsc_offset; 106 + u64 virtual_apic_page_addr; 107 + u64 apic_access_addr; 108 + u64 ept_pointer; 109 + u64 guest_physical_address; 110 + u64 vmcs_link_pointer; 111 + u64 guest_ia32_debugctl; 112 + u64 guest_ia32_pat; 113 + u64 guest_ia32_efer; 114 + u64 guest_pdptr0; 115 + u64 guest_pdptr1; 116 + u64 guest_pdptr2; 117 + u64 guest_pdptr3; 118 + u64 host_ia32_pat; 119 + u64 host_ia32_efer; 120 + u64 padding64[8]; /* room for future expansion */ 121 + natural_width cr0_guest_host_mask; 122 + natural_width cr4_guest_host_mask; 123 + natural_width cr0_read_shadow; 124 + natural_width cr4_read_shadow; 125 + natural_width cr3_target_value0; 126 + natural_width cr3_target_value1; 127 + natural_width cr3_target_value2; 128 + natural_width cr3_target_value3; 129 + natural_width exit_qualification; 130 + natural_width guest_linear_address; 131 + natural_width guest_cr0; 132 + natural_width guest_cr3; 133 + natural_width guest_cr4; 134 + natural_width guest_es_base; 135 + natural_width guest_cs_base; 136 + natural_width guest_ss_base; 137 + natural_width guest_ds_base; 138 + natural_width guest_fs_base; 139 + natural_width guest_gs_base; 140 + natural_width guest_ldtr_base; 141 + natural_width guest_tr_base; 142 + natural_width guest_gdtr_base; 143 + natural_width guest_idtr_base; 144 + natural_width guest_dr7; 145 + natural_width guest_rsp; 146 + natural_width guest_rip; 147 + natural_width guest_rflags; 148 + natural_width guest_pending_dbg_exceptions; 149 + natural_width guest_sysenter_esp; 150 + natural_width guest_sysenter_eip; 151 + natural_width host_cr0; 152 + natural_width host_cr3; 153 + natural_width host_cr4; 154 + natural_width host_fs_base; 155 + natural_width host_gs_base; 156 + natural_width host_tr_base; 157 + natural_width host_gdtr_base; 158 + natural_width host_idtr_base; 159 + natural_width host_ia32_sysenter_esp; 160 + natural_width host_ia32_sysenter_eip; 161 + natural_width host_rsp; 162 + natural_width host_rip; 163 + natural_width paddingl[8]; /* room for future expansion */ 164 + u32 pin_based_vm_exec_control; 165 + u32 cpu_based_vm_exec_control; 166 + u32 exception_bitmap; 167 + u32 page_fault_error_code_mask; 168 + u32 page_fault_error_code_match; 169 + u32 cr3_target_count; 170 + u32 vm_exit_controls; 171 + u32 vm_exit_msr_store_count; 172 + u32 vm_exit_msr_load_count; 173 + u32 vm_entry_controls; 174 + u32 vm_entry_msr_load_count; 175 + u32 vm_entry_intr_info_field; 176 + u32 vm_entry_exception_error_code; 177 + u32 vm_entry_instruction_len; 178 + u32 tpr_threshold; 179 + u32 secondary_vm_exec_control; 180 + u32 vm_instruction_error; 181 + u32 vm_exit_reason; 182 + u32 vm_exit_intr_info; 183 + u32 vm_exit_intr_error_code; 184 + u32 idt_vectoring_info_field; 185 + u32 idt_vectoring_error_code; 186 + u32 vm_exit_instruction_len; 187 + u32 vmx_instruction_info; 188 + u32 guest_es_limit; 189 + u32 guest_cs_limit; 190 + u32 guest_ss_limit; 191 + u32 guest_ds_limit; 192 + u32 guest_fs_limit; 193 + u32 guest_gs_limit; 194 + u32 guest_ldtr_limit; 195 + u32 guest_tr_limit; 196 + u32 guest_gdtr_limit; 197 + u32 guest_idtr_limit; 198 + u32 guest_es_ar_bytes; 199 + u32 guest_cs_ar_bytes; 200 + u32 guest_ss_ar_bytes; 201 + u32 guest_ds_ar_bytes; 202 + u32 guest_fs_ar_bytes; 203 + u32 guest_gs_ar_bytes; 204 + u32 guest_ldtr_ar_bytes; 205 + u32 guest_tr_ar_bytes; 206 + u32 guest_interruptibility_info; 207 + u32 guest_activity_state; 208 + u32 guest_sysenter_cs; 209 + u32 host_ia32_sysenter_cs; 210 + u32 padding32[8]; /* room for future expansion */ 211 + u16 virtual_processor_id; 212 + u16 guest_es_selector; 213 + u16 guest_cs_selector; 214 + u16 guest_ss_selector; 215 + u16 guest_ds_selector; 216 + u16 guest_fs_selector; 217 + u16 guest_gs_selector; 218 + u16 guest_ldtr_selector; 219 + u16 guest_tr_selector; 220 + u16 host_es_selector; 221 + u16 host_cs_selector; 222 + u16 host_ss_selector; 223 + u16 host_ds_selector; 224 + u16 host_fs_selector; 225 + u16 host_gs_selector; 226 + u16 host_tr_selector; 227 + }; 228 + 229 + 230 + Authors 231 + ------- 232 + 233 + These patches were written by: 234 + Abel Gordon, abelg <at> il.ibm.com 235 + Nadav Har'El, nyh <at> il.ibm.com 236 + Orit Wasserman, oritw <at> il.ibm.com 237 + Ben-Ami Yassor, benami <at> il.ibm.com 238 + Muli Ben-Yehuda, muli <at> il.ibm.com 239 + 240 + With contributions by: 241 + Anthony Liguori, aliguori <at> us.ibm.com 242 + Mike Day, mdday <at> us.ibm.com 243 + Michael Factor, factor <at> il.ibm.com 244 + Zvi Dubitzky, dubi <at> il.ibm.com 245 + 246 + And valuable reviews by: 247 + Avi Kivity, avi <at> redhat.com 248 + Gleb Natapov, gleb <at> redhat.com 249 + Marcelo Tosatti, mtosatti <at> redhat.com 250 + Kevin Tian, kevin.tian <at> intel.com 251 + and others.