Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
out, as the kernel will switch CR3 during __text_poke(), e.g. in response
to a static key toggling. If switch_mm_irqs_off() chooses a new ASID for
the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
vmcs.HOST_CR3.
Add a comment to explain why KVM must wait until VM-Enter is imminent to
refresh vmcs.HOST_CR3.
The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
loaded for the KVM-associated mm while KVM has a "running" vCPU:
static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
...
WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
(vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
"KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
}
------------[ cut here ]------------
KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
CPU: 4 PID: 20717 Comm: stable Tainted: G W 5.17.0-rc3+ #747
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:load_new_mm_cr3+0x82/0xe0
RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
FS: 00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
Call Trace:
<TASK>
switch_mm_irqs_off+0x1cb/0x460
__text_poke+0x308/0x3e0
text_poke_bp_batch+0x168/0x220
text_poke_finish+0x1b/0x30
arch_jump_label_transform_apply+0x18/0x30
static_key_slow_inc_cpuslocked+0x7c/0x90
static_key_slow_inc+0x16/0x20
kvm_lapic_set_base+0x116/0x190
kvm_set_apic_base+0xa5/0xe0
kvm_set_msr_common+0x2f4/0xf60
vmx_set_msr+0x355/0xe70 [kvm_intel]
kvm_set_msr_ignored_check+0x91/0x230
kvm_emulate_wrmsr+0x36/0x120
vmx_handle_exit+0x609/0x6c0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
kvm_vcpu_ioctl+0x279/0x690
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
---[ end trace 0000000000000000 ]---
This reverts commit 15ad9762d69fd8e40a4a51828c1d6b0c1b8fbea0.
Fixes: 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Reported-by: Wanpeng Li <kernellwp@gmail.com>
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <20220224191917.3508476-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Undo a nested VMX fix as a step toward reverting the commit it fixed,
15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
as the underlying premise that "host CR3 in the vcpu thread can only be
changed when scheduling" is wrong.
This reverts commit a9f2705ec84449e3b8d70c804766f8e97e23080d.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220224191917.3508476-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should
never have non default value.
Due to way nested tsc scaling support was implmented in qemu,
it would set this msr to 0 when nested tsc scaling was disabled.
Ignore that value for now, as it causes no harm.
Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220223115649.319134-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.
This bug may cause task blocked forever in Guest:
INFO: task stress:7532 blocked for more than 1254 seconds.
Not tainted 5.10.0 #16
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress state:D stack: 0 pid: 7532 ppid: 1409
flags:0x00000080
Call Trace:
__schedule+0x1e7/0x650
schedule+0x46/0xb0
kvm_async_pf_task_wait_schedule+0xad/0xe0
? exit_to_user_mode_prepare+0x60/0x70
__kvm_handle_async_pf+0x4f/0xb0
? asm_exc_page_fault+0x8/0x30
exc_page_fault+0x6f/0x110
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x402d00
RSP: 002b:00007ffd31912500 EFLAGS: 00010206
RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000
Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
By request of Nick Piggin:
> Patch 3 requires a KVM_CAP_PPC number allocated. QEMU maintainers are
> happy with it (link in changelog) just waiting on KVM upstreaming. Do
> you have objections to the series going to ppc/kvm tree first, or
> another option is you could take patch 3 alone first (it's relatively
> independent of the other 2) and ppc/kvm gets it from you?
Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if
only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable
pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add KVM_CAP_PPC_AIL_MODE_3 to advertise the capability to set the AIL
resource mode to 3 with the H_SET_MODE hypercall. This capability
differs between processor types and KVM types (PR, HV, Nested HV), and
affects guest-visible behaviour.
QEMU will implement a cap-ail-mode-3 to control this behaviour[1], and
use the KVM CAP if available to determine KVM support[2].
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On non-x86_64 builds, helpers gtod_is_based_on_tsc() and
kvm_guest_supported_xfd() are defined but never used. Because these are
static inline but are in a .c file, some compilers do warn for them with
-Wunused-function, which becomes an error if -Werror is present.
Add #ifdef so they are only defined in x86_64 builds.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220218034100.115702-1-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't mask off the high nybble when configuring a
RAW perf event.
Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-2-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_arch currently contains the guest supported features in both
guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field.
Currently both fields are set to the same value in
kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that.
Since it's not good to keep duplicated data, remove guest_supported_xcr0.
To keep the code more readable, introduce kvm_guest_supported_xcr()
and kvm_guest_supported_xfd() to replace the previous usages of
guest_supported_xcr0.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220217053028.96432-3-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't drop the high nybble when setting up the
config field of a perf_event_attr structure for a call to
perf_event_create_kernel_counter().
Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-1-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel
swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().
When xsave feature is available, the fpu swap is done by:
- xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used
to store the current state of the fpu registers to a buffer.
- xrstor(s) instruction, with (fpu_kernel_cfg.max_features &
XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.
For xsave(s) the mask is used to limit what parts of the fpu regs will
be copied to the buffer. Likewise on xrstor(s), the mask is used to
limit what parts of the fpu regs will be changed.
The mask for xsave(s), the guest's fpstate->xfeatures, is defined on
kvm_arch_vcpu_create(), which (in summary) sets it to all features
supported by the cpu which are enabled on kernel config.
This means that xsave(s) will save to guest buffer all the fpu regs
contents the cpu has enabled when the guest is paused, even if they
are not used.
This would not be an issue, if xrstor(s) would also do that.
xrstor(s)'s mask for host/guest swap is basically every valid feature
contained in kernel config, except XFEATURE_MASK_PKRU.
Accordingto kernel src, it is instead switched in switch_to() and
flush_thread().
Then, the following happens with a host supporting PKRU starts a
guest that does not support it:
1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest,
2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU)
3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU)
4 - guest runs, then switch back to host,
5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU)
6 - xrstor(s) host fpstate to fpu regs.
7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with
XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)
On 5, even though the guest does not support PKRU, it does have the flag
set on guest fpstate, which is transferred to userspace via vcpu ioctl
KVM_GET_XSAVE.
This becomes a problem when the user decides on migrating the above guest
to another machine that does not support PKRU: the new host restores
guest's fpu regs to as they were before (xrstor(s)), but since the new
host don't support PKRU, a general-protection exception ocurs in xrstor(s)
and that crashes the guest.
This can be solved by making the guest's fpstate->user_xfeatures hold
a copy of guest_supported_xcr0. This way, on 7 the only flags copied to
userspace will be the ones compatible to guest requirements, and thus
there will be no issue during migration.
As a bonus, it will also fail if userspace tries to set fpu features
(with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
configuration. Such features will never be returned by KVM_GET_XSAVE
or KVM_GET_XSAVE2.
Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures,
there is not need to set it in kvm_check_cpuid(). So, change
fpstate_realloc() so it does not touch fpstate->user_xfeatures if a
non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid()
calls it.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220217053028.96432-2-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.
To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC. If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.
Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
in fact it can be handled in exactly the same way; the only difference
lies in who has set IRR, whether svm_deliver_interrupt or the processor.
Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
IPI vmexits as well.
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>