Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
AMD processors based on Zen 2 and later microarchitectures do not
support PMCx087 (instruction pipe stalls) which is used as the backing
event for "stalled-cycles-frontend" and "stalled-cycles-backend".
Use PMCx0A9 (cycles where micro-op queue is empty) instead to count
frontend stalls and remove the entry for backend stalls since there
is no direct replacement.
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Fixes: 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h")
Link: https://lore.kernel.org/r/03d7fc8fa2a28f9be732116009025bdec1b3ec97.1711352180.git.sandipan.das@amd.com
Currently, the LBR code assumes that LBR Freeze is supported on all processors
when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX]
bit 1 is set. This is incorrect as the availability of the feature is
additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set,
which may not be set for all Zen 4 processors.
Define a new feature bit for LBR and PMC freeze and set the freeze enable bit
(FLBRI) in DebugCtl (MSR 0x1d9) conditionally.
It should still be possible to use LBR without freeze for profile-guided
optimization of user programs by using an user-only branch filter during
profiling. When the user-only filter is enabled, branches are no longer
recorded after the transition to CPL 0 upon PMI arrival. When branch
entries are read in the PMI handler, the branch stack does not change.
E.g.
$ perf record -j any,u -e ex_ret_brn_tkn ./workload
Since the feature bit is visible under flags in /proc/cpuinfo, it can be
used to determine the feasibility of use-cases which require LBR Freeze
to be supported by the hardware such as profile-guided optimization of
kernels.
Fixes: ca5b7c0d9621 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
Add a new word for scattered features because all free bits among the
existing Linux-defined auxiliary flags have been exhausted.
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/8380d2a0da469a1f0ad75b8954a79fb689599ff6.1711091584.git.sandipan.das@amd.com
Pull EFI fixes from Ard Biesheuvel:
- Fix logic that is supposed to prevent placement of the kernel image
below LOAD_PHYSICAL_ADDR
- Use the firmware stack in the EFI stub when running in mixed mode
- Clear BSS only once when using mixed mode
- Check efi.get_variable() function pointer for NULL before trying to
call it
* tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: fix panic in kdump kernel
x86/efistub: Don't clear BSS twice in mixed mode
x86/efistub: Call mixed mode boot services on the firmware's stack
efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
Pull x86 fixes from Thomas Gleixner:
- Ensure that the encryption mask at boot is properly propagated on
5-level page tables, otherwise the PGD entry is incorrectly set to
non-encrypted, which causes system crashes during boot.
- Undo the deferred 5-level page table setup as it cannot work with
memory encryption enabled.
- Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
to the default value but the cached variable is not, so subsequent
comparisons might yield the wrong result and as a consequence the
result prevents updating the MSR.
- Register the local APIC address only once in the MPPARSE enumeration
to prevent triggering the related WARN_ONs() in the APIC and topology
code.
- Handle the case where no APIC is found gracefully by registering a
fake APIC in the topology code. That makes all related topology
functions work correctly and does not affect the actual APIC driver
code at all.
- Don't evaluate logical IDs during early boot as the local APIC IDs
are not yet enumerated and the invoked function returns an error
code. Nothing requires the logical IDs before the final CPUID
enumeration takes place, which happens after the enumeration.
- Cure the fallout of the per CPU rework on UP which misplaced the
copying of boot_cpu_data to per CPU data so that the final update to
boot_cpu_data got lost which caused inconsistent state and boot
crashes.
- Use copy_from_kernel_nofault() in the kprobes setup as there is no
guarantee that the address can be safely accessed.
- Reorder struct members in struct saved_context to work around another
kmemleak false positive
- Remove the buggy code which tries to update the E820 kexec table for
setup_data as that is never passed to the kexec kernel.
- Update the resource control documentation to use the proper units.
- Fix a Kconfig warning observed with tinyconfig
* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/64: Move 5-level paging global variable assignments back
x86/boot/64: Apply encryption mask to 5-level pagetable update
x86/cpu: Add model number for another Intel Arrow Lake mobile processor
x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Documentation/x86: Document that resctrl bandwidth control units are MiB
x86/mpparse: Register APIC address only once
x86/topology: Handle the !APIC case gracefully
x86/topology: Don't evaluate logical IDs during early boot
x86/cpu: Ensure that CPU info updates are propagated on UP
kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
x86/pm: Work around false positive kmemleak report in msr_build_context()
x86/kexec: Do not update E820 kexec table for setup_data
x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
Check if get_next_variable() is actually valid pointer before
calling it. In kdump kernel this method is set to NULL that causes
panic during the kexec-ed kernel boot.
Tested with QEMU and OVMF firmware.
Fixes: bad267f9e18f ("efi: verify that variable services are supported")
Signed-off-by: Oleksandr Tymoshenko <ovt@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Pull scheduler doc clarification from Thomas Gleixner:
"A single update for the documentation of the base_slice_ns tunable to
clarify that any value which is less than the tick slice has no effect
because the scheduler tick is not guaranteed to happen within the set
time slice"
* tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation
Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.
While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.
Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
Clearing BSS should only be done once, at the very beginning.
efi_pe_entry() is the entrypoint from the firmware, which may not clear
BSS and so it is done explicitly. However, efi_pe_entry() is also used
as an entrypoint by the mixed mode startup code, in which case BSS will
already have been cleared, and doing it again at this point will corrupt
global variables holding the firmware's GDT/IDT and segment selectors.
So make the memset() conditional on whether the EFI stub is running in
native mode.
Fixes: b3810c5a2cc4a666 ("x86/efistub: Clear decompressor BSS in native EFI entrypoint")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Pull dma-mapping fixes from Christoph Hellwig:
"This has a set of swiotlb alignment fixes for sometimes very long
standing bugs from Will. We've been discussion them for a while and
they should be solid now"
* tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
swiotlb: Fix alignment checks when both allocation and DMA masks are present
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Fix double-allocation of slots due to broken alignment handling
The tunable base_slice_ns is dependent on CONFIG_HZ (i.e. TICK_NSEC)
for any significant performance improvement. The reason being the
scheduler tick is not frequent enough to force preemption when
base_slice expires in case of:
base_slice_ns < TICK_NSEC
The below data is of stress-ng:
Number of CPU: 1
Stressor threads: 4
Time: 30sec
On CONFIG_HZ=1000
| base_slice | avg-run (msec) | context-switches |
| ---------- | -------------- | ---------------- |
| 3ms | 2.914 | 10342 |
| 6ms | 4.857 | 6196 |
| 9ms | 6.754 | 4482 |
| 12ms | 7.872 | 3802 |
| 22ms | 11.294 | 2710 |
| 32ms | 13.425 | 2284 |
On CONFIG_HZ=100
| base_slice | avg-run (msec) | context-switches |
| ---------- | -------------- | ---------------- |
| 3ms | 9.144 | 3337 |
| 6ms | 9.113 | 3301 |
| 9ms | 8.991 | 3315 |
| 12ms | 12.935 | 2328 |
| 22ms | 16.031 | 1915 |
| 32ms | 18.608 | 1622 |
base_slice: the value of base_slice in ms
avg-run (msec): average time of the stressor threads got on cpu before
it got preempted
context-switches: number of context switches for the stress-ng process
Signed-off-by: Mukesh Kumar Chaurasiya <mchauras@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240320173815.927637-2-mchauras@linux.ibm.com
When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.
Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.
Fixes: 533568e06b15 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
Normally, the EFI stub calls into the EFI boot services using the stack
that was live when the stub was entered. According to the UEFI spec,
this stack needs to be at least 128k in size - this might seem large but
all asynchronous processing and event handling in EFI runs from the same
stack and so quite a lot of space may be used in practice.
In mixed mode, the situation is a bit different: the bootloader calls
the 32-bit EFI stub entry point, which calls the decompressor's 32-bit
entry point, where the boot stack is set up, using a fixed allocation
of 16k. This stack is still in use when the EFI stub is started in
64-bit mode, and so all calls back into the EFI firmware will be using
the decompressor's limited boot stack.
Due to the placement of the boot stack right after the boot heap, any
stack overruns have gone unnoticed. However, commit
5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code")
moved the definition of the boot heap into C code, and now the boot
stack is placed right at the base of BSS, where any overruns will
corrupt the end of the .data section.
While it would be possible to work around this by increasing the size of
the boot stack, doing so would affect all x86 systems, and mixed mode
systems are a tiny (and shrinking) fraction of the x86 installed base.
So instead, record the firmware stack pointer value when entering from
the 32-bit firmware, and switch to this stack every time a EFI boot
service call is made.
Cc: <stable@kernel.org> # v6.1+
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Pull timer fixes from Thomas Gleixner:
"Two regression fixes for the timer and timer migration code:
- Prevent endless timer requeuing which is caused by two CPUs racing
out of idle. This happens when the last CPU goes idle and therefore
has to ensure to expire the pending global timers and some other
CPU come out of idle at the same time and the other CPU wins the
race and expires the global queue. This causes the last CPU to
chase ghost timers forever and reprogramming it's clockevent device
endlessly.
Cure this by re-evaluating the wakeup time unconditionally.
- The split into local (pinned) and global timers in the timer wheel
caused a regression for NOHZ full as it broke the idle tracking of
global timers. On NOHZ full this prevents an self IPI being sent
which in turn causes the timer to be not programmed and not being
expired on time.
Restore the idle tracking for the global timer base so that the
self IPI condition for NOHZ full is working correctly again"
* tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix removed self-IPI on global timer's enqueue in nohz_full
timers/migration: Fix endless timer requeue after idle interrupts