Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
code
Clone this repository
https://tangled.org/tjh.dev/kernel
git@gordian.tjh.dev:tjh.dev/kernel
For self-hosted knots, clone URLs may differ based on your setup.
Pull x86 fixes from Thomas Gleixner:
- Make RESERVE_BRK() work again with older binutils. The recent
'simplification' broke that.
- Make early #VE handling increment RIP when successful.
- Make the #VE code consistent vs. the RIP adjustments and add
comments.
- Handle load_unaligned_zeropad() across page boundaries correctly in
#VE when the second page is shared.
* tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
x86/tdx: Clarify RIP adjustments in #VE handler
x86/tdx: Fix early #VE handling
x86/mm: Fix RESERVE_BRK() for older binutils
Pull build tooling updates from Thomas Gleixner:
- Remove obsolete CONFIG_X86_SMAP reference from objtool
- Fix overlapping text section failures in faddr2line for real
- Remove OBJECT_FILES_NON_STANDARD usage from x86 ftrace and replace it
with finegrained annotations so objtool can validate that code
correctly.
* tag 'objtool-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usage
faddr2line: Fix overlapping text section failures, the sequel
objtool: Fix obsolete reference to CONFIG_X86_SMAP
load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.
In TDX guests, the second page can be shared page and a VMM may configure
it to trigger #VE.
The kernel assumes that #VE on a shared page is an MMIO access and tries to
decode instruction to handle it. In case of load_unaligned_zeropad() it
may result in confusion as it is not MMIO access.
Fix it by detecting split page MMIO accesses and failing them.
load_unaligned_zeropad() will recover using exception fixups.
The issue was discovered by analysis and reproduced artificially. It was
not triggered during testing.
[ dhansen: fix up changelogs and comments for grammar and clarity,
plus incorporate Kirill's off-by-one fix]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix plugging a race between sched_setscheduler()
and balance_push().
sched_setscheduler() spliced the balance callbacks accross a lock
break which makes it possible for an interleaving schedule() to
observe an empty list"
* tag 'sched-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix balance_push() vs __sched_setscheduler()
The file-wide OBJECT_FILES_NON_STANDARD annotation is used with
CONFIG_FRAME_POINTER to tell objtool to skip the entire file when frame
pointers are enabled. However that annotation is now deprecated because
it doesn't work with IBT, where objtool runs on vmlinux.o instead of
individual translation units.
Instead, use more fine-grained function-specific annotations:
- The 'save_mcount_regs' macro does funny things with the frame pointer.
Use STACK_FRAME_NON_STANDARD_FP to tell objtool to ignore the
functions using it.
- The return_to_handler() "function" isn't actually a callable function.
Instead of being called, it's returned to. The real return address
isn't on the stack, so unwinding is already doomed no matter which
unwinder is used. So just remove the STT_FUNC annotation, telling
objtool to ignore it. That also removes the implicit
ANNOTATE_NOENDBR, which now needs to be made explicit.
Fixes the following warning:
vmlinux.o: warning: objtool: __fentry__+0x16: return with modified stack frame
Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/b7a7a42fe306aca37826043dac89e113a1acdbac.1654268610.git.jpoimboe@kernel.org
After successful #VE handling, tdx_handle_virt_exception() has to move
RIP to the next instruction. The handler needs to know the length of the
instruction.
If the #VE happened due to instruction execution, the GET_VEINFO TDX
module call provides info on the instruction in R10, including its length.
For #VE due to EPT violation, the info in R10 is not populand and the
kernel must decode the instruction manually to find out its length.
Restructure the code to make it explicit that the instruction length
depends on the type of #VE. Make individual #VE handlers return
the instruction length on success or -errno on failure.
[ dhansen: fix up changelog and comments ]
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com
Pull lockdep fix from Thomas Gleixner:
"A RT fix for lockdep.
lockdep invokes prandom_u32() to create cookies. This worked until
prandom_u32() was switched to the real random generator, which takes a
spinlock for extraction, which does not work on RT when invoked from
atomic contexts.
lockdep has no requirement for real random numbers and it turns out
sched_clock() is good enough to create the cookie. That works
everywhere and is faster"
* tag 'locking-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Use sched_clock() for random numbers
The purpose of balance_push() is to act as a filter on task selection
in the case of CPU hotplug, specifically when taking the CPU out.
It does this by (ab)using the balance callback infrastructure, with
the express purpose of keeping all the unlikely/odd cases in a single
place.
In order to serve its purpose, the balance_push_callback needs to be
(exclusively) on the callback list at all times (noting that the
callback always places itself back on the list the moment it runs,
also noting that when the CPU goes down, regular balancing concerns
are moot, so ignoring them is fine).
And here-in lies the problem, __sched_setscheduler()'s use of
splice_balance_callbacks() takes the callbacks off the list across a
lock-break, making it possible for, an interleaving, __schedule() to
see an empty list and not get filtered.
Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()")
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net
If a function lives in a section other than .text, but .text also exists
in the object, faddr2line may wrongly assume .text. This can result in
comically wrong output. For example:
$ scripts/faddr2line vmlinux.o enter_from_user_mode+0x1c
enter_from_user_mode+0x1c/0x30:
find_next_bit at /home/jpoimboe/git/linux/./include/linux/find.h:40
(inlined by) perf_clear_dirty_counters at /home/jpoimboe/git/linux/arch/x86/events/core.c:2504
Fix it by passing the section name to addr2line, unless the object file
is vmlinux, in which case the symbol table uses absolute addresses.
Fixes: 1d1a0e7c5100 ("scripts/faddr2line: Fix overlapping text section failures")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/7d25bc1408bd3a750ac26e60d2f2815a5f4a8363.1654130536.git.jpoimboe@kernel.org
tdx_early_handle_ve() does not increment RIP after successfully
handling the exception. That leads to infinite loop of exceptions.
Move RIP when exceptions are successfully handled.
[ dhansen: make problem statement more clear ]
Fixes: 32e72854fa5f ("x86/tdx: Port I/O: Add early boot support")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com
Pull irq fixes from Thomas Gleixner:
"A set of interrupt subsystem updates:
Core:
- Ensure runtime power management for chained interrupts
Drivers:
- A collection of OF node refcount fixes
- Unbreak MIPS uniprocessor builds
- Fix xilinx interrupt controller Kconfig dependencies
- Add a missing compatible string to the Uniphier driver"
* tag 'irq-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/loongson-liointc: Use architecture register to get coreid
irqchip/uniphier-aidet: Add compatible string for NX1 SoC
dt-bindings: interrupt-controller/uniphier-aidet: Add bindings for NX1 SoC
irqchip/realtek-rtl: Fix refcount leak in map_interrupts
irqchip/gic-v3: Fix refcount leak in gic_populate_ppi_partitions
irqchip/gic-v3: Fix error handling in gic_populate_ppi_partitions
irqchip/apple-aic: Fix refcount leak in aic_of_ic_init
irqchip/apple-aic: Fix refcount leak in build_fiq_affinity
irqchip/gic/realview: Fix refcount leak in realview_gic_of_init
irqchip/xilinx: Remove microblaze+zynq dependency
genirq: PM: Use runtime PM for chained interrupts
Since the rewrote of prandom_u32(), in the commit mentioned below, the
function uses sleeping locks which extracing random numbers and filling
the batch.
This breaks lockdep on PREEMPT_RT because lock_pin_lock() disables
interrupts while calling __lock_pin_lock(). This can't be moved earlier
because the main user of the function (rq_pin_lock()) invokes that
function after disabling interrupts in order to acquire the lock.
The cookie does not require random numbers as its goal is to provide a
random value in order to notice unexpected "unlock + lock" sites.
Use sched_clock() to provide random numbers.
Fixes: a0103f4d86f88 ("random32: use real rng for non-deterministic randomness")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YoNn3pTkm5+QzE5k@linutronix.de