Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
code
Clone this repository
https://tangled.org/tjh.dev/kernel
git@gordian.tjh.dev:tjh.dev/kernel
For self-hosted knots, clone URLs may differ based on your setup.
Pull sysfs fix from Al Viro:
"Get rid of lockdep false positives around sysfs/overlayfs
syzbot has uncovered a class of lockdep false positives for setups
with sysfs being one of the backing layers in overlayfs. The root
cause is that of->mutex allocated when opening a sysfs file read-only
(which overlayfs might do) is confused with of->mutex of a file opened
writable (held in write to sysfs file, which overlayfs won't do).
Assigning them separate lockdep classes fixes that bunch and it's
obviously safe"
* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kernfs: annotate different lockdep class for of->mutex of writable files
Pull misc x86 fixes from Ingo Molnar:
- Follow up fixes for the BHI mitigations code
- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
expected
- Work around an APIC emulation bug when the kernel is built with Clang
and run as a SEV guest
- Follow up x86 topology fixes
* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
x86/cpu/amd: Make the NODEID_MSR union actually work
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
x86/bugs: Fix BHI handling of RRSBA
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
x86/bugs: Fix BHI documentation
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
x86/bugs: Fix return type of spectre_bhi_state()
x86/apic: Force native_apic_mem_read() to use the MOV instruction
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.
To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.
Reported-by: syzbot+9a5b0ced8b1bfb238b56@syzkaller.appspotmail.com
Fixes: 0fedefd4c4e3 ("kernfs: sysfs: support custom llseek method for sysfs entries")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull timer fixes from Ingo Molnar:
- Address a (valid) W=1 build warning
- Fix timer self-tests
- Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
global variable
- Address a !CONFIG_BUG build warning
* tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests: kselftest: Fix build failure with NOLIBC
selftests: timers: Fix abs() warning in posix_timers test
selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
selftests: timers: Fix posix_timers ksft_print_msg() warning
selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
bug: Fix no-return-statement warning with !CONFIG_BUG
timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
selftests/timers/posix_timers: Reimplement check_timer_distribution()
irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
The topology rework missed that early_init_amd() tries to re-enable the
Topology Extensions when the BIOS disabled them.
The new parser is invoked before early_init_amd() so the re-enable attempt
happens too late.
Move it into the AMD specific topology parser code where it belongs.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx
Pull perf event fix from Ingo Molnar:
"Fix the x86 PMU multi-counter code returning invalid data in certain
circumstances"
* tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix out of range data
As Mark explains ksft_min_kernel_version() can't be compiled with nolibc,
it doesn't implement uname().
Fixes: 6d029c25b71f ("selftests/timers/posix_timers: Reimplement check_timer_distribution()")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240412123536.GA32444@redhat.com
Closes: https://lore.kernel.org/all/f0523b3a-ea08-4615-b0fb-5b504a2d39df@sirena.org.uk/
A system with NODEID_MSR was reported to crash during early boot without
any output.
The reason is that the union which is used for accessing the bitfields in
the MSR is written wrongly and the resulting executable code accesses the
wrong part of the MSR data.
As a consequence a later division by that value results in 0 and that
result is used for another division as divisor, which obviously does not
work well.
The magic world of C, unions and bitfields:
union {
u64 bita : 3,
bitb : 3;
u64 all;
} x;
x.all = foo();
a = x.bita;
b = x.bitb;
results in the effective executable code of:
a = b = x.bita;
because bita and bitb are treated as union members and therefore both end
up at bit offset 0.
Wrapping the bitfield into an anonymous struct:
union {
struct {
u64 bita : 3,
bitb : 3;
};
u64 all;
} x;
works like expected.
Rework the NODEID_MSR union in exactly that way to cure the problem.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de
Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/
Pull x86 fixes from Ingo Molnar:
- Fix MCE timer reinit locking
- Fix/improve CoCo guest random entropy pool init
- Fix SEV-SNP late disable bugs
- Fix false positive objtool build warning
- Fix header dependency bug
- Fix resctrl CPU offlining bug
* tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()
x86/CPU/AMD: Track SNP host status with cc_platform_*()
x86/cc: Add cc_platform_set/_clear() helpers
x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
x86/coco: Require seeding RNG with RDRAND on CoCo systems
x86/numa/32: Include missing <asm/pgtable_areas.h>
x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline
Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"
* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y
On x86 each struct cpu_hw_events maintains a table for counter assignment but
it missed to update one for the deleted event in x86_pmu_del(). This
can make perf_clear_dirty_counters() reset used counter if it's called
before event scheduling or enabling. Then it would return out of range
data which doesn't make sense.
The following code can reproduce the problem.
$ cat repro.c
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/perf_event.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.disabled = 1,
};
void *worker(void *arg)
{
int cpu = (long)arg;
int fd1 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
int fd2 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
void *p;
do {
ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
p = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd1, 0);
ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);
ioctl(fd2, PERF_EVENT_IOC_DISABLE, 0);
munmap(p, 4096);
ioctl(fd1, PERF_EVENT_IOC_DISABLE, 0);
} while (1);
return NULL;
}
int main(void)
{
int i;
int n = sysconf(_SC_NPROCESSORS_ONLN);
pthread_t *th = calloc(n, sizeof(*th));
for (i = 0; i < n; i++)
pthread_create(&th[i], NULL, worker, (void *)(long)i);
for (i = 0; i < n; i++)
pthread_join(th[i], NULL);
free(th);
return 0;
}
And you can see the out of range data using perf stat like this.
Probably it'd be easier to see on a large machine.
$ gcc -o repro repro.c -pthread
$ ./repro &
$ sudo perf stat -A -I 1000 2>&1 | awk '{ if (length($3) > 15) print }'
1.001028462 CPU6 196,719,295,683,763 cycles # 194290.996 GHz (71.54%)
1.001028462 CPU3 396,077,485,787,730 branch-misses # 15804359784.80% of all branches (71.07%)
1.001028462 CPU17 197,608,350,727,877 branch-misses # 14594186554.56% of all branches (71.22%)
2.020064073 CPU4 198,372,472,612,140 cycles # 194681.113 GHz (70.95%)
2.020064073 CPU6 199,419,277,896,696 cycles # 195720.007 GHz (70.57%)
2.020064073 CPU20 198,147,174,025,639 cycles # 194474.654 GHz (71.03%)
2.020064073 CPU20 198,421,240,580,145 stalled-cycles-frontend # 100.14% frontend cycles idle (70.93%)
3.037443155 CPU4 197,382,689,923,416 cycles # 194043.065 GHz (71.30%)
3.037443155 CPU20 196,324,797,879,414 cycles # 193003.773 GHz (71.69%)
3.037443155 CPU5 197,679,956,608,205 stalled-cycles-backend # 1315606428.66% backend cycles idle (71.19%)
3.037443155 CPU5 198,571,860,474,851 instructions # 13215422.58 insn per cycle
It should move the contents in the cpuc->assign as well.
Fixes: 5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240306061003.1894224-1-namhyung@kernel.org
Building with clang results in the following warning:
posix_timers.c:69:6: warning: absolute value function 'abs' given an
argument of type 'long long' but has parameter of type 'int' which may
cause truncation of value [-Wabsolute-value]
if (abs(diff - DELAY * USECS_PER_SEC) > USECS_PER_SEC / 2) {
^
So switch to using llabs() instead.
Fixes: 0bc4b0cf1570 ("selftests: add basic posix timers selftests")
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240410232637.4135564-3-jstultz@google.com
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the
package. The parser uses this value to initialize the SMT domain level.
That's wrong because cpu_nthreads does not describe the number of threads
per physical core. So this needs to set the CORE domain level and let the
later parsers set the SMT shift if available.
Preset the SMT domain level with the assumption of one thread per core,
which is correct ifrt here are no other CPUID leafs to parse, and propagate
cpu_nthreads and the core level APIC bitwidth into the CORE domain.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de
Pull timer fixes from Ingo Molnar:
"Fix various timer bugs:
- Fix a timer migration bug that may result in missed events
- Fix timer migration group hierarchy event updates
- Fix a PowerPC64 build warning
- Fix a handful of DocBook annotation bugs"
* tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Return early on deactivation
timers/migration: Fix ignored event due to missing CPU update
vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h
timers: Fix text inconsistencies and spelling
tick/sched: Fix struct tick_sched doc warnings
tick/sched: Fix various kernel-doc warnings
timers: Fix kernel-doc format and add Return values
time/timekeeping: Fix kernel-doc warnings and typos
time/timecounter: Fix inline documentation