Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.
This matches the behavior described in the documentation:
-1 no request. use system default or follow request of others.
0 no search.
1 search siblings (hyperthreads in a core).
"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.
Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
Using 'hw_pressure' for local variable name is confusing in regard to the
per-CPU 'hw_pressure' variable that uses the same name:
include/linux/arch_topology.h:DECLARE_PER_CPU(unsigned long, hw_pressure);
... which puts it into a global scope for all code that includes
<linux/topology.h>, shadowing the local variable.
Rename it to avoid compiler confusion & Sparse warnings.
[ mingo: Expanded the changelog. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240425073709.379016-1-vincent.guittot@linaro.org
Closes: https://lore.kernel.org/oe-kbuild-all/202404250740.VhQQoD7N-lkp@intel.com/
Fixes: d4dbc991714e ("sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()")
Tested-by: Konrad Dybcio <konrad.dybcio@linaro.org> # QC SM8550 QRD
Pull x86 interrupt handling updates from Thomas Gleixner:
"Add support for posted interrupts on bare metal.
Posted interrupts is a virtualization feature which allows to inject
interrupts directly into a guest without host interaction. The VT-d
interrupt remapping hardware sets the bit which corresponds to the
interrupt vector in a vector bitmap which is either used to inject the
interrupt directly into the guest via a virtualized APIC or in case
that the guest is scheduled out provides a host side notification
interrupt which informs the host that an interrupt has been marked
pending in the bitmap.
This can be utilized on bare metal for scenarios where multiple
devices, e.g. NVME storage, raise interrupts with a high frequency. In
the default mode these interrupts are handles independently and
therefore require a full roundtrip of interrupt entry/exit.
Utilizing posted interrupts this roundtrip overhead can be avoided by
coalescing these interrupt entries to a single entry for the posted
interrupt notification. The notification interrupt then demultiplexes
the pending bits in a memory based bitmap and invokes the
corresponding device specific handlers.
Depending on the usage scenario and device utilization throughput
improvements between 10% and 130% have been measured.
As this is only relevant for high end servers with multiple device
queues per CPU attached and counterproductive for situations where
interrupts are arriving at distinct times, the functionality is opt-in
via a kernel command line parameter"
* tag 'x86-irq-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Use existing helper for pending vector check
iommu/vt-d: Enable posted mode for device MSIs
iommu/vt-d: Make posted MSI an opt-in command line option
x86/irq: Extend checks for pending vectors to posted interrupts
x86/irq: Factor out common code for checking pending interrupts
x86/irq: Install posted MSI notification handler
x86/irq: Factor out handler invocation from common_interrupt()
x86/irq: Set up per host CPU posted interrupt descriptors
x86/irq: Reserve a per CPU IDT vector for posted MSIs
x86/irq: Add a Kconfig option for posted MSI
x86/irq: Remove bitfields in posted interrupt descriptor
x86/irq: Unionize PID.PIR for 64bit access w/o casting
KVM: VMX: Move posted interrupt descriptor out of VMX code
Pull interrupt subsystem updates from Thomas Gleixner:
"Core code:
- Interrupt storm detection for the lockup watchdog:
Lockups which are caused by interrupt storms are not easy to debug
because there is no information about the events which make the
lockup detector trigger.
To make this more user friendly, provide an extenstion to interrupt
statistics which allows to take snapshots and an interface to
retrieve the delta to the snapshot. Use this new mechanism in the
watchdog code to do a two stage lockup analysis by taking the
snapshot and printing the deltas for the topmost active interrupts
on the second trigger.
Note: This contains both the interrupt and the watchdog changes as
the latter depend on the former obviously.
- Avoid summation loops in the /proc/interrupts output and use the
global counter when possible
- Skip suspended interrupts on CPU hotplug operations to ensure that
they are not delivered before the system resumes the device drivers
when coming out of suspend.
- On CPU hot-unplug interrupts which are affine to the outgoing CPU
are migrated to a different CPU in the affinity mask. This can fail
when the CPUs have no vectors left. Instead of giving up try to
migrate it to any online CPU and thereby breaking the affinity
setting in order to prevent a stale device interrupt which targets
an offline CPU
- The usual small cleanups
Driver code:
- Support for the RISCV AIA MSI controller
- Make the interrupt allocation for the Loongson PCH controller more
flexible to prevent vector exhaustion
- The usual set of cleanups and fixes all over the place"
* tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc
cpuidle: Avoid explicit cpumask allocation on stack
irqchip/sifive-plic: Avoid explicit cpumask allocation on stack
irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack
irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack
irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack
irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack
cpumask: Introduce cpumask_first_and_and()
irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown
genirq: Reuse irq_is_nmi()
genirq/cpuhotplug: Retry with cpu_online_mask when migration fails
genirq/cpuhotplug: Skip suspended interrupts when restoring affinity
arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251
arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251
ARM: dts: stm32: List exti parent interrupts on stm32mp131
ARM: dts: stm32: List exti parent interrupts on stm32mp151
arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32
irqchip/stm32-exti: Mark events reserved with RIF configuration check
irqchip/stm32-exti: Skip secure events
irqchip/stm32-exti: Convert driver to standard PM
...
lapic_vector_set_in_irr() is already available, use it for checking
pending vectors at the local APIC. No functional change.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20240506175612.1141095-1-jacob.jun.pan@linux.intel.com
Pull x86 timers update from Thomas Gleixner:
"A single update for the TSC synchronixation sanity checks:
The sad state of TSC being notoriously non-sychronized for several
decades caused the kernel to grow quite rigorous sanity checks to
detect whether the TSC is valid to be used for timekeeping.
The TSC ADJUST MSR provides the offset between the initial TSC value
after hardware reset and later modifications. This allows to detect
cases where firmware tampers with the TSC and also allows to correct
the firmware induced damage by resetting the offset in a controlled
way.
The universal correct rule is that the TSC ADJUST value has to be
consistent within all CPUs of a socket.
The kernel further assumes that the TSC offset should be consistent
between sockets. That's not really correct as systems with a huge
number of sockets are not architecurally guaranteed to reset the per
socket TSC base synchronously.
In case that the per socket offset is not consistent the kernel resets
it to the offset of the boot CPU and then does a synchronization check
which corrects for the inter socket delays.
That works most of the time, but it is suboptimal as the firmware has
eventually better information about the per socket offset and on sane
systems that offset should just work in the validation checks"
* tag 'x86-timers-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Trust initial offset in architectural TSC-adjust MSRs
This BUG_ON() is useless, because the same effect will be obtained
by letting the code run its course and vm being dereferenced,
triggering an exception.
So just remove this check.
Signed-off-by: Guanrui Huang <guanrui.huang@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240418061053.96803-3-guanrui.huang@linux.alibaba.com
With posted MSI feature enabled on the CPU side, iommu interrupt
remapping table entries (IRTEs) for device MSI/x can be allocated,
activated, and programed in posted mode. This means that IRTEs are
linked with their respective PIDs of the target CPU.
Handlers for the posted MSI notification vector will de-multiplex
device MSI handlers. CPU notifications are coalesced if interrupts
arrive at a high frequency.
Posted interrupts are only used for device MSI and not for legacy devices
(IO/APIC, HPET).
Introduce a new irq_chip for posted MSIs, which has a dummy irq_ack()
callback as EOI is performed in the notification handler once.
When posted MSI is enabled, MSI domain/chip hierarchy will look like
this example:
domain: IR-PCI-MSIX-0000:50:00.0-12
hwirq: 0x29
chip: IR-PCI-MSIX-0000:50:00.0
flags: 0x430
IRQCHIP_SKIP_SET_WAKE
IRQCHIP_ONESHOT_SAFE
parent:
domain: INTEL-IR-10-13
hwirq: 0x2d0000
chip: INTEL-IR-POST
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x77
chip: APIC
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-13-jacob.jun.pan@linux.intel.com
Pull timers and timekeeping updates from Thomas Gleixner:
"Core code:
- Make timekeeping and VDSO time readouts resilent against math
overflow:
In guest context the kernel is prone to math overflow when the host
defers the timer interrupt due to overload, malfunction or malice.
This can be mitigated by checking the clocksource delta for the
maximum deferrement which is readily available. If that value is
exceeded then the code uses a slowpath function which can handle
the multiplication overflow.
This functionality is enabled unconditionally in the kernel, but
made conditional in the VDSO code. The latter is conditional
because it allows architectures to optimize the check so it is not
causing performance regressions.
On X86 this is achieved by reworking the existing check for
negative TSC deltas as a negative delta obviously exceeds the
maximum deferrement when it is evaluated as an unsigned value. That
avoids two conditionals in the hotpath and allows to hide both the
negative delta and the large delta handling in the same slow path.
- Add an initial minimal ktime_t abstraction for Rust
- The usual boring cleanups and enhancements
Drivers:
- Boring updates to device trees and trivial enhancements in various
drivers"
* tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
clocksource/drivers/arm_arch_timer: Mark hisi_161010101_oem_info const
clocksource/drivers/timer-ti-dm: Remove an unused field in struct dmtimer
clocksource/drivers/renesas-ostm: Avoid reprobe after successful early probe
clocksource/drivers/renesas-ostm: Allow OSTM driver to reprobe for RZ/V2H(P) SoC
dt-bindings: timer: renesas: ostm: Document Renesas RZ/V2H(P) SoC
rust: time: doc: Add missing C header links
clocksource: Make the int help prompt unit readable in ncurses
hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active()
timerqueue: Remove never used function timerqueue_node_expires()
rust: time: Add Ktime
vdso: Fix powerpc build U64_MAX undeclared error
clockevents: Convert s[n]printf() to sysfs_emit()
clocksource: Convert s[n]printf() to sysfs_emit()
clocksource: Make watchdog and suspend-timing multiplication overflow safe
timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflow
timekeeping: Make delta calculation overflow safe
timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safety
timekeeping: Fold in timekeeping_delta_to_ns()
timekeeping: Consolidate timekeeping helpers
timekeeping: Refactor timekeeping helpers
...
When the BIOS configures the architectural TSC-adjust MSRs on secondary
sockets to correct a constant inter-chassis offset, after Linux brings the
cores online, the TSC sync check later resets the core-local MSR to 0,
triggering HPET fallback and leading to performance loss.
Fix this by unconditionally using the initial adjust values read from the
MSRs. Trusting the initial offsets in this architectural mechanism is a
better approach than special-casing workarounds for specific platforms.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steffen Persvold <sp@numascale.com>
Reviewed-by: James Cleverdon <james.cleverdon.external@eviden.com>
Reviewed-by: Dimitri Sivanich <sivanich@hpe.com>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Link: https://lore.kernel.org/r/20240419085146.175665-1-daniel@quora.org
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.
Use cpumask_first_and_and() and cpumask_weight_and() to avoid the need
for a temporary cpumask on the stack.
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240416085454.3547175-8-dawei.li@shingroup.cn
Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y.
Also introduce a helper function for testing if posted MSI is supported on
the platform.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-12-jacob.jun.pan@linux.intel.com
Pull x86 APIC update from Dave Hansen:
"Coccinelle complained about some 64-bit divisions, but the divisor was
really just a 32-bit value being stored as 'unsigned long'.
Fixing the types fixes the warning"
* tag 'x86_apic_for_6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Improve data types to fix Coccinelle warnings
Pull clockevent/source updates from Daniel Lezcano:
- Add the R9A09G057 compatible bindings in the DT documentation and
add specific code to deal with the probe routine being called twice
(Geert Uytterhoeven)
- Remove unused field in the struct dmtimer in the TI driver
(Christophe JAILLET)
- Constify the hisi_161010101_oem_info variable in the ARM arch timer
(Stephen Boyd)
Link: https://lore.kernel.org/lkml/7ca1c46a-93e6-4f67-bee3-623cb56764fa@linaro.org