Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated. However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0. And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:
[ 135.245946] ------------[ cut here ]------------
[ 135.247579] kernel BUG at fs/ext4/extents.c:2258!
[ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
[ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
[ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
[ 135.256475] Code:
[ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
[ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
[ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
[ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
[ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
[ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
[ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
[ 135.277952] Call Trace:
[ 135.278635] <TASK>
[ 135.279247] ? preempt_count_add+0x6d/0xa0
[ 135.280358] ? percpu_counter_add_batch+0x55/0xb0
[ 135.281612] ? _raw_read_unlock+0x18/0x30
[ 135.282704] ext4_map_blocks+0x294/0x5a0
[ 135.283745] ? xa_load+0x6f/0xa0
[ 135.284562] ext4_mpage_readpages+0x3d6/0x770
[ 135.285646] read_pages+0x67/0x1d0
[ 135.286492] ? folio_add_lru+0x51/0x80
[ 135.287441] page_cache_ra_unbounded+0x124/0x170
[ 135.288510] filemap_get_pages+0x23d/0x5a0
[ 135.289457] ? path_openat+0xa72/0xdd0
[ 135.290332] filemap_read+0xbf/0x300
[ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40
[ 135.292192] new_sync_read+0x103/0x170
[ 135.293014] vfs_read+0x15d/0x180
[ 135.293745] ksys_read+0xa1/0xe0
[ 135.294461] do_syscall_64+0x3c/0x80
[ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0
This patch simply adds an extra check in __ext4_ext_check(), verifying that
eh_entries is not 0 when eh_depth is > 0.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
Cc: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.
So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.
This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Curently we don't use any preallocation when a file is already closed
when allocating blocks (from writeback code when converting delayed
allocation). However for small files, using locality group preallocation
is actually desirable as that is not specific to a particular file.
Rather it is a method to pack small files together to reduce
fragmentation and for that the fact the file is closed is actually even
stronger hint the file would benefit from packing. So change the logic
to allow locality group preallocation in this case.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently the Orlov inode allocator searches for free inodes for a
directory only in flex block groups with at most inodes_per_group/16
more directory inodes than average per flex block group. However with
growing size of flex block group this becomes unnecessarily strict.
Scale allowed difference from average directory count per flex block
group with flex block group size as we do with other metrics.
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:
baseline mb_optimize_scan
Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%)
Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%*
Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%*
Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%*
Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%*
Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%)
Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%)
Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%*
Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%*
Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.
Fix the problem by making sure we try goal group first.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull powerpc fixes from Michael Ellerman:
- Fix handling of PCI domains in /proc on 32-bit systems using the
recently added support for numbering buses from zero for each domain.
- A fix and a revert for some changes to use READ/WRITE_ONCE() which
caused problems with KASAN enabled due to sanitisation calls being
introduced in low-level paths that can't cope with it.
- Fix build errors on 32-bit caused by the syscall table being
misaligned sometimes.
- Two fixes to get IBM Cell native machines booting again, which had
bit-rotted while my QS22 was temporarily out of action.
- Fix the papr_scm driver to not assume the order of events returned by
the hypervisor is stable, and a related compile fix.
Thanks to Aneesh Kumar K.V, Christophe Leroy, Jordan Niethe, Kajol Jain,
Masahiro Yamada, Nathan Chancellor, Pali Rohár, Vaibhav Jain, and Zhouyi
Zhou.
* tag 'powerpc-6.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/papr_scm: Ensure rc is always initialized in papr_scm_pmu_register()
Revert "powerpc/irq: Don't open code irq_soft_mask helpers"
powerpc: Fix hard_irq_disable() with sanitizer
powerpc/rtas: Fix RTAS MSR[HV] handling for Cell
Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"
powerpc: align syscall table for ppc32
powerpc/pci: Enable PCI domains in /proc when PCI bus numbers are not unique
powerpc/papr_scm: Fix nvdimm event mappings
Pull kvm fixes from Paolo Bonzini:
"s390:
- PCI interpretation compile fixes
RISC-V:
- fix unused variable warnings in vcpu_timer.c
- move extern sbi_ext declarations to a header
x86:
- check validity of argument to KVM_SET_MP_STATE
- use guest's global_ctrl to completely disable guest PEBS
- fix a memory leak on memory allocation failure
- mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES
- fix build failure with Clang integrated assembler
- fix MSR interception
- always flush TLBs when enabling dirty logging"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: check validity of argument to KVM_SET_MP_STATE
perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
KVM: x86: fix memoryleak in kvm_arch_vcpu_create()
KVM: x86: Mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES
KVM: s390: pci: Hook to access KVM lowlevel from VFIO
riscv: kvm: move extern sbi_ext declarations to a header
riscv: kvm: vcpu_timer: fix unused variable warnings
KVM: selftests: Fix ambiguous mov in KVM_ASM_SAFE()
KVM: selftests: Fix KVM_EXCEPTION_MAGIC build with Clang
KVM: VMX: Heed the 'msr' argument in msr_write_intercepted()
kvm: x86: mmu: Always flush TLBs when enabling dirty logging
kvm: x86: mmu: Drop the need_remote_flush() function
Clang warns:
arch/powerpc/platforms/pseries/papr_scm.c:492:6: warning: variable 'rc' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
if (!p->stat_buffer_len)
^~~~~~~~~~~~~~~~~~~
arch/powerpc/platforms/pseries/papr_scm.c:523:64: note: uninitialized use occurs here
dev_info(&p->pdev->dev, "nvdimm pmu didn't register rc=%d\n", rc);
^~
include/linux/dev_printk.h:150:67: note: expanded from macro 'dev_info'
dev_printk_index_wrap(_dev_info, KERN_INFO, dev, dev_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~~
include/linux/dev_printk.h:110:23: note: expanded from macro 'dev_printk_index_wrap'
_p_func(dev, fmt, ##__VA_ARGS__); \
^~~~~~~~~~~
arch/powerpc/platforms/pseries/papr_scm.c:492:2: note: remove the 'if' if its condition is always false
if (!p->stat_buffer_len)
^~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/platforms/pseries/papr_scm.c:484:8: note: initialize the variable 'rc' to silence this warning
int rc, nodeid;
^
= 0
1 warning generated.
The call to papr_scm_pmu_check_events() was eliminated but a return code
was not added to the if statement. Add the same return code from
papr_scm_pmu_check_events() for this condition so there is no more
warning.
Fixes: 9b1ac04698a4 ("powerpc/papr_scm: Fix nvdimm event mappings")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://github.com/ClangBuiltLinux/linux/issues/1701
Link: https://lore.kernel.org/r/20220830151256.1473169-1-nathan@kernel.org
-Wformat was recently re-enabled for builds with clang, then quickly
re-disabled, due to concerns stemming from the frequency of default
argument promotion related warning instances.
commit 258fafcd0683 ("Makefile.extrawarn: re-enable -Wformat for clang")
commit 21f9c8a13bb2 ("Revert "Makefile.extrawarn: re-enable -Wformat for clang"")
ISO WG14 has ratified N2562 to address default argument promotion
explicitly for printf, as part of the upcoming ISO C2X standard.
The behavior of clang was changed in clang-16 to not warn for the cited
cases in all language modes.
Add a version check, so that users of clang-16 now get the full effect
of -Wformat. For older clang versions, re-enable flags under the
-Wformat group that way users still get some useful checks related to
format strings, without noisy default argument promotion warnings. I
intentionally omitted -Wformat-y2k and -Wformat-security from being
re-enabled, which are also part of -Wformat in clang-16.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Link: https://github.com/llvm/llvm-project/issues/57102
Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2562.pdf
Suggested-by: Justin Stitt <jstitt007@gmail.com>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Youngmin Nam <youngmin.nam@samsung.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PCI interpretation compile fixes
This reverts commit ef5b570d3700fbb8628a58da0487486ceeb713cd.
Zhouyi reported that commit is causing crashes when running rcutorture
with KASAN enabled:
BUG: using smp_processor_id() in preemptible [00000000] code: rcu_torture_rea/100
caller is rcu_preempt_deferred_qs_irqrestore+0x74/0xed0
CPU: 4 PID: 100 Comm: rcu_torture_rea Tainted: G W 5.19.0-rc5-next-20220708-dirty #253
Call Trace:
dump_stack_lvl+0xbc/0x108 (unreliable)
check_preemption_disabled+0x154/0x160
rcu_preempt_deferred_qs_irqrestore+0x74/0xed0
__rcu_read_unlock+0x290/0x3b0
rcu_torture_read_unlock+0x30/0xb0
rcutorture_one_extend+0x198/0x810
rcu_torture_one_read+0x58c/0xc90
rcu_torture_reader+0x12c/0x360
kthread+0x1e8/0x220
ret_from_kernel_thread+0x5c/0x64
KASAN will generate instrumentation instructions around the
WRITE_ONCE(local_paca->irq_soft_mask, mask):
0xc000000000295cb0 <+0>: addis r2,r12,774
0xc000000000295cb4 <+4>: addi r2,r2,16464
0xc000000000295cb8 <+8>: mflr r0
0xc000000000295cbc <+12>: bl 0xc00000000008bb4c <mcount>
0xc000000000295cc0 <+16>: mflr r0
0xc000000000295cc4 <+20>: std r31,-8(r1)
0xc000000000295cc8 <+24>: addi r3,r13,2354
0xc000000000295ccc <+28>: mr r31,r13
0xc000000000295cd0 <+32>: std r0,16(r1)
0xc000000000295cd4 <+36>: stdu r1,-48(r1)
0xc000000000295cd8 <+40>: bl 0xc000000000609b98 <__asan_store1+8>
0xc000000000295cdc <+44>: nop
0xc000000000295ce0 <+48>: li r9,1
0xc000000000295ce4 <+52>: stb r9,2354(r31)
0xc000000000295ce8 <+56>: addi r1,r1,48
0xc000000000295cec <+60>: ld r0,16(r1)
0xc000000000295cf0 <+64>: ld r31,-8(r1)
0xc000000000295cf4 <+68>: mtlr r0
If there is a context switch before "stb r9,2354(r31)", r31 may
not equal to r13, in such case, irq soft mask will not work.
The usual solution of marking the code ineligible for instrumentation
forces the code out-of-line, which we would prefer to avoid. Christophe
proposed a partial revert, but Nick raised some concerns with that. So
for now do a full revert.
Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
[mpe: Construct change log based on Zhouyi's original report]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220831131052.42250-1-mpe@ellerman.id.au
Pull gpio fixes from Bartosz Golaszewski:
"A a set of fixes from the GPIO subsystem.
Most are small driver fixes except the realtek-otto driver patch which
is pretty big but addresses a significant flaw that can cause the CPU
to stay infinitely busy on uncleared ISR on some platforms.
Summary:
- MAINTAINERS update
- fix resource leaks in gpio-mockup and gpio-pxa
- add missing locking in gpio-pca953x
- use 32-bit I/O in gpio-realtek-otto
- make irq_chip structures immutable in four more drivers"
* tag 'gpio-fixes-for-v6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: ws16c48: Make irq_chip immutable
gpio: 104-idio-16: Make irq_chip immutable
gpio: 104-idi-48: Make irq_chip immutable
gpio: 104-dio-48e: Make irq_chip immutable
gpio: realtek-otto: switch to 32-bit I/O
gpio: pca953x: Add mutex_lock for regcache sync in PM
gpio: mockup: remove gpio debugfs when remove device
gpio: pxa: use devres for the clock struct
MAINTAINERS: rectify entry for XILINX GPIO DRIVER