commits
Pull tracing fixes from Steven Rostedt:
- Swapping the ring buffer for snapshotting (for things like irqsoff)
can crash if the ring buffer is being resized. Disable swapping when
this happens. The missed swap will be reported to the tracer
- Report error if the histogram fails to be created due to an error in
adding a histogram variable, in event_hist_trigger_parse()
- Remove unused declaration of tracing_map_set_field_descr()
* tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
ring-buffer: Do not swap cpu_buffer during resize process
tracing: Remove unused extern declaration tracing_map_set_field_descr()
Pull Kbuild fixes from Masahiro Yamada:
- Fix stale help text in gconfig
- Support *.S files in compile_commands.json
- Flatten KBUILD_CFLAGS
- Fix external module builds with Rust so that temporary files are
created in the modules directories instead of the kernel tree
* tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: rust: avoid creating temporary files
kbuild: flatten KBUILD_CFLAGS
gen_compile_commands: add assembly files to compilation database
kconfig: gconfig: correct program name in help text
kconfig: gconfig: drop the Show Debug Info help text
Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.
Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Avoid pKVM finalization if KVM initialization fails
- Add missing BTI instructions in the hypervisor, fixing an early
boot failure on BTI systems
- Handle MMU notifiers correctly for non hugepage-aligned memslots
- Work around a bug in the architecture where hypervisor timer
controls have UNKNOWN behavior under nested virt
- Disable preemption in kvm_arch_hardware_enable(), fixing a kernel
BUG in cpu hotplug resulting from per-CPU accessor sanity checking
- Make WFI emulation on GICv4 systems robust w.r.t. preemption,
consistently requesting a doorbell interrupt on vcpu_put()
- Uphold RES0 sysreg behavior when emulating older PMU versions
- Avoid macro expansion when initializing PMU register names,
ensuring the tracepoints pretty-print the sysreg
s390:
- Two fixes for asynchronous destroy
x86 fixes will come early next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: pv: fix index value of replaced ASCE
KVM: s390: pv: simplify shutdown and fix race
KVM: arm64: Fix the name of sys_reg_desc related to PMU
KVM: arm64: Correctly handle RES0 bits PMEVTYPER<n>_EL0.evtCount
KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption
KVM: arm64: Add missing BTI instructions
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
KVM: arm64: Disable preemption in kvm_arch_hardware_enable()
KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm
KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits
`rustc` outputs by default the temporary files (i.e. the ones saved
by `-Csave-temps`, such as `*.rcgu*` files) in the current working
directory when `-o` and `--out-dir` are not given (even if
`--emit=x=path` is given, i.e. it does not use those for temporaries).
Since out-of-tree modules are compiled from the `linux` tree,
`rustc` then tries to create them there, which may not be accessible.
Thus pass `--out-dir` explicitly, even if it is just for the temporary
files.
Similarly, do so for Rust host programs too.
Reported-by: Raphael Nestler <raphael.nestler@gmail.com>
Closes: https://github.com/Rust-for-Linux/linux/issues/1015
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Raphael Nestler <raphael.nestler@gmail.com> # non-hostprogs
Tested-by: Andrea Righi <andrea.righi@canonical.com> # non-hostprogs
Fixes: 295d8398c67e ("kbuild: specify output names separately for each emission type from rustc")
Cc: stable@vger.kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Tested-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When ring_buffer_swap_cpu was called during resize process,
the cpu buffer was swapped in the middle, resulting in incorrect state.
Continuing to run in the wrong state will result in oops.
This issue can be easily reproduced using the following two scripts:
/tmp # cat test1.sh
//#! /bin/sh
for i in `seq 0 100000`
do
echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
sleep 0.5
echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
sleep 0.5
done
/tmp # cat test2.sh
//#! /bin/sh
for i in `seq 0 100000`
do
echo irqsoff > /sys/kernel/debug/tracing/current_tracer
sleep 1
echo nop > /sys/kernel/debug/tracing/current_tracer
sleep 1
done
/tmp # ./test1.sh &
/tmp # ./test2.sh &
A typical oops log is as follows, sometimes with other different oops logs.
[ 231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
[ 231.713375] Modules linked in:
[ 231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15
[ 231.716750] Hardware name: linux,dummy-virt (DT)
[ 231.718152] Workqueue: events update_pages_handler
[ 231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 231.721171] pc : rb_update_pages+0x378/0x3f8
[ 231.722212] lr : rb_update_pages+0x25c/0x3f8
[ 231.723248] sp : ffff800082b9bd50
[ 231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[ 231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
[ 231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
[ 231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
[ 231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
[ 231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[ 231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
[ 231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
[ 231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
[ 231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
[ 231.744196] Call trace:
[ 231.744892] rb_update_pages+0x378/0x3f8
[ 231.745893] update_pages_handler+0x1c/0x38
[ 231.746893] process_one_work+0x1f0/0x468
[ 231.747852] worker_thread+0x54/0x410
[ 231.748737] kthread+0x124/0x138
[ 231.749549] ret_from_fork+0x10/0x20
[ 231.750434] ---[ end trace 0000000000000000 ]---
[ 233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 233.721696] Mem abort info:
[ 233.721935] ESR = 0x0000000096000004
[ 233.722283] EC = 0x25: DABT (current EL), IL = 32 bits
[ 233.722596] SET = 0, FnV = 0
[ 233.722805] EA = 0, S1PTW = 0
[ 233.723026] FSC = 0x04: level 0 translation fault
[ 233.723458] Data abort info:
[ 233.723734] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 233.724176] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 233.724589] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
[ 233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 233.726720] Modules linked in:
[ 233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15
[ 233.727777] Hardware name: linux,dummy-virt (DT)
[ 233.728225] Workqueue: events update_pages_handler
[ 233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 233.729054] pc : rb_update_pages+0x1a8/0x3f8
[ 233.729334] lr : rb_update_pages+0x154/0x3f8
[ 233.729592] sp : ffff800082b9bd50
[ 233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[ 233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
[ 233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
[ 233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
[ 233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
[ 233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
[ 233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
[ 233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
[ 233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
[ 233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
[ 233.734418] Call trace:
[ 233.734593] rb_update_pages+0x1a8/0x3f8
[ 233.734853] update_pages_handler+0x1c/0x38
[ 233.735148] process_one_work+0x1f0/0x468
[ 233.735525] worker_thread+0x54/0x410
[ 233.735852] kthread+0x124/0x138
[ 233.736064] ret_from_fork+0x10/0x20
[ 233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
[ 233.736959] ---[ end trace 0000000000000000 ]---
After analysis, the seq of the error is as follows [1-5]:
int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
int cpu_id)
{
for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
//1. get cpu_buffer, aka cpu_buffer(A)
...
...
schedule_work_on(cpu,
&cpu_buffer->update_pages_work);
//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
// update_pages_handler, do the update process, set 'update_done' in
// complete(&cpu_buffer->update_done) and to wakeup resize process.
//---->
//3. Just at this moment, ring_buffer_swap_cpu is triggered,
//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
//ring_buffer_swap_cpu is called as the 'Call trace' below.
Call trace:
dump_backtrace+0x0/0x2f8
show_stack+0x18/0x28
dump_stack+0x12c/0x188
ring_buffer_swap_cpu+0x2f8/0x328
update_max_tr_single+0x180/0x210
check_critical_timing+0x2b4/0x2c8
tracer_hardirqs_on+0x1c0/0x200
trace_hardirqs_on+0xec/0x378
el0_svc_common+0x64/0x260
do_el0_svc+0x90/0xf8
el0_svc+0x20/0x30
el0_sync_handler+0xb0/0xb8
el0_sync+0x180/0x1c0
//<----
/* wait for all the updates to complete */
for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
//for example, cpu_buffer(A)->update_done will leave be set 1, and will
//not 'wait_for_completion' at the next resize round.
if (!cpu_buffer->nr_pages_to_update)
continue;
if (cpu_online(cpu))
wait_for_completion(&cpu_buffer->update_done);
cpu_buffer->nr_pages_to_update = 0;
}
...
}
//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
//Continuing to run in the wrong state, then oops occurs.
Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn
Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull ext4 fixes from Ted Ts'o:
"Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's
checkpoint code"
* tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix rbtree traversal bug in ext4_mb_use_preallocated
ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()
ext4: correct inline offset when handling xattrs in inode body
jbd2: remove __journal_try_to_free_buffer()
jbd2: fix a race when checking checkpoint buffer busy
jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
jbd2: remove journal_clean_one_cp_list()
jbd2: remove t_checkpoint_io_list
jbd2: recheck chechpointing non-dirty buffer
Two fixes for asynchronous destroy
Make it slightly easier to see which compiler options are added and
removed (and not worry about column limit too!).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Since commit 08d43a5fa063 ("tracing: Add lock-free tracing_map"),
this is never used, so can be removed.
Link: https://lore.kernel.org/linux-trace-kernel/20230722032123.24664-1-yuehaibing@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull smb client fix from Steve French:
"Add minor debugging improvement.
The change improves ability to read a network trace to debug problems
on encrypted connections which are very common (e.g. using wireshark
or tcpdump).
That works today with tools like 'smbinfo keys /mnt/file' but requires
passing in a filename on the mount (see e.g. [1]), but it often makes
more sense to just pass in the mount point path (ie a directory not a
filename).
So this fix was needed to debug some types of problems (an obvious
example is on an encrypted connection failing operations on an empty
share or with no files in the root of the directory) - so you can
simply pass in the 'smbinfo keys <mntpoint>' and get the information
that wireshark needs"
Link: https://wiki.samba.org/index.php/Wireshark_Decryption [1]
* tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko
cifs: allow dumping keys for directories too
During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.
To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:
1. It must have it's logical start immediately to the left of
(ie less than) original logical start.
2. It must not be deleted
To find this pa we use the following traversal method:
1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.
2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.
3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.
4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.
This approach also takes care of having deleted PAs in the tree.
(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)
[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/
Cc: stable@kernel.org # 6.4
Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
KVM/arm64 fixes for 6.5, part #1
- Avoid pKVM finalization if KVM initialization fails
- Add missing BTI instructions in the hypervisor, fixing an early boot
failure on BTI systems
- Handle MMU notifiers correctly for non hugepage-aligned memslots
- Work around a bug in the architecture where hypervisor timer controls
have UNKNOWN behavior under nested virt.
- Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG
in cpu hotplug resulting from per-CPU accessor sanity checking.
- Make WFI emulation on GICv4 systems robust w.r.t. preemption,
consistently requesting a doorbell interrupt on vcpu_put()
- Uphold RES0 sysreg behavior when emulating older PMU versions
- Avoid macro expansion when initializing PMU register names, ensuring
the tracepoints pretty-print the sysreg.
The index field of the struct page corresponding to a guest ASCE should
be 0. When replacing the ASCE in s390_replace_asce(), the index of the
new ASCE should also be set to 0.
Having the wrong index might lead to the wrong addresses being passed
around when notifying pte invalidations, and eventually to validity
intercepts (VM crash) if the prefix gets unmapped and the notifier gets
called with the wrong address.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Fixes: faa2f72cb356 ("KVM: s390: pv: leak the topmost page table when destroy fails")
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20230705111937.33472-3-imbrenda@linux.ibm.com>
Like C source files, tooling can find it useful to have the assembly
source file compilation recorded.
The .S extension appears to used across all architectures.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull powerpc fixes from Michael Ellerman:
- Reinstate support for little endian ELFv1 binaries, which it turns
out still exist in the wild.
- Revert a change which used asm goto for WARN_ON/__WARN_FLAGS, as it
lead to dead code generation and seemed to trigger compiler bugs in
some edge cases.
- Fix a deadlock in the pseries VAS code, between live migration and
the driver's mmap handler.
- Disable KCOV instrumentation in the powerpc KASAN code.
Thanks to Andrew Donnellan, Benjamin Gray, Christophe Leroy, Haren
Myneni, Russell Currey, and Uwe Kleine-König.
* tag 'powerpc-6.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
Revert "powerpc/64s: Remove support for ELFv1 little endian userspace"
powerpc/kasan: Disable KCOV in KASAN code
powerpc/512x: lpbfifo: Convert to platform remove callback returning void
powerpc/crypto: Add gitignore for generated P10 AES/GCM .S files
Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto"
powerpc/pseries/vas: Hold mmap_mutex after mmap lock during window close
From 2.43 to 2.44
Signed-off-by: Steve French <stfrench@microsoft.com>
In ext4_mb_choose_next_group_best_avail(), we want the start order to be
1 less than goal length and the min_order to be, at max, 1 more than the
original length. This commit fixes an off by one issue that arose due to
the fact that 1 << fls(n) > (n).
After all the processing:
order = 1 order below goal len
min_order = maximum of the three:-
- order - trim_order
- 1 order below B2C(s_stripe)
- 1 order above original len
Cc: stable@kernel.org
Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For those PMU system registers defined in sys_reg_descs[], use macro
PMU_SYS_REG() / PMU_PMEVCNTR_EL0 / PMU_PMEVTYPER_EL0 to define them, and
later two macros call macro PMU_SYS_REG() actually.
Currently the input parameter of PMU_SYS_REG() is another macro which is
calculation formula of the value of system registers, so for example, if
we want to "SYS_PMINTENSET_EL1" as the name of sys register, actually
the name we get is as following:
(((3) << 19) | ((0) << 16) | ((9) << 12) | ((14) << 8) | ((1) << 5))
The name of system register is used in some tracepoints such as
trace_kvm_sys_access(), if not set correctly, we need to analyze the
inaccurate name to get the exact name (which also is inconsistent with
other system registers), and also the inaccurate name occupies more space.
To fix the issue, use the name as a input parameter of PMU_SYS_REG like
MTE_REG or EL2_REG.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1689305920-170523-1-git-send-email-chenxiang66@hisilicon.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Simplify the shutdown of non-protected VMs. There is no need to do
complex manipulations of the counter if it was zero.
This also fixes a very rare race which caused pages to be torn down
from the address space with a non-zero counter even on older machines
that don't support the UVC instruction, causing a crash.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: fb491d5500a7 ("KVM: s390: pv: asynchronous destroy for reboot")
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20230705111937.33472-2-imbrenda@linux.ibm.com>
Change "gkc" to "gconfig" in 3 places since it is called "gconfig" and
not "gkc". Add a period at the end of one sentence.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull xtensa fixes from Max Filippov:
- fix interaction between unaligned exception handler and load/store
exception handler
- fix parsing ISS network interface specification string
- add comment about etherdev freeing to ISS network driver
* tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa: fix unaligned and load/store configuration interaction
xtensa: ISS: fix call to split_if_spec
xtensa: ISS: add comment about etherdev freeing
Pull s390 fixes from Heiko Carstens:
- Fix per vma lock fault handling: add missing !(fault & VM_FAULT_ERROR)
check to fault handler to prevent error handling for return values
that don't indicate an error
- Use kfree_sensitive() instead of kfree() in paes crypto code to clear
memory that may contain keys before freeing it
- Fix reply buffer size calculation for CCA replies in zcrypt device
driver
* tag 's390-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: fix reply buffer calculations for CCA replies
s390/crypto: use kfree_sensitive() instead of kfree()
s390/mm: fix per vma lock fault handling
This reverts commit 606787fed7268feb256957872586370b56af697a.
ELFv1 with LE has never been a thing, and people who try to make ELFv1 LE
binaries are maniacs who need to be stopped, but unfortunately there are
ELFv1 LE binaries out there in the wild.
One such binary is the ppc64el (as Debian calls it) helper for
arch-test[0], a tool for detecting architectures that can be executed on a
given machine by means of attempting to execute helper binaries compiled
for each architecture and seeing which binaries succeed and fail. The
helpers are small snippets of assembly, and the ppc64el assembly doesn't
include the right directives to generate an ELFv2 binary.
This results in arch-test incorrectly determining that a ppc64el kernel
can't execute a ppc64el userspace, which in turn means that a number of
developer tools such as debootstrap will break (assuming arch-test is
installed).
[0] https://github.com/kilobyte/arch-test
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230719071821.320594-1-ajd@linux.ibm.com
Dumping the enc/dec keys is a session wide operation.
And it should not matter if the ioctl was run on
a regular file or a directory.
Currently, we obtain the tcon pointer from the
cifs file handle. But since there's no dir open call
in cifs, this is not populated for dirs.
This change allows dumping of session keys using ioctl
even for directories. To do this, we'll now get the
tcon pointer from the superblock, and not from the file
handle.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When run on a file system where the inline_data feature has been
enabled, xfstests generic/269, generic/270, and generic/476 cause ext4
to emit error messages indicating that inline directory entries are
corrupted. This occurs because the inline offset used to locate
inline directory entries in the inode body is not updated when an
xattr in that shared region is deleted and the region is shifted in
memory to recover the space it occupied. If the deleted xattr precedes
the system.data attribute, which points to the inline directory entries,
that attribute will be moved further up in the region. The inline
offset continues to point to whatever is located in system.data's former
location, with unfortunate effects when used to access directory entries
or (presumably) inline data in the inode body.
Cc: stable@kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The PMU event ID varies from 10 to 16 bits, depending on the PMU
version. If the PMU only supports 10 bits of event ID, bits [15:10] of
the evtCount field behave as RES0.
While the actual PMU emulation code gets this right (i.e. RES0 bits are
masked out when programming the perf event), the sysreg emulation writes
an unmasked value to the in-memory cpu context. The net effect is that
guest reads and writes of PMEVTYPER<n>_EL0 will see non-RES0 behavior in
the reserved bits of the field.
As it so happens, kvm_pmu_set_counter_event_type() already writes a
masked value to the in-memory context that gets overwritten by
access_pmu_evtyper(). Fix the issue by removing the unnecessary (and
incorrect) register write in access_pmu_evtyper().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230713221649.3889210-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The Show Debug Info option was removed eons ago. Now finish the job
by removing the help text for it also.
Fixes: 7b5d87215b38 ("gconfig: remove show_debug option")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull perf fix from Borislav Petkov:
- Fix a lockdep warning when the event given is the first one, no event
group exists yet but the code still goes and iterates over event
siblings
* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
Unaligned exception handler is needed in configurations with hardware
support for unaligned access when the load/store exception handler is
enabled because such configurations would still raise an exception on
unaligned access through the instruction bus.
Fixes: f29cf77609cc ("xtensa: add load/store exception handler")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Pull block fixes from Jens Axboe:
- Fix for loop regressions (Mauricio)
- Fix a potential stall with batched wakeups in sbitmap (David)
- Fix for stall with recursive plug flushes (Ross)
- Skip accounting of empty requests for blk-iocost (Chengming)
- Remove a dead field in struct blk_mq_hw_ctx (Chengming)
* tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux:
loop: do not enforce max_loop hard limit by (new) default
loop: deprecate autoloading callback loop_probe()
sbitmap: fix batching wakeup
blk-iocost: skip empty flush bio in iocost
blk-mq: delete dead struct blk_mq_hw_ctx->queued field
blk-mq: Fix stall due to recursive flush plug
The length information for available buffer space for CCA
replies is covered with two fields in the T6 header prepended
on each CCA reply: fromcardlen1 and fromcardlen2. The sum of
these both values must not exceed the AP bus limit for this
card (24KB for CEX8, 12KB CEX7 and older) minus the always
present headers.
The current code adjusted the fromcardlen2 value in case
of exceeding the AP bus limit when there was a non-zero
value given from userspace. Some tests now showed that this
was the wrong assumption. Instead the userspace value given for
this field should always be trusted and if the sum of the
two fields exceeds the AP bus limit for this card the first
field fromcardlen1 should be adjusted instead.
So now the calculation is done with this new insight in mind.
Also some additional checks for overflow have been introduced
and some comments to provide some documentation for future
maintainers of this complicated calculation code.
Furthermore the 128 bytes of fix overhead which is used
in the current code is not correct. Investigations showed
that for a reply always the same two header structs are
prepended before a possible payload. So this is also fixed
with this patch.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
As per the generic KASAN code in mm/kasan, disable KCOV with
KCOV_INSTRUMENT := n in the makefile.
This fixes a ppc64 boot hang when KCOV and KASAN are enabled.
kasan_early_init() gets called before a PACA is initialised, but the
KCOV hook expects a valid PACA.
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230710044143.146840-1-bgray@linux.ibm.com
__journal_try_to_free_buffer() has only one caller and it's logic is
much simple now, so just remove it and open code in
jbd2_journal_try_to_free_buffers().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Xiang reports that VMs occasionally fail to boot on GICv4.1 systems when
running a preemptible kernel, as it is possible that a vCPU is blocked
without requesting a doorbell interrupt.
The issue is that any preemption that occurs between vgic_v4_put() and
schedule() on the block path will mark the vPE as nonresident and *not*
request a doorbell irq. This occurs because when the vcpu thread is
resumed on its way to block, vcpu_load() will make the vPE resident
again. Once the vcpu actually blocks, we don't request a doorbell
anymore, and the vcpu won't be woken up on interrupt delivery.
Fix it by tracking that we're entering WFI, and key the doorbell
request on that flag. This allows us not to make the vPE resident
when going through a preempt/schedule cycle, meaning we don't lose
any state.
Cc: stable@vger.kernel.org
Fixes: 8e01d9a396e6 ("KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put")
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Suggested-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20230713070657.3873244-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Pull objtool fixes from Borislav Petkov:
- Mark copy_iovec_from_user() __noclone in order to prevent gcc from
doing an inter-procedural optimization and confuse objtool
- Initialize struct elf fully to avoid build failures
* tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iov_iter: Mark copy_iovec_from_user() noclone
objtool: initialize all of struct elf
On SPR, the load latency event needs an auxiliary event in the same
group to work properly. There's a check in intel_pmu_hw_config()
for this to iterate sibling events and find a mem-loads-aux event.
The for_each_sibling_event() has a lockdep assert to make sure if it
disabled hardirq or hold leader->ctx->mutex. This works well if the
given event has a separate leader event since perf_try_init_event()
grabs the leader->ctx->mutex to protect the sibling list. But it can
cause a problem when the event itself is a leader since the event is
not initialized yet and there's no ctx for the event.
Actually I got a lockdep warning when I run the below command on SPR,
but I guess it could be a NULL pointer dereference.
$ perf record -d -e cpu/mem-loads/uP true
The code path to the warning is:
sys_perf_event_open()
perf_event_alloc()
perf_init_event()
perf_try_init_event()
x86_pmu_event_init()
hsw_hw_config()
intel_pmu_hw_config()
for_each_sibling_event()
lockdep_assert_event_ctx()
We don't need for_each_sibling_event() when it's a standalone event.
Let's return the error code directly.
Fixes: f3c0eba28704 ("perf: Add a few assertions")
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230704181516.3293665-1-namhyung@kernel.org
split_if_spec expects a NULL-pointer as an end marker for the argument
list, but tuntap_probe never supplied that terminating NULL. As a result
incorrectly formatted interface specification string may cause a crash
because of the random memory access. Fix that by adding NULL terminator
to the split_if_spec argument list.
Cc: stable@vger.kernel.org
Fixes: 7282bee78798 ("[PATCH] xtensa: Architecture support for Tensilica Xtensa Part 8")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Pull io_uring fixes from Jens Axboe:
- Fix for io-wq not always honoring REQ_F_NOWAIT, if it was set and
punted directly (eg via DRAIN) (me)
- Capability check fix (Ondrej)
- Regression fix for the mmap changes that went into 6.4, which
apparently broke IA64 (Helge)
* tag 'io_uring-6.5-2023-07-21' of git://git.kernel.dk/linux:
ia64: mmap: Consider pgoff when searching for free mapping
io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()
io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq
io_uring: don't audit the capability check in io_uring_create()
Problem:
The max_loop parameter is used for 2 different purposes:
1) initial number of loop devices to pre-create on init
2) maximum number of loop devices to add on access/open()
Historically, its default value (zero) caused 1) to create non-zero
number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on
2) to add devices with autoloading.
However, the default value changed in commit 85c50197716c ("loop: Fix
the max_loop commandline argument treatment when it is set to 0") to
CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices.
That does improve 1), but unfortunately it breaks 2), as the default
behavior changed from no-limit to hard-limit.
Example:
For example, this userspace code broke for N >= CONFIG, if the user
relied on the default value 0 for max_loop:
mknod("/dev/loopN");
open("/dev/loopN"); // now fails with ENXIO
Though affected users may "fix" it with (loop.)max_loop=0, this means to
require a kernel parameter change on stable kernel update (that commit
Fixes: an old commit in stable).
Solution:
The original semantics for the default value in 2) can be applied if the
parameter is not set (ie, default behavior).
This still keeps the intended function in 1) and 2) if set, and that
commit's intended improvement in 1) if max_loop=0.
Before 85c50197716c:
- default: 1) CONFIG devices 2) no limit
- max_loop=0: 1) CONFIG devices 2) no limit
- max_loop=X: 1) X devices 2) X limit
After 85c50197716c:
- default: 1) CONFIG devices 2) CONFIG limit (*)
- max_loop=0: 1) 0 devices (*) 2) no limit
- max_loop=X: 1) X devices 2) X limit
This commit:
- default: 1) CONFIG devices 2) no limit (*)
- max_loop=0: 1) 0 devices 2) no limit
- max_loop=X: 1) X devices 2) X limit
Future:
The issue/regression from that commit only affects code under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is
contained under it.
Once that deprecated functionality/code is removed, the purpose 2) of
max_loop (hard limit) is no longer in use, so the module parameter
description can be changed then.
Tests:
Linux 6.4-rc7
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
- default (original)
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
open: /dev/loop8: No such device or address
- default (patched)
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
#
- max_loop=0 (original & patched):
# ls -1 /dev/loop*
/dev/loop-control
# ./test-loop
#
- max_loop=8 (original & patched):
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
open: /dev/loop8: No such device or address
- max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set)
# ls -1 /dev/loop*
/dev/loop-control
# ./test-loop
open: /dev/loop8: No such device or address
Fixes: 85c50197716c ("loop: Fix the max_loop commandline argument treatment when it is set to 0")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
key might contain private part of the key, so better use
kfree_sensitive() to free it.
Signed-off-by: Wang Ming <machel@vivo.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Link: https://lore.kernel.org/r/20230717094533.18418-1-machel@vivo.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new() which already returns void. Eventually after all drivers
are converted, .remove_new() is renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230711143145.1192651-1-u.kleine-koenig@pengutronix.de
Before removing checkpoint buffer from the t_checkpoint_list, we have to
check both BH_Dirty and BH_Lock bits together to distinguish buffers
have not been or were being written back. But __cp_buffer_busy() checks
them separately, it first check lock state and then check dirty, the
window between these two checks could be raced by writing back
procedure, which locks buffer and clears buffer dirty before I/O
completes. So it cannot guarantee checkpointing buffers been written
back to disk if some error happens later. Finally, it may clean
checkpoint transactions and lead to inconsistent filesystem.
jbd2_journal_forget() and __journal_try_to_free_buffer() also have the
same problem (journal_unmap_buffer() escape from this issue since it's
running under the buffer lock), so fix them through introducing a new
helper to try holding the buffer lock and remove really clean buffer.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Cc: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some bti instructions were missing from
commit b53d4a272349 ("KVM: arm64: Use BTI for nvhe")
1) kvm_host_psci_cpu_entry
kvm_host_psci_cpu_entry is called from __kvm_hyp_init_cpu through "br"
instruction as __kvm_hyp_init_cpu resides in idmap section while
kvm_host_psci_cpu_entry is in hyp .text so the offset is larger than
128MB range covered by "b".
Which means that this function should start with "bti j" instruction.
LLVM which is the only compiler supporting BTI for Linux, adds "bti j"
for jump tables or by when taking the address of the block [1].
Same behaviour is observed with GCC.
As kvm_host_psci_cpu_entry is a C function, this must be done in
assembly.
Another solution is to use X16/X17 with "br", as according to ARM
ARM DDI0487I.a RLJHCL/IGMGRS, PACIASP has an implicit branch
target identification instruction that is compatible with
PSTATE.BTYPE 0b01 which includes "br X16/X17"
And the kvm_host_psci_cpu_entry has PACIASP as it is an external
function.
Although, using explicit "bti" makes it more clear than relying on
which register is used.
A third solution is to clear SCTLR_EL2.BT, which would make PACIASP
compatible PSTATE.BTYPE 0b11 ("br" to other registers).
However this deviates from the kernel behaviour (in bti_enable()).
2) Spectre vector table
"br" instructions are generated at runtime for the vector table
(__bp_harden_hyp_vecs).
These branches would land on vectors in __kvm_hyp_vector at offset 8.
As all the macros are defined with valid_vect/invalid_vect, it is
sufficient to add "bti j" at the correct offset.
[1] https://reviews.llvm.org/D52867
Fixes: b53d4a272349 ("KVM: arm64: Use BTI for nvhe")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20230706152240.685684-1-smostafa@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We just sorted the entries and fields last release, so just out of a
perverse sense of curiosity, I decided to see if we can keep things
ordered for even just one release.
The answer is "No. No we cannot".
I suggest that all kernel developers will need weekly training sessions,
involving a lot of Big Bird and Sesame Street. And at the yearly
maintainer summit, we will all sing the alphabet song together.
I doubt I will keep doing this. At some point "perverse sense of
curiosity" turns into just a cold dark place filled with sadness and
despair.
Repeats: 80e62bc8487b ("MAINTAINERS: re-sort all entries and fields")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Borislav Petkov:
- Remove a cgroup from under a polling process properly
- Fix the idle sibling selection
* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: use kernfs polling functions for PSI trigger polling
sched/fair: Use recent_used_cpu to test p->cpus_ptr
Extend commit 50f9a76ef127 ("iov_iter: Mark
copy_compat_iovec_from_user() noinline") to also cover
copy_iovec_from_user(). Different compiler versions cause the same
problem on different functions.
lib/iov_iter.o: warning: objtool: .altinstr_replacement+0x1f: redundant UACCESS disable
lib/iov_iter.o: warning: objtool: iovec_from_user+0x84: call to copy_iovec_from_user.part.0() with UACCESS enabled
lib/iov_iter.o: warning: objtool: __import_iovec+0x143: call to copy_iovec_from_user.part.0() with UACCESS enabled
Fixes: 50f9a76ef127 ("iov_iter: Mark copy_compat_iovec_from_user() noinline")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20230616124354.GD4253@hirez.programming.kicks-ass.net
iss_net_configure explicitly frees etherdev in all error return paths
except one where register_netdevice fails. In that remaining error
return path the etherdev is freed by the iss_net_pdev_release callback
triggered by the platform_device_unregister call. Add a comment stating
that.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Pull devicetree fixes from Rob Herring:
- Fix moortec,mr75203 schema usage of 'multipleOf' keyword
- Fix regression in systems depending on "of-display" device name
- Build fix for s390 with CONFIG_PCI=n and OF_EARLY_FLATTREE=y
- Drop two obsolete serial .txt bindings
* tag 'devicetree-fixes-for-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: serial: Remove obsolete nxp,lpc1850-uart.txt
dt-bindings: serial: Remove obsolete cavium-uart.txt
dt-bindings: hwmon: moortec,mr75203: fix multipleOf for coefficients
of: Preserve "of-display" device name for compatibility
of: make OF_EARLY_FLATTREE depend on HAS_IOMEM
IA64 is the only architecture which does not consider the pgoff value when
searching for a possible free memory region with vm_unmapped_area().
Adding this seems to have no negative side effect on IA64, so add it now
to make IA64 consistent with all other architectures.
Cc: stable@vger.kernel.org # 6.4
Signed-off-by: Helge Deller <deller@gmx.de>
Tested-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-ia64@vger.kernel.org
Link: https://lore.kernel.org/r/20230721152432.196382-3-deller@gmx.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'probe' callback in __register_blkdev() is only used under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard.
The loop_probe() function is only used for that callback, so guard it
too, accordingly.
See commit fbdee71bb5d8 ("block: deprecate autoloading based on dev_t").
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With per-vma locks, handle_mm_fault() may return non-fatal error
flags. In this case the code should reset the fault flags before
returning.
Fixes: e06f47a16573 ("s390/mm: try VMA lock-based page fault handling first")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
aesp10-ppc.S and ghashp10-ppc.S are autogenerated and not tracked by
git, so they should be ignored. This is doing the same as the P8 files
in drivers/crypto/vmx/.gitignore but for the P10 files in
arch/powerpc/crypto.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230713042206.85669-1-ruscur@russell.cc
Following process,
jbd2_journal_commit_transaction
// there are several dirty buffer heads in transaction->t_checkpoint_list
P1 wb_workfn
jbd2_log_do_checkpoint
if (buffer_locked(bh)) // false
__block_write_full_page
trylock_buffer(bh)
test_clear_buffer_dirty(bh)
if (!buffer_dirty(bh))
__jbd2_journal_remove_checkpoint(jh)
if (buffer_write_io_error(bh)) // false
>> bh IO error occurs <<
jbd2_cleanup_journal_tail
__jbd2_update_log_tail
jbd2_write_superblock
// The bh won't be replayed in next mount.
, which could corrupt the ext4 image, fetch a reproducer in [Link].
Since writeback process clears buffer dirty after locking buffer head,
we can fix it by try locking buffer and check dirtiness while buffer is
locked, the buffer head can be removed if it is neither dirty nor locked.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest
memory. This is even the case with hugepages, although it is a rather
suboptimal configuration as PTE level mappings are used at stage-2.
The arm64 page aging handlers have an assumption that the specified
range is exactly one page/block of memory, which in the aforementioned
case is not necessarily true. All together this leads to the WARN() in
kvm_age_gfn() firing.
However, the WARN is only part of the issue as the table walkers visit
at most a single leaf PTE. For hugepage-backed memory in a memslot that
isn't hugepage-aligned, page aging entirely misses accesses to the
hugepage beyond the first page in the memslot.
Add a new walker dedicated to handling page aging MMU notifiers capable
of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new
walker and drop the WARN that caught the issue in the first place. The
implementation of this walker was inspired by the test_clear_young()
implementation by Yu Zhao [*], but repurposed to address a bug in the
existing aging implementation.
Cc: stable@vger.kernel.org # v5.15
Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")
Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/
Co-developed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Pull dma-mapping fixes from Christoph Hellwig:
- swiotlb area sizing fixes (Petr Tesarik)
* tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: reduce the number of areas to match actual memory pool size
swiotlb: always set the number of areas before allocating the pool
Pull pin control fixes from Linus Walleij:
"I'm mostly on vacation but what would vacation be without a few
critical fixes so people can use their gaming laptops when hiding away
from the sun (or rain)?
- Fix a really annoying interrupt storm in the AMD driver affecting
Asus TUF gaming notebooks
- Fix device tree parsing in the Renesas driver"
* tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: amd: Unify debounce handling into amd_pinconf_set()
pinctrl: amd: Drop pull up select configuration
pinctrl: amd: Use amd_pinconf_set() for all config options
pinctrl: amd: Only use special debounce behavior for GPIO 0
pinctrl: renesas: rzg2l: Handle non-unique subnode names
pinctrl: renesas: rzv2m: Handle non-unique subnode names
Pull tracing fixes from Steven Rostedt:
- Swapping the ring buffer for snapshotting (for things like irqsoff)
can crash if the ring buffer is being resized. Disable swapping when
this happens. The missed swap will be reported to the tracer
- Report error if the histogram fails to be created due to an error in
adding a histogram variable, in event_hist_trigger_parse()
- Remove unused declaration of tracing_map_set_field_descr()
* tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
ring-buffer: Do not swap cpu_buffer during resize process
tracing: Remove unused extern declaration tracing_map_set_field_descr()
Pull Kbuild fixes from Masahiro Yamada:
- Fix stale help text in gconfig
- Support *.S files in compile_commands.json
- Flatten KBUILD_CFLAGS
- Fix external module builds with Rust so that temporary files are
created in the modules directories instead of the kernel tree
* tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: rust: avoid creating temporary files
kbuild: flatten KBUILD_CFLAGS
gen_compile_commands: add assembly files to compilation database
kconfig: gconfig: correct program name in help text
kconfig: gconfig: drop the Show Debug Info help text
Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.
Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Avoid pKVM finalization if KVM initialization fails
- Add missing BTI instructions in the hypervisor, fixing an early
boot failure on BTI systems
- Handle MMU notifiers correctly for non hugepage-aligned memslots
- Work around a bug in the architecture where hypervisor timer
controls have UNKNOWN behavior under nested virt
- Disable preemption in kvm_arch_hardware_enable(), fixing a kernel
BUG in cpu hotplug resulting from per-CPU accessor sanity checking
- Make WFI emulation on GICv4 systems robust w.r.t. preemption,
consistently requesting a doorbell interrupt on vcpu_put()
- Uphold RES0 sysreg behavior when emulating older PMU versions
- Avoid macro expansion when initializing PMU register names,
ensuring the tracepoints pretty-print the sysreg
s390:
- Two fixes for asynchronous destroy
x86 fixes will come early next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: pv: fix index value of replaced ASCE
KVM: s390: pv: simplify shutdown and fix race
KVM: arm64: Fix the name of sys_reg_desc related to PMU
KVM: arm64: Correctly handle RES0 bits PMEVTYPER<n>_EL0.evtCount
KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption
KVM: arm64: Add missing BTI instructions
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
KVM: arm64: Disable preemption in kvm_arch_hardware_enable()
KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm
KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits
`rustc` outputs by default the temporary files (i.e. the ones saved
by `-Csave-temps`, such as `*.rcgu*` files) in the current working
directory when `-o` and `--out-dir` are not given (even if
`--emit=x=path` is given, i.e. it does not use those for temporaries).
Since out-of-tree modules are compiled from the `linux` tree,
`rustc` then tries to create them there, which may not be accessible.
Thus pass `--out-dir` explicitly, even if it is just for the temporary
files.
Similarly, do so for Rust host programs too.
Reported-by: Raphael Nestler <raphael.nestler@gmail.com>
Closes: https://github.com/Rust-for-Linux/linux/issues/1015
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Raphael Nestler <raphael.nestler@gmail.com> # non-hostprogs
Tested-by: Andrea Righi <andrea.righi@canonical.com> # non-hostprogs
Fixes: 295d8398c67e ("kbuild: specify output names separately for each emission type from rustc")
Cc: stable@vger.kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Tested-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When ring_buffer_swap_cpu was called during resize process,
the cpu buffer was swapped in the middle, resulting in incorrect state.
Continuing to run in the wrong state will result in oops.
This issue can be easily reproduced using the following two scripts:
/tmp # cat test1.sh
//#! /bin/sh
for i in `seq 0 100000`
do
echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
sleep 0.5
echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
sleep 0.5
done
/tmp # cat test2.sh
//#! /bin/sh
for i in `seq 0 100000`
do
echo irqsoff > /sys/kernel/debug/tracing/current_tracer
sleep 1
echo nop > /sys/kernel/debug/tracing/current_tracer
sleep 1
done
/tmp # ./test1.sh &
/tmp # ./test2.sh &
A typical oops log is as follows, sometimes with other different oops logs.
[ 231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
[ 231.713375] Modules linked in:
[ 231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15
[ 231.716750] Hardware name: linux,dummy-virt (DT)
[ 231.718152] Workqueue: events update_pages_handler
[ 231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 231.721171] pc : rb_update_pages+0x378/0x3f8
[ 231.722212] lr : rb_update_pages+0x25c/0x3f8
[ 231.723248] sp : ffff800082b9bd50
[ 231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[ 231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
[ 231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
[ 231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
[ 231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
[ 231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[ 231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
[ 231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
[ 231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
[ 231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
[ 231.744196] Call trace:
[ 231.744892] rb_update_pages+0x378/0x3f8
[ 231.745893] update_pages_handler+0x1c/0x38
[ 231.746893] process_one_work+0x1f0/0x468
[ 231.747852] worker_thread+0x54/0x410
[ 231.748737] kthread+0x124/0x138
[ 231.749549] ret_from_fork+0x10/0x20
[ 231.750434] ---[ end trace 0000000000000000 ]---
[ 233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 233.721696] Mem abort info:
[ 233.721935] ESR = 0x0000000096000004
[ 233.722283] EC = 0x25: DABT (current EL), IL = 32 bits
[ 233.722596] SET = 0, FnV = 0
[ 233.722805] EA = 0, S1PTW = 0
[ 233.723026] FSC = 0x04: level 0 translation fault
[ 233.723458] Data abort info:
[ 233.723734] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 233.724176] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 233.724589] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
[ 233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 233.726720] Modules linked in:
[ 233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15
[ 233.727777] Hardware name: linux,dummy-virt (DT)
[ 233.728225] Workqueue: events update_pages_handler
[ 233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 233.729054] pc : rb_update_pages+0x1a8/0x3f8
[ 233.729334] lr : rb_update_pages+0x154/0x3f8
[ 233.729592] sp : ffff800082b9bd50
[ 233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[ 233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
[ 233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
[ 233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
[ 233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
[ 233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
[ 233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
[ 233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
[ 233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
[ 233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
[ 233.734418] Call trace:
[ 233.734593] rb_update_pages+0x1a8/0x3f8
[ 233.734853] update_pages_handler+0x1c/0x38
[ 233.735148] process_one_work+0x1f0/0x468
[ 233.735525] worker_thread+0x54/0x410
[ 233.735852] kthread+0x124/0x138
[ 233.736064] ret_from_fork+0x10/0x20
[ 233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
[ 233.736959] ---[ end trace 0000000000000000 ]---
After analysis, the seq of the error is as follows [1-5]:
int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
int cpu_id)
{
for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
//1. get cpu_buffer, aka cpu_buffer(A)
...
...
schedule_work_on(cpu,
&cpu_buffer->update_pages_work);
//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
// update_pages_handler, do the update process, set 'update_done' in
// complete(&cpu_buffer->update_done) and to wakeup resize process.
//---->
//3. Just at this moment, ring_buffer_swap_cpu is triggered,
//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
//ring_buffer_swap_cpu is called as the 'Call trace' below.
Call trace:
dump_backtrace+0x0/0x2f8
show_stack+0x18/0x28
dump_stack+0x12c/0x188
ring_buffer_swap_cpu+0x2f8/0x328
update_max_tr_single+0x180/0x210
check_critical_timing+0x2b4/0x2c8
tracer_hardirqs_on+0x1c0/0x200
trace_hardirqs_on+0xec/0x378
el0_svc_common+0x64/0x260
do_el0_svc+0x90/0xf8
el0_svc+0x20/0x30
el0_sync_handler+0xb0/0xb8
el0_sync+0x180/0x1c0
//<----
/* wait for all the updates to complete */
for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
//for example, cpu_buffer(A)->update_done will leave be set 1, and will
//not 'wait_for_completion' at the next resize round.
if (!cpu_buffer->nr_pages_to_update)
continue;
if (cpu_online(cpu))
wait_for_completion(&cpu_buffer->update_done);
cpu_buffer->nr_pages_to_update = 0;
}
...
}
//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
//Continuing to run in the wrong state, then oops occurs.
Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn
Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull ext4 fixes from Ted Ts'o:
"Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's
checkpoint code"
* tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix rbtree traversal bug in ext4_mb_use_preallocated
ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()
ext4: correct inline offset when handling xattrs in inode body
jbd2: remove __journal_try_to_free_buffer()
jbd2: fix a race when checking checkpoint buffer busy
jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
jbd2: remove journal_clean_one_cp_list()
jbd2: remove t_checkpoint_io_list
jbd2: recheck chechpointing non-dirty buffer
Since commit 08d43a5fa063 ("tracing: Add lock-free tracing_map"),
this is never used, so can be removed.
Link: https://lore.kernel.org/linux-trace-kernel/20230722032123.24664-1-yuehaibing@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull smb client fix from Steve French:
"Add minor debugging improvement.
The change improves ability to read a network trace to debug problems
on encrypted connections which are very common (e.g. using wireshark
or tcpdump).
That works today with tools like 'smbinfo keys /mnt/file' but requires
passing in a filename on the mount (see e.g. [1]), but it often makes
more sense to just pass in the mount point path (ie a directory not a
filename).
So this fix was needed to debug some types of problems (an obvious
example is on an encrypted connection failing operations on an empty
share or with no files in the root of the directory) - so you can
simply pass in the 'smbinfo keys <mntpoint>' and get the information
that wireshark needs"
Link: https://wiki.samba.org/index.php/Wireshark_Decryption [1]
* tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko
cifs: allow dumping keys for directories too
During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.
To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:
1. It must have it's logical start immediately to the left of
(ie less than) original logical start.
2. It must not be deleted
To find this pa we use the following traversal method:
1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.
2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.
3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.
4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.
This approach also takes care of having deleted PAs in the tree.
(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)
[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/
Cc: stable@kernel.org # 6.4
Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
KVM/arm64 fixes for 6.5, part #1
- Avoid pKVM finalization if KVM initialization fails
- Add missing BTI instructions in the hypervisor, fixing an early boot
failure on BTI systems
- Handle MMU notifiers correctly for non hugepage-aligned memslots
- Work around a bug in the architecture where hypervisor timer controls
have UNKNOWN behavior under nested virt.
- Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG
in cpu hotplug resulting from per-CPU accessor sanity checking.
- Make WFI emulation on GICv4 systems robust w.r.t. preemption,
consistently requesting a doorbell interrupt on vcpu_put()
- Uphold RES0 sysreg behavior when emulating older PMU versions
- Avoid macro expansion when initializing PMU register names, ensuring
the tracepoints pretty-print the sysreg.
The index field of the struct page corresponding to a guest ASCE should
be 0. When replacing the ASCE in s390_replace_asce(), the index of the
new ASCE should also be set to 0.
Having the wrong index might lead to the wrong addresses being passed
around when notifying pte invalidations, and eventually to validity
intercepts (VM crash) if the prefix gets unmapped and the notifier gets
called with the wrong address.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Fixes: faa2f72cb356 ("KVM: s390: pv: leak the topmost page table when destroy fails")
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20230705111937.33472-3-imbrenda@linux.ibm.com>
Like C source files, tooling can find it useful to have the assembly
source file compilation recorded.
The .S extension appears to used across all architectures.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull powerpc fixes from Michael Ellerman:
- Reinstate support for little endian ELFv1 binaries, which it turns
out still exist in the wild.
- Revert a change which used asm goto for WARN_ON/__WARN_FLAGS, as it
lead to dead code generation and seemed to trigger compiler bugs in
some edge cases.
- Fix a deadlock in the pseries VAS code, between live migration and
the driver's mmap handler.
- Disable KCOV instrumentation in the powerpc KASAN code.
Thanks to Andrew Donnellan, Benjamin Gray, Christophe Leroy, Haren
Myneni, Russell Currey, and Uwe Kleine-König.
* tag 'powerpc-6.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
Revert "powerpc/64s: Remove support for ELFv1 little endian userspace"
powerpc/kasan: Disable KCOV in KASAN code
powerpc/512x: lpbfifo: Convert to platform remove callback returning void
powerpc/crypto: Add gitignore for generated P10 AES/GCM .S files
Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto"
powerpc/pseries/vas: Hold mmap_mutex after mmap lock during window close
In ext4_mb_choose_next_group_best_avail(), we want the start order to be
1 less than goal length and the min_order to be, at max, 1 more than the
original length. This commit fixes an off by one issue that arose due to
the fact that 1 << fls(n) > (n).
After all the processing:
order = 1 order below goal len
min_order = maximum of the three:-
- order - trim_order
- 1 order below B2C(s_stripe)
- 1 order above original len
Cc: stable@kernel.org
Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For those PMU system registers defined in sys_reg_descs[], use macro
PMU_SYS_REG() / PMU_PMEVCNTR_EL0 / PMU_PMEVTYPER_EL0 to define them, and
later two macros call macro PMU_SYS_REG() actually.
Currently the input parameter of PMU_SYS_REG() is another macro which is
calculation formula of the value of system registers, so for example, if
we want to "SYS_PMINTENSET_EL1" as the name of sys register, actually
the name we get is as following:
(((3) << 19) | ((0) << 16) | ((9) << 12) | ((14) << 8) | ((1) << 5))
The name of system register is used in some tracepoints such as
trace_kvm_sys_access(), if not set correctly, we need to analyze the
inaccurate name to get the exact name (which also is inconsistent with
other system registers), and also the inaccurate name occupies more space.
To fix the issue, use the name as a input parameter of PMU_SYS_REG like
MTE_REG or EL2_REG.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1689305920-170523-1-git-send-email-chenxiang66@hisilicon.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Simplify the shutdown of non-protected VMs. There is no need to do
complex manipulations of the counter if it was zero.
This also fixes a very rare race which caused pages to be torn down
from the address space with a non-zero counter even on older machines
that don't support the UVC instruction, causing a crash.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: fb491d5500a7 ("KVM: s390: pv: asynchronous destroy for reboot")
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20230705111937.33472-2-imbrenda@linux.ibm.com>
Pull xtensa fixes from Max Filippov:
- fix interaction between unaligned exception handler and load/store
exception handler
- fix parsing ISS network interface specification string
- add comment about etherdev freeing to ISS network driver
* tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa: fix unaligned and load/store configuration interaction
xtensa: ISS: fix call to split_if_spec
xtensa: ISS: add comment about etherdev freeing
Pull s390 fixes from Heiko Carstens:
- Fix per vma lock fault handling: add missing !(fault & VM_FAULT_ERROR)
check to fault handler to prevent error handling for return values
that don't indicate an error
- Use kfree_sensitive() instead of kfree() in paes crypto code to clear
memory that may contain keys before freeing it
- Fix reply buffer size calculation for CCA replies in zcrypt device
driver
* tag 's390-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: fix reply buffer calculations for CCA replies
s390/crypto: use kfree_sensitive() instead of kfree()
s390/mm: fix per vma lock fault handling
This reverts commit 606787fed7268feb256957872586370b56af697a.
ELFv1 with LE has never been a thing, and people who try to make ELFv1 LE
binaries are maniacs who need to be stopped, but unfortunately there are
ELFv1 LE binaries out there in the wild.
One such binary is the ppc64el (as Debian calls it) helper for
arch-test[0], a tool for detecting architectures that can be executed on a
given machine by means of attempting to execute helper binaries compiled
for each architecture and seeing which binaries succeed and fail. The
helpers are small snippets of assembly, and the ppc64el assembly doesn't
include the right directives to generate an ELFv2 binary.
This results in arch-test incorrectly determining that a ppc64el kernel
can't execute a ppc64el userspace, which in turn means that a number of
developer tools such as debootstrap will break (assuming arch-test is
installed).
[0] https://github.com/kilobyte/arch-test
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230719071821.320594-1-ajd@linux.ibm.com
Dumping the enc/dec keys is a session wide operation.
And it should not matter if the ioctl was run on
a regular file or a directory.
Currently, we obtain the tcon pointer from the
cifs file handle. But since there's no dir open call
in cifs, this is not populated for dirs.
This change allows dumping of session keys using ioctl
even for directories. To do this, we'll now get the
tcon pointer from the superblock, and not from the file
handle.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When run on a file system where the inline_data feature has been
enabled, xfstests generic/269, generic/270, and generic/476 cause ext4
to emit error messages indicating that inline directory entries are
corrupted. This occurs because the inline offset used to locate
inline directory entries in the inode body is not updated when an
xattr in that shared region is deleted and the region is shifted in
memory to recover the space it occupied. If the deleted xattr precedes
the system.data attribute, which points to the inline directory entries,
that attribute will be moved further up in the region. The inline
offset continues to point to whatever is located in system.data's former
location, with unfortunate effects when used to access directory entries
or (presumably) inline data in the inode body.
Cc: stable@kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The PMU event ID varies from 10 to 16 bits, depending on the PMU
version. If the PMU only supports 10 bits of event ID, bits [15:10] of
the evtCount field behave as RES0.
While the actual PMU emulation code gets this right (i.e. RES0 bits are
masked out when programming the perf event), the sysreg emulation writes
an unmasked value to the in-memory cpu context. The net effect is that
guest reads and writes of PMEVTYPER<n>_EL0 will see non-RES0 behavior in
the reserved bits of the field.
As it so happens, kvm_pmu_set_counter_event_type() already writes a
masked value to the in-memory context that gets overwritten by
access_pmu_evtyper(). Fix the issue by removing the unnecessary (and
incorrect) register write in access_pmu_evtyper().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230713221649.3889210-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Pull perf fix from Borislav Petkov:
- Fix a lockdep warning when the event given is the first one, no event
group exists yet but the code still goes and iterates over event
siblings
* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
Unaligned exception handler is needed in configurations with hardware
support for unaligned access when the load/store exception handler is
enabled because such configurations would still raise an exception on
unaligned access through the instruction bus.
Fixes: f29cf77609cc ("xtensa: add load/store exception handler")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Pull block fixes from Jens Axboe:
- Fix for loop regressions (Mauricio)
- Fix a potential stall with batched wakeups in sbitmap (David)
- Fix for stall with recursive plug flushes (Ross)
- Skip accounting of empty requests for blk-iocost (Chengming)
- Remove a dead field in struct blk_mq_hw_ctx (Chengming)
* tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux:
loop: do not enforce max_loop hard limit by (new) default
loop: deprecate autoloading callback loop_probe()
sbitmap: fix batching wakeup
blk-iocost: skip empty flush bio in iocost
blk-mq: delete dead struct blk_mq_hw_ctx->queued field
blk-mq: Fix stall due to recursive flush plug
The length information for available buffer space for CCA
replies is covered with two fields in the T6 header prepended
on each CCA reply: fromcardlen1 and fromcardlen2. The sum of
these both values must not exceed the AP bus limit for this
card (24KB for CEX8, 12KB CEX7 and older) minus the always
present headers.
The current code adjusted the fromcardlen2 value in case
of exceeding the AP bus limit when there was a non-zero
value given from userspace. Some tests now showed that this
was the wrong assumption. Instead the userspace value given for
this field should always be trusted and if the sum of the
two fields exceeds the AP bus limit for this card the first
field fromcardlen1 should be adjusted instead.
So now the calculation is done with this new insight in mind.
Also some additional checks for overflow have been introduced
and some comments to provide some documentation for future
maintainers of this complicated calculation code.
Furthermore the 128 bytes of fix overhead which is used
in the current code is not correct. Investigations showed
that for a reply always the same two header structs are
prepended before a possible payload. So this is also fixed
with this patch.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
As per the generic KASAN code in mm/kasan, disable KCOV with
KCOV_INSTRUMENT := n in the makefile.
This fixes a ppc64 boot hang when KCOV and KASAN are enabled.
kasan_early_init() gets called before a PACA is initialised, but the
KCOV hook expects a valid PACA.
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230710044143.146840-1-bgray@linux.ibm.com
__journal_try_to_free_buffer() has only one caller and it's logic is
much simple now, so just remove it and open code in
jbd2_journal_try_to_free_buffers().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Xiang reports that VMs occasionally fail to boot on GICv4.1 systems when
running a preemptible kernel, as it is possible that a vCPU is blocked
without requesting a doorbell interrupt.
The issue is that any preemption that occurs between vgic_v4_put() and
schedule() on the block path will mark the vPE as nonresident and *not*
request a doorbell irq. This occurs because when the vcpu thread is
resumed on its way to block, vcpu_load() will make the vPE resident
again. Once the vcpu actually blocks, we don't request a doorbell
anymore, and the vcpu won't be woken up on interrupt delivery.
Fix it by tracking that we're entering WFI, and key the doorbell
request on that flag. This allows us not to make the vPE resident
when going through a preempt/schedule cycle, meaning we don't lose
any state.
Cc: stable@vger.kernel.org
Fixes: 8e01d9a396e6 ("KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put")
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Suggested-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20230713070657.3873244-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Pull objtool fixes from Borislav Petkov:
- Mark copy_iovec_from_user() __noclone in order to prevent gcc from
doing an inter-procedural optimization and confuse objtool
- Initialize struct elf fully to avoid build failures
* tag 'objtool_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iov_iter: Mark copy_iovec_from_user() noclone
objtool: initialize all of struct elf
On SPR, the load latency event needs an auxiliary event in the same
group to work properly. There's a check in intel_pmu_hw_config()
for this to iterate sibling events and find a mem-loads-aux event.
The for_each_sibling_event() has a lockdep assert to make sure if it
disabled hardirq or hold leader->ctx->mutex. This works well if the
given event has a separate leader event since perf_try_init_event()
grabs the leader->ctx->mutex to protect the sibling list. But it can
cause a problem when the event itself is a leader since the event is
not initialized yet and there's no ctx for the event.
Actually I got a lockdep warning when I run the below command on SPR,
but I guess it could be a NULL pointer dereference.
$ perf record -d -e cpu/mem-loads/uP true
The code path to the warning is:
sys_perf_event_open()
perf_event_alloc()
perf_init_event()
perf_try_init_event()
x86_pmu_event_init()
hsw_hw_config()
intel_pmu_hw_config()
for_each_sibling_event()
lockdep_assert_event_ctx()
We don't need for_each_sibling_event() when it's a standalone event.
Let's return the error code directly.
Fixes: f3c0eba28704 ("perf: Add a few assertions")
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230704181516.3293665-1-namhyung@kernel.org
split_if_spec expects a NULL-pointer as an end marker for the argument
list, but tuntap_probe never supplied that terminating NULL. As a result
incorrectly formatted interface specification string may cause a crash
because of the random memory access. Fix that by adding NULL terminator
to the split_if_spec argument list.
Cc: stable@vger.kernel.org
Fixes: 7282bee78798 ("[PATCH] xtensa: Architecture support for Tensilica Xtensa Part 8")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Pull io_uring fixes from Jens Axboe:
- Fix for io-wq not always honoring REQ_F_NOWAIT, if it was set and
punted directly (eg via DRAIN) (me)
- Capability check fix (Ondrej)
- Regression fix for the mmap changes that went into 6.4, which
apparently broke IA64 (Helge)
* tag 'io_uring-6.5-2023-07-21' of git://git.kernel.dk/linux:
ia64: mmap: Consider pgoff when searching for free mapping
io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()
io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq
io_uring: don't audit the capability check in io_uring_create()
Problem:
The max_loop parameter is used for 2 different purposes:
1) initial number of loop devices to pre-create on init
2) maximum number of loop devices to add on access/open()
Historically, its default value (zero) caused 1) to create non-zero
number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on
2) to add devices with autoloading.
However, the default value changed in commit 85c50197716c ("loop: Fix
the max_loop commandline argument treatment when it is set to 0") to
CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices.
That does improve 1), but unfortunately it breaks 2), as the default
behavior changed from no-limit to hard-limit.
Example:
For example, this userspace code broke for N >= CONFIG, if the user
relied on the default value 0 for max_loop:
mknod("/dev/loopN");
open("/dev/loopN"); // now fails with ENXIO
Though affected users may "fix" it with (loop.)max_loop=0, this means to
require a kernel parameter change on stable kernel update (that commit
Fixes: an old commit in stable).
Solution:
The original semantics for the default value in 2) can be applied if the
parameter is not set (ie, default behavior).
This still keeps the intended function in 1) and 2) if set, and that
commit's intended improvement in 1) if max_loop=0.
Before 85c50197716c:
- default: 1) CONFIG devices 2) no limit
- max_loop=0: 1) CONFIG devices 2) no limit
- max_loop=X: 1) X devices 2) X limit
After 85c50197716c:
- default: 1) CONFIG devices 2) CONFIG limit (*)
- max_loop=0: 1) 0 devices (*) 2) no limit
- max_loop=X: 1) X devices 2) X limit
This commit:
- default: 1) CONFIG devices 2) no limit (*)
- max_loop=0: 1) 0 devices 2) no limit
- max_loop=X: 1) X devices 2) X limit
Future:
The issue/regression from that commit only affects code under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is
contained under it.
Once that deprecated functionality/code is removed, the purpose 2) of
max_loop (hard limit) is no longer in use, so the module parameter
description can be changed then.
Tests:
Linux 6.4-rc7
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
- default (original)
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
open: /dev/loop8: No such device or address
- default (patched)
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
#
- max_loop=0 (original & patched):
# ls -1 /dev/loop*
/dev/loop-control
# ./test-loop
#
- max_loop=8 (original & patched):
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
open: /dev/loop8: No such device or address
- max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set)
# ls -1 /dev/loop*
/dev/loop-control
# ./test-loop
open: /dev/loop8: No such device or address
Fixes: 85c50197716c ("loop: Fix the max_loop commandline argument treatment when it is set to 0")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
key might contain private part of the key, so better use
kfree_sensitive() to free it.
Signed-off-by: Wang Ming <machel@vivo.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Link: https://lore.kernel.org/r/20230717094533.18418-1-machel@vivo.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new() which already returns void. Eventually after all drivers
are converted, .remove_new() is renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230711143145.1192651-1-u.kleine-koenig@pengutronix.de
Before removing checkpoint buffer from the t_checkpoint_list, we have to
check both BH_Dirty and BH_Lock bits together to distinguish buffers
have not been or were being written back. But __cp_buffer_busy() checks
them separately, it first check lock state and then check dirty, the
window between these two checks could be raced by writing back
procedure, which locks buffer and clears buffer dirty before I/O
completes. So it cannot guarantee checkpointing buffers been written
back to disk if some error happens later. Finally, it may clean
checkpoint transactions and lead to inconsistent filesystem.
jbd2_journal_forget() and __journal_try_to_free_buffer() also have the
same problem (journal_unmap_buffer() escape from this issue since it's
running under the buffer lock), so fix them through introducing a new
helper to try holding the buffer lock and remove really clean buffer.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Cc: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some bti instructions were missing from
commit b53d4a272349 ("KVM: arm64: Use BTI for nvhe")
1) kvm_host_psci_cpu_entry
kvm_host_psci_cpu_entry is called from __kvm_hyp_init_cpu through "br"
instruction as __kvm_hyp_init_cpu resides in idmap section while
kvm_host_psci_cpu_entry is in hyp .text so the offset is larger than
128MB range covered by "b".
Which means that this function should start with "bti j" instruction.
LLVM which is the only compiler supporting BTI for Linux, adds "bti j"
for jump tables or by when taking the address of the block [1].
Same behaviour is observed with GCC.
As kvm_host_psci_cpu_entry is a C function, this must be done in
assembly.
Another solution is to use X16/X17 with "br", as according to ARM
ARM DDI0487I.a RLJHCL/IGMGRS, PACIASP has an implicit branch
target identification instruction that is compatible with
PSTATE.BTYPE 0b01 which includes "br X16/X17"
And the kvm_host_psci_cpu_entry has PACIASP as it is an external
function.
Although, using explicit "bti" makes it more clear than relying on
which register is used.
A third solution is to clear SCTLR_EL2.BT, which would make PACIASP
compatible PSTATE.BTYPE 0b11 ("br" to other registers).
However this deviates from the kernel behaviour (in bti_enable()).
2) Spectre vector table
"br" instructions are generated at runtime for the vector table
(__bp_harden_hyp_vecs).
These branches would land on vectors in __kvm_hyp_vector at offset 8.
As all the macros are defined with valid_vect/invalid_vect, it is
sufficient to add "bti j" at the correct offset.
[1] https://reviews.llvm.org/D52867
Fixes: b53d4a272349 ("KVM: arm64: Use BTI for nvhe")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20230706152240.685684-1-smostafa@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We just sorted the entries and fields last release, so just out of a
perverse sense of curiosity, I decided to see if we can keep things
ordered for even just one release.
The answer is "No. No we cannot".
I suggest that all kernel developers will need weekly training sessions,
involving a lot of Big Bird and Sesame Street. And at the yearly
maintainer summit, we will all sing the alphabet song together.
I doubt I will keep doing this. At some point "perverse sense of
curiosity" turns into just a cold dark place filled with sadness and
despair.
Repeats: 80e62bc8487b ("MAINTAINERS: re-sort all entries and fields")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Borislav Petkov:
- Remove a cgroup from under a polling process properly
- Fix the idle sibling selection
* tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: use kernfs polling functions for PSI trigger polling
sched/fair: Use recent_used_cpu to test p->cpus_ptr
Extend commit 50f9a76ef127 ("iov_iter: Mark
copy_compat_iovec_from_user() noinline") to also cover
copy_iovec_from_user(). Different compiler versions cause the same
problem on different functions.
lib/iov_iter.o: warning: objtool: .altinstr_replacement+0x1f: redundant UACCESS disable
lib/iov_iter.o: warning: objtool: iovec_from_user+0x84: call to copy_iovec_from_user.part.0() with UACCESS enabled
lib/iov_iter.o: warning: objtool: __import_iovec+0x143: call to copy_iovec_from_user.part.0() with UACCESS enabled
Fixes: 50f9a76ef127 ("iov_iter: Mark copy_compat_iovec_from_user() noinline")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20230616124354.GD4253@hirez.programming.kicks-ass.net
iss_net_configure explicitly frees etherdev in all error return paths
except one where register_netdevice fails. In that remaining error
return path the etherdev is freed by the iss_net_pdev_release callback
triggered by the platform_device_unregister call. Add a comment stating
that.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Pull devicetree fixes from Rob Herring:
- Fix moortec,mr75203 schema usage of 'multipleOf' keyword
- Fix regression in systems depending on "of-display" device name
- Build fix for s390 with CONFIG_PCI=n and OF_EARLY_FLATTREE=y
- Drop two obsolete serial .txt bindings
* tag 'devicetree-fixes-for-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: serial: Remove obsolete nxp,lpc1850-uart.txt
dt-bindings: serial: Remove obsolete cavium-uart.txt
dt-bindings: hwmon: moortec,mr75203: fix multipleOf for coefficients
of: Preserve "of-display" device name for compatibility
of: make OF_EARLY_FLATTREE depend on HAS_IOMEM
IA64 is the only architecture which does not consider the pgoff value when
searching for a possible free memory region with vm_unmapped_area().
Adding this seems to have no negative side effect on IA64, so add it now
to make IA64 consistent with all other architectures.
Cc: stable@vger.kernel.org # 6.4
Signed-off-by: Helge Deller <deller@gmx.de>
Tested-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-ia64@vger.kernel.org
Link: https://lore.kernel.org/r/20230721152432.196382-3-deller@gmx.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'probe' callback in __register_blkdev() is only used under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard.
The loop_probe() function is only used for that callback, so guard it
too, accordingly.
See commit fbdee71bb5d8 ("block: deprecate autoloading based on dev_t").
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With per-vma locks, handle_mm_fault() may return non-fatal error
flags. In this case the code should reset the fault flags before
returning.
Fixes: e06f47a16573 ("s390/mm: try VMA lock-based page fault handling first")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
aesp10-ppc.S and ghashp10-ppc.S are autogenerated and not tracked by
git, so they should be ignored. This is doing the same as the P8 files
in drivers/crypto/vmx/.gitignore but for the P10 files in
arch/powerpc/crypto.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230713042206.85669-1-ruscur@russell.cc
Following process,
jbd2_journal_commit_transaction
// there are several dirty buffer heads in transaction->t_checkpoint_list
P1 wb_workfn
jbd2_log_do_checkpoint
if (buffer_locked(bh)) // false
__block_write_full_page
trylock_buffer(bh)
test_clear_buffer_dirty(bh)
if (!buffer_dirty(bh))
__jbd2_journal_remove_checkpoint(jh)
if (buffer_write_io_error(bh)) // false
>> bh IO error occurs <<
jbd2_cleanup_journal_tail
__jbd2_update_log_tail
jbd2_write_superblock
// The bh won't be replayed in next mount.
, which could corrupt the ext4 image, fetch a reproducer in [Link].
Since writeback process clears buffer dirty after locking buffer head,
we can fix it by try locking buffer and check dirtiness while buffer is
locked, the buffer head can be removed if it is neither dirty nor locked.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest
memory. This is even the case with hugepages, although it is a rather
suboptimal configuration as PTE level mappings are used at stage-2.
The arm64 page aging handlers have an assumption that the specified
range is exactly one page/block of memory, which in the aforementioned
case is not necessarily true. All together this leads to the WARN() in
kvm_age_gfn() firing.
However, the WARN is only part of the issue as the table walkers visit
at most a single leaf PTE. For hugepage-backed memory in a memslot that
isn't hugepage-aligned, page aging entirely misses accesses to the
hugepage beyond the first page in the memslot.
Add a new walker dedicated to handling page aging MMU notifiers capable
of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new
walker and drop the WARN that caught the issue in the first place. The
implementation of this walker was inspired by the test_clear_young()
implementation by Yu Zhao [*], but repurposed to address a bug in the
existing aging implementation.
Cc: stable@vger.kernel.org # v5.15
Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")
Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/
Co-developed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Pull dma-mapping fixes from Christoph Hellwig:
- swiotlb area sizing fixes (Petr Tesarik)
* tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: reduce the number of areas to match actual memory pool size
swiotlb: always set the number of areas before allocating the pool
Pull pin control fixes from Linus Walleij:
"I'm mostly on vacation but what would vacation be without a few
critical fixes so people can use their gaming laptops when hiding away
from the sun (or rain)?
- Fix a really annoying interrupt storm in the AMD driver affecting
Asus TUF gaming notebooks
- Fix device tree parsing in the Renesas driver"
* tag 'pinctrl-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: amd: Unify debounce handling into amd_pinconf_set()
pinctrl: amd: Drop pull up select configuration
pinctrl: amd: Use amd_pinconf_set() for all config options
pinctrl: amd: Only use special debounce behavior for GPIO 0
pinctrl: renesas: rzg2l: Handle non-unique subnode names
pinctrl: renesas: rzv2m: Handle non-unique subnode names