commits

bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.

Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:

- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?

The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.

This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.

The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.

The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.

Sample reproducer, note you'll need to change the path to the bdi and
device:

SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192

mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs

btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync

echo 4 > /proc/sys/vm/drop_caches

echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb

while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done

Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2y ago

Yicong Yang

fff67c1b

i2c: hisi: Only handle the interrupt of the driver's transfer

2y ago

Linus Torvalds

fb0d9199

Merge tag 'ata-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

2y ago

Andrea Righi

b0554488

rust: fix bindgen build error with UBSAN_BOUNDS_STRICT

2y ago

Linus Torvalds

3e327154

vfs: get rid of old '->iterate' directory operation

2y ago

Tony Lindgren

dfe2aeb2

serial: 8250: Fix oops for port->pm on uart_change_pm()

2y ago

Linus Torvalds

bf98bae3

Merge tag 'x86_urgent_for_v6.5_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
"Extraordinary embargoed times call for extraordinary measures. That's
why this week's x86/urgent branch is larger than usual, containing all
the known fallout fixes after the SRSO mitigation got merged.

I know, it is a bit late in the game but everyone who has reported a
bug stemming from the SRSO pile, has tested that branch and has
confirmed that it fixes their bug.

Also, I've run it on every possible hardware I have and it is looking
good. It is running on this very machine while I'm typing, for 2 days
now without an issue. Famous last words...

- Use LEA ...%rsp instead of ADD %rsp in the Zen1/2 SRSO return
sequence as latter clobbers flags which interferes with fastop
emulation in KVM, leading to guests freezing during boot

- A fix for the DIV(0) quotient data leak on Zen1 to clear the
divider buffers at the right time

- Disable the SRSO mitigation on unaffected configurations as it got
enabled there unnecessarily

- Change .text section name to fix CONFIG_LTO_CLANG builds

- Improve the optprobe indirect jmp check so that certain
configurations can still be able to use optprobes at all

- A serious and good scrubbing of the untraining routines by PeterZ:
- Add proper speculation stopping traps so that objtool is happy
- Adjust objtool to handle the new thunks
- Make the thunk pointer assignable to the different untraining
sequences at runtime, thus avoiding the alternative at the
return thunk. It simplifies the code a bit too.
- Add a entry_untrain_ret() main entry point which selects the
respective untraining sequence
- Rename things so that they're more clear
- Fix stack validation with FRAME_POINTER=y builds

- Fix static call patching to handle when a JMP to the return thunk
is the last insn on the very last module memory page

- Add more documentation about what each untraining routine does and
why"

* tag 'x86_urgent_for_v6.5_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/srso: Correct the mitigation status when SMT is disabled
x86/static_call: Fix __static_call_fixup()
objtool/x86: Fixup frame-pointer vs rethunk
x86/srso: Explain the untraining sequences a bit more
x86/cpu/kvm: Provide UNTRAIN_RET_VM
x86/cpu: Cleanup the untrain mess
x86/cpu: Rename srso_(.*)_alias to srso_alias_\1
x86/cpu: Rename original retbleed methods
x86/cpu: Clean up SRSO return thunk mess
x86/alternative: Make custom return thunk unconditional
objtool/x86: Fix SRSO mess
x86/cpu: Fix up srso_safe_ret() and __x86_return_thunk()
x86/cpu: Fix __x86_return_thunk symbol type
x86/retpoline,kprobes: Skip optprobe check for indirect jumps with retpolines and IBT
x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG
x86/srso: Disable the mitigation on unaffected configurations
x86/CPU/AMD: Fix the DIV(0) initial fix attempt
x86/retpoline: Don't clobber RFLAGS during srso_safe_ret()

2y ago

Fabio Estevam

2908042a

media: imx: imx7-media-csi: Fix applying format constraints

2y ago

Sweet Tea Dorminy

c984ff14

blk-crypto: dynamically allocate fallback profile

2y ago

Yue Haibing

082f6131

fbdev: kyro: Remove unused declarations

2y ago

Anand Jain

b471965f

btrfs: fix replace/scrub failure with metadata_uuid

2y ago

Parker Newman

27ec43c7

i2c: tegra: Fix i2c-tegra DMA config option processing

2y ago

Linus Torvalds

f6a69168

Merge tag '6.5-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6

2y ago

Damien Le Moal

0a858905

ata,scsi: do not issue START STOP UNIT on resume

During system resume, ata_port_pm_resume() triggers ata EH to
1) Resume the controller
2) Reset and rescan the ports
3) Revalidate devices
This EH execution is started asynchronously from ata_port_pm_resume(),
which means that when sd_resume() is executed, none or only part of the
above processing may have been executed. However, sd_resume() issues a
START STOP UNIT to wake up the drive from sleep mode. This command is
translated to ATA with ata_scsi_start_stop_xlat() and issued to the
device. However, depending on the state of execution of the EH process
and revalidation triggerred by ata_port_pm_resume(), two things may
happen:
1) The START STOP UNIT fails if it is received before the controller has
been reenabled at the beginning of the EH execution. This is visible
with error messages like:

ata10.00: device reported invalid CHS sector 0
sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
sd 9:0:0:0: PM: failed to resume async: error -5

2) The START STOP UNIT command is received while the EH process is
on-going, which mean that it is stopped and must wait for its
completion, at which point the command is rather useless as the drive
is already fully spun up already. This case results also in a
significant delay in sd_resume() which is observable by users as
the entire system resume completion is delayed.

Given that ATA devices will be woken up by libata activity on resume,
sd_resume() has no need to issue a START STOP UNIT command, which solves
the above mentioned problems. Do not issue this command by introducing
the new scsi_device flag no_start_on_resume and setting this flag to 1
in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
UNIT command only if this flag is not set.

Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Tanner Watkins <dalzot@gmail.com>
Tested-by: Paul Ausbeck <paula@soe.ucsc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>

2y ago

Alice Ryhl

1d24eb2d

rust: delete `ForeignOwnable::borrow_mut`

2y ago

Linus Torvalds

0a2c2baa

proc: fix missing conversion to 'iterate_shared'

I'm looking at the directory handling due to the discussion about f_pos
locking (see commit 797964253d35: "file: reinstate f_pos locking
optimization for regular files"), and wanting to clean that up.

And one source of ugliness is how we were supposed to move filesystems
over to the '->iterate_shared()' function that only takes the inode lock
for reading many many years ago, but several filesystems still use the
bad old '->iterate()' that takes the inode lock for exclusive access.

See commit 6192269444eb ("introduce a parallel variant of ->iterate()")
that also added some documentation stating

Old method is only used if the new one is absent; eventually it will
be removed. Switch while you still can; the old one won't stay.

and that was back in April 2016. Here we are, many years later, and the
old version is still clearly sadly alive and well.

Now, some of those old style iterators are probably just because the
filesystem may end up having per-inode mutable data that it uses for
iterating a directory, but at least one case is just a mistake.

Al switched over most filesystems to use '->iterate_shared()' back when
it was introduced. In particular, the /proc filesystem was converted as
one of the first ones in commit f50752eaa0b0 ("switch all procfs
directories ->iterate_shared()").

But then later one new user of '->iterate()' was then re-introduced by
commit 6d9c939dbe4d ("procfs: add smack subdir to attrs").

And that's clearly not what we wanted, since that new case just uses the
same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions
that other /proc pident directories use, and they are most definitely
safe to use with the inode lock held shared.

So just fix it.

This still leaves a fair number of oddball filesystems using the
old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2,
overlayfs, and vboxsf), but at least we don't have any remaining in the
core filesystems.

I'm going to add a wrapper function that just drops the read-lock and
takes it as a write lock, so that we can clean up the core vfs layer and
make all the ugly 'this filesystem needs exclusive inode locking' be
just filesystem-internal warts.

I just didn't want to make that conversion when we still had a core user
left.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

2y ago

Tony Lindgren

bbb4abb1

serial: 8250: Reinit port_id when adding back serial8250_isa_devs

2y ago

Linus Torvalds

4e7ffde6

Merge tag 'powerpc-6.5-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

2y ago

Borislav Petkov (AMD)

6405b72e

x86/srso: Correct the mitigation status when SMT is disabled

2y ago

Laurent Pinchart

6d00f4ec

media: uvcvideo: Fix menu count handling for userspace XU mappings

2y ago

Ming Lei

c164c7bc

blk-cgroup: hold queue_lock when removing blkg->q_node

2y ago

Uwe Kleine-König

bc2e67ac

fbdev: ssd1307fb: Print the PWM's label instead of its number

2y ago

Filipe Manana

9b378f6a

btrfs: fix infinite directory reads

The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.

The following C program and test script reproduce the problem:

$ cat /mnt/readdir_prog.c
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
DIR *dir = opendir(".");
struct dirent *dd;

while ((dd = readdir(dir))) {
printf("%s\n", dd->d_name);
rename(dd->d_name, "TEMPFILE");
rename("TEMPFILE", dd->d_name);
}
closedir(dir);
}

$ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c

$ cat test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV &> /dev/null
#mkfs.xfs -f $DEV &> /dev/null
#mkfs.ext4 -F $DEV &> /dev/null

mount $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= 2000; i++)); do
echo -n > $MNT/testdir/file_$i
done

cd $MNT/testdir
/mnt/readdir_prog

cd /mnt

umount $MNT

This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.

So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.

Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2y ago

Thierry Reding

b3497ef4

i2c: tegra: Fix failure during probe deferral cleanup

2y ago

Linus Torvalds

251a94f1

Merge tag 'powerpc-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

2y ago

Paulo Alcantara

11260c3d

smb: client: fix dfs link mount against w2k8

2y ago

Linus Torvalds

5d0c230f

Linux 6.5-rc4 v6.5-rc4

2y ago

Boqun Feng

b3d8aa84

rust: allocator: Prevent mis-aligned allocation

2y ago

Aleksa Sarai

a0fc452a

open: make RESOLVE_CACHED correctly test for O_TMPFILE

2y ago

Tony Lindgren

6be1a8d5

serial: core: Fix kmemleak issue for serial core device remove

2y ago

Linus Torvalds

d4ddefee

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

2y ago

Nathan Lynch

4f317597

powerpc/rtas_flash: allow user copy to flash block cache objects

2y ago

Peter Zijlstra

54097309

x86/static_call: Fix __static_call_fixup()

2y ago

Chen-Yu Tsai

8329d0c7

media: mtk-jpeg: Set platform driver data earlier

2y ago

Linux 6.5-rc7 v6.5-rc7

706a7415

Linus Torvalds

Merge tag 'tty-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

b320441c

Linus Torvalds

Merge tag 'rust-fixes-6.5-rc7' of https://github.com/Rust-for-Linux/linux

ec27a636

Linus Torvalds

serial: core: Fix serial core port id, including multiport devices

04c7f60c

Tony Lindgren

Merge tag 'i2c-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

9e6c269d

Linus Torvalds

rust: macros: vtable: fix `HAS_*` redefinition (`gen_const_name`)

3fa7187e

Qingsong Chen

serial: 8250: drop lockdep annotation from serial8250_clear_IER()

3d9e6f55

Jiri Slaby (SUSE)

Merge tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

12e6cced

Linus Torvalds

i2c: bcm-iproc: Fix bcm_iproc_i2c_isr deadlock issue

4caf4cb1

Chengfeng Ye

Linux 6.5-rc5 v6.5-rc5

52a93d39

Linus Torvalds

tty: n_gsm: fix the UAF caused by race condition in gsm_cleanup_mux

3c4f8333

Yi Yang

Merge tag 'fbdev-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

b5cab28b

Linus Torvalds

btrfs: fix incorrect splitting in btrfs_drop_extent_map_range

In production we were seeing a variety of WARN_ON()'s in the extent_map
code, specifically in btrfs_drop_extent_map_range() when we have to call
add_extent_mapping() for our second split.

Consider the following extent map layout

PINNED
[0 16K) [32K, 48K)

and then we call btrfs_drop_extent_map_range for [0, 36K), with
skip_pinned == true. The initial loop will have

start = 0
end = 36K
len = 36K

we will find the [0, 16k) extent, but since we are pinned we will skip
it, which has this code

start = em_end;
if (end != (u64)-1)
len = start + len - em_end;

em_end here is 16K, so now the values are

start = 16K
len = 16K + 36K - 16K = 36K

len should instead be 20K. This is a problem when we find the next
extent at [32K, 48K), we need to split this extent to leave [36K, 48k),
however the code for the split looks like this

split->start = start + len;
split->len = em_end - (start + len);

In this case we have

em_end = 48K
split->start = 16K + 36K // this should be 16K + 20K
split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K

and now we have an invalid extent_map in the tree that potentially
overlaps other entries in the extent map. Even in the non-overlapping
case we will have split->start set improperly, which will cause problems
with any block related calculations.

We don't actually need len in this loop, we can simply use end as our
end point, and only adjust start up when we find a pinned extent we need
to skip.

Adjust the logic to do this, which keeps us from inserting an invalid
extent map.

We only skip_pinned in the relocation case, so this is relatively rare,
except in the case where you are running relocation a lot, which can
happen with auto relocation on.

Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>