commits

This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:

A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.

When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.

The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.

The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

3y ago

Linus Torvalds

e5fa173f

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

3y ago

Daniel Vetter

414208e4

Merge tag 'amd-drm-fixes-6.0-2022-09-30-1' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

3y ago

Florian Westphal

30c19366

mm: fix BUG splat with kvmalloc + GFP_ATOMIC

3y ago

Hangyu Hua

37238699

media: dvb_vb2: fix possible out of bound access

3y ago

Patrice Chotard

f5c5936d

usb: dwc3: st: Fix node's child name

3y ago

Jarkko Sakkinen

133e049a

x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxd

3y ago

Linus Torvalds

373eff57

Merge tag 'bitmap-6.0-rc3' of github.com:/norov/linux

3y ago

Luca Ceresoli

0ebafe2e

.mailmap: update Luca Ceresoli's e-mail address

3y ago

Linus Torvalds

f0cc7c00

Merge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

3y ago

Dan Williams

b3bbcc5d

Merge branch 'for-6.0/dax' into libnvdimm-fixes

3y ago

Luís Henriques

29a5b8a1

ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0

When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated. However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0. And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:

[ 135.245946] ------------[ cut here ]------------
[ 135.247579] kernel BUG at fs/ext4/extents.c:2258!
[ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
[ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
[ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
[ 135.256475] Code:
[ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
[ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
[ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
[ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
[ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
[ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
[ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
[ 135.277952] Call Trace:
[ 135.278635] <TASK>
[ 135.279247] ? preempt_count_add+0x6d/0xa0
[ 135.280358] ? percpu_counter_add_batch+0x55/0xb0
[ 135.281612] ? _raw_read_unlock+0x18/0x30
[ 135.282704] ext4_map_blocks+0x294/0x5a0
[ 135.283745] ? xa_load+0x6f/0xa0
[ 135.284562] ext4_mpage_readpages+0x3d6/0x770
[ 135.285646] read_pages+0x67/0x1d0
[ 135.286492] ? folio_add_lru+0x51/0x80
[ 135.287441] page_cache_ra_unbounded+0x124/0x170
[ 135.288510] filemap_get_pages+0x23d/0x5a0
[ 135.289457] ? path_openat+0xa72/0xdd0
[ 135.290332] filemap_read+0xbf/0x300
[ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40
[ 135.292192] new_sync_read+0x103/0x170
[ 135.293014] vfs_read+0x15d/0x180
[ 135.293745] ksys_read+0xa1/0xe0
[ 135.294461] do_syscall_64+0x3c/0x80
[ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch simply adds an extra check in __ext4_ext_check(), verifying that
eh_entries is not 0 when eh_depth is > 0.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
Cc: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

3y ago

Linus Torvalds

c816f2e9

Merge tag 'perf-tools-fixes-for-v6.0-2022-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

- Fail the 'perf test record' entry on error, fixing a regression where
just setup stuff like allocating memory and not the actual things
being tested failed.

- Fixup disabling of -Wdeprecated-declarations for the python scripting
engine, the previous attempt had a brown paper bag thinko.

- Fix branch stack sampling test to include sanity check for branch
filter on PowerPC.

- Update is_ignored_symbol function to match the kernel ignored list,
fixing running the 'perf test' entry that compares resolving symbols
from kallsyms to resolving from vmlinux.

- Augment the data source type with ARM's neoverse_spe list, the
previous code was limited in its search resolving the data source.

- Fix some clang 5 variable set but unused cases.

- Get a perf cgroup more portably in BPF as the
__builtin_preserve_enum_value builtin is not available in older
versions of clang. In those cases we can forgo BPF's CO-RE (Compile
Once, Run Everywhere).

- More Fixes for Intel's hybrid CPU model.

* tag 'perf-tools-fixes-for-v6.0-2022-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf build: Fixup disabling of -Wdeprecated-declarations for the python scripting engine
perf tests mmap-basic: Remove unused variable to address clang 15 warning
perf parse-events: Ignore clang 15 warning about variable set but unused in bison produced code
perf tests record: Fail the test if the 'errs' counter is not zero
perf test: Fix test case 87 ("perf record tests") for hybrid systems
perf arm-spe: augment the data source type with neoverse_spe list
perf tests vmlinux-kallsyms: Update is_ignored_symbol function to match the kernel ignored list
perf tests powerpc: Fix branch stack sampling test to include sanity check for branch filter
perf parse-events: Remove "not supported" hybrid cache events
perf print-events: Fix "perf list" can not display the PMU prefix for some hybrid cache events
perf tools: Get a perf cgroup more portably in BPF

3y ago

Peng Fan

daaa2fbe

clk: imx93: drop of_match_ptr

3y ago

Dave Airlie

6643b383

Merge tag 'drm-intel-fixes-2022-09-29' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

3y ago

Hawking Zhang

0fd85e89

drm/amdgpu/gfx11: switch to amdgpu_gfx_rlc_init_microcode

3y ago

Lukas Bulwahn

b674dedd

MAINTAINERS: drop entry to removed file in ARM/RISCPC ARCHITECTURE

3y ago

Hans Verkuil

f0da34f3

media: v4l2-ioctl.c: fix incorrect error path

3y ago

Heikki Krogerus

415ba26c

usb: typec: ucsi: Remove incorrect warning

3y ago

Tony Luck

5515d21c

x86/cpu: Add CPU model numbers for Meteor Lake

3y ago

Linus Torvalds

8379c0b3

Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

3y ago

Sander Vanheule

5d7fef08

lib/cpumask_kunit: add tests file to MAINTAINERS

3y ago

Peter Xu

3d2f78f0

mm/mprotect: only reference swap pfn page if type match

Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to
fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]:

kernel BUG at include/linux/swapops.h:117!
CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L 6.0.0-dbg-DEV #2
RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
change_pte_range+0x36e/0x880
change_p4d_range+0x2e8/0x670
change_protection_range+0x14e/0x2c0
mprotect_fixup+0x1ee/0x330
do_mprotect_pkey+0x34c/0x440
__x64_sys_mprotect+0x1d/0x30

It triggers because pfn_swap_entry_to_page() could be called upon e.g. a
genuine swap entry.

Fix it by only calling it when it's a write migration entry where the page*
is used.

[1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS+k1w@mail.gmail.com/

Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Yu Zhao <yuzhao@google.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3y ago

Linus Torvalds

105a36f3

Merge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

3y ago

Dan Carpenter

b7af938f

i2c: mux: harden i2c_mux_alloc() against integer overflows

3y ago

Li Jinlin

17d9c15c

fsdax: Fix infinite loop in dax_iomap_rw()

3y ago

Dan Williams

67feaba4

devdax: Fix soft-reservation memory description

3y ago

Jan Kara

83e80a6e

ext4: use buckets for cr 1 block scan instead of rbtree

3y ago

Linus Torvalds

920541bb

Merge tag 'for-linus-6.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm

3y ago

Arnaldo Carvalho de Melo

8e8bf60a

perf build: Fixup disabling of -Wdeprecated-declarations for the python scripting engine

3y ago

Florian Fainelli

1b24a132

clk: iproc: Do not rely on node name for correct PLL setup

3y ago

Linux 6.0 v6.0

4fe89d07

Linus Torvalds

Merge tag 'i2c-for-6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

a962b54e

Linus Torvalds

Merge tag 'perf-urgent-2022-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

febae48a

Linus Torvalds

i2c: davinci: fix PM disable depth imbalance in davinci_i2c_probe

e2062df7

Zhang Qilong

Merge tag 'x86_urgent_for_v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

534b0abc

Linus Torvalds

perf/core: Fix reentry problem in perf_output_read_group()

6b959ba2

Yang Jihong

dt-bindings: i2c: st,stm32-i2c: Document wakeup-source property

367d4c88

Marek Vasut

Merge tag 'usb-6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

b357fd1c

Linus Torvalds

x86/cacheinfo: Add a cpu_llc_shared_mask() UP variant

df5b035b

Borislav Petkov

perf/x86/core: Completely disable guest PEBS via guest's global_ctrl

f2aeea57

Like Xu

dt-bindings: i2c: st,stm32-i2c: Document interrupt-names property

f938a529

Marek Vasut

Merge tag 'media/v6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

89f2ddce

Linus Torvalds

uas: ignore UAS for Thinkplus chips

0fb9703a

Hongling Zeng

x86/alternative: Fix race in try_get_desc()

efd608fa

Nadav Amit

perf/x86/intel: Fix unchecked MSR access error for Alder Lake N

24919fde

Kan Liang

Linux 6.0-rc7 v6.0-rc7

f76349cf

Linus Torvalds

Merge tag 'mm-hotfixes-stable-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2a4b6e13

Linus Torvalds

media: rkvdec: Disable H.264 error detection

3a99c447

Nicolas Dufresne

usb-storage: Add Hiksemi USB3-FW to IGNORE_UAS

e00b488e

Hongling Zeng

ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems

e400ad8b

Dave Hansen

Linux 6.0-rc3 v6.0-rc3

b90cb105

Linus Torvalds

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

5e049663

Linus Torvalds

Merge tag 'drm-fixes-2022-10-01' of git://anongit.freedesktop.org/drm/drm

ffb4d94b

Linus Torvalds

damon/sysfs: fix possible memleak on damon_sysfs_add_target

1c8e2349

Levi Yun

media: mediatek: vcodec: Drop platform_get_resource(IORESOURCE_IRQ)

a2d2e593

Nícolas F. R. A. Prado

uas: add no-uas quirk for Hiksemi usb_disk

a625a4b8

Hongling Zeng

x86/sgx: Handle VA page allocation failure for EAUG on PF.

81fa6fd1

Haitao Huang

Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

b467192e

Linus Torvalds

Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

4207d595

Linus Torvalds

ext4: limit the number of retries after discarding preallocations blocks

80fa46d6

Theodore Ts'o

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

e5fa173f

Linus Torvalds

Merge tag 'amd-drm-fixes-6.0-2022-09-30-1' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

414208e4

Daniel Vetter

mm: fix BUG splat with kvmalloc + GFP_ATOMIC

30c19366

Florian Westphal

media: dvb_vb2: fix possible out of bound access

37238699

Hangyu Hua

usb: dwc3: st: Fix node's child name

f5c5936d

Patrice Chotard

x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxd

133e049a

Jarkko Sakkinen

Merge tag 'bitmap-6.0-rc3' of github.com:/norov/linux

373eff57

Linus Torvalds

.mailmap: update Luca Ceresoli's e-mail address

0ebafe2e

Luca Ceresoli

Merge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

f0cc7c00

Linus Torvalds

Merge branch 'for-6.0/dax' into libnvdimm-fixes

b3bbcc5d

Dan Williams

ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0

29a5b8a1

Luís Henriques

Merge tag 'perf-tools-fixes-for-v6.0-2022-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

c816f2e9

Linus Torvalds

clk: imx93: drop of_match_ptr

daaa2fbe

Peng Fan

Merge tag 'drm-intel-fixes-2022-09-29' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

6643b383

Dave Airlie

drm/amdgpu/gfx11: switch to amdgpu_gfx_rlc_init_microcode

0fd85e89

Hawking Zhang

MAINTAINERS: drop entry to removed file in ARM/RISCPC ARCHITECTURE

b674dedd

Lukas Bulwahn

media: v4l2-ioctl.c: fix incorrect error path

f0da34f3

Hans Verkuil

usb: typec: ucsi: Remove incorrect warning

415ba26c

Heikki Krogerus

x86/cpu: Add CPU model numbers for Meteor Lake

5515d21c

Tony Luck

Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

8379c0b3

Linus Torvalds

lib/cpumask_kunit: add tests file to MAINTAINERS

5d7fef08

Sander Vanheule

mm/mprotect: only reference swap pfn page if type match

3d2f78f0

Peter Xu

Merge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

105a36f3

Linus Torvalds

i2c: mux: harden i2c_mux_alloc() against integer overflows

b7af938f

Dan Carpenter

fsdax: Fix infinite loop in dax_iomap_rw()

17d9c15c

Li Jinlin

devdax: Fix soft-reservation memory description

67feaba4

Dan Williams

ext4: use buckets for cr 1 block scan instead of rbtree

Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.

So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.

This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

83e80a6e

Jan Kara

Merge tag 'for-linus-6.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm

920541bb

Linus Torvalds

perf build: Fixup disabling of -Wdeprecated-declarations for the python scripting engine

8e8bf60a

Arnaldo Carvalho de Melo

clk: iproc: Do not rely on node name for correct PLL setup

1b24a132

Florian Fainelli