commits

This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:

A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.

When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.

The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.

The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

3y ago

Linus Torvalds

f0cc7c00

Merge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

3y ago

Dan Williams

b3bbcc5d

Merge branch 'for-6.0/dax' into libnvdimm-fixes

3y ago

Luís Henriques

29a5b8a1

ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0

When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated. However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0. And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:

[ 135.245946] ------------[ cut here ]------------
[ 135.247579] kernel BUG at fs/ext4/extents.c:2258!
[ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
[ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
[ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
[ 135.256475] Code:
[ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
[ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
[ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
[ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
[ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
[ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
[ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
[ 135.277952] Call Trace:
[ 135.278635] <TASK>
[ 135.279247] ? preempt_count_add+0x6d/0xa0
[ 135.280358] ? percpu_counter_add_batch+0x55/0xb0
[ 135.281612] ? _raw_read_unlock+0x18/0x30
[ 135.282704] ext4_map_blocks+0x294/0x5a0
[ 135.283745] ? xa_load+0x6f/0xa0
[ 135.284562] ext4_mpage_readpages+0x3d6/0x770
[ 135.285646] read_pages+0x67/0x1d0
[ 135.286492] ? folio_add_lru+0x51/0x80
[ 135.287441] page_cache_ra_unbounded+0x124/0x170
[ 135.288510] filemap_get_pages+0x23d/0x5a0
[ 135.289457] ? path_openat+0xa72/0xdd0
[ 135.290332] filemap_read+0xbf/0x300
[ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40
[ 135.292192] new_sync_read+0x103/0x170
[ 135.293014] vfs_read+0x15d/0x180
[ 135.293745] ksys_read+0xa1/0xe0
[ 135.294461] do_syscall_64+0x3c/0x80
[ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch simply adds an extra check in __ext4_ext_check(), verifying that
eh_entries is not 0 when eh_depth is > 0.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
Cc: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

3y ago

Linus Torvalds

105a36f3

Merge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

3y ago

Dan Carpenter

b7af938f

i2c: mux: harden i2c_mux_alloc() against integer overflows

3y ago

Li Jinlin

17d9c15c

fsdax: Fix infinite loop in dax_iomap_rw()

3y ago

Dan Williams

67feaba4

devdax: Fix soft-reservation memory description

3y ago

Jan Kara

83e80a6e

ext4: use buckets for cr 1 block scan instead of rbtree

3y ago

Linus Torvalds

23b99237

Merge tag 's390-6.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

3y ago

Nick Desaulniers

32ef9e50

Makefile.debug: re-enable debug info for .S files

3y ago

Asmaa Mnebhi

37f071ec

i2c: mlxbf: Fix frequency calculation

3y ago

Andy Shevchenko

d34213eb

nvdimm/namespace: drop nested variable in create_namespace_pmem()

3y ago

Linus Torvalds

521a547c

Linux 6.0-rc6 v6.0-rc6

3y ago

Jan Kara

a9f2a293

ext4: use locality group preallocation for small closed files

3y ago

Linus Torvalds

42f9508b

Merge tag 'pm-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

3y ago

Tony Krowiak

1918f2b2

s390/vfio-ap: bypass unnecessary processing of AP resources

3y ago

Nick Desaulniers

61f2b7c7

Makefile.debug: set -g unconditional on CONFIG_DEBUG_INFO_SPLIT

3y ago

Asmaa Mnebhi

de24aceb

i2c: mlxbf: prevent stack overflow in mlxbf_i2c_smbus_start_transaction()

3y ago

Shivaprasad G Bhat

69053101

ndtest: Cleanup all of blk namespace specific code

3y ago

Linus Torvalds

7c18b453

Merge tag 'parisc-for-6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

3y ago

Jan Kara

613c5a85

ext4: make directory inode spreading reflect flexbg size

3y ago

Linus Torvalds

1a61b828

Merge tag 'char-misc-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

3y ago

Rafael J. Wysocki

9614369a

Merge tag 'opp-fixes-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm

3y ago

Alexander Gordeev

8d96bba7

s390/smp: enforce lowcore protection on CPU restart

3y ago

Masahiro Yamada

2154aca2

certs: make system keyring depend on built-in x509 parser

3y ago

Asmaa Mnebhi

2a5be6d1

i2c: mlxbf: incorrect base address passed during io write

3y ago

Jane Chu

149d1714

pmem: fix a name collision

3y ago

Linus Torvalds

38eddeed

Merge tag 'io_uring-6.0-2022-09-18' of git://git.kernel.dk/linux

3y ago

Helge Deller

805ce861

parisc: Allow CONFIG_64BIT with ARCH=parisc

3y ago

Jan Kara

1940265e

ext4: avoid unnecessary spreading of allocations among groups

3y ago

Linus Torvalds

7e2cd21e

Merge tag 'tty-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

3y ago

William Breathitt Gray

2bc54aaa

counter: 104-quad-8: Fix skipped IRQ lines during events configuration

3y ago

Rob Herring

c7e31e36

dt-bindings: opp: Add missing (unevaluated|additional)Properties on child nodes

3y ago

Alexander Gordeev

12dd19c1

s390/boot: fix absolute zero lowcore corruption on boot

3y ago

Zeng Heng

03764b30

Kconfig: remove unused function 'menu_get_root_menu'

3y ago

Wolfram Sang

9d55e7b0

Documentation: i2c: fix references to other documents

3y ago

Linus Torvalds

88084a3d

Linux 5.19-rc5 v5.19-rc5

3y ago

Linus Torvalds

a335366b

Merge tag 'gpio-fixes-for-v6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

3y ago

Stefan Metzmacher

9bd3f728

io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC

3y ago

Rolf Eike Beer

e359b70c

parisc: remove obsolete manual allocation aligning in iosapic

3y ago

Jan Kara

4fca50d4

ext4: make mballoc try target group first even with mb_optimize_scan

3y ago

Linus Torvalds

1772094f

Merge tag 'cgroup-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

3y ago

Olof Johansson

64379204

serial: sifive: enable clocks for UART when probed

3y ago

Greg Kroah-Hartman

ab4bbde8

Merge tag 'fpga-for-6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/fpga/linux-fpga into char-misc-linus

3y ago

Christophe JAILLET

d36cb843

OPP: Fix an un-initialized variable usage

3y ago

Gerald Schaefer

7c8d42fd

s390/hugetlb: fix prepare_hugepage_range() check for 2 GB hugepages

3y ago

yangxingwu

237fe727

scripts/clang-tools: remove unused module

3y ago

Wolfram Sang

2c2c72ec

MAINTAINERS: remove Nehal Shah from AMD MP2 I2C DRIVER

3y ago

Linus Torvalds

b8d5109f

lockref: remove unused 'lockref_get_or_lock()' function

Looking at the conditional lock acquire functions in the kernel due to
the new sparse support (see commit 4a557a5d1a61 "sparse: introduce
conditional lock acquire function attribute"), it became obvious that
the lockref code has a couple of them, but they don't match the usual
naming convention for the other ones, and their return value logic is
also reversed.

In the other very similar places, the naming pattern is '*_and_lock()'
(eg 'atomic_put_and_lock()' and 'refcount_dec_and_lock()'), and the
function returns true when the lock is taken.

The lockref code is superficially very similar to the refcount code,
only with the special "atomic wrt the embedded lock" semantics. But
instead of the '*_and_lock()' naming it uses '*_or_lock()'.

And instead of returning true in case it took the lock, it returns true
if it *didn't* take the lock.

Now, arguably the reflock code is quite logical: it really is a "either
decrement _or_ lock" kind of situation - and the return value is about
whether the operation succeeded without any special care needed.

So despite the similarities, the differences do make some sense, and
maybe it's not worth trying to unify the different conditional locking
primitives in this area.

But while looking at this all, it did become obvious that the
'lockref_get_or_lock()' function hasn't actually had any users for
almost a decade.

The only user it ever had was the shortlived 'd_rcu_to_refcount()'
function, and it got removed and replaced with 'lockref_get_not_dead()'
back in 2013 in commits 0d98439ea3c6 ("vfs: use lockred 'dead' flag to
mark unrecoverably dead dentries") and e5c832d55588 ("vfs: fix dentry
RCU to refcounting possibly sleeping dput()")

In fact, that single use was removed less than a week after the whole
function was introduced in commit b3abd80250c1 ("lockref: add
'lockref_get_or_lock() helper") so this function has been around for a
decade, but only had a user for six days.

Let's just put this mis-designed and unused function out of its misery.

We can think about the naming and semantic oddities of the remaining
'lockref_put_or_lock()' later, but at least that function has users.

And while the naming is different and the return value doesn't match,
that function matches the whole '{atomic,refcount}_dec_and_test()'
pattern much better (ie the magic happens when the count goes down to
zero, not when it is incremented from zero).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3y ago

Linus Torvalds

6879c2d3

Merge tag 'pinctrl-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

3y ago

Sergio Paracuellos

09eed5a1

gpio: mt7621: Make the irqchip immutable

3y ago

Pavel Begunkov

e3366e02

io_uring/net: fix zc fixed buf lifetime

3y ago

Ben Hutchings

95363747

tools/include/uapi: Fix <asm/errno.h> for parisc and xtensa

3y ago

Linus Torvalds

7e18e42e

Linux 6.0-rc4 v6.0-rc4

3y ago

Linus Torvalds

aae8dda5

Merge tag 'wq-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

3y ago

Ming Lei

df02452f

cgroup: cgroup_get_from_id() must check the looked-up kn is a directory

3y ago

Linux 6.0-rc7 v6.0-rc7

f76349cf

Linus Torvalds

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

5e049663

Linus Torvalds

Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

4207d595

Linus Torvalds

ext4: limit the number of retries after discarding preallocations blocks

80fa46d6

Theodore Ts'o

Merge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

f0cc7c00

Linus Torvalds

Merge branch 'for-6.0/dax' into libnvdimm-fixes

b3bbcc5d

Dan Williams

ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0

29a5b8a1

Luís Henriques

Merge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

105a36f3

Linus Torvalds

i2c: mux: harden i2c_mux_alloc() against integer overflows

b7af938f

Dan Carpenter

fsdax: Fix infinite loop in dax_iomap_rw()

17d9c15c

Li Jinlin

devdax: Fix soft-reservation memory description

67feaba4

Dan Williams

ext4: use buckets for cr 1 block scan instead of rbtree

Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.

So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.

This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>