commits

Moves the location of the superblock logging zones. The new locations of
the logging zones are now determined based on fixed block addresses
instead of on fixed zone numbers.

The old placement method based on fixed zone numbers causes problems when
one needs to inspect a file system image without access to the drive zone
information. In such case, the super block locations cannot be reliably
determined as the zone size is unknown. By locating the superblock logging
zones using fixed addresses, we can scan a dumped file system image without
the zone information since a super block copy will always be present at or
after the fixed known locations.

Introduce the following three pairs of zones containing fixed offset
locations, regardless of the device zone size.

- primary superblock: offset 0B (and the following zone)
- first copy: offset 512G (and the following zone)
- Second copy: offset 4T (4096G, and the following zone)

If a logging zone is outside of the disk capacity, we do not record the
superblock copy.

The first copy position is much larger than for a non-zoned filesystem,
which is at 64M. This is to avoid overlapping with the log zones for
the primary superblock. This higher location is arbitrary but allows
supporting devices with very large zone sizes, plus some space around in
between.

Such large zone size is unrealistic and very unlikely to ever be seen in
real devices. Currently, SMR disks have a zone size of 256MB, and we are
expecting ZNS drives to be in the 1-4GB range, so this limit gives us
room to breathe. For now, we only allow zone sizes up to 8GB. The
maximum zone size that would still fit in the space is 256G.

The fixed location addresses are somewhat arbitrary, with the intent of
maintaining superblock reliability for smaller and larger devices, with
the preference for the latter. For this reason, there are two superblocks
under the first 1T. This should cover use cases for physical devices and
for emulated/device-mapper devices.

The superblock logging zones are reserved for superblock logging and
never used for data or metadata blocks. Note that we only reserve the
two zones per primary/copy actually used for superblock logging. We do
not reserve the ranges of zones possibly containing superblocks with the
largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G).

The zones containing the fixed location offsets used to store
superblocks on a non-zoned volume are also reserved to avoid confusion.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4y ago

Linus Torvalds

f5ce0466

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

4y ago

Arnd Bergmann

b9a9786a

Merge tag 'omap-for-v5.12/fixes-rc6-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes

4y ago

Linus Torvalds

8db5efb8

Merge tag 'pinctrl-v5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

4y ago

Linus Torvalds

06f838e0

Merge tag 'x86_urgent_for_v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Arnd Bergmann

6d48b791

lockdep: Address clang -Wformat warning printing for %hd

4y ago

Omar Sandoval

c1d6abda

btrfs: fix check_data_csum() error message for direct I/O

4y ago

Linus Torvalds

c98ff1d0

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

4y ago

Fredrik Strupe

d2f7eca6

ARM: 9071/1: uprobes: Don't hook on thumb instructions

4y ago

Arnd Bergmann

aa68a778

Merge tag 'qcom-drivers-fixes-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes

4y ago

Tony Lindgren

fc85dc42

ARM: OMAP2+: Fix uninitialized sr_inst

4y ago

Linus Torvalds

e77a830c

Merge branch 'akpm' (patches from Andrew)

4y ago

Andy Shevchenko

482715ff

pinctrl: core: Show pin numbers for the controllers with base = 0

4y ago

Linus Torvalds

52e44129

Merge branch 'for-5.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu

4y ago

Thomas Tai

632a1c20

x86/traps: Correct exc_general_protection() and math_error() return paths

4y ago

Tetsuo Handa

3a85969e

lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message

4y ago

Filipe Manana

0bb78830

btrfs: fix sleep while in non-sleep context during qgroup removal

While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
producing the following trace:

[821.843637] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
[821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, name: podman
[821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: G W 5.11.6 #15
[821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 12/08/2020
[821.843647] Call Trace:
[821.843650] dump_stack+0xa1/0xfb
[821.843656] ___might_sleep+0x144/0x160
[821.843659] mutex_lock+0x17/0x40
[821.843662] kernfs_remove_by_name_ns+0x1f/0x80
[821.843666] sysfs_remove_group+0x7d/0xe0
[821.843668] sysfs_remove_groups+0x28/0x40
[821.843670] kobject_del+0x2a/0x80
[821.843672] btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
[821.843685] __del_qgroup_rb+0x12/0x150 [btrfs]
[821.843696] btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
[821.843707] btrfs_ioctl+0x3129/0x36a0 [btrfs]
[821.843717] ? __mod_lruvec_page_state+0x5e/0xb0
[821.843719] ? page_add_new_anon_rmap+0xbc/0x150
[821.843723] ? kfree+0x1b4/0x300
[821.843725] ? mntput_no_expire+0x55/0x330
[821.843728] __x64_sys_ioctl+0x5a/0xa0
[821.843731] do_syscall_64+0x33/0x70
[821.843733] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[821.843736] RIP: 0033:0x4cd3fb
[821.843741] RSP: 002b:000000c000906b20 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[821.843744] RAX: ffffffffffffffda RBX: 000000c000050000 RCX: 00000000004cd3fb
[821.843745] RDX: 000000c000906b98 RSI: 000000004010942a RDI: 000000000000000f
[821.843747] RBP: 000000c000907cd0 R08: 000000c000622901 R09: 0000000000000000
[821.843748] R10: 000000c000d992c0 R11: 0000000000000206 R12: 000000000000012d
[821.843749] R13: 000000000000012c R14: 0000000000000200 R15: 0000000000000049

Fix this by removing the qgroup sysfs entry while not holding the spinlock,
since the spinlock is only meant for protection of the qgroup rbtree.

Reported-by: Stuart Shelton <srcshelton@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7A5485BB-0628-419D-A4D3-27B1AF47E25A@gmail.com/
Fixes: 49e5fb46211de0 ("btrfs: qgroup: export qgroups in sysfs")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4y ago

Linus Torvalds

aba5970c

Merge tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm

4y ago

Jolly Shah

176ddd89

scsi: libsas: Reset num_scatter if libata marks qc as NODATA

4y ago

Russell King

30e3b4f2

ARM: footbridge: fix PCI interrupt mapping

4y ago

Arnd Bergmann

974be36e

Merge tag 'sunxi-fixes-for-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into arm/fixes

4y ago

Shawn Guo

0c9fdcdb

soc: qcom: geni: shield geni_icc_get() for ACPI boot

4y ago

Tony Lindgren

a1ebdb37

ARM: dts: Fix swapped mmc order for omap3

4y ago

Linus Torvalds

95838bd9

Merge tag 'block-5.12-2021-04-23' of git://git.kernel.dk/linux-block

4y ago

Vasily Averin

1974c45d

tools/cgroup/slabinfo.py: updated to work on current kernel

4y ago

Linus Walleij

33cc5270

Merge tag 'intel-pinctrl-v5.12-4' of gitolite.kernel.org:pub/scm/linux/kernel/git/pinctrl/intel into fixes

4y ago

Linus Torvalds

efc2da92

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

4y ago

Roman Gushchin

0760fa3d

percpu: make pcpu_nr_empty_pop_pages per chunk type

4y ago

William Roche

3a62583c

RAS/CEC: Correct ce_add_elem()'s returned values

4y ago

Kan Liang

2dc0572f

perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT

4y ago

Filipe Manana

8d488a8c

btrfs: fix subvolume/snapshot deletion not triggered on mount

4y ago

Linus Torvalds

194cf482

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

4y ago

Dave Airlie

796b556c

Merge tag 'vmwgfx-fixes-2021-04-14' of gitlab.freedesktop.org:zack/vmwgfx into drm-fixes

4y ago

Mike Christie

0dcf8feb

scsi: iscsi: Fix iSCSI cls conn state

4y ago

Vladimir Murzin

45c2f70c

ARM: 9069/1: NOMMU: Fix conversion for_each_membock() to for_each_mem_range()

4y ago

Linus Torvalds

e49d033b

Linux 5.12-rc6 v5.12-rc6

4y ago

Linux 5.12 v5.12

9f4ad9e4

Linus Torvalds

Merge tag 'perf-tools-fixes-for-v5.12-2021-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

d2d09fbe

Linus Torvalds

Merge tag 'perf_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

24dfc390

Linus Torvalds

perf map: Fix error return code in maps__clone()

c6f87141

Zhen Lei

Merge tag 'locking_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

0146da0d

Linus Torvalds

perf/x86/kvm: Fix Broadwell Xeon stepping in isolation_ucodes[]

4b2f1e59

Jim Mattson

perf ftrace: Fix access to pid in array when setting a pid filter

671b60cb

Thomas Richter

Merge tag 'sched_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

682b26bd

Linus Torvalds

locking/qrwlock: Fix ordering in queued_write_lock_slowpath()

While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock. The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. This exposes the window between the
acquire and the cmpxchg to an A-B-A problem which allows reads
following the lock acquisition to observe values speculatively before
the write lock is truly acquired.

We've seen a problem in epoll where the reader does a xchg while
holding the read lock, but the writer can see a value change out from
under it.

Writer | Reader
--------------------------------------------------------------------------------
ep_scan_ready_list() |
|- write_lock_irq() |
|- queued_write_lock_slowpath() |
|- atomic_cond_read_acquire() |
| read_lock_irqsave(&ep->lock, flags);
--> (observes value before unlock) | chain_epi_lockless()
| | epi->next = xchg(&ep->ovflist, epi);
| | read_unlock_irqrestore(&ep->lock, flags);
| |
| atomic_cmpxchg_relaxed() |
|-- READ_ONCE(ep->ovflist); |

A core can order the read of the ovflist ahead of the
atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire
semantics addresses this issue at which point the atomic_cond_read can
be switched to use relaxed semantics.

Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock")
Signed-off-by: Ali Saidi <alisaidi@amazon.com>
[peterz: use try_cmpxchg()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Steve Capper <steve.capper@arm.com>