commits

IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:

BUG: kernel NULL pointer dereference, address: 00000000000000d8
PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__lock_acquire+0x19d/0x18c0
Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe
RSP: 0018:ffffc90001933828 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8
RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8
FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
lock_acquire+0x31a/0x440
? close_fd_get_file+0x39/0x160
? __lock_acquire+0x647/0x18c0
_raw_spin_lock+0x2c/0x40
? close_fd_get_file+0x39/0x160
close_fd_get_file+0x39/0x160
io_issue_sqe+0x1334/0x14e0
? lock_acquire+0x31a/0x440
? __io_free_req+0xcf/0x2e0
? __io_free_req+0x175/0x2e0
? find_held_lock+0x28/0xb0
? io_wq_submit_work+0x7f/0x240
io_wq_submit_work+0x7f/0x240
io_wq_cancel_cb+0x161/0x580
? io_wqe_wake_worker+0x114/0x360
? io_uring_get_socket+0x40/0x40
io_async_find_and_cancel+0x3b/0x140
io_issue_sqe+0xbe1/0x14e0
? __lock_acquire+0x647/0x18c0
? __io_queue_sqe+0x10b/0x5f0
__io_queue_sqe+0x10b/0x5f0
? io_req_prep+0xdb/0x1150
? mark_held_locks+0x6d/0xb0
? mark_held_locks+0x6d/0xb0
? io_queue_sqe+0x235/0x4b0
io_queue_sqe+0x235/0x4b0
io_submit_sqes+0xd7e/0x12a0
? _raw_spin_unlock_irq+0x24/0x30
? io_sq_thread+0x3ae/0x940
io_sq_thread+0x207/0x940
? do_wait_intr_irq+0xc0/0xc0
? __ia32_sys_io_uring_enter+0x650/0x650
kthread+0x134/0x180
? kthread_create_worker_on_cpu+0x90/0x90
ret_from_fork+0x1f/0x30

Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.

For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.

Cc: stable@vger.kernel.org
Fixes: b5dba59e0cf7 ("io_uring: add support for IORING_OP_CLOSE")
Reported-by: "Abaci <abaci@linux.alibaba.com>"
Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5y ago

Jinyang He

19170492

sh: Remove unused HAVE_COPY_THREAD_TLS macro

5y ago

Linus Torvalds

832bceef

Merge tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

5y ago

Saravana Kannan

e020ff61

driver core: Fix device link device name collision

5y ago

Greg Kroah-Hartman

b11f623c

Merge tag 'misc-habanalabs-fixes-2021-01-21' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into char-misc-linus

5y ago

Thomas Gleixner

78502582

powerpc/mm/highmem: use __set_pte_at() for kmap_local()

5y ago

Jens Axboe

b4f66425

Merge tag 'nvme-5.11-2021-01-14' of git://git.infradead.org/nvme into block-5.11

5y ago

Xiao Ni

dc5d17a3

md: Set prev_flush_start and flush_bio in an atomic way

5y ago

Christoph Hellwig

9275c206

nvme-pci: refactor nvme_unmap_data

5y ago

Pavel Begunkov

0b5cd6c3

io_uring: fix skipping disabling sqo on exec

5y ago

Christoph Hellwig

b7aaf16d

sh: remove CONFIG_IDE from most defconfig

5y ago

Linus Torvalds

4da81fa2

Merge tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

5y ago

Greg Kroah-Hartman

a1bfb0cc

Merge tag 'iio-fixes-for-5.11a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus

5y ago

Rafael J. Wysocki

3d1cf435

driver core: Extend device_is_dependent()

5y ago

Alexander Shishkin

cb5c681a

intel_th: pci: Add Alder Lake-P support

5y ago

Oded Gabbay

2dc4a6d7

habanalabs: disable FW events on device removal

5y ago

Thomas Gleixner

8c0d5d78

mips/mm/highmem: use set_pte() for kmap_local()

5y ago

Coly Li

5342fd42

bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET

5y ago

Sagi Grimberg

5ab25a32

nvme: don't intialize hwmon for discovery controllers

5y ago

Chaitanya Kulkarni

bffcd507

nvmet: set right status on error in id-ns handler

5y ago

Pavel Begunkov

4325cb49

io_uring: fix uring_flush in exit_files() warning

5y ago

Qinglang Miao

a1153636

sh: mm: Convert to DEFINE_SHOW_ATTRIBUTE

5y ago

Linus Torvalds

8f3bfd21

Merge tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

5y ago

Linus Torvalds

17749851

tty: fix up hung_up_tty_write() conversion

5y ago

Linus Torvalds

7c53f6b6

Linux 5.11-rc3 v5.11-rc3

5y ago

Stephen Boyd

b8653aff

iio: sx9310: Fix semtech,avg-pos-strength setting when > 16

5y ago

Christoph Hellwig

f2d6c270

kernfs: wire up ->splice_read and ->splice_write

5y ago

Wang Hui

927633a6

stm class: Fix module init return on allocation failure

5y ago

Oded Gabbay

f8abaf37

habanalabs: fix backward compatibility of idle check

5y ago

Thomas Gleixner

a1dce7fd

mm/highmem: prepare for overriding set_pte_at()

5y ago

Coly Li

b16671e8

bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket

When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
was introduced into the incompat feature set. It used bucket_size_hi
(which was added at the tail of struct cache_sb_disk) to extend current
16bit bucket size to 32bit with existing bucket_size in struct
cache_sb_disk.

This is not a good idea, there are two obvious problems,
- Bucket size is always value power of 2, if store log2(bucket size) in
existing bucket_size of struct cache_sb_disk, it is unnecessary to add
bucket_size_hi.
- Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
struct cache_sb_disk, bucket_size_hi was added after d[] which makes
csum_set calculate an unexpected super block checksum.

To fix the above problems, this patch introduces a new incompat feature
bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
means bucket_size in struct cache_sb_disk stores the order of power-of-2
bucket size value. When user specifies a bucket size larger than 32768
sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
incompat feature set, and bucket_size stores log2(bucket size) more
than store the real bucket size value.

The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
recognized by kernel driver for legacy compatible purpose. The previous
bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
and not used in bcache-tools anymore.

For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
bcache-tools and kernel driver still recognize the feature string and
display it as "obso_large_bucket".

With this change, the unnecessary extra space extend of bcache on-disk
super block can be avoided, and csum_set() may generate expected check
sum as well.

Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>