Linux kernel
============
There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``. The formatted documentation can also be read online at:
https://www.kernel.org/doc/html/latest/
There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.
Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
Clone this repository
For self-hosted knots, clone URLs may differ based on your setup.
Download tar.gz
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
#0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
#1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x90
__might_resched.cold+0x1f7/0x23d
? __pfx___might_resched+0x10/0x10
? vsnprintf+0xdeb/0x18f0
__mutex_lock+0xf4/0x1220
? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
? __pfx_vsnprintf+0x10/0x10
? __pfx___mutex_lock+0x10/0x10
? snprintf+0xa5/0xe0
? xas_load+0x1ce/0x3f0
? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
nvmet_req_find_ns+0x24e/0x300 [nvmet]
nvmet_req_init+0x694/0xd40 [nvmet]
? blk_mq_start_request+0x11c/0x750
? nvme_setup_cmd+0x369/0x990 [nvme_core]
nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
? __pfx___lock_acquire+0x10/0x10
? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
__blk_mq_issue_directly+0xe2/0x1d0
? __pfx___blk_mq_issue_directly+0x10/0x10
? blk_mq_request_issue_directly+0xc2/0x140
blk_mq_plug_issue_direct+0x13f/0x630
? lock_acquire+0x2d/0xc0
? blk_mq_flush_plug_list+0xa75/0x1950
blk_mq_flush_plug_list+0xa9d/0x1950
? __pfx_blk_mq_flush_plug_list+0x10/0x10
? __pfx_mpage_readahead+0x10/0x10
__blk_flush_plug+0x278/0x4d0
? __pfx___blk_flush_plug+0x10/0x10
? lock_release+0x460/0x7a0
blk_finish_plug+0x4e/0x90
read_pages+0x51b/0xbc0
? __pfx_read_pages+0x10/0x10
? lock_release+0x460/0x7a0
page_cache_ra_unbounded+0x326/0x5c0
force_page_cache_ra+0x1ea/0x2f0
filemap_get_pages+0x59e/0x17b0
? __pfx_filemap_get_pages+0x10/0x10
? lock_is_held_type+0xd5/0x130
? __pfx___might_resched+0x10/0x10
? find_held_lock+0x2d/0x110
filemap_read+0x317/0xb70
? up_write+0x1ba/0x510
? __pfx_filemap_read+0x10/0x10
? inode_security+0x54/0xf0
? selinux_file_permission+0x36d/0x420
blkdev_read_iter+0x143/0x3b0
vfs_read+0x6ac/0xa20
? __pfx_vfs_read+0x10/0x10
? __pfx_vm_mmap_pgoff+0x10/0x10
? __pfx___seccomp_filter+0x10/0x10
ksys_read+0xf7/0x1d0
? __pfx_ksys_read+0x10/0x10
do_syscall_64+0x93/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
</TASK>
Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 505363957fad
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.
To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.
So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).
Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 505363957fad ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Ensure we propagate npwg to the target as well instead
of assuming its the same logical blocks per physical block.
This ensures devices with large IUs information properly
propagated on the target.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet_root_discovery_nqn_store treats the subsysnqn string like a fixed
size buffer, even though it is dynamically allocated to the size of the
string.
Create a new string with kstrndup instead of using the old buffer.
Reported-by: syzbot+ff4aab278fa7e27e0f9e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ff4aab278fa7e27e0f9e
Fixes: 95409e277d83 ("nvmet: implement unique discovery NQN")
Signed-off-by: Leo Stone <leocstone@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
We initially introduced a quick fix limiting the queue depth to 1 as
experimentation showed that it fixed data corruption on 64GB steamdecks.
Further experimentation revealed corruption only happens when the last
PRP data element aligns to the end of the page boundary. The device
appears to treat this as a PRP chain to a new list instead of the data
element that it actually is. This implementation is in violation of the
spec. Encountering this errata with the Linux driver requires the host
request a 128k transfer and coincidently be handed the last small pool
dma buffer within a page.
The QD1 quirk effectly works around this because the last data PRP
always was at a 248 byte offset from the page start, so it never
appeared at the end of the page, but comes at the expense of throttling
IO and wasting the remainder of the PRP page beyond 256 bytes. Also to
note, the MDTS on these devices is small enough that the "large" prp
pool can hold enough PRP elements to never reach the end, so that pool
is not a problem either.
Introduce a new quirk to ensure the small pool is always aligned such
that the last PRP element can't appear a the end of the page. This comes
at the expense of wasting 256 bytes per small pool page allocated.
Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u
Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
Cc: Paweł Anikiel <panikiel@google.com>
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Call bdev_offset_from_zone_start() instead of open-coding it.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: LongPing Wei <weilongping@oppo.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241107020439.1644577-1-weilongping@oppo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg_unpin_online() walks up the blkcg hierarchy putting the online pin. To
walk up, it uses blkcg_parent(blkcg) but it was calling that after
blkcg_destroy_blkgs(blkcg) which could free the blkcg, leading to the
following UAF:
==================================================================
BUG: KASAN: slab-use-after-free in blkcg_unpin_online+0x15a/0x270
Read of size 8 at addr ffff8881057678c0 by task kworker/9:1/117
CPU: 9 UID: 0 PID: 117 Comm: kworker/9:1 Not tainted 6.13.0-rc1-work-00182-gb8f52214c61a-dirty #48
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022
Workqueue: cgwb_release cgwb_release_workfn
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x80
print_report+0x151/0x710
kasan_report+0xc0/0x100
blkcg_unpin_online+0x15a/0x270
cgwb_release_workfn+0x194/0x480
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
...
Freed by task 1944:
kasan_save_track+0x2b/0x70
kasan_save_free_info+0x3c/0x50
__kasan_slab_free+0x33/0x50
kfree+0x10c/0x330
css_free_rwork_fn+0xe6/0xb30
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
Note that the UAF is not easy to trigger as the free path is indirected
behind a couple RCU grace periods and a work item execution. I could only
trigger it with artifical msleep() injected in blkcg_unpin_online().
Fix it by reading the parent pointer before destroying the blkcg's blkg's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abagail ren <renzezhongucas@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 4308a434e5e0 ("blkcg: don't offline parent blkcg first")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch updates Coly Li's email addres to colyli@kernel.org.
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20241208115350.85103-1-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.
However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.
Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.
The changes to remove the automatic error recovery are as follows:
- Completely remove the error recovery work and its associated
resources (zone write plug list head, disk error list, and disk
zone_wplugs_work work struct). This also removes the functions
disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
plug whenever a write opration targetting the zone of the zone write
plug fails. This flag indicates that the zone write pointer offset is
not reliable and that it must be updated when the next report zone,
reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this
new flag, thus implementing recovery of a correct write pointer
offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb()
callback so that disk_zone_wplug_sync_wp_offset() can be called for
any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
This implements recovery of a correct write pointer offset for zone
write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call
disk_zone_wplug_sync_wp_offset() for all sequential write required
zones when a zoned block device is revalidated, thus always resolving
any inconsistency between the write pointer offset of zone write
plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are currently any issuer of REQ_OP_ZONE_RESET and
REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot
handle this flag correctly due to the potential request allocation
failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio()
has handled the zone write plug write pointer updates for the targeted
zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this
flag is set and ignore it.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.
Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.
Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.
blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Registering and unregistering cpuhp callback requires global cpu hotplug lock,
which is used everywhere. Meantime q->sysfs_lock is used in block layer
almost everywhere.
It is easy to trigger lockdep warning[1] by connecting the two locks.
Fix the warning by moving blk-mq's cpuhp callback registering out of
q->sysfs_lock. Add one dedicated global lock for covering registering &
unregistering hctx's cpuhp, and it is safe to do so because hctx is
guaranteed to be live if our request_queue is live.
[1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Reported-by: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the
callback should be registered after this 'hctx' is added to xarray table.
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull NVMe fixess from Keith:
"nvme fixes for Linux 6.13
- Target fix using incorrect zero buffer (Nilay)
- Device specifc deallocate quirk fixes (Christoph, Keith)
- Fabrics fix for handling max command target bugs (Maurizio)
- Cocci fix usage for kzalloc (Yu-Chen)
- DMA size fix for host memory buffer feature (Christoph)
- Fabrics queue cleanup fixes (Chunguang)"
* tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme:
nvme-tcp: simplify nvme_tcp_teardown_io_queues()
nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()
nvme-rdma: unquiesce admin_q before destroy it
nvme-tcp: fix the memleak while create new ctrl failed
nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary
nvmet: replace kmalloc + memset with kzalloc for data allocation
nvme-fabrics: handle zero MAXCMD without closing the connection
nvme-pci: remove two deallocate zeroes quirks
nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported
nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()
Commit 4ce6e2db00de ("virtio-blk: Ensure no requests in virtqueues before
deleting vqs.") replaces queue quiesce with queue freeze in virtio-blk's
PM callbacks. And the motivation is to drain inflight IOs before suspending.
block layer's queue freeze looks very handy, but it is also easy to cause
deadlock, such as, any attempt to call into bio_queue_enter() may run into
deadlock if the queue is frozen in current context. There are all kinds
of ->suspend() called in suspend context, so keeping queue frozen in the
whole suspend context isn't one good idea. And Marek reported lockdep
warning[1] caused by virtio-blk's freeze queue in virtblk_freeze().
[1] https://lore.kernel.org/linux-block/ca16370e-d646-4eee-b9cc-87277c89c43c@samsung.com/
Given the motivation is to drain in-flight IOs, it can be done by calling
freeze & unfreeze, meantime restore to previous behavior by keeping queue
quiesced during suspend.
Cc: Yi Sun <yi.sun@unisoc.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: virtualization@lists.linux.dev
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20241112125821.1475793-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As nvme_tcp_teardown_io_queues() is the only one caller of
nvme_tcp_destroy_admin_queue(), so we can merge it into
nvme_tcp_teardown_io_queues() to simplify the code.
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>