Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

ublk: document zero copy feature

Add words to explain how zero copy feature works, and why it has to be
trusted for handling IO read command.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Ming Lei and committed by
Jens Axboe
17970209 ebf695f1

+25 -8
+25 -8
Documentation/block/ublk.rst
··· 309 309 ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy 310 310 the server buffer (pages) read to the IO request pages. 311 311 312 - Future development 313 - ================== 314 - 315 312 Zero copy 316 313 --------- 317 314 318 - Zero copy is a generic requirement for nbd, fuse or similar drivers. A 319 - problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace 320 - can't be remapped any more in kernel with existing mm interfaces. This can 321 - occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that 322 - big requests (IO size >= 256 KB) may benefit a lot from zero copy. 315 + ublk zero copy relies on io_uring's fixed kernel buffer, which provides 316 + two APIs: `io_buffer_register_bvec()` and `io_buffer_unregister_bvec`. 323 317 318 + ublk adds IO command of `UBLK_IO_REGISTER_IO_BUF` to call 319 + `io_buffer_register_bvec()` for ublk server to register client request 320 + buffer into io_uring buffer table, then ublk server can submit io_uring 321 + IOs with the registered buffer index. IO command of `UBLK_IO_UNREGISTER_IO_BUF` 322 + calls `io_buffer_unregister_bvec()` to unregister the buffer, which is 323 + guaranteed to be live between calling `io_buffer_register_bvec()` and 324 + `io_buffer_unregister_bvec()`. Any io_uring operation which supports this 325 + kind of kernel buffer will grab one reference of the buffer until the 326 + operation is completed. 327 + 328 + ublk server implementing zero copy or user copy has to be CAP_SYS_ADMIN and 329 + be trusted, because it is ublk server's responsibility to make sure IO buffer 330 + filled with data for handling read command, and ublk server has to return 331 + correct result to ublk driver when handling READ command, and the result 332 + has to match with how many bytes filled to the IO buffer. Otherwise, 333 + uninitialized kernel IO buffer will be exposed to client application. 334 + 335 + ublk server needs to align the parameter of `struct ublk_param_dma_align` 336 + with backend for zero copy to work correctly. 337 + 338 + For reaching best IO performance, ublk server should align its segment 339 + parameter of `struct ublk_param_segment` with backend for avoiding 340 + unnecessary IO split, which usually hurts io_uring performance. 324 341 325 342 References 326 343 ==========