Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux

+42 -1

Documentation/ABI/stable/sysfs-block

··· 109 109 Description: 110 110 Indicates whether a storage device is capable of storing 111 111 integrity metadata. Set if the device is T10 PI-capable. 112 + This flag is set to 1 if the storage media is formatted 113 + with T10 Protection Information. If the storage media is 114 + not formatted with T10 Protection Information, this flag 115 + is set to 0. 112 116 113 117 114 118 What: /sys/block/<disk>/integrity/format ··· 121 117 Description: 122 118 Metadata format for integrity capable block device. 123 119 E.g. T10-DIF-TYPE1-CRC. 120 + This field describes the type of T10 Protection Information 121 + that the block device can send and receive. 122 + If the device can store application integrity metadata but 123 + no T10 Protection Information profile is used, this field 124 + contains "nop". 125 + If the device does not support integrity metadata, this 126 + field contains "none". 124 127 125 128 126 129 What: /sys/block/<disk>/integrity/protection_interval_bytes ··· 153 142 Contact: Martin K. Petersen <martin.petersen@oracle.com> 154 143 Description: 155 144 Number of bytes of integrity tag space available per 156 - 512 bytes of data. 145 + protection_interval_bytes, which is typically 146 + the device's logical block size. 147 + This field describes the size of the application tag 148 + if the storage device is formatted with T10 Protection 149 + Information and permits use of the application tag. 150 + The tag_size is reported in bytes and indicates the 151 + space available for adding an opaque tag to each block 152 + (protection_interval_bytes). 153 + If the device does not support T10 Protection Information 154 + (even if the device provides application integrity 155 + metadata space), this field is set to 0. 157 156 158 157 159 158 What: /sys/block/<disk>/integrity/write_generate ··· 250 229 encryption, refer to Documentation/block/inline-encryption.rst. 251 230 252 231 232 + What: /sys/block/<disk>/queue/crypto/hw_wrapped_keys 233 + Date: February 2025 234 + Contact: linux-block@vger.kernel.org 235 + Description: 236 + [RO] The presence of this file indicates that the device 237 + supports hardware-wrapped inline encryption keys, i.e. key blobs 238 + that can only be unwrapped and used by dedicated hardware. For 239 + more information about hardware-wrapped inline encryption keys, 240 + see Documentation/block/inline-encryption.rst. 241 + 242 + 253 243 What: /sys/block/<disk>/queue/crypto/max_dun_bits 254 244 Date: February 2022 255 245 Contact: linux-block@vger.kernel.org ··· 297 265 Description: 298 266 [RO] This file shows the number of keyslots the device has for 299 267 use with inline encryption. 268 + 269 + 270 + What: /sys/block/<disk>/queue/crypto/raw_keys 271 + Date: February 2025 272 + Contact: linux-block@vger.kernel.org 273 + Description: 274 + [RO] The presence of this file indicates that the device 275 + supports raw inline encryption keys, i.e. keys that are managed 276 + in raw, plaintext form in software. 300 277 301 278 302 279 What: /sys/block/<disk>/queue/dax

+251 -4

Documentation/block/inline-encryption.rst

··· 77 77 ============ 78 78 79 79 We introduce ``struct blk_crypto_key`` to represent an inline encryption key and 80 - how it will be used. This includes the actual bytes of the key; the size of the 81 - key; the algorithm and data unit size the key will be used with; and the number 82 - of bytes needed to represent the maximum data unit number the key will be used 83 - with. 80 + how it will be used. This includes the type of the key (raw or 81 + hardware-wrapped); the actual bytes of the key; the size of the key; the 82 + algorithm and data unit size the key will be used with; and the number of bytes 83 + needed to represent the maximum data unit number the key will be used with. 84 84 85 85 We introduce ``struct bio_crypt_ctx`` to represent an encryption context. It 86 86 contains a data unit number and a pointer to a blk_crypto_key. We add pointers ··· 301 301 When the crypto API fallback is enabled, this means that all bios with and 302 302 encryption context will use the fallback, and IO will complete as usual. When 303 303 the fallback is disabled, a bio with an encryption context will be failed. 304 + 305 + .. _hardware_wrapped_keys: 306 + 307 + Hardware-wrapped keys 308 + ===================== 309 + 310 + Motivation and threat model 311 + --------------------------- 312 + 313 + Linux storage encryption (dm-crypt, fscrypt, eCryptfs, etc.) traditionally 314 + relies on the raw encryption key(s) being present in kernel memory so that the 315 + encryption can be performed. This traditionally isn't seen as a problem because 316 + the key(s) won't be present during an offline attack, which is the main type of 317 + attack that storage encryption is intended to protect from. 318 + 319 + However, there is an increasing desire to also protect users' data from other 320 + types of attacks (to the extent possible), including: 321 + 322 + - Cold boot attacks, where an attacker with physical access to a system suddenly 323 + powers it off, then immediately dumps the system memory to extract recently 324 + in-use encryption keys, then uses these keys to decrypt user data on-disk. 325 + 326 + - Online attacks where the attacker is able to read kernel memory without fully 327 + compromising the system, followed by an offline attack where any extracted 328 + keys can be used to decrypt user data on-disk. An example of such an online 329 + attack would be if the attacker is able to run some code on the system that 330 + exploits a Meltdown-like vulnerability but is unable to escalate privileges. 331 + 332 + - Online attacks where the attacker fully compromises the system, but their data 333 + exfiltration is significantly time-limited and/or bandwidth-limited, so in 334 + order to completely exfiltrate the data they need to extract the encryption 335 + keys to use in a later offline attack. 336 + 337 + Hardware-wrapped keys are a feature of inline encryption hardware that is 338 + designed to protect users' data from the above attacks (to the extent possible), 339 + without introducing limitations such as a maximum number of keys. 340 + 341 + Note that it is impossible to **fully** protect users' data from these attacks. 342 + Even in the attacks where the attacker "just" gets read access to kernel memory, 343 + they can still extract any user data that is present in memory, including 344 + plaintext pagecache pages of encrypted files. The focus here is just on 345 + protecting the encryption keys, as those instantly give access to **all** user 346 + data in any following offline attack, rather than just some of it (where which 347 + data is included in that "some" might not be controlled by the attacker). 348 + 349 + Solution overview 350 + ----------------- 351 + 352 + Inline encryption hardware typically has "keyslots" into which software can 353 + program keys for the hardware to use; the contents of keyslots typically can't 354 + be read back by software. As such, the above security goals could be achieved 355 + if the kernel simply erased its copy of the key(s) after programming them into 356 + keyslot(s) and thereafter only referred to them via keyslot number. 357 + 358 + However, that naive approach runs into a couple problems: 359 + 360 + - It limits the number of unlocked keys to the number of keyslots, which 361 + typically is a small number. In cases where there is only one encryption key 362 + system-wide (e.g., a full-disk encryption key), that can be tolerable. 363 + However, in general there can be many logged-in users with many different 364 + keys, and/or many running applications with application-specific encrypted 365 + storage areas. This is especially true if file-based encryption (e.g. 366 + fscrypt) is being used. 367 + 368 + - Inline crypto engines typically lose the contents of their keyslots if the 369 + storage controller (usually UFS or eMMC) is reset. Resetting the storage 370 + controller is a standard error recovery procedure that is executed if certain 371 + types of storage errors occur, and such errors can occur at any time. 372 + Therefore, when inline crypto is being used, the operating system must always 373 + be ready to reprogram the keyslots without user intervention. 374 + 375 + Thus, it is important for the kernel to still have a way to "remind" the 376 + hardware about a key, without actually having the raw key itself. 377 + 378 + Somewhat less importantly, it is also desirable that the raw keys are never 379 + visible to software at all, even while being initially unlocked. This would 380 + ensure that a read-only compromise of system memory will never allow a key to be 381 + extracted to be used off-system, even if it occurs when a key is being unlocked. 382 + 383 + To solve all these problems, some vendors of inline encryption hardware have 384 + made their hardware support *hardware-wrapped keys*. Hardware-wrapped keys 385 + are encrypted keys that can only be unwrapped (decrypted) and used by hardware 386 + -- either by the inline encryption hardware itself, or by a dedicated hardware 387 + block that can directly provision keys to the inline encryption hardware. 388 + 389 + (We refer to them as "hardware-wrapped keys" rather than simply "wrapped keys" 390 + to add some clarity in cases where there could be other types of wrapped keys, 391 + such as in file-based encryption. Key wrapping is a commonly used technique.) 392 + 393 + The key which wraps (encrypts) hardware-wrapped keys is a hardware-internal key 394 + that is never exposed to software; it is either a persistent key (a "long-term 395 + wrapping key") or a per-boot key (an "ephemeral wrapping key"). The long-term 396 + wrapped form of the key is what is initially unlocked, but it is erased from 397 + memory as soon as it is converted into an ephemerally-wrapped key. In-use 398 + hardware-wrapped keys are always ephemerally-wrapped, not long-term wrapped. 399 + 400 + As inline encryption hardware can only be used to encrypt/decrypt data on-disk, 401 + the hardware also includes a level of indirection; it doesn't use the unwrapped 402 + key directly for inline encryption, but rather derives both an inline encryption 403 + key and a "software secret" from it. Software can use the "software secret" for 404 + tasks that can't use the inline encryption hardware, such as filenames 405 + encryption. The software secret is not protected from memory compromise. 406 + 407 + Key hierarchy 408 + ------------- 409 + 410 + Here is the key hierarchy for a hardware-wrapped key:: 411 + 412 + Hardware-wrapped key 413 + | 414 + | 415 + <Hardware KDF> 416 + | 417 + ----------------------------- 418 + | | 419 + Inline encryption key Software secret 420 + 421 + The components are: 422 + 423 + - *Hardware-wrapped key*: a key for the hardware's KDF (Key Derivation 424 + Function), in ephemerally-wrapped form. The key wrapping algorithm is a 425 + hardware implementation detail that doesn't impact kernel operation, but a 426 + strong authenticated encryption algorithm such as AES-256-GCM is recommended. 427 + 428 + - *Hardware KDF*: a KDF (Key Derivation Function) which the hardware uses to 429 + derive subkeys after unwrapping the wrapped key. The hardware's choice of KDF 430 + doesn't impact kernel operation, but it does need to be known for testing 431 + purposes, and it's also assumed to have at least a 256-bit security strength. 432 + All known hardware uses the SP800-108 KDF in Counter Mode with AES-256-CMAC, 433 + with a particular choice of labels and contexts; new hardware should use this 434 + already-vetted KDF. 435 + 436 + - *Inline encryption key*: a derived key which the hardware directly provisions 437 + to a keyslot of the inline encryption hardware, without exposing it to 438 + software. In all known hardware, this will always be an AES-256-XTS key. 439 + However, in principle other encryption algorithms could be supported too. 440 + Hardware must derive distinct subkeys for each supported encryption algorithm. 441 + 442 + - *Software secret*: a derived key which the hardware returns to software so 443 + that software can use it for cryptographic tasks that can't use inline 444 + encryption. This value is cryptographically isolated from the inline 445 + encryption key, i.e. knowing one doesn't reveal the other. (The KDF ensures 446 + this.) Currently, the software secret is always 32 bytes and thus is suitable 447 + for cryptographic applications that require up to a 256-bit security strength. 448 + Some use cases (e.g. full-disk encryption) won't require the software secret. 449 + 450 + Example: in the case of fscrypt, the fscrypt master key (the key that protects a 451 + particular set of encrypted directories) is made hardware-wrapped. The inline 452 + encryption key is used as the file contents encryption key, while the software 453 + secret (rather than the master key directly) is used to key fscrypt's KDF 454 + (HKDF-SHA512) to derive other subkeys such as filenames encryption keys. 455 + 456 + Note that currently this design assumes a single inline encryption key per 457 + hardware-wrapped key, without any further key derivation. Thus, in the case of 458 + fscrypt, currently hardware-wrapped keys are only compatible with the "inline 459 + encryption optimized" settings, which use one file contents encryption key per 460 + encryption policy rather than one per file. This design could be extended to 461 + make the hardware derive per-file keys using per-file nonces passed down the 462 + storage stack, and in fact some hardware already supports this; future work is 463 + planned to remove this limitation by adding the corresponding kernel support. 464 + 465 + Kernel support 466 + -------------- 467 + 468 + The inline encryption support of the kernel's block layer ("blk-crypto") has 469 + been extended to support hardware-wrapped keys as an alternative to raw keys, 470 + when hardware support is available. This works in the following way: 471 + 472 + - A ``key_types_supported`` field is added to the crypto capabilities in 473 + ``struct blk_crypto_profile``. This allows device drivers to declare that 474 + they support raw keys, hardware-wrapped keys, or both. 475 + 476 + - ``struct blk_crypto_key`` can now contain a hardware-wrapped key as an 477 + alternative to a raw key; a ``key_type`` field is added to 478 + ``struct blk_crypto_config`` to distinguish between the different key types. 479 + This allows users of blk-crypto to en/decrypt data using a hardware-wrapped 480 + key in a way very similar to using a raw key. 481 + 482 + - A new method ``blk_crypto_ll_ops::derive_sw_secret`` is added. Device drivers 483 + that support hardware-wrapped keys must implement this method. Users of 484 + blk-crypto can call ``blk_crypto_derive_sw_secret()`` to access this method. 485 + 486 + - The programming and eviction of hardware-wrapped keys happens via 487 + ``blk_crypto_ll_ops::keyslot_program`` and 488 + ``blk_crypto_ll_ops::keyslot_evict``, just like it does for raw keys. If a 489 + driver supports hardware-wrapped keys, then it must handle hardware-wrapped 490 + keys being passed to these methods. 491 + 492 + blk-crypto-fallback doesn't support hardware-wrapped keys. Therefore, 493 + hardware-wrapped keys can only be used with actual inline encryption hardware. 494 + 495 + All the above deals with hardware-wrapped keys in ephemerally-wrapped form only. 496 + To get such keys in the first place, new block device ioctls have been added to 497 + provide a generic interface to creating and preparing such keys: 498 + 499 + - ``BLKCRYPTOIMPORTKEY`` converts a raw key to long-term wrapped form. It takes 500 + in a pointer to a ``struct blk_crypto_import_key_arg``. The caller must set 501 + ``raw_key_ptr`` and ``raw_key_size`` to the pointer and size (in bytes) of the 502 + raw key to import. On success, ``BLKCRYPTOIMPORTKEY`` returns 0 and writes 503 + the resulting long-term wrapped key blob to the buffer pointed to by 504 + ``lt_key_ptr``, which is of maximum size ``lt_key_size``. It also updates 505 + ``lt_key_size`` to be the actual size of the key. On failure, it returns -1 506 + and sets errno. An errno of ``EOPNOTSUPP`` indicates that the block device 507 + does not support hardware-wrapped keys. An errno of ``EOVERFLOW`` indicates 508 + that the output buffer did not have enough space for the key blob. 509 + 510 + - ``BLKCRYPTOGENERATEKEY`` is like ``BLKCRYPTOIMPORTKEY``, but it has the 511 + hardware generate the key instead of importing one. It takes in a pointer to 512 + a ``struct blk_crypto_generate_key_arg``. 513 + 514 + - ``BLKCRYPTOPREPAREKEY`` converts a key from long-term wrapped form to 515 + ephemerally-wrapped form. It takes in a pointer to a ``struct 516 + blk_crypto_prepare_key_arg``. The caller must set ``lt_key_ptr`` and 517 + ``lt_key_size`` to the pointer and size (in bytes) of the long-term wrapped 518 + key blob to convert. On success, ``BLKCRYPTOPREPAREKEY`` returns 0 and writes 519 + the resulting ephemerally-wrapped key blob to the buffer pointed to by 520 + ``eph_key_ptr``, which is of maximum size ``eph_key_size``. It also updates 521 + ``eph_key_size`` to be the actual size of the key. On failure, it returns -1 522 + and sets errno. Errno values of ``EOPNOTSUPP`` and ``EOVERFLOW`` mean the 523 + same as they do for ``BLKCRYPTOIMPORTKEY``. An errno of ``EBADMSG`` indicates 524 + that the long-term wrapped key is invalid. 525 + 526 + Userspace needs to use either ``BLKCRYPTOIMPORTKEY`` or ``BLKCRYPTOGENERATEKEY`` 527 + once to create a key, and then ``BLKCRYPTOPREPAREKEY`` each time the key is 528 + unlocked and added to the kernel. Note that these ioctls have no relevance for 529 + raw keys; they are only for hardware-wrapped keys. 530 + 531 + Testability 532 + ----------- 533 + 534 + Both the hardware KDF and the inline encryption itself are well-defined 535 + algorithms that don't depend on any secrets other than the unwrapped key. 536 + Therefore, if the unwrapped key is known to software, these algorithms can be 537 + reproduced in software in order to verify the ciphertext that is written to disk 538 + by the inline encryption hardware. 539 + 540 + However, the unwrapped key will only be known to software for testing if the 541 + "import" functionality is used. Proper testing is not possible in the 542 + "generate" case where the hardware generates the key itself. The correct 543 + operation of the "generate" mode thus relies on the security and correctness of 544 + the hardware RNG and its use to generate the key, as well as the testing of the 545 + "import" mode as that should cover all parts other than the key generation. 546 + 547 + For an example of a test that verifies the ciphertext written to disk in the 548 + "import" mode, see the fscrypt hardware-wrapped key tests in xfstests, or 549 + `Android's vts_kernel_encryption_test 550 + <https://android.googlesource.com/platform/test/vts-testcase/kernel/+/refs/heads/main/encryption/>`_.

+2

Documentation/userspace-api/ioctl/ioctl-number.rst

··· 85 85 0x10 20-2F arch/s390/include/uapi/asm/hypfs.h 86 86 0x12 all linux/fs.h BLK* ioctls 87 87 linux/blkpg.h 88 + linux/blkzoned.h 89 + linux/blk-crypto.h 88 90 0x15 all linux/fs.h FS_IOC_* ioctls 89 91 0x1b all InfiniBand Subsystem 90 92 <http://infiniband.sourceforge.net/>

+2 -1

block/Makefile

··· 26 26 bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o 27 27 obj-$(CONFIG_IOSCHED_BFQ) += bfq.o 28 28 29 - obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o 29 + obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o \ 30 + bio-integrity-auto.o 30 31 obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o 31 32 obj-$(CONFIG_BLK_WBT) += blk-wbt.o 32 33 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o

+123 -206

block/badblocks.c

··· 528 528 } 529 529 530 530 /* 531 - * Return 'true' if the range indicated by 'bad' can be backward merged 532 - * with the bad range (from the bad table) index by 'behind'. 533 - */ 534 - static bool can_merge_behind(struct badblocks *bb, 535 - struct badblocks_context *bad, int behind) 536 - { 537 - sector_t sectors = bad->len; 538 - sector_t s = bad->start; 539 - u64 *p = bb->page; 540 - 541 - if ((s < BB_OFFSET(p[behind])) && 542 - ((s + sectors) >= BB_OFFSET(p[behind])) && 543 - ((BB_END(p[behind]) - s) <= BB_MAX_LEN) && 544 - BB_ACK(p[behind]) == bad->ack) 545 - return true; 546 - return false; 547 - } 548 - 549 - /* 550 - * Do backward merge for range indicated by 'bad' and the bad range 551 - * (from the bad table) indexed by 'behind'. The return value is merged 552 - * sectors from bad->len. 553 - */ 554 - static int behind_merge(struct badblocks *bb, struct badblocks_context *bad, 555 - int behind) 556 - { 557 - sector_t sectors = bad->len; 558 - sector_t s = bad->start; 559 - u64 *p = bb->page; 560 - int merged = 0; 561 - 562 - WARN_ON(s >= BB_OFFSET(p[behind])); 563 - WARN_ON((s + sectors) < BB_OFFSET(p[behind])); 564 - 565 - if (s < BB_OFFSET(p[behind])) { 566 - merged = BB_OFFSET(p[behind]) - s; 567 - p[behind] = BB_MAKE(s, BB_LEN(p[behind]) + merged, bad->ack); 568 - 569 - WARN_ON((BB_LEN(p[behind]) + merged) >= BB_MAX_LEN); 570 - } 571 - 572 - return merged; 573 - } 574 - 575 - /* 576 531 * Return 'true' if the range indicated by 'bad' can be forward 577 532 * merged with the bad range (from the bad table) indexed by 'prev'. 578 533 */ ··· 700 745 *extra = 2; 701 746 } 702 747 703 - if ((bb->count + (*extra)) >= MAX_BADBLOCKS) 748 + if ((bb->count + (*extra)) > MAX_BADBLOCKS) 704 749 return false; 705 750 706 751 return true; ··· 810 855 bb->unacked_exist = 0; 811 856 } 812 857 813 - /* Do exact work to set bad block range into the bad block table */ 814 - static int _badblocks_set(struct badblocks *bb, sector_t s, int sectors, 815 - int acknowledged) 858 + /* 859 + * Return 'true' if the range indicated by 'bad' is exactly backward 860 + * overlapped with the bad range (from bad table) indexed by 'behind'. 861 + */ 862 + static bool try_adjacent_combine(struct badblocks *bb, int prev) 816 863 { 817 - int retried = 0, space_desired = 0; 818 - int orig_len, len = 0, added = 0; 864 + u64 *p = bb->page; 865 + 866 + if (prev >= 0 && (prev + 1) < bb->count && 867 + BB_END(p[prev]) == BB_OFFSET(p[prev + 1]) && 868 + (BB_LEN(p[prev]) + BB_LEN(p[prev + 1])) <= BB_MAX_LEN && 869 + BB_ACK(p[prev]) == BB_ACK(p[prev + 1])) { 870 + p[prev] = BB_MAKE(BB_OFFSET(p[prev]), 871 + BB_LEN(p[prev]) + BB_LEN(p[prev + 1]), 872 + BB_ACK(p[prev])); 873 + 874 + if ((prev + 2) < bb->count) 875 + memmove(p + prev + 1, p + prev + 2, 876 + (bb->count - (prev + 2)) * 8); 877 + bb->count--; 878 + return true; 879 + } 880 + return false; 881 + } 882 + 883 + /* Do exact work to set bad block range into the bad block table */ 884 + static bool _badblocks_set(struct badblocks *bb, sector_t s, sector_t sectors, 885 + int acknowledged) 886 + { 887 + int len = 0, added = 0; 819 888 struct badblocks_context bad; 820 889 int prev = -1, hint = -1; 821 - sector_t orig_start; 822 890 unsigned long flags; 823 - int rv = 0; 824 891 u64 *p; 825 892 826 893 if (bb->shift < 0) 827 894 /* badblocks are disabled */ 828 - return 1; 895 + return false; 829 896 830 897 if (sectors == 0) 831 898 /* Invalid sectors number */ 832 - return 1; 899 + return false; 833 900 834 901 if (bb->shift) { 835 902 /* round the start down, and the end up */ 836 903 sector_t next = s + sectors; 837 904 838 - rounddown(s, bb->shift); 839 - roundup(next, bb->shift); 905 + rounddown(s, 1 << bb->shift); 906 + roundup(next, 1 << bb->shift); 840 907 sectors = next - s; 841 908 } 842 909 843 910 write_seqlock_irqsave(&bb->lock, flags); 844 911 845 - orig_start = s; 846 - orig_len = sectors; 847 912 bad.ack = acknowledged; 848 913 p = bb->page; 849 914 ··· 871 896 bad.start = s; 872 897 bad.len = sectors; 873 898 len = 0; 899 + 900 + if (badblocks_full(bb)) 901 + goto out; 874 902 875 903 if (badblocks_empty(bb)) { 876 904 len = insert_at(bb, 0, &bad); ··· 886 908 887 909 /* start before all badblocks */ 888 910 if (prev < 0) { 889 - if (!badblocks_full(bb)) { 890 - /* insert on the first */ 891 - if (bad.len > (BB_OFFSET(p[0]) - bad.start)) 892 - bad.len = BB_OFFSET(p[0]) - bad.start; 893 - len = insert_at(bb, 0, &bad); 894 - bb->count++; 895 - added++; 896 - hint = 0; 897 - goto update_sectors; 898 - } 899 - 900 - /* No sapce, try to merge */ 901 - if (overlap_behind(bb, &bad, 0)) { 902 - if (can_merge_behind(bb, &bad, 0)) { 903 - len = behind_merge(bb, &bad, 0); 904 - added++; 905 - } else { 906 - len = BB_OFFSET(p[0]) - s; 907 - space_desired = 1; 908 - } 909 - hint = 0; 910 - goto update_sectors; 911 - } 912 - 913 - /* no table space and give up */ 914 - goto out; 911 + /* insert on the first */ 912 + if (bad.len > (BB_OFFSET(p[0]) - bad.start)) 913 + bad.len = BB_OFFSET(p[0]) - bad.start; 914 + len = insert_at(bb, 0, &bad); 915 + bb->count++; 916 + added++; 917 + hint = ++prev; 918 + goto update_sectors; 915 919 } 916 920 917 921 /* in case p[prev-1] can be merged with p[prev] */ ··· 905 945 goto update_sectors; 906 946 } 907 947 908 - if (overlap_front(bb, prev, &bad)) { 909 - if (can_merge_front(bb, prev, &bad)) { 910 - len = front_merge(bb, prev, &bad); 911 - added++; 912 - } else { 913 - int extra = 0; 914 - 915 - if (!can_front_overwrite(bb, prev, &bad, &extra)) { 916 - len = min_t(sector_t, 917 - BB_END(p[prev]) - s, sectors); 918 - hint = prev; 919 - goto update_sectors; 920 - } 921 - 922 - len = front_overwrite(bb, prev, &bad, extra); 923 - added++; 924 - bb->count += extra; 925 - 926 - if (can_combine_front(bb, prev, &bad)) { 927 - front_combine(bb, prev); 928 - bb->count--; 929 - } 930 - } 931 - hint = prev; 932 - goto update_sectors; 933 - } 934 - 935 948 if (can_merge_front(bb, prev, &bad)) { 936 949 len = front_merge(bb, prev, &bad); 937 950 added++; ··· 912 979 goto update_sectors; 913 980 } 914 981 915 - /* if no space in table, still try to merge in the covered range */ 916 - if (badblocks_full(bb)) { 917 - /* skip the cannot-merge range */ 918 - if (((prev + 1) < bb->count) && 919 - overlap_behind(bb, &bad, prev + 1) && 920 - ((s + sectors) >= BB_END(p[prev + 1]))) { 921 - len = BB_END(p[prev + 1]) - s; 922 - hint = prev + 1; 982 + if (overlap_front(bb, prev, &bad)) { 983 + int extra = 0; 984 + 985 + if (!can_front_overwrite(bb, prev, &bad, &extra)) { 986 + if (extra > 0) 987 + goto out; 988 + 989 + len = min_t(sector_t, 990 + BB_END(p[prev]) - s, sectors); 991 + hint = prev; 923 992 goto update_sectors; 924 993 } 925 994 926 - /* no retry any more */ 927 - len = sectors; 928 - space_desired = 1; 929 - hint = -1; 995 + len = front_overwrite(bb, prev, &bad, extra); 996 + added++; 997 + bb->count += extra; 998 + 999 + if (can_combine_front(bb, prev, &bad)) { 1000 + front_combine(bb, prev); 1001 + bb->count--; 1002 + } 1003 + 1004 + hint = prev; 930 1005 goto update_sectors; 931 1006 } 932 1007 ··· 947 1006 len = insert_at(bb, prev + 1, &bad); 948 1007 bb->count++; 949 1008 added++; 950 - hint = prev + 1; 1009 + hint = ++prev; 951 1010 952 1011 update_sectors: 953 1012 s += len; ··· 956 1015 if (sectors > 0) 957 1016 goto re_insert; 958 1017 959 - WARN_ON(sectors < 0); 960 - 961 1018 /* 962 1019 * Check whether the following already set range can be 963 1020 * merged. (prev < 0) condition is not handled here, 964 1021 * because it's already complicated enough. 965 1022 */ 966 - if (prev >= 0 && 967 - (prev + 1) < bb->count && 968 - BB_END(p[prev]) == BB_OFFSET(p[prev + 1]) && 969 - (BB_LEN(p[prev]) + BB_LEN(p[prev + 1])) <= BB_MAX_LEN && 970 - BB_ACK(p[prev]) == BB_ACK(p[prev + 1])) { 971 - p[prev] = BB_MAKE(BB_OFFSET(p[prev]), 972 - BB_LEN(p[prev]) + BB_LEN(p[prev + 1]), 973 - BB_ACK(p[prev])); 974 - 975 - if ((prev + 2) < bb->count) 976 - memmove(p + prev + 1, p + prev + 2, 977 - (bb->count - (prev + 2)) * 8); 978 - bb->count--; 979 - } 980 - 981 - if (space_desired && !badblocks_full(bb)) { 982 - s = orig_start; 983 - sectors = orig_len; 984 - space_desired = 0; 985 - if (retried++ < 3) 986 - goto re_insert; 987 - } 1023 + try_adjacent_combine(bb, prev); 988 1024 989 1025 out: 990 1026 if (added) { ··· 975 1057 976 1058 write_sequnlock_irqrestore(&bb->lock, flags); 977 1059 978 - if (!added) 979 - rv = 1; 980 - 981 - return rv; 1060 + return sectors == 0; 982 1061 } 983 1062 984 1063 /* ··· 1046 1131 } 1047 1132 1048 1133 /* Do the exact work to clear bad block range from the bad block table */ 1049 - static int _badblocks_clear(struct badblocks *bb, sector_t s, int sectors) 1134 + static bool _badblocks_clear(struct badblocks *bb, sector_t s, sector_t sectors) 1050 1135 { 1051 1136 struct badblocks_context bad; 1052 1137 int prev = -1, hint = -1; 1053 1138 int len = 0, cleared = 0; 1054 - int rv = 0; 1055 1139 u64 *p; 1056 1140 1057 1141 if (bb->shift < 0) 1058 1142 /* badblocks are disabled */ 1059 - return 1; 1143 + return false; 1060 1144 1061 1145 if (sectors == 0) 1062 1146 /* Invalid sectors number */ 1063 - return 1; 1147 + return false; 1064 1148 1065 1149 if (bb->shift) { 1066 1150 sector_t target; ··· 1071 1157 * isn't than to think a block is not bad when it is. 1072 1158 */ 1073 1159 target = s + sectors; 1074 - roundup(s, bb->shift); 1075 - rounddown(target, bb->shift); 1160 + roundup(s, 1 << bb->shift); 1161 + rounddown(target, 1 << bb->shift); 1076 1162 sectors = target - s; 1077 1163 } 1078 1164 ··· 1128 1214 if ((BB_OFFSET(p[prev]) < bad.start) && 1129 1215 (BB_END(p[prev]) > (bad.start + bad.len))) { 1130 1216 /* Splitting */ 1131 - if ((bb->count + 1) < MAX_BADBLOCKS) { 1217 + if ((bb->count + 1) <= MAX_BADBLOCKS) { 1132 1218 len = front_splitting_clear(bb, prev, &bad); 1133 1219 bb->count += 1; 1134 1220 cleared++; ··· 1169 1255 if (sectors > 0) 1170 1256 goto re_clear; 1171 1257 1172 - WARN_ON(sectors < 0); 1173 - 1174 1258 if (cleared) { 1175 1259 badblocks_update_acked(bb); 1176 1260 set_changed(bb); ··· 1177 1265 write_sequnlock_irq(&bb->lock); 1178 1266 1179 1267 if (!cleared) 1180 - rv = 1; 1268 + return false; 1181 1269 1182 - return rv; 1270 + return true; 1183 1271 } 1184 1272 1185 1273 /* Do the exact work to check bad blocks range from the bad block table */ 1186 - static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors, 1187 - sector_t *first_bad, int *bad_sectors) 1274 + static int _badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors, 1275 + sector_t *first_bad, sector_t *bad_sectors) 1188 1276 { 1189 - int unacked_badblocks, acked_badblocks; 1190 1277 int prev = -1, hint = -1, set = 0; 1191 1278 struct badblocks_context bad; 1192 - unsigned int seq; 1279 + int unacked_badblocks = 0; 1280 + int acked_badblocks = 0; 1281 + u64 *p = bb->page; 1193 1282 int len, rv; 1194 - u64 *p; 1195 - 1196 - WARN_ON(bb->shift < 0 || sectors == 0); 1197 - 1198 - if (bb->shift > 0) { 1199 - sector_t target; 1200 - 1201 - /* round the start down, and the end up */ 1202 - target = s + sectors; 1203 - rounddown(s, bb->shift); 1204 - roundup(target, bb->shift); 1205 - sectors = target - s; 1206 - } 1207 - 1208 - retry: 1209 - seq = read_seqbegin(&bb->lock); 1210 - 1211 - p = bb->page; 1212 - unacked_badblocks = 0; 1213 - acked_badblocks = 0; 1214 1283 1215 1284 re_check: 1216 1285 bad.start = s; ··· 1242 1349 len = sectors; 1243 1350 1244 1351 update_sectors: 1352 + /* This situation should never happen */ 1353 + WARN_ON(sectors < len); 1354 + 1245 1355 s += len; 1246 1356 sectors -= len; 1247 1357 1248 1358 if (sectors > 0) 1249 1359 goto re_check; 1250 - 1251 - WARN_ON(sectors < 0); 1252 1360 1253 1361 if (unacked_badblocks > 0) 1254 1362 rv = -1; ··· 1257 1363 rv = 1; 1258 1364 else 1259 1365 rv = 0; 1260 - 1261 - if (read_seqretry(&bb->lock, seq)) 1262 - goto retry; 1263 1366 1264 1367 return rv; 1265 1368 } ··· 1295 1404 * -1: there are bad blocks which have not yet been acknowledged in metadata. 1296 1405 * plus the start/length of the first bad section we overlap. 1297 1406 */ 1298 - int badblocks_check(struct badblocks *bb, sector_t s, int sectors, 1299 - sector_t *first_bad, int *bad_sectors) 1407 + int badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors, 1408 + sector_t *first_bad, sector_t *bad_sectors) 1300 1409 { 1301 - return _badblocks_check(bb, s, sectors, first_bad, bad_sectors); 1410 + unsigned int seq; 1411 + int rv; 1412 + 1413 + WARN_ON(bb->shift < 0 || sectors == 0); 1414 + 1415 + if (bb->shift > 0) { 1416 + /* round the start down, and the end up */ 1417 + sector_t target = s + sectors; 1418 + 1419 + rounddown(s, 1 << bb->shift); 1420 + roundup(target, 1 << bb->shift); 1421 + sectors = target - s; 1422 + } 1423 + 1424 + retry: 1425 + seq = read_seqbegin(&bb->lock); 1426 + rv = _badblocks_check(bb, s, sectors, first_bad, bad_sectors); 1427 + if (read_seqretry(&bb->lock, seq)) 1428 + goto retry; 1429 + 1430 + return rv; 1302 1431 } 1303 1432 EXPORT_SYMBOL_GPL(badblocks_check); 1304 1433 ··· 1334 1423 * decide how best to handle it. 1335 1424 * 1336 1425 * Return: 1337 - * 0: success 1338 - * 1: failed to set badblocks (out of space) 1426 + * true: success 1427 + * false: failed to set badblocks (out of space). Parital setting will be 1428 + * treated as failure. 1339 1429 */ 1340 - int badblocks_set(struct badblocks *bb, sector_t s, int sectors, 1341 - int acknowledged) 1430 + bool badblocks_set(struct badblocks *bb, sector_t s, sector_t sectors, 1431 + int acknowledged) 1342 1432 { 1343 1433 return _badblocks_set(bb, s, sectors, acknowledged); 1344 1434 } ··· 1356 1444 * drop the remove request. 1357 1445 * 1358 1446 * Return: 1359 - * 0: success 1360 - * 1: failed to clear badblocks 1447 + * true: success 1448 + * false: failed to clear badblocks 1361 1449 */ 1362 - int badblocks_clear(struct badblocks *bb, sector_t s, int sectors) 1450 + bool badblocks_clear(struct badblocks *bb, sector_t s, sector_t sectors) 1363 1451 { 1364 1452 return _badblocks_clear(bb, s, sectors); 1365 1453 } ··· 1391 1479 p[i] = BB_MAKE(start, len, 1); 1392 1480 } 1393 1481 } 1482 + 1483 + for (i = 0; i < bb->count ; i++) 1484 + while (try_adjacent_combine(bb, i)) 1485 + ; 1486 + 1394 1487 bb->unacked_exist = 0; 1395 1488 } 1396 1489 write_sequnlock_irq(&bb->lock); ··· 1481 1564 return -EINVAL; 1482 1565 } 1483 1566 1484 - if (badblocks_set(bb, sector, length, !unack)) 1567 + if (!badblocks_set(bb, sector, length, !unack)) 1485 1568 return -ENOSPC; 1486 - else 1487 - return len; 1569 + 1570 + return len; 1488 1571 } 1489 1572 EXPORT_SYMBOL_GPL(badblocks_store); 1490 1573

+191

block/bio-integrity-auto.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2007, 2008, 2009 Oracle Corporation 4 + * Written by: Martin K. Petersen <martin.petersen@oracle.com> 5 + * 6 + * Automatically generate and verify integrity data on PI capable devices if the 7 + * bio submitter didn't provide PI itself. This ensures that kernel verifies 8 + * data integrity even if the file system (or other user of the block device) is 9 + * not aware of PI. 10 + */ 11 + #include <linux/blk-integrity.h> 12 + #include <linux/workqueue.h> 13 + #include "blk.h" 14 + 15 + struct bio_integrity_data { 16 + struct bio *bio; 17 + struct bvec_iter saved_bio_iter; 18 + struct work_struct work; 19 + struct bio_integrity_payload bip; 20 + struct bio_vec bvec; 21 + }; 22 + 23 + static struct kmem_cache *bid_slab; 24 + static mempool_t bid_pool; 25 + static struct workqueue_struct *kintegrityd_wq; 26 + 27 + static void bio_integrity_finish(struct bio_integrity_data *bid) 28 + { 29 + bid->bio->bi_integrity = NULL; 30 + bid->bio->bi_opf &= ~REQ_INTEGRITY; 31 + kfree(bvec_virt(bid->bip.bip_vec)); 32 + mempool_free(bid, &bid_pool); 33 + } 34 + 35 + static void bio_integrity_verify_fn(struct work_struct *work) 36 + { 37 + struct bio_integrity_data *bid = 38 + container_of(work, struct bio_integrity_data, work); 39 + struct bio *bio = bid->bio; 40 + 41 + blk_integrity_verify_iter(bio, &bid->saved_bio_iter); 42 + bio_integrity_finish(bid); 43 + bio_endio(bio); 44 + } 45 + 46 + /** 47 + * __bio_integrity_endio - Integrity I/O completion function 48 + * @bio: Protected bio 49 + * 50 + * Normally I/O completion is done in interrupt context. However, verifying I/O 51 + * integrity is a time-consuming task which must be run in process context. 52 + * 53 + * This function postpones completion accordingly. 54 + */ 55 + bool __bio_integrity_endio(struct bio *bio) 56 + { 57 + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 58 + struct bio_integrity_payload *bip = bio_integrity(bio); 59 + struct bio_integrity_data *bid = 60 + container_of(bip, struct bio_integrity_data, bip); 61 + 62 + if (bio_op(bio) == REQ_OP_READ && !bio->bi_status && bi->csum_type) { 63 + INIT_WORK(&bid->work, bio_integrity_verify_fn); 64 + queue_work(kintegrityd_wq, &bid->work); 65 + return false; 66 + } 67 + 68 + bio_integrity_finish(bid); 69 + return true; 70 + } 71 + 72 + /** 73 + * bio_integrity_prep - Prepare bio for integrity I/O 74 + * @bio: bio to prepare 75 + * 76 + * Checks if the bio already has an integrity payload attached. If it does, the 77 + * payload has been generated by another kernel subsystem, and we just pass it 78 + * through. 79 + * Otherwise allocates integrity payload and for writes the integrity metadata 80 + * will be generated. For reads, the completion handler will verify the 81 + * metadata. 82 + */ 83 + bool bio_integrity_prep(struct bio *bio) 84 + { 85 + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 86 + struct bio_integrity_data *bid; 87 + gfp_t gfp = GFP_NOIO; 88 + unsigned int len; 89 + void *buf; 90 + 91 + if (!bi) 92 + return true; 93 + 94 + if (!bio_sectors(bio)) 95 + return true; 96 + 97 + /* Already protected? */ 98 + if (bio_integrity(bio)) 99 + return true; 100 + 101 + switch (bio_op(bio)) { 102 + case REQ_OP_READ: 103 + if (bi->flags & BLK_INTEGRITY_NOVERIFY) 104 + return true; 105 + break; 106 + case REQ_OP_WRITE: 107 + if (bi->flags & BLK_INTEGRITY_NOGENERATE) 108 + return true; 109 + 110 + /* 111 + * Zero the memory allocated to not leak uninitialized kernel 112 + * memory to disk for non-integrity metadata where nothing else 113 + * initializes the memory. 114 + */ 115 + if (bi->csum_type == BLK_INTEGRITY_CSUM_NONE) 116 + gfp |= __GFP_ZERO; 117 + break; 118 + default: 119 + return true; 120 + } 121 + 122 + if (WARN_ON_ONCE(bio_has_crypt_ctx(bio))) 123 + return true; 124 + 125 + /* Allocate kernel buffer for protection data */ 126 + len = bio_integrity_bytes(bi, bio_sectors(bio)); 127 + buf = kmalloc(len, gfp); 128 + if (!buf) 129 + goto err_end_io; 130 + bid = mempool_alloc(&bid_pool, GFP_NOIO); 131 + if (!bid) 132 + goto err_free_buf; 133 + bio_integrity_init(bio, &bid->bip, &bid->bvec, 1); 134 + 135 + bid->bio = bio; 136 + 137 + bid->bip.bip_flags |= BIP_BLOCK_INTEGRITY; 138 + bip_set_seed(&bid->bip, bio->bi_iter.bi_sector); 139 + 140 + if (bi->csum_type == BLK_INTEGRITY_CSUM_IP) 141 + bid->bip.bip_flags |= BIP_IP_CHECKSUM; 142 + if (bi->csum_type) 143 + bid->bip.bip_flags |= BIP_CHECK_GUARD; 144 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 145 + bid->bip.bip_flags |= BIP_CHECK_REFTAG; 146 + 147 + if (bio_integrity_add_page(bio, virt_to_page(buf), len, 148 + offset_in_page(buf)) < len) 149 + goto err_end_io; 150 + 151 + /* Auto-generate integrity metadata if this is a write */ 152 + if (bio_data_dir(bio) == WRITE) 153 + blk_integrity_generate(bio); 154 + else 155 + bid->saved_bio_iter = bio->bi_iter; 156 + return true; 157 + 158 + err_free_buf: 159 + kfree(buf); 160 + err_end_io: 161 + bio->bi_status = BLK_STS_RESOURCE; 162 + bio_endio(bio); 163 + return false; 164 + } 165 + EXPORT_SYMBOL(bio_integrity_prep); 166 + 167 + void blk_flush_integrity(void) 168 + { 169 + flush_workqueue(kintegrityd_wq); 170 + } 171 + 172 + static int __init blk_integrity_auto_init(void) 173 + { 174 + bid_slab = kmem_cache_create("bio_integrity_data", 175 + sizeof(struct bio_integrity_data), 0, 176 + SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); 177 + 178 + if (mempool_init_slab_pool(&bid_pool, BIO_POOL_SIZE, bid_slab)) 179 + panic("bio: can't create integrity pool\n"); 180 + 181 + /* 182 + * kintegrityd won't block much but may burn a lot of CPU cycles. 183 + * Make it highpri CPU intensive wq with max concurrency of 1. 184 + */ 185 + kintegrityd_wq = alloc_workqueue("kintegrityd", WQ_MEM_RECLAIM | 186 + WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1); 187 + if (!kintegrityd_wq) 188 + panic("Failed to create kintegrityd\n"); 189 + return 0; 190 + } 191 + subsys_initcall(blk_integrity_auto_init);

+22 -244

block/bio-integrity.c

··· 7 7 */ 8 8 9 9 #include <linux/blk-integrity.h> 10 - #include <linux/mempool.h> 11 - #include <linux/export.h> 12 - #include <linux/bio.h> 13 - #include <linux/workqueue.h> 14 - #include <linux/slab.h> 15 10 #include "blk.h" 16 11 17 - static struct kmem_cache *bip_slab; 18 - static struct workqueue_struct *kintegrityd_wq; 19 - 20 - void blk_flush_integrity(void) 21 - { 22 - flush_workqueue(kintegrityd_wq); 23 - } 12 + struct bio_integrity_alloc { 13 + struct bio_integrity_payload bip; 14 + struct bio_vec bvecs[]; 15 + }; 24 16 25 17 /** 26 18 * bio_integrity_free - Free bio integrity payload ··· 22 30 */ 23 31 void bio_integrity_free(struct bio *bio) 24 32 { 25 - struct bio_integrity_payload *bip = bio_integrity(bio); 26 - struct bio_set *bs = bio->bi_pool; 27 - 28 - if (bs && mempool_initialized(&bs->bio_integrity_pool)) { 29 - if (bip->bip_vec) 30 - bvec_free(&bs->bvec_integrity_pool, bip->bip_vec, 31 - bip->bip_max_vcnt); 32 - mempool_free(bip, &bs->bio_integrity_pool); 33 - } else { 34 - kfree(bip); 35 - } 33 + kfree(bio_integrity(bio)); 36 34 bio->bi_integrity = NULL; 37 35 bio->bi_opf &= ~REQ_INTEGRITY; 36 + } 37 + 38 + void bio_integrity_init(struct bio *bio, struct bio_integrity_payload *bip, 39 + struct bio_vec *bvecs, unsigned int nr_vecs) 40 + { 41 + memset(bip, 0, sizeof(*bip)); 42 + bip->bip_max_vcnt = nr_vecs; 43 + if (nr_vecs) 44 + bip->bip_vec = bvecs; 45 + 46 + bio->bi_integrity = bip; 47 + bio->bi_opf |= REQ_INTEGRITY; 38 48 } 39 49 40 50 /** ··· 53 59 gfp_t gfp_mask, 54 60 unsigned int nr_vecs) 55 61 { 56 - struct bio_integrity_payload *bip; 57 - struct bio_set *bs = bio->bi_pool; 58 - unsigned inline_vecs; 62 + struct bio_integrity_alloc *bia; 59 63 60 64 if (WARN_ON_ONCE(bio_has_crypt_ctx(bio))) 61 65 return ERR_PTR(-EOPNOTSUPP); 62 66 63 - if (!bs || !mempool_initialized(&bs->bio_integrity_pool)) { 64 - bip = kmalloc(struct_size(bip, bip_inline_vecs, nr_vecs), gfp_mask); 65 - inline_vecs = nr_vecs; 66 - } else { 67 - bip = mempool_alloc(&bs->bio_integrity_pool, gfp_mask); 68 - inline_vecs = BIO_INLINE_VECS; 69 - } 70 - 71 - if (unlikely(!bip)) 67 + bia = kmalloc(struct_size(bia, bvecs, nr_vecs), gfp_mask); 68 + if (unlikely(!bia)) 72 69 return ERR_PTR(-ENOMEM); 73 - 74 - memset(bip, 0, sizeof(*bip)); 75 - 76 - /* always report as many vecs as asked explicitly, not inline vecs */ 77 - bip->bip_max_vcnt = nr_vecs; 78 - if (nr_vecs > inline_vecs) { 79 - bip->bip_vec = bvec_alloc(&bs->bvec_integrity_pool, 80 - &bip->bip_max_vcnt, gfp_mask); 81 - if (!bip->bip_vec) 82 - goto err; 83 - } else if (nr_vecs) { 84 - bip->bip_vec = bip->bip_inline_vecs; 85 - } 86 - 87 - bip->bip_bio = bio; 88 - bio->bi_integrity = bip; 89 - bio->bi_opf |= REQ_INTEGRITY; 90 - 91 - return bip; 92 - err: 93 - if (bs && mempool_initialized(&bs->bio_integrity_pool)) 94 - mempool_free(bip, &bs->bio_integrity_pool); 95 - else 96 - kfree(bip); 97 - return ERR_PTR(-ENOMEM); 70 + bio_integrity_init(bio, &bia->bip, bia->bvecs, nr_vecs); 71 + return &bia->bip; 98 72 } 99 73 EXPORT_SYMBOL(bio_integrity_alloc); 100 74 ··· 376 414 } 377 415 378 416 /** 379 - * bio_integrity_prep - Prepare bio for integrity I/O 380 - * @bio: bio to prepare 381 - * 382 - * Description: Checks if the bio already has an integrity payload attached. 383 - * If it does, the payload has been generated by another kernel subsystem, 384 - * and we just pass it through. Otherwise allocates integrity payload. 385 - * The bio must have data direction, target device and start sector set priot 386 - * to calling. In the WRITE case, integrity metadata will be generated using 387 - * the block device's integrity function. In the READ case, the buffer 388 - * will be prepared for DMA and a suitable end_io handler set up. 389 - */ 390 - bool bio_integrity_prep(struct bio *bio) 391 - { 392 - struct bio_integrity_payload *bip; 393 - struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 394 - unsigned int len; 395 - void *buf; 396 - gfp_t gfp = GFP_NOIO; 397 - 398 - if (!bi) 399 - return true; 400 - 401 - if (!bio_sectors(bio)) 402 - return true; 403 - 404 - /* Already protected? */ 405 - if (bio_integrity(bio)) 406 - return true; 407 - 408 - switch (bio_op(bio)) { 409 - case REQ_OP_READ: 410 - if (bi->flags & BLK_INTEGRITY_NOVERIFY) 411 - return true; 412 - break; 413 - case REQ_OP_WRITE: 414 - if (bi->flags & BLK_INTEGRITY_NOGENERATE) 415 - return true; 416 - 417 - /* 418 - * Zero the memory allocated to not leak uninitialized kernel 419 - * memory to disk for non-integrity metadata where nothing else 420 - * initializes the memory. 421 - */ 422 - if (bi->csum_type == BLK_INTEGRITY_CSUM_NONE) 423 - gfp |= __GFP_ZERO; 424 - break; 425 - default: 426 - return true; 427 - } 428 - 429 - /* Allocate kernel buffer for protection data */ 430 - len = bio_integrity_bytes(bi, bio_sectors(bio)); 431 - buf = kmalloc(len, gfp); 432 - if (unlikely(buf == NULL)) { 433 - goto err_end_io; 434 - } 435 - 436 - bip = bio_integrity_alloc(bio, GFP_NOIO, 1); 437 - if (IS_ERR(bip)) { 438 - kfree(buf); 439 - goto err_end_io; 440 - } 441 - 442 - bip->bip_flags |= BIP_BLOCK_INTEGRITY; 443 - bip_set_seed(bip, bio->bi_iter.bi_sector); 444 - 445 - if (bi->csum_type == BLK_INTEGRITY_CSUM_IP) 446 - bip->bip_flags |= BIP_IP_CHECKSUM; 447 - 448 - /* describe what tags to check in payload */ 449 - if (bi->csum_type) 450 - bip->bip_flags |= BIP_CHECK_GUARD; 451 - if (bi->flags & BLK_INTEGRITY_REF_TAG) 452 - bip->bip_flags |= BIP_CHECK_REFTAG; 453 - if (bio_integrity_add_page(bio, virt_to_page(buf), len, 454 - offset_in_page(buf)) < len) { 455 - printk(KERN_ERR "could not attach integrity payload\n"); 456 - goto err_end_io; 457 - } 458 - 459 - /* Auto-generate integrity metadata if this is a write */ 460 - if (bio_data_dir(bio) == WRITE) 461 - blk_integrity_generate(bio); 462 - else 463 - bip->bio_iter = bio->bi_iter; 464 - return true; 465 - 466 - err_end_io: 467 - bio->bi_status = BLK_STS_RESOURCE; 468 - bio_endio(bio); 469 - return false; 470 - } 471 - EXPORT_SYMBOL(bio_integrity_prep); 472 - 473 - /** 474 - * bio_integrity_verify_fn - Integrity I/O completion worker 475 - * @work: Work struct stored in bio to be verified 476 - * 477 - * Description: This workqueue function is called to complete a READ 478 - * request. The function verifies the transferred integrity metadata 479 - * and then calls the original bio end_io function. 480 - */ 481 - static void bio_integrity_verify_fn(struct work_struct *work) 482 - { 483 - struct bio_integrity_payload *bip = 484 - container_of(work, struct bio_integrity_payload, bip_work); 485 - struct bio *bio = bip->bip_bio; 486 - 487 - blk_integrity_verify(bio); 488 - 489 - kfree(bvec_virt(bip->bip_vec)); 490 - bio_integrity_free(bio); 491 - bio_endio(bio); 492 - } 493 - 494 - /** 495 - * __bio_integrity_endio - Integrity I/O completion function 496 - * @bio: Protected bio 497 - * 498 - * Description: Completion for integrity I/O 499 - * 500 - * Normally I/O completion is done in interrupt context. However, 501 - * verifying I/O integrity is a time-consuming task which must be run 502 - * in process context. This function postpones completion 503 - * accordingly. 504 - */ 505 - bool __bio_integrity_endio(struct bio *bio) 506 - { 507 - struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 508 - struct bio_integrity_payload *bip = bio_integrity(bio); 509 - 510 - if (bio_op(bio) == REQ_OP_READ && !bio->bi_status && bi->csum_type) { 511 - INIT_WORK(&bip->bip_work, bio_integrity_verify_fn); 512 - queue_work(kintegrityd_wq, &bip->bip_work); 513 - return false; 514 - } 515 - 516 - kfree(bvec_virt(bip->bip_vec)); 517 - bio_integrity_free(bio); 518 - return true; 519 - } 520 - 521 - /** 522 417 * bio_integrity_advance - Advance integrity vector 523 418 * @bio: bio whose integrity vector to update 524 419 * @bytes_done: number of data bytes that have been completed ··· 435 616 bip->app_tag = bip_src->app_tag; 436 617 437 618 return 0; 438 - } 439 - 440 - int bioset_integrity_create(struct bio_set *bs, int pool_size) 441 - { 442 - if (mempool_initialized(&bs->bio_integrity_pool)) 443 - return 0; 444 - 445 - if (mempool_init_slab_pool(&bs->bio_integrity_pool, 446 - pool_size, bip_slab)) 447 - return -1; 448 - 449 - if (biovec_init_pool(&bs->bvec_integrity_pool, pool_size)) { 450 - mempool_exit(&bs->bio_integrity_pool); 451 - return -1; 452 - } 453 - 454 - return 0; 455 - } 456 - EXPORT_SYMBOL(bioset_integrity_create); 457 - 458 - void bioset_integrity_free(struct bio_set *bs) 459 - { 460 - mempool_exit(&bs->bio_integrity_pool); 461 - mempool_exit(&bs->bvec_integrity_pool); 462 - } 463 - 464 - void __init bio_integrity_init(void) 465 - { 466 - /* 467 - * kintegrityd won't block much but may burn a lot of CPU cycles. 468 - * Make it highpri CPU intensive wq with max concurrency of 1. 469 - */ 470 - kintegrityd_wq = alloc_workqueue("kintegrityd", WQ_MEM_RECLAIM | 471 - WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1); 472 - if (!kintegrityd_wq) 473 - panic("Failed to create kintegrityd\n"); 474 - 475 - bip_slab = kmem_cache_create("bio_integrity_payload", 476 - sizeof(struct bio_integrity_payload) + 477 - sizeof(struct bio_vec) * BIO_INLINE_VECS, 478 - 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); 479 619 }

+7 -10

block/bio.c

··· 1026 1026 void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len, 1027 1027 size_t off) 1028 1028 { 1029 + unsigned long nr = off / PAGE_SIZE; 1030 + 1029 1031 WARN_ON_ONCE(len > UINT_MAX); 1030 - WARN_ON_ONCE(off > UINT_MAX); 1031 - __bio_add_page(bio, &folio->page, len, off); 1032 + __bio_add_page(bio, folio_page(folio, nr), len, off % PAGE_SIZE); 1032 1033 } 1033 1034 EXPORT_SYMBOL_GPL(bio_add_folio_nofail); 1034 1035 ··· 1050 1049 bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len, 1051 1050 size_t off) 1052 1051 { 1053 - if (len > UINT_MAX || off > UINT_MAX) 1052 + unsigned long nr = off / PAGE_SIZE; 1053 + 1054 + if (len > UINT_MAX) 1054 1055 return false; 1055 - return bio_add_page(bio, &folio->page, len, off) > 0; 1056 + return bio_add_page(bio, folio_page(folio, nr), len, off % PAGE_SIZE) > 0; 1056 1057 } 1057 1058 EXPORT_SYMBOL(bio_add_folio); 1058 1059 ··· 1660 1657 mempool_exit(&bs->bio_pool); 1661 1658 mempool_exit(&bs->bvec_pool); 1662 1659 1663 - bioset_integrity_free(bs); 1664 1660 if (bs->bio_slab) 1665 1661 bio_put_slab(bs); 1666 1662 bs->bio_slab = NULL; ··· 1739 1737 1740 1738 BUILD_BUG_ON(BIO_FLAG_LAST > 8 * sizeof_field(struct bio, bi_flags)); 1741 1739 1742 - bio_integrity_init(); 1743 - 1744 1740 for (i = 0; i < ARRAY_SIZE(bvec_slabs); i++) { 1745 1741 struct biovec_slab *bvs = bvec_slabs + i; 1746 1742 ··· 1753 1753 if (bioset_init(&fs_bio_set, BIO_POOL_SIZE, 0, 1754 1754 BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE)) 1755 1755 panic("bio: can't allocate bios\n"); 1756 - 1757 - if (bioset_integrity_create(&fs_bio_set, BIO_POOL_SIZE)) 1758 - panic("bio: can't create integrity pool\n"); 1759 1756 1760 1757 return 0; 1761 1758 }

+67 -14

block/blk-cgroup.c

··· 816 816 ctx->bdev = bdev; 817 817 return 0; 818 818 } 819 + /* 820 + * Similar to blkg_conf_open_bdev, but additionally freezes the queue, 821 + * acquires q->elevator_lock, and ensures the correct locking order 822 + * between q->elevator_lock and q->rq_qos_mutex. 823 + * 824 + * This function returns negative error on failure. On success it returns 825 + * memflags which must be saved and later passed to blkg_conf_exit_frozen 826 + * for restoring the memalloc scope. 827 + */ 828 + unsigned long __must_check blkg_conf_open_bdev_frozen(struct blkg_conf_ctx *ctx) 829 + { 830 + int ret; 831 + unsigned long memflags; 832 + 833 + if (ctx->bdev) 834 + return -EINVAL; 835 + 836 + ret = blkg_conf_open_bdev(ctx); 837 + if (ret < 0) 838 + return ret; 839 + /* 840 + * At this point, we haven’t started protecting anything related to QoS, 841 + * so we release q->rq_qos_mutex here, which was first acquired in blkg_ 842 + * conf_open_bdev. Later, we re-acquire q->rq_qos_mutex after freezing 843 + * the queue and acquiring q->elevator_lock to maintain the correct 844 + * locking order. 845 + */ 846 + mutex_unlock(&ctx->bdev->bd_queue->rq_qos_mutex); 847 + 848 + memflags = blk_mq_freeze_queue(ctx->bdev->bd_queue); 849 + mutex_lock(&ctx->bdev->bd_queue->elevator_lock); 850 + mutex_lock(&ctx->bdev->bd_queue->rq_qos_mutex); 851 + 852 + return memflags; 853 + } 819 854 820 855 /** 821 856 * blkg_conf_prep - parse and prepare for per-blkg config update ··· 1006 971 } 1007 972 } 1008 973 EXPORT_SYMBOL_GPL(blkg_conf_exit); 974 + 975 + /* 976 + * Similar to blkg_conf_exit, but also unfreezes the queue and releases 977 + * q->elevator_lock. Should be used when blkg_conf_open_bdev_frozen 978 + * is used to open the bdev. 979 + */ 980 + void blkg_conf_exit_frozen(struct blkg_conf_ctx *ctx, unsigned long memflags) 981 + { 982 + if (ctx->bdev) { 983 + struct request_queue *q = ctx->bdev->bd_queue; 984 + 985 + blkg_conf_exit(ctx); 986 + mutex_unlock(&q->elevator_lock); 987 + blk_mq_unfreeze_queue(q, memflags); 988 + } 989 + } 1009 990 1010 991 static void blkg_iostat_add(struct blkg_iostat *dst, struct blkg_iostat *src) 1011 992 { ··· 1779 1728 struct blkcg *blkcg; 1780 1729 int i, ret; 1781 1730 1782 - mutex_lock(&blkcg_pol_register_mutex); 1783 - mutex_lock(&blkcg_pol_mutex); 1784 - 1785 - /* find an empty slot */ 1786 - ret = -ENOSPC; 1787 - for (i = 0; i < BLKCG_MAX_POLS; i++) 1788 - if (!blkcg_policy[i]) 1789 - break; 1790 - if (i >= BLKCG_MAX_POLS) { 1791 - pr_warn("blkcg_policy_register: BLKCG_MAX_POLS too small\n"); 1792 - goto err_unlock; 1793 - } 1794 - 1795 1731 /* 1796 1732 * Make sure cpd/pd_alloc_fn and cpd/pd_free_fn in pairs, and policy 1797 1733 * without pd_alloc_fn/pd_free_fn can't be activated. 1798 1734 */ 1799 1735 if ((!pol->cpd_alloc_fn ^ !pol->cpd_free_fn) || 1800 1736 (!pol->pd_alloc_fn ^ !pol->pd_free_fn)) 1737 + return -EINVAL; 1738 + 1739 + mutex_lock(&blkcg_pol_register_mutex); 1740 + mutex_lock(&blkcg_pol_mutex); 1741 + 1742 + /* find an empty slot */ 1743 + for (i = 0; i < BLKCG_MAX_POLS; i++) 1744 + if (!blkcg_policy[i]) 1745 + break; 1746 + if (i >= BLKCG_MAX_POLS) { 1747 + pr_warn("blkcg_policy_register: BLKCG_MAX_POLS too small\n"); 1748 + ret = -ENOSPC; 1801 1749 goto err_unlock; 1750 + } 1802 1751 1803 1752 /* register @pol */ 1804 1753 pol->plid = i; ··· 1810 1759 struct blkcg_policy_data *cpd; 1811 1760 1812 1761 cpd = pol->cpd_alloc_fn(GFP_KERNEL); 1813 - if (!cpd) 1762 + if (!cpd) { 1763 + ret = -ENOMEM; 1814 1764 goto err_free_cpds; 1765 + } 1815 1766 1816 1767 blkcg->cpd[pol->plid] = cpd; 1817 1768 cpd->blkcg = blkcg;

+2

block/blk-cgroup.h

··· 219 219 220 220 void blkg_conf_init(struct blkg_conf_ctx *ctx, char *input); 221 221 int blkg_conf_open_bdev(struct blkg_conf_ctx *ctx); 222 + unsigned long blkg_conf_open_bdev_frozen(struct blkg_conf_ctx *ctx); 222 223 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, 223 224 struct blkg_conf_ctx *ctx); 224 225 void blkg_conf_exit(struct blkg_conf_ctx *ctx); 226 + void blkg_conf_exit_frozen(struct blkg_conf_ctx *ctx, unsigned long memflags); 225 227 226 228 /** 227 229 * bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg

+7

block/blk-core.c

··· 429 429 430 430 refcount_set(&q->refs, 1); 431 431 mutex_init(&q->debugfs_mutex); 432 + mutex_init(&q->elevator_lock); 432 433 mutex_init(&q->sysfs_lock); 433 434 mutex_init(&q->limits_lock); 434 435 mutex_init(&q->rq_qos_mutex); ··· 455 454 &q->io_lock_cls_key, 0); 456 455 lockdep_init_map(&q->q_lockdep_map, "&q->q_usage_counter(queue)", 457 456 &q->q_lock_cls_key, 0); 457 + 458 + /* Teach lockdep about lock ordering (reclaim WRT queue freeze lock). */ 459 + fs_reclaim_acquire(GFP_KERNEL); 460 + rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_); 461 + rwsem_release(&q->io_lockdep_map, _RET_IP_); 462 + fs_reclaim_release(GFP_KERNEL); 458 463 459 464 q->nr_requests = BLKDEV_DEFAULT_RQ; 460 465

+4 -3

block/blk-crypto-fallback.c

··· 87 87 * This is the key we set when evicting a keyslot. This *should* be the all 0's 88 88 * key, but AES-XTS rejects that key, so we use some random bytes instead. 89 89 */ 90 - static u8 blank_key[BLK_CRYPTO_MAX_KEY_SIZE]; 90 + static u8 blank_key[BLK_CRYPTO_MAX_RAW_KEY_SIZE]; 91 91 92 92 static void blk_crypto_fallback_evict_keyslot(unsigned int slot) 93 93 { ··· 119 119 blk_crypto_fallback_evict_keyslot(slot); 120 120 121 121 slotp->crypto_mode = crypto_mode; 122 - err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], key->raw, 122 + err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], key->bytes, 123 123 key->size); 124 124 if (err) { 125 125 blk_crypto_fallback_evict_keyslot(slot); ··· 539 539 if (blk_crypto_fallback_inited) 540 540 return 0; 541 541 542 - get_random_bytes(blank_key, BLK_CRYPTO_MAX_KEY_SIZE); 542 + get_random_bytes(blank_key, sizeof(blank_key)); 543 543 544 544 err = bioset_init(&crypto_bio_split, 64, 0, 0); 545 545 if (err) ··· 561 561 562 562 blk_crypto_fallback_profile->ll_ops = blk_crypto_fallback_ll_ops; 563 563 blk_crypto_fallback_profile->max_dun_bytes_supported = BLK_CRYPTO_MAX_IV_SIZE; 564 + blk_crypto_fallback_profile->key_types_supported = BLK_CRYPTO_KEY_TYPE_RAW; 564 565 565 566 /* All blk-crypto modes have a crypto API fallback. */ 566 567 for (i = 0; i < BLK_ENCRYPTION_MODE_MAX; i++)

+10

block/blk-crypto-internal.h

··· 14 14 const char *name; /* name of this mode, shown in sysfs */ 15 15 const char *cipher_str; /* crypto API name (for fallback case) */ 16 16 unsigned int keysize; /* key size in bytes */ 17 + unsigned int security_strength; /* security strength in bytes */ 17 18 unsigned int ivsize; /* iv size in bytes */ 18 19 }; 19 20 ··· 83 82 bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, 84 83 const struct blk_crypto_config *cfg); 85 84 85 + int blk_crypto_ioctl(struct block_device *bdev, unsigned int cmd, 86 + void __user *argp); 87 + 86 88 #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 87 89 88 90 static inline int blk_crypto_sysfs_register(struct gendisk *disk) ··· 131 127 static inline bool blk_crypto_rq_has_keyslot(struct request *rq) 132 128 { 133 129 return false; 130 + } 131 + 132 + static inline int blk_crypto_ioctl(struct block_device *bdev, unsigned int cmd, 133 + void __user *argp) 134 + { 135 + return -ENOTTY; 134 136 } 135 137 136 138 #endif /* CONFIG_BLK_INLINE_ENCRYPTION */

+101

block/blk-crypto-profile.c

··· 352 352 return false; 353 353 if (profile->max_dun_bytes_supported < cfg->dun_bytes) 354 354 return false; 355 + if (!(profile->key_types_supported & cfg->key_type)) 356 + return false; 355 357 return true; 356 358 } 357 359 ··· 465 463 EXPORT_SYMBOL_GPL(blk_crypto_register); 466 464 467 465 /** 466 + * blk_crypto_derive_sw_secret() - Derive software secret from wrapped key 467 + * @bdev: a block device that supports hardware-wrapped keys 468 + * @eph_key: a hardware-wrapped key in ephemerally-wrapped form 469 + * @eph_key_size: size of @eph_key in bytes 470 + * @sw_secret: (output) the software secret 471 + * 472 + * Given a hardware-wrapped key in ephemerally-wrapped form (the same form that 473 + * it is used for I/O), ask the hardware to derive the secret which software can 474 + * use for cryptographic tasks other than inline encryption. This secret is 475 + * guaranteed to be cryptographically isolated from the inline encryption key, 476 + * i.e. derived with a different KDF context. 477 + * 478 + * Return: 0 on success, -EOPNOTSUPP if the block device doesn't support 479 + * hardware-wrapped keys, -EBADMSG if the key isn't a valid 480 + * ephemerally-wrapped key, or another -errno code. 481 + */ 482 + int blk_crypto_derive_sw_secret(struct block_device *bdev, 483 + const u8 *eph_key, size_t eph_key_size, 484 + u8 sw_secret[BLK_CRYPTO_SW_SECRET_SIZE]) 485 + { 486 + struct blk_crypto_profile *profile = 487 + bdev_get_queue(bdev)->crypto_profile; 488 + int err; 489 + 490 + if (!profile) 491 + return -EOPNOTSUPP; 492 + if (!(profile->key_types_supported & BLK_CRYPTO_KEY_TYPE_HW_WRAPPED)) 493 + return -EOPNOTSUPP; 494 + if (!profile->ll_ops.derive_sw_secret) 495 + return -EOPNOTSUPP; 496 + blk_crypto_hw_enter(profile); 497 + err = profile->ll_ops.derive_sw_secret(profile, eph_key, eph_key_size, 498 + sw_secret); 499 + blk_crypto_hw_exit(profile); 500 + return err; 501 + } 502 + 503 + int blk_crypto_import_key(struct blk_crypto_profile *profile, 504 + const u8 *raw_key, size_t raw_key_size, 505 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]) 506 + { 507 + int ret; 508 + 509 + if (!profile) 510 + return -EOPNOTSUPP; 511 + if (!(profile->key_types_supported & BLK_CRYPTO_KEY_TYPE_HW_WRAPPED)) 512 + return -EOPNOTSUPP; 513 + if (!profile->ll_ops.import_key) 514 + return -EOPNOTSUPP; 515 + blk_crypto_hw_enter(profile); 516 + ret = profile->ll_ops.import_key(profile, raw_key, raw_key_size, 517 + lt_key); 518 + blk_crypto_hw_exit(profile); 519 + return ret; 520 + } 521 + 522 + int blk_crypto_generate_key(struct blk_crypto_profile *profile, 523 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]) 524 + { 525 + int ret; 526 + 527 + if (!profile) 528 + return -EOPNOTSUPP; 529 + if (!(profile->key_types_supported & BLK_CRYPTO_KEY_TYPE_HW_WRAPPED)) 530 + return -EOPNOTSUPP; 531 + if (!profile->ll_ops.generate_key) 532 + return -EOPNOTSUPP; 533 + blk_crypto_hw_enter(profile); 534 + ret = profile->ll_ops.generate_key(profile, lt_key); 535 + blk_crypto_hw_exit(profile); 536 + return ret; 537 + } 538 + 539 + int blk_crypto_prepare_key(struct blk_crypto_profile *profile, 540 + const u8 *lt_key, size_t lt_key_size, 541 + u8 eph_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]) 542 + { 543 + int ret; 544 + 545 + if (!profile) 546 + return -EOPNOTSUPP; 547 + if (!(profile->key_types_supported & BLK_CRYPTO_KEY_TYPE_HW_WRAPPED)) 548 + return -EOPNOTSUPP; 549 + if (!profile->ll_ops.prepare_key) 550 + return -EOPNOTSUPP; 551 + blk_crypto_hw_enter(profile); 552 + ret = profile->ll_ops.prepare_key(profile, lt_key, lt_key_size, 553 + eph_key); 554 + blk_crypto_hw_exit(profile); 555 + return ret; 556 + } 557 + 558 + /** 468 559 * blk_crypto_intersect_capabilities() - restrict supported crypto capabilities 469 560 * by child device 470 561 * @parent: the crypto profile for the parent device ··· 580 485 child->max_dun_bytes_supported); 581 486 for (i = 0; i < ARRAY_SIZE(child->modes_supported); i++) 582 487 parent->modes_supported[i] &= child->modes_supported[i]; 488 + parent->key_types_supported &= child->key_types_supported; 583 489 } else { 584 490 parent->max_dun_bytes_supported = 0; 585 491 memset(parent->modes_supported, 0, 586 492 sizeof(parent->modes_supported)); 493 + parent->key_types_supported = 0; 587 494 } 588 495 } 589 496 EXPORT_SYMBOL_GPL(blk_crypto_intersect_capabilities); ··· 616 519 617 520 if (reference->max_dun_bytes_supported > 618 521 target->max_dun_bytes_supported) 522 + return false; 523 + 524 + if (reference->key_types_supported & ~target->key_types_supported) 619 525 return false; 620 526 621 527 return true; ··· 655 555 sizeof(dst->modes_supported)); 656 556 657 557 dst->max_dun_bytes_supported = src->max_dun_bytes_supported; 558 + dst->key_types_supported = src->key_types_supported; 658 559 } 659 560 EXPORT_SYMBOL_GPL(blk_crypto_update_capabilities);

+35

block/blk-crypto-sysfs.c

··· 31 31 return container_of(attr, struct blk_crypto_attr, attr); 32 32 } 33 33 34 + static ssize_t hw_wrapped_keys_show(struct blk_crypto_profile *profile, 35 + struct blk_crypto_attr *attr, char *page) 36 + { 37 + /* Always show supported, since the file doesn't exist otherwise. */ 38 + return sysfs_emit(page, "supported\n"); 39 + } 40 + 34 41 static ssize_t max_dun_bits_show(struct blk_crypto_profile *profile, 35 42 struct blk_crypto_attr *attr, char *page) 36 43 { ··· 50 43 return sysfs_emit(page, "%u\n", profile->num_slots); 51 44 } 52 45 46 + static ssize_t raw_keys_show(struct blk_crypto_profile *profile, 47 + struct blk_crypto_attr *attr, char *page) 48 + { 49 + /* Always show supported, since the file doesn't exist otherwise. */ 50 + return sysfs_emit(page, "supported\n"); 51 + } 52 + 53 53 #define BLK_CRYPTO_RO_ATTR(_name) \ 54 54 static struct blk_crypto_attr _name##_attr = __ATTR_RO(_name) 55 55 56 + BLK_CRYPTO_RO_ATTR(hw_wrapped_keys); 56 57 BLK_CRYPTO_RO_ATTR(max_dun_bits); 57 58 BLK_CRYPTO_RO_ATTR(num_keyslots); 59 + BLK_CRYPTO_RO_ATTR(raw_keys); 60 + 61 + static umode_t blk_crypto_is_visible(struct kobject *kobj, 62 + struct attribute *attr, int n) 63 + { 64 + struct blk_crypto_profile *profile = kobj_to_crypto_profile(kobj); 65 + struct blk_crypto_attr *a = attr_to_crypto_attr(attr); 66 + 67 + if (a == &hw_wrapped_keys_attr && 68 + !(profile->key_types_supported & BLK_CRYPTO_KEY_TYPE_HW_WRAPPED)) 69 + return 0; 70 + if (a == &raw_keys_attr && 71 + !(profile->key_types_supported & BLK_CRYPTO_KEY_TYPE_RAW)) 72 + return 0; 73 + 74 + return 0444; 75 + } 58 76 59 77 static struct attribute *blk_crypto_attrs[] = { 78 + &hw_wrapped_keys_attr.attr, 60 79 &max_dun_bits_attr.attr, 61 80 &num_keyslots_attr.attr, 81 + &raw_keys_attr.attr, 62 82 NULL, 63 83 }; 64 84 65 85 static const struct attribute_group blk_crypto_attr_group = { 66 86 .attrs = blk_crypto_attrs, 87 + .is_visible = blk_crypto_is_visible, 67 88 }; 68 89 69 90 /*

+190 -14

block/blk-crypto.c

··· 23 23 .name = "AES-256-XTS", 24 24 .cipher_str = "xts(aes)", 25 25 .keysize = 64, 26 + .security_strength = 32, 26 27 .ivsize = 16, 27 28 }, 28 29 [BLK_ENCRYPTION_MODE_AES_128_CBC_ESSIV] = { 29 30 .name = "AES-128-CBC-ESSIV", 30 31 .cipher_str = "essiv(cbc(aes),sha256)", 31 32 .keysize = 16, 33 + .security_strength = 16, 32 34 .ivsize = 16, 33 35 }, 34 36 [BLK_ENCRYPTION_MODE_ADIANTUM] = { 35 37 .name = "Adiantum", 36 38 .cipher_str = "adiantum(xchacha12,aes)", 37 39 .keysize = 32, 40 + .security_strength = 32, 38 41 .ivsize = 32, 39 42 }, 40 43 [BLK_ENCRYPTION_MODE_SM4_XTS] = { 41 44 .name = "SM4-XTS", 42 45 .cipher_str = "xts(sm4)", 43 46 .keysize = 32, 47 + .security_strength = 16, 44 48 .ivsize = 16, 45 49 }, 46 50 }; ··· 80 76 /* This is assumed in various places. */ 81 77 BUILD_BUG_ON(BLK_ENCRYPTION_MODE_INVALID != 0); 82 78 83 - /* Sanity check that no algorithm exceeds the defined limits. */ 79 + /* 80 + * Validate the crypto mode properties. This ideally would be done with 81 + * static assertions, but boot-time checks are the next best thing. 82 + */ 84 83 for (i = 0; i < BLK_ENCRYPTION_MODE_MAX; i++) { 85 - BUG_ON(blk_crypto_modes[i].keysize > BLK_CRYPTO_MAX_KEY_SIZE); 84 + BUG_ON(blk_crypto_modes[i].keysize > 85 + BLK_CRYPTO_MAX_RAW_KEY_SIZE); 86 + BUG_ON(blk_crypto_modes[i].security_strength > 87 + blk_crypto_modes[i].keysize); 86 88 BUG_ON(blk_crypto_modes[i].ivsize > BLK_CRYPTO_MAX_IV_SIZE); 87 89 } 88 90 ··· 325 315 /** 326 316 * blk_crypto_init_key() - Prepare a key for use with blk-crypto 327 317 * @blk_key: Pointer to the blk_crypto_key to initialize. 328 - * @raw_key: Pointer to the raw key. Must be the correct length for the chosen 329 - * @crypto_mode; see blk_crypto_modes[]. 318 + * @key_bytes: the bytes of the key 319 + * @key_size: size of the key in bytes 320 + * @key_type: type of the key -- either raw or hardware-wrapped 330 321 * @crypto_mode: identifier for the encryption algorithm to use 331 322 * @dun_bytes: number of bytes that will be used to specify the DUN when this 332 323 * key is used 333 324 * @data_unit_size: the data unit size to use for en/decryption 334 325 * 335 326 * Return: 0 on success, -errno on failure. The caller is responsible for 336 - * zeroizing both blk_key and raw_key when done with them. 327 + * zeroizing both blk_key and key_bytes when done with them. 337 328 */ 338 - int blk_crypto_init_key(struct blk_crypto_key *blk_key, const u8 *raw_key, 329 + int blk_crypto_init_key(struct blk_crypto_key *blk_key, 330 + const u8 *key_bytes, size_t key_size, 331 + enum blk_crypto_key_type key_type, 339 332 enum blk_crypto_mode_num crypto_mode, 340 333 unsigned int dun_bytes, 341 334 unsigned int data_unit_size) ··· 351 338 return -EINVAL; 352 339 353 340 mode = &blk_crypto_modes[crypto_mode]; 354 - if (mode->keysize == 0) 341 + switch (key_type) { 342 + case BLK_CRYPTO_KEY_TYPE_RAW: 343 + if (key_size != mode->keysize) 344 + return -EINVAL; 345 + break; 346 + case BLK_CRYPTO_KEY_TYPE_HW_WRAPPED: 347 + if (key_size < mode->security_strength || 348 + key_size > BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE) 349 + return -EINVAL; 350 + break; 351 + default: 355 352 return -EINVAL; 353 + } 356 354 357 355 if (dun_bytes == 0 || dun_bytes > mode->ivsize) 358 356 return -EINVAL; ··· 374 350 blk_key->crypto_cfg.crypto_mode = crypto_mode; 375 351 blk_key->crypto_cfg.dun_bytes = dun_bytes; 376 352 blk_key->crypto_cfg.data_unit_size = data_unit_size; 353 + blk_key->crypto_cfg.key_type = key_type; 377 354 blk_key->data_unit_size_bits = ilog2(data_unit_size); 378 - blk_key->size = mode->keysize; 379 - memcpy(blk_key->raw, raw_key, mode->keysize); 355 + blk_key->size = key_size; 356 + memcpy(blk_key->bytes, key_bytes, key_size); 380 357 381 358 return 0; 382 359 } ··· 397 372 bool blk_crypto_config_supported(struct block_device *bdev, 398 373 const struct blk_crypto_config *cfg) 399 374 { 400 - return IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) || 401 - blk_crypto_config_supported_natively(bdev, cfg); 375 + if (IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) && 376 + cfg->key_type == BLK_CRYPTO_KEY_TYPE_RAW) 377 + return true; 378 + return blk_crypto_config_supported_natively(bdev, cfg); 402 379 } 403 380 404 381 /** ··· 414 387 * an skcipher, and *should not* be called from the data path, since that might 415 388 * cause a deadlock 416 389 * 417 - * Return: 0 on success; -ENOPKG if the hardware doesn't support the key and 418 - * blk-crypto-fallback is either disabled or the needed algorithm 419 - * is disabled in the crypto API; or another -errno code. 390 + * Return: 0 on success; -EOPNOTSUPP if the key is wrapped but the hardware does 391 + * not support wrapped keys; -ENOPKG if the key is a raw key but the 392 + * hardware does not support raw keys and blk-crypto-fallback is either 393 + * disabled or the needed algorithm is disabled in the crypto API; or 394 + * another -errno code if something else went wrong. 420 395 */ 421 396 int blk_crypto_start_using_key(struct block_device *bdev, 422 397 const struct blk_crypto_key *key) 423 398 { 424 399 if (blk_crypto_config_supported_natively(bdev, &key->crypto_cfg)) 425 400 return 0; 401 + if (key->crypto_cfg.key_type != BLK_CRYPTO_KEY_TYPE_RAW) { 402 + pr_warn_ratelimited("%pg: no support for wrapped keys\n", bdev); 403 + return -EOPNOTSUPP; 404 + } 426 405 return blk_crypto_fallback_start_using_mode(key->crypto_cfg.crypto_mode); 427 406 } 428 407 ··· 469 436 pr_warn_ratelimited("%pg: error %d evicting key\n", bdev, err); 470 437 } 471 438 EXPORT_SYMBOL_GPL(blk_crypto_evict_key); 439 + 440 + static int blk_crypto_ioctl_import_key(struct blk_crypto_profile *profile, 441 + void __user *argp) 442 + { 443 + struct blk_crypto_import_key_arg arg; 444 + u8 raw_key[BLK_CRYPTO_MAX_RAW_KEY_SIZE]; 445 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]; 446 + int ret; 447 + 448 + if (copy_from_user(&arg, argp, sizeof(arg))) 449 + return -EFAULT; 450 + 451 + if (memchr_inv(arg.reserved, 0, sizeof(arg.reserved))) 452 + return -EINVAL; 453 + 454 + if (arg.raw_key_size < 16 || arg.raw_key_size > sizeof(raw_key)) 455 + return -EINVAL; 456 + 457 + if (copy_from_user(raw_key, u64_to_user_ptr(arg.raw_key_ptr), 458 + arg.raw_key_size)) { 459 + ret = -EFAULT; 460 + goto out; 461 + } 462 + ret = blk_crypto_import_key(profile, raw_key, arg.raw_key_size, lt_key); 463 + if (ret < 0) 464 + goto out; 465 + if (ret > arg.lt_key_size) { 466 + ret = -EOVERFLOW; 467 + goto out; 468 + } 469 + arg.lt_key_size = ret; 470 + if (copy_to_user(u64_to_user_ptr(arg.lt_key_ptr), lt_key, 471 + arg.lt_key_size) || 472 + copy_to_user(argp, &arg, sizeof(arg))) { 473 + ret = -EFAULT; 474 + goto out; 475 + } 476 + ret = 0; 477 + 478 + out: 479 + memzero_explicit(raw_key, sizeof(raw_key)); 480 + memzero_explicit(lt_key, sizeof(lt_key)); 481 + return ret; 482 + } 483 + 484 + static int blk_crypto_ioctl_generate_key(struct blk_crypto_profile *profile, 485 + void __user *argp) 486 + { 487 + struct blk_crypto_generate_key_arg arg; 488 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]; 489 + int ret; 490 + 491 + if (copy_from_user(&arg, argp, sizeof(arg))) 492 + return -EFAULT; 493 + 494 + if (memchr_inv(arg.reserved, 0, sizeof(arg.reserved))) 495 + return -EINVAL; 496 + 497 + ret = blk_crypto_generate_key(profile, lt_key); 498 + if (ret < 0) 499 + goto out; 500 + if (ret > arg.lt_key_size) { 501 + ret = -EOVERFLOW; 502 + goto out; 503 + } 504 + arg.lt_key_size = ret; 505 + if (copy_to_user(u64_to_user_ptr(arg.lt_key_ptr), lt_key, 506 + arg.lt_key_size) || 507 + copy_to_user(argp, &arg, sizeof(arg))) { 508 + ret = -EFAULT; 509 + goto out; 510 + } 511 + ret = 0; 512 + 513 + out: 514 + memzero_explicit(lt_key, sizeof(lt_key)); 515 + return ret; 516 + } 517 + 518 + static int blk_crypto_ioctl_prepare_key(struct blk_crypto_profile *profile, 519 + void __user *argp) 520 + { 521 + struct blk_crypto_prepare_key_arg arg; 522 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]; 523 + u8 eph_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]; 524 + int ret; 525 + 526 + if (copy_from_user(&arg, argp, sizeof(arg))) 527 + return -EFAULT; 528 + 529 + if (memchr_inv(arg.reserved, 0, sizeof(arg.reserved))) 530 + return -EINVAL; 531 + 532 + if (arg.lt_key_size > sizeof(lt_key)) 533 + return -EINVAL; 534 + 535 + if (copy_from_user(lt_key, u64_to_user_ptr(arg.lt_key_ptr), 536 + arg.lt_key_size)) { 537 + ret = -EFAULT; 538 + goto out; 539 + } 540 + ret = blk_crypto_prepare_key(profile, lt_key, arg.lt_key_size, eph_key); 541 + if (ret < 0) 542 + goto out; 543 + if (ret > arg.eph_key_size) { 544 + ret = -EOVERFLOW; 545 + goto out; 546 + } 547 + arg.eph_key_size = ret; 548 + if (copy_to_user(u64_to_user_ptr(arg.eph_key_ptr), eph_key, 549 + arg.eph_key_size) || 550 + copy_to_user(argp, &arg, sizeof(arg))) { 551 + ret = -EFAULT; 552 + goto out; 553 + } 554 + ret = 0; 555 + 556 + out: 557 + memzero_explicit(lt_key, sizeof(lt_key)); 558 + memzero_explicit(eph_key, sizeof(eph_key)); 559 + return ret; 560 + } 561 + 562 + int blk_crypto_ioctl(struct block_device *bdev, unsigned int cmd, 563 + void __user *argp) 564 + { 565 + struct blk_crypto_profile *profile = 566 + bdev_get_queue(bdev)->crypto_profile; 567 + 568 + if (!profile) 569 + return -EOPNOTSUPP; 570 + 571 + switch (cmd) { 572 + case BLKCRYPTOIMPORTKEY: 573 + return blk_crypto_ioctl_import_key(profile, argp); 574 + case BLKCRYPTOGENERATEKEY: 575 + return blk_crypto_ioctl_generate_key(profile, argp); 576 + case BLKCRYPTOPREPAREKEY: 577 + return blk_crypto_ioctl_prepare_key(profile, argp); 578 + default: 579 + return -ENOTTY; 580 + } 581 + }

+5 -5

block/blk-flush.c

··· 95 95 struct blk_flush_queue *fq, blk_opf_t flags); 96 96 97 97 static inline struct blk_flush_queue * 98 - blk_get_flush_queue(struct request_queue *q, struct blk_mq_ctx *ctx) 98 + blk_get_flush_queue(struct blk_mq_ctx *ctx) 99 99 { 100 - return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq; 100 + return blk_mq_map_queue(REQ_OP_FLUSH, ctx)->fq; 101 101 } 102 102 103 103 static unsigned int blk_flush_cur_seq(struct request *rq) ··· 205 205 struct list_head *running; 206 206 struct request *rq, *n; 207 207 unsigned long flags = 0; 208 - struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx); 208 + struct blk_flush_queue *fq = blk_get_flush_queue(flush_rq->mq_ctx); 209 209 210 210 /* release the tag's ownership to the req cloned from */ 211 211 spin_lock_irqsave(&fq->mq_flush_lock, flags); ··· 341 341 struct blk_mq_hw_ctx *hctx = rq->mq_hctx; 342 342 struct blk_mq_ctx *ctx = rq->mq_ctx; 343 343 unsigned long flags; 344 - struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx); 344 + struct blk_flush_queue *fq = blk_get_flush_queue(ctx); 345 345 346 346 if (q->elevator) { 347 347 WARN_ON(rq->tag < 0); ··· 382 382 bool blk_insert_flush(struct request *rq) 383 383 { 384 384 struct request_queue *q = rq->q; 385 - struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx); 385 + struct blk_flush_queue *fq = blk_get_flush_queue(rq->mq_ctx); 386 386 bool supports_fua = q->limits.features & BLK_FEAT_FUA; 387 387 unsigned int policy = 0; 388 388

+8 -12

block/blk-iocost.c

··· 2718 2718 * All waiters are on iocg->waitq and the wait states are 2719 2719 * synchronized using waitq.lock. 2720 2720 */ 2721 - init_waitqueue_func_entry(&wait.wait, iocg_wake_fn); 2722 - wait.wait.private = current; 2721 + init_wait_func(&wait.wait, iocg_wake_fn); 2723 2722 wait.bio = bio; 2724 2723 wait.abs_cost = abs_cost; 2725 2724 wait.committed = false; /* will be set true by waker */ ··· 3222 3223 u32 qos[NR_QOS_PARAMS]; 3223 3224 bool enable, user; 3224 3225 char *body, *p; 3225 - unsigned int memflags; 3226 + unsigned long memflags; 3226 3227 int ret; 3227 3228 3228 3229 blkg_conf_init(&ctx, input); 3229 3230 3230 - ret = blkg_conf_open_bdev(&ctx); 3231 - if (ret) 3231 + memflags = blkg_conf_open_bdev_frozen(&ctx); 3232 + if (IS_ERR_VALUE(memflags)) { 3233 + ret = memflags; 3232 3234 goto err; 3235 + } 3233 3236 3234 3237 body = ctx.body; 3235 3238 disk = ctx.bdev->bd_disk; ··· 3248 3247 ioc = q_to_ioc(disk->queue); 3249 3248 } 3250 3249 3251 - memflags = blk_mq_freeze_queue(disk->queue); 3252 3250 blk_mq_quiesce_queue(disk->queue); 3253 3251 3254 3252 spin_lock_irq(&ioc->lock); ··· 3347 3347 wbt_enable_default(disk); 3348 3348 3349 3349 blk_mq_unquiesce_queue(disk->queue); 3350 - blk_mq_unfreeze_queue(disk->queue, memflags); 3351 3350 3352 - blkg_conf_exit(&ctx); 3351 + blkg_conf_exit_frozen(&ctx, memflags); 3353 3352 return nbytes; 3354 3353 einval: 3355 3354 spin_unlock_irq(&ioc->lock); 3356 - 3357 3355 blk_mq_unquiesce_queue(disk->queue); 3358 - blk_mq_unfreeze_queue(disk->queue, memflags); 3359 - 3360 3356 ret = -EINVAL; 3361 3357 err: 3362 - blkg_conf_exit(&ctx); 3358 + blkg_conf_exit_frozen(&ctx, memflags); 3363 3359 return ret; 3364 3360 } 3365 3361

+2 -2

block/blk-merge.c

··· 551 551 * Map a request to scatterlist, return number of sg entries setup. Caller 552 552 * must make sure sg can hold rq->nr_phys_segments entries. 553 553 */ 554 - int __blk_rq_map_sg(struct request_queue *q, struct request *rq, 555 - struct scatterlist *sglist, struct scatterlist **last_sg) 554 + int __blk_rq_map_sg(struct request *rq, struct scatterlist *sglist, 555 + struct scatterlist **last_sg) 556 556 { 557 557 struct req_iterator iter = { 558 558 .bio = rq->bio,

+21 -20

block/blk-mq-debugfs.c

··· 347 347 { 348 348 struct blk_mq_hw_ctx *hctx = data; 349 349 struct show_busy_params params = { .m = m, .hctx = hctx }; 350 + int res; 350 351 352 + res = mutex_lock_interruptible(&hctx->queue->elevator_lock); 353 + if (res) 354 + return res; 351 355 blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy_rq, 352 356 &params); 357 + mutex_unlock(&hctx->queue->elevator_lock); 353 358 354 359 return 0; 355 360 } ··· 405 400 struct request_queue *q = hctx->queue; 406 401 int res; 407 402 408 - res = mutex_lock_interruptible(&q->sysfs_lock); 403 + res = mutex_lock_interruptible(&q->elevator_lock); 409 404 if (res) 410 - goto out; 405 + return res; 411 406 if (hctx->tags) 412 407 blk_mq_debugfs_tags_show(m, hctx->tags); 413 - mutex_unlock(&q->sysfs_lock); 408 + mutex_unlock(&q->elevator_lock); 414 409 415 - out: 416 - return res; 410 + return 0; 417 411 } 418 412 419 413 static int hctx_tags_bitmap_show(void *data, struct seq_file *m) ··· 421 417 struct request_queue *q = hctx->queue; 422 418 int res; 423 419 424 - res = mutex_lock_interruptible(&q->sysfs_lock); 420 + res = mutex_lock_interruptible(&q->elevator_lock); 425 421 if (res) 426 - goto out; 422 + return res; 427 423 if (hctx->tags) 428 424 sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m); 429 - mutex_unlock(&q->sysfs_lock); 425 + mutex_unlock(&q->elevator_lock); 430 426 431 - out: 432 - return res; 427 + return 0; 433 428 } 434 429 435 430 static int hctx_sched_tags_show(void *data, struct seq_file *m) ··· 437 434 struct request_queue *q = hctx->queue; 438 435 int res; 439 436 440 - res = mutex_lock_interruptible(&q->sysfs_lock); 437 + res = mutex_lock_interruptible(&q->elevator_lock); 441 438 if (res) 442 - goto out; 439 + return res; 443 440 if (hctx->sched_tags) 444 441 blk_mq_debugfs_tags_show(m, hctx->sched_tags); 445 - mutex_unlock(&q->sysfs_lock); 442 + mutex_unlock(&q->elevator_lock); 446 443 447 - out: 448 - return res; 444 + return 0; 449 445 } 450 446 451 447 static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m) ··· 453 451 struct request_queue *q = hctx->queue; 454 452 int res; 455 453 456 - res = mutex_lock_interruptible(&q->sysfs_lock); 454 + res = mutex_lock_interruptible(&q->elevator_lock); 457 455 if (res) 458 - goto out; 456 + return res; 459 457 if (hctx->sched_tags) 460 458 sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m); 461 - mutex_unlock(&q->sysfs_lock); 459 + mutex_unlock(&q->elevator_lock); 462 460 463 - out: 464 - return res; 461 + return 0; 465 462 } 466 463 467 464 static int hctx_active_show(void *data, struct seq_file *m)

+1 -1

block/blk-mq-sched.c

··· 349 349 } 350 350 351 351 ctx = blk_mq_get_ctx(q); 352 - hctx = blk_mq_map_queue(q, bio->bi_opf, ctx); 352 + hctx = blk_mq_map_queue(bio->bi_opf, ctx); 353 353 type = hctx->type; 354 354 if (list_empty_careful(&ctx->rq_lists[type])) 355 355 goto out_put;

+2 -2

block/blk-mq-sysfs.c

··· 61 61 if (!entry->show) 62 62 return -EIO; 63 63 64 - mutex_lock(&q->sysfs_lock); 64 + mutex_lock(&q->elevator_lock); 65 65 res = entry->show(hctx, page); 66 - mutex_unlock(&q->sysfs_lock); 66 + mutex_unlock(&q->elevator_lock); 67 67 return res; 68 68 } 69 69

+1 -2

block/blk-mq-tag.c

··· 190 190 sbitmap_finish_wait(bt, ws, &wait); 191 191 192 192 data->ctx = blk_mq_get_ctx(data->q); 193 - data->hctx = blk_mq_map_queue(data->q, data->cmd_flags, 194 - data->ctx); 193 + data->hctx = blk_mq_map_queue(data->cmd_flags, data->ctx); 195 194 tags = blk_mq_tags_from_data(data); 196 195 if (data->flags & BLK_MQ_REQ_RESERVED) 197 196 bt = &tags->breserved_tags;

+13 -9

block/blk-mq.c

··· 508 508 509 509 retry: 510 510 data->ctx = blk_mq_get_ctx(q); 511 - data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx); 511 + data->hctx = blk_mq_map_queue(data->cmd_flags, data->ctx); 512 512 513 513 if (q->elevator) { 514 514 /* ··· 3314 3314 rq->special_vec = rq_src->special_vec; 3315 3315 } 3316 3316 rq->nr_phys_segments = rq_src->nr_phys_segments; 3317 + rq->nr_integrity_segments = rq_src->nr_integrity_segments; 3317 3318 3318 3319 if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0) 3319 3320 goto free_and_out; ··· 4095 4094 struct blk_mq_ctx *ctx; 4096 4095 struct blk_mq_tag_set *set = q->tag_set; 4097 4096 4097 + mutex_lock(&q->elevator_lock); 4098 + 4098 4099 queue_for_each_hw_ctx(q, hctx, i) { 4099 4100 cpumask_clear(hctx->cpumask); 4100 4101 hctx->nr_ctx = 0; ··· 4201 4198 hctx->next_cpu = blk_mq_first_mapped_cpu(hctx); 4202 4199 hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; 4203 4200 } 4201 + 4202 + mutex_unlock(&q->elevator_lock); 4204 4203 } 4205 4204 4206 4205 /* ··· 4472 4467 unsigned long i, j; 4473 4468 4474 4469 /* protect against switching io scheduler */ 4475 - mutex_lock(&q->sysfs_lock); 4470 + mutex_lock(&q->elevator_lock); 4476 4471 for (i = 0; i < set->nr_hw_queues; i++) { 4477 4472 int old_node; 4478 4473 int node = blk_mq_get_hctx_node(set, i); ··· 4505 4500 4506 4501 xa_for_each_start(&q->hctx_table, j, hctx, j) 4507 4502 blk_mq_exit_hctx(q, set, hctx, j); 4508 - mutex_unlock(&q->sysfs_lock); 4503 + mutex_unlock(&q->elevator_lock); 4509 4504 4510 4505 /* unregister cpuhp callbacks for exited hctxs */ 4511 4506 blk_mq_remove_hw_queues_cpuhp(q); ··· 4938 4933 if (!qe) 4939 4934 return false; 4940 4935 4941 - /* q->elevator needs protection from ->sysfs_lock */ 4942 - mutex_lock(&q->sysfs_lock); 4936 + /* Accessing q->elevator needs protection from ->elevator_lock. */ 4937 + mutex_lock(&q->elevator_lock); 4943 4938 4944 - /* the check has to be done with holding sysfs_lock */ 4945 4939 if (!q->elevator) { 4946 4940 kfree(qe); 4947 4941 goto unlock; ··· 4954 4950 list_add(&qe->node, head); 4955 4951 elevator_disable(q); 4956 4952 unlock: 4957 - mutex_unlock(&q->sysfs_lock); 4953 + mutex_unlock(&q->elevator_lock); 4958 4954 4959 4955 return true; 4960 4956 } ··· 4984 4980 list_del(&qe->node); 4985 4981 kfree(qe); 4986 4982 4987 - mutex_lock(&q->sysfs_lock); 4983 + mutex_lock(&q->elevator_lock); 4988 4984 elevator_switch(q, t); 4989 4985 /* drop the reference acquired in blk_mq_elv_switch_none */ 4990 4986 elevator_put(t); 4991 - mutex_unlock(&q->sysfs_lock); 4987 + mutex_unlock(&q->elevator_lock); 4992 4988 } 4993 4989 4994 4990 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,

+1 -3

block/blk-mq.h

··· 100 100 101 101 /* 102 102 * blk_mq_map_queue() - map (cmd_flags,type) to hardware queue 103 - * @q: request queue 104 103 * @opf: operation type (REQ_OP_*) and flags (e.g. REQ_POLLED). 105 104 * @ctx: software queue cpu ctx 106 105 */ 107 - static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, 108 - blk_opf_t opf, 106 + static inline struct blk_mq_hw_ctx *blk_mq_map_queue(blk_opf_t opf, 109 107 struct blk_mq_ctx *ctx) 110 108 { 111 109 return ctx->hctxs[blk_mq_get_hctx_type(opf)];

+54 -28

block/blk-rq-qos.c

··· 196 196 197 197 struct rq_qos_wait_data { 198 198 struct wait_queue_entry wq; 199 - struct task_struct *task; 200 199 struct rq_wait *rqw; 201 200 acquire_inflight_cb_t *cb; 202 201 void *private_data; ··· 217 218 return -1; 218 219 219 220 data->got_token = true; 220 - wake_up_process(data->task); 221 + /* 222 + * autoremove_wake_function() removes the wait entry only when it 223 + * actually changed the task state. We want the wait always removed. 224 + * Remove explicitly and use default_wake_function(). 225 + */ 226 + default_wake_function(curr, mode, wake_flags, key); 227 + /* 228 + * Note that the order of operations is important as finish_wait() 229 + * tests whether @curr is removed without grabbing the lock. This 230 + * should be the last thing to do to make sure we will not have a 231 + * UAF access to @data. And the semantics of memory barrier in it 232 + * also make sure the waiter will see the latest @data->got_token 233 + * once list_empty_careful() in finish_wait() returns true. 234 + */ 221 235 list_del_init_careful(&curr->entry); 222 236 return 1; 223 237 } ··· 256 244 cleanup_cb_t *cleanup_cb) 257 245 { 258 246 struct rq_qos_wait_data data = { 259 - .wq = { 260 - .func = rq_qos_wake_function, 261 - .entry = LIST_HEAD_INIT(data.wq.entry), 262 - }, 263 - .task = current, 264 - .rqw = rqw, 265 - .cb = acquire_inflight_cb, 266 - .private_data = private_data, 247 + .rqw = rqw, 248 + .cb = acquire_inflight_cb, 249 + .private_data = private_data, 250 + .got_token = false, 267 251 }; 268 - bool has_sleeper; 252 + bool first_waiter; 269 253 270 - has_sleeper = wq_has_sleeper(&rqw->wait); 271 - if (!has_sleeper && acquire_inflight_cb(rqw, private_data)) 254 + /* 255 + * If there are no waiters in the waiting queue, try to increase the 256 + * inflight counter if we can. Otherwise, prepare for adding ourselves 257 + * to the waiting queue. 258 + */ 259 + if (!waitqueue_active(&rqw->wait) && acquire_inflight_cb(rqw, private_data)) 272 260 return; 273 261 274 - has_sleeper = !prepare_to_wait_exclusive(&rqw->wait, &data.wq, 262 + init_wait_func(&data.wq, rq_qos_wake_function); 263 + first_waiter = prepare_to_wait_exclusive(&rqw->wait, &data.wq, 275 264 TASK_UNINTERRUPTIBLE); 265 + /* 266 + * Make sure there is at least one inflight process; otherwise, waiters 267 + * will never be woken up. Since there may be no inflight process before 268 + * adding ourselves to the waiting queue above, we need to try to 269 + * increase the inflight counter for ourselves. And it is sufficient to 270 + * guarantee that at least the first waiter to enter the waiting queue 271 + * will re-check the waiting condition before going to sleep, thus 272 + * ensuring forward progress. 273 + */ 274 + if (!data.got_token && first_waiter && acquire_inflight_cb(rqw, private_data)) { 275 + finish_wait(&rqw->wait, &data.wq); 276 + /* 277 + * We raced with rq_qos_wake_function() getting a token, 278 + * which means we now have two. Put our local token 279 + * and wake anyone else potentially waiting for one. 280 + * 281 + * Enough memory barrier in list_empty_careful() in 282 + * finish_wait() is paired with list_del_init_careful() 283 + * in rq_qos_wake_function() to make sure we will see 284 + * the latest @data->got_token. 285 + */ 286 + if (data.got_token) 287 + cleanup_cb(rqw, private_data); 288 + return; 289 + } 290 + 291 + /* we are now relying on the waker to increase our inflight counter. */ 276 292 do { 277 - /* The memory barrier in set_current_state saves us here. */ 278 293 if (data.got_token) 279 294 break; 280 - if (!has_sleeper && acquire_inflight_cb(rqw, private_data)) { 281 - finish_wait(&rqw->wait, &data.wq); 282 - 283 - /* 284 - * We raced with rq_qos_wake_function() getting a token, 285 - * which means we now have two. Put our local token 286 - * and wake anyone else potentially waiting for one. 287 - */ 288 - if (data.got_token) 289 - cleanup_cb(rqw, private_data); 290 - return; 291 - } 292 295 io_schedule(); 293 - has_sleeper = true; 294 296 set_current_state(TASK_UNINTERRUPTIBLE); 295 297 } while (1); 296 298 finish_wait(&rqw->wait, &data.wq);

+28 -30

block/blk-settings.c

··· 21 21 22 22 void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout) 23 23 { 24 - q->rq_timeout = timeout; 24 + WRITE_ONCE(q->rq_timeout, timeout); 25 25 } 26 26 EXPORT_SYMBOL_GPL(blk_queue_rq_timeout); 27 27 ··· 114 114 pr_warn("invalid PI settings.\n"); 115 115 return -EINVAL; 116 116 } 117 + bi->flags |= BLK_INTEGRITY_NOGENERATE | BLK_INTEGRITY_NOVERIFY; 117 118 return 0; 119 + } 120 + 121 + if (lim->features & BLK_FEAT_BOUNCE_HIGH) { 122 + pr_warn("no bounce buffer support for integrity metadata\n"); 123 + return -EINVAL; 118 124 } 119 125 120 126 if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) { ··· 873 867 if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) 874 868 return true; 875 869 876 - if (!ti->tuple_size) { 877 - /* inherit the settings from the first underlying device */ 878 - if (!(ti->flags & BLK_INTEGRITY_STACKED)) { 879 - ti->flags = BLK_INTEGRITY_DEVICE_CAPABLE | 880 - (bi->flags & BLK_INTEGRITY_REF_TAG); 881 - ti->csum_type = bi->csum_type; 882 - ti->tuple_size = bi->tuple_size; 883 - ti->pi_offset = bi->pi_offset; 884 - ti->interval_exp = bi->interval_exp; 885 - ti->tag_size = bi->tag_size; 886 - goto done; 887 - } 888 - if (!bi->tuple_size) 889 - goto done; 870 + if (ti->flags & BLK_INTEGRITY_STACKED) { 871 + if (ti->tuple_size != bi->tuple_size) 872 + goto incompatible; 873 + if (ti->interval_exp != bi->interval_exp) 874 + goto incompatible; 875 + if (ti->tag_size != bi->tag_size) 876 + goto incompatible; 877 + if (ti->csum_type != bi->csum_type) 878 + goto incompatible; 879 + if ((ti->flags & BLK_INTEGRITY_REF_TAG) != 880 + (bi->flags & BLK_INTEGRITY_REF_TAG)) 881 + goto incompatible; 882 + } else { 883 + ti->flags = BLK_INTEGRITY_STACKED; 884 + ti->flags |= (bi->flags & BLK_INTEGRITY_DEVICE_CAPABLE) | 885 + (bi->flags & BLK_INTEGRITY_REF_TAG); 886 + ti->csum_type = bi->csum_type; 887 + ti->tuple_size = bi->tuple_size; 888 + ti->pi_offset = bi->pi_offset; 889 + ti->interval_exp = bi->interval_exp; 890 + ti->tag_size = bi->tag_size; 890 891 } 891 - 892 - if (ti->tuple_size != bi->tuple_size) 893 - goto incompatible; 894 - if (ti->interval_exp != bi->interval_exp) 895 - goto incompatible; 896 - if (ti->tag_size != bi->tag_size) 897 - goto incompatible; 898 - if (ti->csum_type != bi->csum_type) 899 - goto incompatible; 900 - if ((ti->flags & BLK_INTEGRITY_REF_TAG) != 901 - (bi->flags & BLK_INTEGRITY_REF_TAG)) 902 - goto incompatible; 903 - 904 - done: 905 - ti->flags |= BLK_INTEGRITY_STACKED; 906 892 return true; 907 893 908 894 incompatible:

+195 -107

block/blk-sysfs.c

··· 23 23 struct queue_sysfs_entry { 24 24 struct attribute attr; 25 25 ssize_t (*show)(struct gendisk *disk, char *page); 26 + ssize_t (*show_limit)(struct gendisk *disk, char *page); 27 + 26 28 ssize_t (*store)(struct gendisk *disk, const char *page, size_t count); 27 29 int (*store_limit)(struct gendisk *disk, const char *page, 28 30 size_t count, struct queue_limits *lim); 29 - void (*load_module)(struct gendisk *disk, const char *page, size_t count); 30 31 }; 31 32 32 33 static ssize_t ··· 53 52 54 53 static ssize_t queue_requests_show(struct gendisk *disk, char *page) 55 54 { 56 - return queue_var_show(disk->queue->nr_requests, page); 55 + ssize_t ret; 56 + 57 + mutex_lock(&disk->queue->elevator_lock); 58 + ret = queue_var_show(disk->queue->nr_requests, page); 59 + mutex_unlock(&disk->queue->elevator_lock); 60 + return ret; 57 61 } 58 62 59 63 static ssize_t ··· 66 60 { 67 61 unsigned long nr; 68 62 int ret, err; 63 + unsigned int memflags; 64 + struct request_queue *q = disk->queue; 69 65 70 - if (!queue_is_mq(disk->queue)) 66 + if (!queue_is_mq(q)) 71 67 return -EINVAL; 72 68 73 69 ret = queue_var_store(&nr, page, count); 74 70 if (ret < 0) 75 71 return ret; 76 72 73 + memflags = blk_mq_freeze_queue(q); 74 + mutex_lock(&q->elevator_lock); 77 75 if (nr < BLKDEV_MIN_RQ) 78 76 nr = BLKDEV_MIN_RQ; 79 77 80 78 err = blk_mq_update_nr_requests(disk->queue, nr); 81 79 if (err) 82 - return err; 83 - 80 + ret = err; 81 + mutex_unlock(&q->elevator_lock); 82 + blk_mq_unfreeze_queue(q, memflags); 84 83 return ret; 85 84 } 86 85 87 86 static ssize_t queue_ra_show(struct gendisk *disk, char *page) 88 87 { 89 - return queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page); 88 + ssize_t ret; 89 + 90 + mutex_lock(&disk->queue->limits_lock); 91 + ret = queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page); 92 + mutex_unlock(&disk->queue->limits_lock); 93 + 94 + return ret; 90 95 } 91 96 92 97 static ssize_t ··· 105 88 { 106 89 unsigned long ra_kb; 107 90 ssize_t ret; 91 + unsigned int memflags; 92 + struct request_queue *q = disk->queue; 108 93 109 94 ret = queue_var_store(&ra_kb, page, count); 110 95 if (ret < 0) 111 96 return ret; 97 + /* 98 + * ->ra_pages is protected by ->limits_lock because it is usually 99 + * calculated from the queue limits by queue_limits_commit_update. 100 + */ 101 + mutex_lock(&q->limits_lock); 102 + memflags = blk_mq_freeze_queue(q); 112 103 disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10); 104 + mutex_unlock(&q->limits_lock); 105 + blk_mq_unfreeze_queue(q, memflags); 106 + 113 107 return ret; 114 108 } 115 109 ··· 266 238 { 267 239 if (queue_is_mq(disk->queue)) 268 240 return sysfs_emit(page, "%u\n", blk_mq_can_poll(disk->queue)); 241 + 269 242 return sysfs_emit(page, "%u\n", 270 - !!(disk->queue->limits.features & BLK_FEAT_POLL)); 243 + !!(disk->queue->limits.features & BLK_FEAT_POLL)); 271 244 } 272 245 273 246 static ssize_t queue_zoned_show(struct gendisk *disk, char *page) ··· 315 286 size_t count) 316 287 { 317 288 unsigned long nm; 289 + unsigned int memflags; 290 + struct request_queue *q = disk->queue; 318 291 ssize_t ret = queue_var_store(&nm, page, count); 319 292 320 293 if (ret < 0) 321 294 return ret; 322 295 323 - blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, disk->queue); 324 - blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, disk->queue); 296 + memflags = blk_mq_freeze_queue(q); 297 + blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q); 298 + blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, q); 325 299 if (nm == 2) 326 - blk_queue_flag_set(QUEUE_FLAG_NOMERGES, disk->queue); 300 + blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q); 327 301 else if (nm) 328 - blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, disk->queue); 302 + blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q); 303 + blk_mq_unfreeze_queue(q, memflags); 329 304 330 305 return ret; 331 306 } ··· 349 316 #ifdef CONFIG_SMP 350 317 struct request_queue *q = disk->queue; 351 318 unsigned long val; 319 + unsigned int memflags; 352 320 353 321 ret = queue_var_store(&val, page, count); 354 322 if (ret < 0) 355 323 return ret; 356 324 325 + /* 326 + * Here we update two queue flags each using atomic bitops, although 327 + * updating two flags isn't atomic it should be harmless as those flags 328 + * are accessed individually using atomic test_bit operation. So we 329 + * don't grab any lock while updating these flags. 330 + */ 331 + memflags = blk_mq_freeze_queue(q); 357 332 if (val == 2) { 358 333 blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q); 359 334 blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q); ··· 372 331 blk_queue_flag_clear(QUEUE_FLAG_SAME_COMP, q); 373 332 blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q); 374 333 } 334 + blk_mq_unfreeze_queue(q, memflags); 375 335 #endif 376 336 return ret; 377 337 } ··· 386 344 static ssize_t queue_poll_store(struct gendisk *disk, const char *page, 387 345 size_t count) 388 346 { 389 - if (!(disk->queue->limits.features & BLK_FEAT_POLL)) 390 - return -EINVAL; 347 + unsigned int memflags; 348 + ssize_t ret = count; 349 + struct request_queue *q = disk->queue; 350 + 351 + memflags = blk_mq_freeze_queue(q); 352 + if (!(q->limits.features & BLK_FEAT_POLL)) { 353 + ret = -EINVAL; 354 + goto out; 355 + } 356 + 391 357 pr_info_ratelimited("writes to the poll attribute are ignored.\n"); 392 358 pr_info_ratelimited("please use driver specific parameters instead.\n"); 393 - return count; 359 + out: 360 + blk_mq_unfreeze_queue(q, memflags); 361 + return ret; 394 362 } 395 363 396 364 static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page) 397 365 { 398 - return sysfs_emit(page, "%u\n", jiffies_to_msecs(disk->queue->rq_timeout)); 366 + return sysfs_emit(page, "%u\n", 367 + jiffies_to_msecs(READ_ONCE(disk->queue->rq_timeout))); 399 368 } 400 369 401 370 static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page, 402 371 size_t count) 403 372 { 404 - unsigned int val; 373 + unsigned int val, memflags; 405 374 int err; 375 + struct request_queue *q = disk->queue; 406 376 407 377 err = kstrtou32(page, 10, &val); 408 378 if (err || val == 0) 409 379 return -EINVAL; 410 380 411 - blk_queue_rq_timeout(disk->queue, msecs_to_jiffies(val)); 381 + memflags = blk_mq_freeze_queue(q); 382 + blk_queue_rq_timeout(q, msecs_to_jiffies(val)); 383 + blk_mq_unfreeze_queue(q, memflags); 412 384 413 385 return count; 414 386 } ··· 468 412 .store = _prefix##_store, \ 469 413 }; 470 414 415 + #define QUEUE_LIM_RO_ENTRY(_prefix, _name) \ 416 + static struct queue_sysfs_entry _prefix##_entry = { \ 417 + .attr = { .name = _name, .mode = 0444 }, \ 418 + .show_limit = _prefix##_show, \ 419 + } 420 + 471 421 #define QUEUE_LIM_RW_ENTRY(_prefix, _name) \ 472 422 static struct queue_sysfs_entry _prefix##_entry = { \ 473 423 .attr = { .name = _name, .mode = 0644 }, \ 474 - .show = _prefix##_show, \ 424 + .show_limit = _prefix##_show, \ 475 425 .store_limit = _prefix##_store, \ 476 - } 477 - 478 - #define QUEUE_RW_LOAD_MODULE_ENTRY(_prefix, _name) \ 479 - static struct queue_sysfs_entry _prefix##_entry = { \ 480 - .attr = { .name = _name, .mode = 0644 }, \ 481 - .show = _prefix##_show, \ 482 - .load_module = _prefix##_load_module, \ 483 - .store = _prefix##_store, \ 484 426 } 485 427 486 428 QUEUE_RW_ENTRY(queue_requests, "nr_requests"); 487 429 QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb"); 488 430 QUEUE_LIM_RW_ENTRY(queue_max_sectors, "max_sectors_kb"); 489 - QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb"); 490 - QUEUE_RO_ENTRY(queue_max_segments, "max_segments"); 491 - QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments"); 492 - QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size"); 493 - QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler"); 431 + QUEUE_LIM_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb"); 432 + QUEUE_LIM_RO_ENTRY(queue_max_segments, "max_segments"); 433 + QUEUE_LIM_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments"); 434 + QUEUE_LIM_RO_ENTRY(queue_max_segment_size, "max_segment_size"); 435 + QUEUE_RW_ENTRY(elv_iosched, "scheduler"); 494 436 495 - QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size"); 496 - QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size"); 497 - QUEUE_RO_ENTRY(queue_chunk_sectors, "chunk_sectors"); 498 - QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size"); 499 - QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size"); 437 + QUEUE_LIM_RO_ENTRY(queue_logical_block_size, "logical_block_size"); 438 + QUEUE_LIM_RO_ENTRY(queue_physical_block_size, "physical_block_size"); 439 + QUEUE_LIM_RO_ENTRY(queue_chunk_sectors, "chunk_sectors"); 440 + QUEUE_LIM_RO_ENTRY(queue_io_min, "minimum_io_size"); 441 + QUEUE_LIM_RO_ENTRY(queue_io_opt, "optimal_io_size"); 500 442 501 - QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments"); 502 - QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity"); 503 - QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes"); 443 + QUEUE_LIM_RO_ENTRY(queue_max_discard_segments, "max_discard_segments"); 444 + QUEUE_LIM_RO_ENTRY(queue_discard_granularity, "discard_granularity"); 445 + QUEUE_LIM_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes"); 504 446 QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes"); 505 447 QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data"); 506 448 507 - QUEUE_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes"); 508 - QUEUE_RO_ENTRY(queue_atomic_write_boundary_sectors, 449 + QUEUE_LIM_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes"); 450 + QUEUE_LIM_RO_ENTRY(queue_atomic_write_boundary_sectors, 509 451 "atomic_write_boundary_bytes"); 510 - QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes"); 511 - QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes"); 452 + QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes"); 453 + QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes"); 512 454 513 455 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); 514 - QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes"); 515 - QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes"); 516 - QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity"); 456 + QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes"); 457 + QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes"); 458 + QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity"); 517 459 518 - QUEUE_RO_ENTRY(queue_zoned, "zoned"); 460 + QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned"); 519 461 QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones"); 520 - QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones"); 521 - QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones"); 462 + QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones"); 463 + QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones"); 522 464 523 465 QUEUE_RW_ENTRY(queue_nomerges, "nomerges"); 524 466 QUEUE_LIM_RW_ENTRY(queue_iostats_passthrough, "iostats_passthrough"); ··· 524 470 QUEUE_RW_ENTRY(queue_poll, "io_poll"); 525 471 QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay"); 526 472 QUEUE_LIM_RW_ENTRY(queue_wc, "write_cache"); 527 - QUEUE_RO_ENTRY(queue_fua, "fua"); 528 - QUEUE_RO_ENTRY(queue_dax, "dax"); 473 + QUEUE_LIM_RO_ENTRY(queue_fua, "fua"); 474 + QUEUE_LIM_RO_ENTRY(queue_dax, "dax"); 529 475 QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout"); 530 - QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask"); 531 - QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment"); 476 + QUEUE_LIM_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask"); 477 + QUEUE_LIM_RO_ENTRY(queue_dma_alignment, "dma_alignment"); 532 478 533 479 /* legacy alias for logical_block_size: */ 534 480 static struct queue_sysfs_entry queue_hw_sector_size_entry = { 535 - .attr = {.name = "hw_sector_size", .mode = 0444 }, 536 - .show = queue_logical_block_size_show, 481 + .attr = {.name = "hw_sector_size", .mode = 0444 }, 482 + .show_limit = queue_logical_block_size_show, 537 483 }; 538 484 539 485 QUEUE_LIM_RW_ENTRY(queue_rotational, "rotational"); ··· 557 503 558 504 static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page) 559 505 { 560 - if (!wbt_rq_qos(disk->queue)) 561 - return -EINVAL; 506 + ssize_t ret; 507 + struct request_queue *q = disk->queue; 562 508 563 - if (wbt_disabled(disk->queue)) 564 - return sysfs_emit(page, "0\n"); 509 + mutex_lock(&q->elevator_lock); 510 + if (!wbt_rq_qos(q)) { 511 + ret = -EINVAL; 512 + goto out; 513 + } 565 514 566 - return sysfs_emit(page, "%llu\n", 567 - div_u64(wbt_get_min_lat(disk->queue), 1000)); 515 + if (wbt_disabled(q)) { 516 + ret = sysfs_emit(page, "0\n"); 517 + goto out; 518 + } 519 + 520 + ret = sysfs_emit(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000)); 521 + out: 522 + mutex_unlock(&q->elevator_lock); 523 + return ret; 568 524 } 569 525 570 526 static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page, ··· 584 520 struct rq_qos *rqos; 585 521 ssize_t ret; 586 522 s64 val; 523 + unsigned int memflags; 587 524 588 525 ret = queue_var_store64(&val, page); 589 526 if (ret < 0) ··· 592 527 if (val < -1) 593 528 return -EINVAL; 594 529 530 + memflags = blk_mq_freeze_queue(q); 531 + mutex_lock(&q->elevator_lock); 532 + 595 533 rqos = wbt_rq_qos(q); 596 534 if (!rqos) { 597 535 ret = wbt_init(disk); 598 536 if (ret) 599 - return ret; 537 + goto out; 600 538 } 601 539 540 + ret = count; 602 541 if (val == -1) 603 542 val = wbt_default_latency_nsec(q); 604 543 else if (val >= 0) 605 544 val *= 1000ULL; 606 545 607 546 if (wbt_get_min_lat(q) == val) 608 - return count; 547 + goto out; 609 548 610 549 /* 611 550 * Ensure that the queue is idled, in case the latency update ··· 621 552 wbt_set_min_lat(q, val); 622 553 623 554 blk_mq_unquiesce_queue(q); 555 + out: 556 + mutex_unlock(&q->elevator_lock); 557 + blk_mq_unfreeze_queue(q, memflags); 624 558 625 - return count; 559 + return ret; 626 560 } 627 561 628 562 QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec"); ··· 633 561 634 562 /* Common attributes for bio-based and request-based queues. */ 635 563 static struct attribute *queue_attrs[] = { 636 - &queue_ra_entry.attr, 564 + /* 565 + * Attributes which are protected with q->limits_lock. 566 + */ 637 567 &queue_max_hw_sectors_entry.attr, 638 568 &queue_max_sectors_entry.attr, 639 569 &queue_max_segments_entry.attr, ··· 651 577 &queue_discard_granularity_entry.attr, 652 578 &queue_max_discard_sectors_entry.attr, 653 579 &queue_max_hw_discard_sectors_entry.attr, 654 - &queue_discard_zeroes_data_entry.attr, 655 580 &queue_atomic_write_max_sectors_entry.attr, 656 581 &queue_atomic_write_boundary_sectors_entry.attr, 657 582 &queue_atomic_write_unit_min_entry.attr, 658 583 &queue_atomic_write_unit_max_entry.attr, 659 - &queue_write_same_max_entry.attr, 660 584 &queue_max_write_zeroes_sectors_entry.attr, 661 585 &queue_max_zone_append_sectors_entry.attr, 662 586 &queue_zone_write_granularity_entry.attr, 663 587 &queue_rotational_entry.attr, 664 588 &queue_zoned_entry.attr, 665 - &queue_nr_zones_entry.attr, 666 589 &queue_max_open_zones_entry.attr, 667 590 &queue_max_active_zones_entry.attr, 668 - &queue_nomerges_entry.attr, 669 591 &queue_iostats_passthrough_entry.attr, 670 592 &queue_iostats_entry.attr, 671 593 &queue_stable_writes_entry.attr, 672 594 &queue_add_random_entry.attr, 673 - &queue_poll_entry.attr, 674 595 &queue_wc_entry.attr, 675 596 &queue_fua_entry.attr, 676 597 &queue_dax_entry.attr, 677 - &queue_poll_delay_entry.attr, 678 598 &queue_virt_boundary_mask_entry.attr, 679 599 &queue_dma_alignment_entry.attr, 600 + &queue_ra_entry.attr, 601 + 602 + /* 603 + * Attributes which don't require locking. 604 + */ 605 + &queue_discard_zeroes_data_entry.attr, 606 + &queue_write_same_max_entry.attr, 607 + &queue_nr_zones_entry.attr, 608 + &queue_nomerges_entry.attr, 609 + &queue_poll_entry.attr, 610 + &queue_poll_delay_entry.attr, 611 + 680 612 NULL, 681 613 }; 682 614 683 615 /* Request-based queue attributes that are not relevant for bio-based queues. */ 684 616 static struct attribute *blk_mq_queue_attrs[] = { 685 - &queue_requests_entry.attr, 617 + /* 618 + * Attributes which require some form of locking other than 619 + * q->sysfs_lock. 620 + */ 686 621 &elv_iosched_entry.attr, 687 - &queue_rq_affinity_entry.attr, 688 - &queue_io_timeout_entry.attr, 622 + &queue_requests_entry.attr, 689 623 #ifdef CONFIG_BLK_WBT 690 624 &queue_wb_lat_entry.attr, 691 625 #endif 626 + /* 627 + * Attributes which don't require locking. 628 + */ 629 + &queue_rq_affinity_entry.attr, 630 + &queue_io_timeout_entry.attr, 631 + 692 632 NULL, 693 633 }; 694 634 ··· 752 664 { 753 665 struct queue_sysfs_entry *entry = to_queue(attr); 754 666 struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 755 - ssize_t res; 756 667 757 - if (!entry->show) 668 + if (!entry->show && !entry->show_limit) 758 669 return -EIO; 759 - mutex_lock(&disk->queue->sysfs_lock); 760 - res = entry->show(disk, page); 761 - mutex_unlock(&disk->queue->sysfs_lock); 762 - return res; 670 + 671 + if (entry->show_limit) { 672 + ssize_t res; 673 + 674 + mutex_lock(&disk->queue->limits_lock); 675 + res = entry->show_limit(disk, page); 676 + mutex_unlock(&disk->queue->limits_lock); 677 + return res; 678 + } 679 + 680 + return entry->show(disk, page); 763 681 } 764 682 765 683 static ssize_t ··· 775 681 struct queue_sysfs_entry *entry = to_queue(attr); 776 682 struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 777 683 struct request_queue *q = disk->queue; 778 - unsigned int memflags; 779 - ssize_t res; 780 684 781 685 if (!entry->store_limit && !entry->store) 782 686 return -EIO; 783 687 784 - /* 785 - * If the attribute needs to load a module, do it before freezing the 786 - * queue to ensure that the module file can be read when the request 787 - * queue is the one for the device storing the module file. 788 - */ 789 - if (entry->load_module) 790 - entry->load_module(disk, page, length); 791 - 792 688 if (entry->store_limit) { 689 + ssize_t res; 690 + 793 691 struct queue_limits lim = queue_limits_start_update(q); 794 692 795 693 res = entry->store_limit(disk, page, length, &lim); ··· 796 710 return length; 797 711 } 798 712 799 - mutex_lock(&q->sysfs_lock); 800 - memflags = blk_mq_freeze_queue(q); 801 - res = entry->store(disk, page, length); 802 - blk_mq_unfreeze_queue(q, memflags); 803 - mutex_unlock(&q->sysfs_lock); 804 - return res; 713 + return entry->store(disk, page, length); 805 714 } 806 715 807 716 static const struct sysfs_ops queue_sysfs_ops = { ··· 865 784 if (ret) 866 785 goto out_debugfs_remove; 867 786 868 - if (q->elevator) { 869 - ret = elv_register_queue(q, false); 870 - if (ret) 871 - goto out_unregister_ia_ranges; 872 - } 873 - 874 787 ret = blk_crypto_sysfs_register(disk); 875 788 if (ret) 876 - goto out_elv_unregister; 789 + goto out_unregister_ia_ranges; 790 + 791 + mutex_lock(&q->elevator_lock); 792 + if (q->elevator) { 793 + ret = elv_register_queue(q, false); 794 + if (ret) { 795 + mutex_unlock(&q->elevator_lock); 796 + goto out_crypto_sysfs_unregister; 797 + } 798 + } 799 + wbt_enable_default(disk); 800 + mutex_unlock(&q->elevator_lock); 877 801 878 802 blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); 879 - wbt_enable_default(disk); 880 803 881 804 /* Now everything is ready and send out KOBJ_ADD uevent */ 882 805 kobject_uevent(&disk->queue_kobj, KOBJ_ADD); ··· 902 817 903 818 return ret; 904 819 905 - out_elv_unregister: 906 - elv_unregister_queue(q); 820 + out_crypto_sysfs_unregister: 821 + blk_crypto_sysfs_unregister(disk); 907 822 out_unregister_ia_ranges: 908 823 disk_unregister_independent_access_ranges(disk); 909 824 out_debugfs_remove: ··· 949 864 blk_mq_sysfs_unregister(disk); 950 865 blk_crypto_sysfs_unregister(disk); 951 866 952 - mutex_lock(&q->sysfs_lock); 867 + mutex_lock(&q->elevator_lock); 953 868 elv_unregister_queue(q); 869 + mutex_unlock(&q->elevator_lock); 870 + 871 + mutex_lock(&q->sysfs_lock); 954 872 disk_unregister_independent_access_ranges(disk); 955 873 mutex_unlock(&q->sysfs_lock); 956 874

+38 -44

block/blk-throttle.c

··· 478 478 { 479 479 tg->bytes_disp[rw] = 0; 480 480 tg->io_disp[rw] = 0; 481 - tg->carryover_bytes[rw] = 0; 482 - tg->carryover_ios[rw] = 0; 483 481 484 482 /* 485 483 * Previous slice has expired. We must have trimmed it after last ··· 496 498 } 497 499 498 500 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw, 499 - bool clear_carryover) 501 + bool clear) 500 502 { 501 - tg->bytes_disp[rw] = 0; 502 - tg->io_disp[rw] = 0; 503 + if (clear) { 504 + tg->bytes_disp[rw] = 0; 505 + tg->io_disp[rw] = 0; 506 + } 503 507 tg->slice_start[rw] = jiffies; 504 508 tg->slice_end[rw] = jiffies + tg->td->throtl_slice; 505 - if (clear_carryover) { 506 - tg->carryover_bytes[rw] = 0; 507 - tg->carryover_ios[rw] = 0; 508 - } 509 509 510 510 throtl_log(&tg->service_queue, 511 511 "[%c] new slice start=%lu end=%lu jiffies=%lu", ··· 595 599 * sooner, then we need to reduce slice_end. A high bogus slice_end 596 600 * is bad because it does not allow new slice to start. 597 601 */ 598 - 599 602 throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice); 600 603 601 604 time_elapsed = rounddown(jiffies - tg->slice_start[rw], 602 605 tg->td->throtl_slice); 603 - if (!time_elapsed) 606 + /* Don't trim slice until at least 2 slices are used */ 607 + if (time_elapsed < tg->td->throtl_slice * 2) 604 608 return; 605 609 610 + /* 611 + * The bio submission time may be a few jiffies more than the expected 612 + * waiting time, due to 'extra_bytes' can't be divided in 613 + * tg_within_bps_limit(), and also due to timer wakeup delay. In this 614 + * case, adjust slice_start will discard the extra wait time, causing 615 + * lower rate than expected. Therefore, other than the above rounddown, 616 + * one extra slice is preserved for deviation. 617 + */ 618 + time_elapsed -= tg->td->throtl_slice; 606 619 bytes_trim = calculate_bytes_allowed(tg_bps_limit(tg, rw), 607 - time_elapsed) + 608 - tg->carryover_bytes[rw]; 609 - io_trim = calculate_io_allowed(tg_iops_limit(tg, rw), time_elapsed) + 610 - tg->carryover_ios[rw]; 620 + time_elapsed); 621 + io_trim = calculate_io_allowed(tg_iops_limit(tg, rw), time_elapsed); 611 622 if (bytes_trim <= 0 && io_trim <= 0) 612 623 return; 613 624 614 - tg->carryover_bytes[rw] = 0; 615 625 if ((long long)tg->bytes_disp[rw] >= bytes_trim) 616 626 tg->bytes_disp[rw] -= bytes_trim; 617 627 else 618 628 tg->bytes_disp[rw] = 0; 619 629 620 - tg->carryover_ios[rw] = 0; 621 630 if ((int)tg->io_disp[rw] >= io_trim) 622 631 tg->io_disp[rw] -= io_trim; 623 632 else ··· 637 636 jiffies); 638 637 } 639 638 640 - static void __tg_update_carryover(struct throtl_grp *tg, bool rw) 639 + static void __tg_update_carryover(struct throtl_grp *tg, bool rw, 640 + long long *bytes, int *ios) 641 641 { 642 642 unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw]; 643 643 u64 bps_limit = tg_bps_limit(tg, rw); ··· 651 649 * configuration. 652 650 */ 653 651 if (bps_limit != U64_MAX) 654 - tg->carryover_bytes[rw] += 655 - calculate_bytes_allowed(bps_limit, jiffy_elapsed) - 652 + *bytes = calculate_bytes_allowed(bps_limit, jiffy_elapsed) - 656 653 tg->bytes_disp[rw]; 657 654 if (iops_limit != UINT_MAX) 658 - tg->carryover_ios[rw] += 659 - calculate_io_allowed(iops_limit, jiffy_elapsed) - 655 + *ios = calculate_io_allowed(iops_limit, jiffy_elapsed) - 660 656 tg->io_disp[rw]; 657 + tg->bytes_disp[rw] -= *bytes; 658 + tg->io_disp[rw] -= *ios; 661 659 } 662 660 663 661 static void tg_update_carryover(struct throtl_grp *tg) 664 662 { 663 + long long bytes[2] = {0}; 664 + int ios[2] = {0}; 665 + 665 666 if (tg->service_queue.nr_queued[READ]) 666 - __tg_update_carryover(tg, READ); 667 + __tg_update_carryover(tg, READ, &bytes[READ], &ios[READ]); 667 668 if (tg->service_queue.nr_queued[WRITE]) 668 - __tg_update_carryover(tg, WRITE); 669 + __tg_update_carryover(tg, WRITE, &bytes[WRITE], &ios[WRITE]); 669 670 670 671 /* see comments in struct throtl_grp for meaning of these fields. */ 671 672 throtl_log(&tg->service_queue, "%s: %lld %lld %d %d\n", __func__, 672 - tg->carryover_bytes[READ], tg->carryover_bytes[WRITE], 673 - tg->carryover_ios[READ], tg->carryover_ios[WRITE]); 673 + bytes[READ], bytes[WRITE], ios[READ], ios[WRITE]); 674 674 } 675 675 676 676 static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio, ··· 690 686 691 687 /* Round up to the next throttle slice, wait time must be nonzero */ 692 688 jiffy_elapsed_rnd = roundup(jiffy_elapsed + 1, tg->td->throtl_slice); 693 - io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd) + 694 - tg->carryover_ios[rw]; 689 + io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd); 695 690 if (io_allowed > 0 && tg->io_disp[rw] + 1 <= io_allowed) 696 691 return 0; 697 692 ··· 723 720 jiffy_elapsed_rnd = tg->td->throtl_slice; 724 721 725 722 jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice); 726 - bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd) + 727 - tg->carryover_bytes[rw]; 723 + bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd); 728 724 if (bytes_allowed > 0 && tg->bytes_disp[rw] + bio_size <= bytes_allowed) 729 725 return 0; 730 726 ··· 812 810 unsigned int bio_size = throtl_bio_data_size(bio); 813 811 814 812 /* Charge the bio to the group */ 815 - if (!bio_flagged(bio, BIO_BPS_THROTTLED)) { 813 + if (!bio_flagged(bio, BIO_BPS_THROTTLED)) 816 814 tg->bytes_disp[rw] += bio_size; 817 - tg->last_bytes_disp[rw] += bio_size; 818 - } 819 815 820 816 tg->io_disp[rw]++; 821 - tg->last_io_disp[rw]++; 822 817 } 823 818 824 819 /** ··· 1613 1614 return tg_may_dispatch(tg, bio, NULL); 1614 1615 } 1615 1616 1616 - static void tg_dispatch_in_debt(struct throtl_grp *tg, struct bio *bio, bool rw) 1617 - { 1618 - if (!bio_flagged(bio, BIO_BPS_THROTTLED)) 1619 - tg->carryover_bytes[rw] -= throtl_bio_data_size(bio); 1620 - tg->carryover_ios[rw]--; 1621 - } 1622 - 1623 1617 bool __blk_throtl_bio(struct bio *bio) 1624 1618 { 1625 1619 struct request_queue *q = bdev_get_queue(bio->bi_bdev); ··· 1649 1657 /* 1650 1658 * IOs which may cause priority inversions are 1651 1659 * dispatched directly, even if they're over limit. 1652 - * Debts are handled by carryover_bytes/ios while 1653 - * calculating wait time. 1660 + * 1661 + * Charge and dispatch directly, and our throttle 1662 + * control algorithm is adaptive, and extra IO bytes 1663 + * will be throttled for paying the debt 1654 1664 */ 1655 - tg_dispatch_in_debt(tg, bio, rw); 1665 + throtl_charge_bio(tg, bio); 1656 1666 } else { 1657 1667 /* if above limits, break to queue */ 1658 1668 break;

+2 -5

block/blk-throttle.h

··· 102 102 unsigned int iops[2]; 103 103 104 104 /* Number of bytes dispatched in current slice */ 105 - uint64_t bytes_disp[2]; 105 + int64_t bytes_disp[2]; 106 106 /* Number of bio's dispatched in current slice */ 107 - unsigned int io_disp[2]; 108 - 109 - uint64_t last_bytes_disp[2]; 110 - unsigned int last_io_disp[2]; 107 + int io_disp[2]; 111 108 112 109 /* 113 110 * The following two fields are updated when new configuration is

+7 -10

block/blk-wbt.c

··· 136 136 RWB_MIN_WRITE_SAMPLES = 3, 137 137 138 138 /* 139 - * If we have this number of consecutive windows with not enough 140 - * information to scale up or down, scale up. 139 + * If we have this number of consecutive windows without enough 140 + * information to scale up or down, slowly return to center state 141 + * (step == 0). 141 142 */ 142 143 RWB_UNKNOWN_BUMP = 5, 143 144 }; ··· 447 446 break; 448 447 case LAT_UNKNOWN_WRITES: 449 448 /* 450 - * We started a the center step, but don't have a valid 451 - * read/write sample, but we do have writes going on. 452 - * Allow step to go negative, to increase write perf. 449 + * We don't have a valid read/write sample, but we do have 450 + * writes going on. Allow step to go negative, to increase 451 + * write performance. 453 452 */ 454 453 scale_up(rwb); 455 454 break; ··· 639 638 __wbt_done(rqos, flags); 640 639 } 641 640 642 - /* 643 - * May sleep, if we have exceeded the writeback limits. Caller can pass 644 - * in an irq held spinlock, if it holds one when calling this function. 645 - * If we do sleep, we'll release and re-grab it. 646 - */ 641 + /* May sleep, if we have exceeded the writeback limits. */ 647 642 static void wbt_wait(struct rq_qos *rqos, struct bio *bio) 648 643 { 649 644 struct rq_wb *rwb = RQWB(rqos);

+1 -1

block/blk.h

··· 715 715 int bdev_permission(dev_t dev, blk_mode_t mode, void *holder); 716 716 717 717 void blk_integrity_generate(struct bio *bio); 718 - void blk_integrity_verify(struct bio *bio); 718 + void blk_integrity_verify_iter(struct bio *bio, struct bvec_iter *saved_iter); 719 719 void blk_integrity_prepare(struct request *rq); 720 720 void blk_integrity_complete(struct request *rq, unsigned int nr_bytes); 721 721

-2

block/bounce.c

··· 41 41 42 42 ret = bioset_init(&bounce_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); 43 43 BUG_ON(ret); 44 - if (bioset_integrity_create(&bounce_bio_set, BIO_POOL_SIZE)) 45 - BUG_ON(1); 46 44 47 45 ret = bioset_init(&bounce_bio_split, BIO_POOL_SIZE, 0, 0); 48 46 BUG_ON(ret);

+1 -1

block/bsg-lib.c

··· 219 219 if (!buf->sg_list) 220 220 return -ENOMEM; 221 221 sg_init_table(buf->sg_list, req->nr_phys_segments); 222 - buf->sg_cnt = blk_rq_map_sg(req->q, req, buf->sg_list); 222 + buf->sg_cnt = blk_rq_map_sg(req, buf->sg_list); 223 223 buf->payload_len = blk_rq_bytes(req); 224 224 return 0; 225 225 }

+28 -15

block/elevator.c

··· 457 457 struct elevator_queue *e = q->elevator; 458 458 int error; 459 459 460 - lockdep_assert_held(&q->sysfs_lock); 460 + lockdep_assert_held(&q->elevator_lock); 461 461 462 462 error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched"); 463 463 if (!error) { ··· 481 481 { 482 482 struct elevator_queue *e = q->elevator; 483 483 484 - lockdep_assert_held(&q->sysfs_lock); 484 + lockdep_assert_held(&q->elevator_lock); 485 485 486 486 if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) { 487 487 kobject_uevent(&e->kobj, KOBJ_REMOVE); ··· 618 618 unsigned int memflags; 619 619 int ret; 620 620 621 - lockdep_assert_held(&q->sysfs_lock); 621 + lockdep_assert_held(&q->elevator_lock); 622 622 623 623 memflags = blk_mq_freeze_queue(q); 624 624 blk_mq_quiesce_queue(q); ··· 655 655 { 656 656 unsigned int memflags; 657 657 658 - lockdep_assert_held(&q->sysfs_lock); 658 + lockdep_assert_held(&q->elevator_lock); 659 659 660 660 memflags = blk_mq_freeze_queue(q); 661 661 blk_mq_quiesce_queue(q); ··· 700 700 return ret; 701 701 } 702 702 703 - void elv_iosched_load_module(struct gendisk *disk, const char *buf, 704 - size_t count) 703 + static void elv_iosched_load_module(char *elevator_name) 705 704 { 706 - char elevator_name[ELV_NAME_MAX]; 707 705 struct elevator_type *found; 708 - const char *name; 709 - 710 - strscpy(elevator_name, buf, sizeof(elevator_name)); 711 - name = strstrip(elevator_name); 712 706 713 707 spin_lock(&elv_list_lock); 714 - found = __elevator_find(name); 708 + found = __elevator_find(elevator_name); 715 709 spin_unlock(&elv_list_lock); 716 710 717 711 if (!found) 718 - request_module("%s-iosched", name); 712 + request_module("%s-iosched", elevator_name); 719 713 } 720 714 721 715 ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, 722 716 size_t count) 723 717 { 724 718 char elevator_name[ELV_NAME_MAX]; 719 + char *name; 725 720 int ret; 721 + unsigned int memflags; 722 + struct request_queue *q = disk->queue; 726 723 724 + /* 725 + * If the attribute needs to load a module, do it before freezing the 726 + * queue to ensure that the module file can be read when the request 727 + * queue is the one for the device storing the module file. 728 + */ 727 729 strscpy(elevator_name, buf, sizeof(elevator_name)); 728 - ret = elevator_change(disk->queue, strstrip(elevator_name)); 730 + name = strstrip(elevator_name); 731 + 732 + elv_iosched_load_module(name); 733 + 734 + memflags = blk_mq_freeze_queue(q); 735 + mutex_lock(&q->elevator_lock); 736 + ret = elevator_change(q, name); 729 737 if (!ret) 730 - return count; 738 + ret = count; 739 + mutex_unlock(&q->elevator_lock); 740 + blk_mq_unfreeze_queue(q, memflags); 731 741 return ret; 732 742 } 733 743 ··· 748 738 struct elevator_type *cur = NULL, *e; 749 739 int len = 0; 750 740 741 + mutex_lock(&q->elevator_lock); 751 742 if (!q->elevator) { 752 743 len += sprintf(name+len, "[none] "); 753 744 } else { ··· 766 755 spin_unlock(&elv_list_lock); 767 756 768 757 len += sprintf(name+len, "\n"); 758 + mutex_unlock(&q->elevator_lock); 759 + 769 760 return len; 770 761 } 771 762

-2

block/elevator.h

··· 148 148 * io scheduler sysfs switching 149 149 */ 150 150 ssize_t elv_iosched_show(struct gendisk *disk, char *page); 151 - void elv_iosched_load_module(struct gendisk *disk, const char *page, 152 - size_t count); 153 151 ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count); 154 152 155 153 extern bool elv_bio_merge_ok(struct request *, struct bio *);

+6 -3

block/genhd.c

··· 565 565 if (disk->major == BLOCK_EXT_MAJOR) 566 566 blk_free_ext_minor(disk->first_minor); 567 567 out_exit_elevator: 568 - if (disk->queue->elevator) 568 + if (disk->queue->elevator) { 569 + mutex_lock(&disk->queue->elevator_lock); 569 570 elevator_exit(disk->queue); 571 + mutex_unlock(&disk->queue->elevator_lock); 572 + } 570 573 return ret; 571 574 } 572 575 EXPORT_SYMBOL_GPL(add_disk_fwnode); ··· 745 742 746 743 blk_mq_quiesce_queue(q); 747 744 if (q->elevator) { 748 - mutex_lock(&q->sysfs_lock); 745 + mutex_lock(&q->elevator_lock); 749 746 elevator_exit(q); 750 - mutex_unlock(&q->sysfs_lock); 747 + mutex_unlock(&q->elevator_lock); 751 748 } 752 749 rq_qos_exit(q); 753 750 blk_mq_unquiesce_queue(q);

+5

block/ioctl.c

··· 15 15 #include <linux/io_uring/cmd.h> 16 16 #include <uapi/linux/blkdev.h> 17 17 #include "blk.h" 18 + #include "blk-crypto-internal.h" 18 19 19 20 static int blkpg_do_ioctl(struct block_device *bdev, 20 21 struct blkpg_partition __user *upart, int op) ··· 621 620 case BLKTRACESTOP: 622 621 case BLKTRACETEARDOWN: 623 622 return blk_trace_ioctl(bdev, cmd, argp); 623 + case BLKCRYPTOIMPORTKEY: 624 + case BLKCRYPTOGENERATEKEY: 625 + case BLKCRYPTOPREPAREKEY: 626 + return blk_crypto_ioctl(bdev, cmd, argp); 624 627 case IOC_PR_REGISTER: 625 628 return blkdev_pr_register(bdev, mode, argp); 626 629 case IOC_PR_RESERVE:

+1 -1

block/kyber-iosched.c

··· 568 568 unsigned int nr_segs) 569 569 { 570 570 struct blk_mq_ctx *ctx = blk_mq_get_ctx(q); 571 - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, bio->bi_opf, ctx); 571 + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(bio->bi_opf, ctx); 572 572 struct kyber_hctx_data *khd = hctx->sched_data; 573 573 struct kyber_ctx_queue *kcq = &khd->kcqs[ctx->index_hw[hctx->type]]; 574 574 unsigned int sched_domain = kyber_sched_domain(bio->bi_opf);

-2

block/partitions/sgi.c

··· 50 50 p = &label->partitions[0]; 51 51 magic = label->magic_mushroom; 52 52 if(be32_to_cpu(magic) != SGI_LABEL_MAGIC) { 53 - /*printk("Dev %s SGI disklabel: bad magic %08x\n", 54 - state->disk->disk_name, be32_to_cpu(magic));*/ 55 53 put_dev_sector(sect); 56 54 return 0; 57 55 }

-2

block/partitions/sun.c

··· 74 74 75 75 p = label->partitions; 76 76 if (be16_to_cpu(label->magic) != SUN_LABEL_MAGIC) { 77 - /* printk(KERN_INFO "Dev %s Sun disklabel: bad magic %04x\n", 78 - state->disk->disk_name, be16_to_cpu(label->magic)); */ 79 77 put_dev_sector(sect); 80 78 return 0; 81 79 }

+3 -3

block/t10-pi.c

··· 404 404 } 405 405 } 406 406 407 - void blk_integrity_verify(struct bio *bio) 407 + void blk_integrity_verify_iter(struct bio *bio, struct bvec_iter *saved_iter) 408 408 { 409 409 struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 410 410 struct bio_integrity_payload *bip = bio_integrity(bio); ··· 418 418 */ 419 419 iter.disk_name = bio->bi_bdev->bd_disk->disk_name; 420 420 iter.interval = 1 << bi->interval_exp; 421 - iter.seed = bip->bio_iter.bi_sector; 421 + iter.seed = saved_iter->bi_sector; 422 422 iter.prot_buf = bvec_virt(bip->bip_vec); 423 - __bio_for_each_segment(bv, bio, bviter, bip->bio_iter) { 423 + __bio_for_each_segment(bv, bio, bviter, *saved_iter) { 424 424 void *kaddr = bvec_kmap_local(&bv); 425 425 blk_status_t ret = BLK_STS_OK; 426 426

+6

crypto/Kconfig

··· 141 141 select CRYPTO_ALGAPI 142 142 select CRYPTO_ACOMP2 143 143 144 + config CRYPTO_HKDF 145 + tristate 146 + select CRYPTO_SHA256 if !CONFIG_CRYPTO_MANAGER_DISABLE_TESTS 147 + select CRYPTO_SHA512 if !CONFIG_CRYPTO_MANAGER_DISABLE_TESTS 148 + select CRYPTO_HASH2 149 + 144 150 config CRYPTO_MANAGER 145 151 tristate "Cryptographic algorithm manager" 146 152 select CRYPTO_MANAGER2

+1

crypto/Makefile

··· 34 34 obj-$(CONFIG_CRYPTO_AKCIPHER2) += akcipher.o 35 35 obj-$(CONFIG_CRYPTO_SIG2) += sig.o 36 36 obj-$(CONFIG_CRYPTO_KPP2) += kpp.o 37 + obj-$(CONFIG_CRYPTO_HKDF) += hkdf.o 37 38 38 39 dh_generic-y := dh.o 39 40 dh_generic-y += dh_helper.o

+573

crypto/hkdf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Implementation of HKDF ("HMAC-based Extract-and-Expand Key Derivation 4 + * Function"), aka RFC 5869. See also the original paper (Krawczyk 2010): 5 + * "Cryptographic Extraction and Key Derivation: The HKDF Scheme". 6 + * 7 + * Copyright 2019 Google LLC 8 + */ 9 + 10 + #include <crypto/internal/hash.h> 11 + #include <crypto/sha2.h> 12 + #include <crypto/hkdf.h> 13 + #include <linux/module.h> 14 + 15 + /* 16 + * HKDF consists of two steps: 17 + * 18 + * 1. HKDF-Extract: extract a pseudorandom key from the input keying material 19 + * and optional salt. 20 + * 2. HKDF-Expand: expand the pseudorandom key into output keying material of 21 + * any length, parameterized by an application-specific info string. 22 + * 23 + */ 24 + 25 + /** 26 + * hkdf_extract - HKDF-Extract (RFC 5869 section 2.2) 27 + * @hmac_tfm: an HMAC transform using the hash function desired for HKDF. The 28 + * caller is responsible for setting the @prk afterwards. 29 + * @ikm: input keying material 30 + * @ikmlen: length of @ikm 31 + * @salt: input salt value 32 + * @saltlen: length of @salt 33 + * @prk: resulting pseudorandom key 34 + * 35 + * Extracts a pseudorandom key @prk from the input keying material 36 + * @ikm with length @ikmlen and salt @salt with length @saltlen. 37 + * The length of @prk is given by the digest size of @hmac_tfm. 38 + * For an 'unsalted' version of HKDF-Extract @salt must be set 39 + * to all zeroes and @saltlen must be set to the length of @prk. 40 + * 41 + * Returns 0 on success with the pseudorandom key stored in @prk, 42 + * or a negative errno value otherwise. 43 + */ 44 + int hkdf_extract(struct crypto_shash *hmac_tfm, const u8 *ikm, 45 + unsigned int ikmlen, const u8 *salt, unsigned int saltlen, 46 + u8 *prk) 47 + { 48 + int err; 49 + 50 + err = crypto_shash_setkey(hmac_tfm, salt, saltlen); 51 + if (!err) 52 + err = crypto_shash_tfm_digest(hmac_tfm, ikm, ikmlen, prk); 53 + 54 + return err; 55 + } 56 + EXPORT_SYMBOL_GPL(hkdf_extract); 57 + 58 + /** 59 + * hkdf_expand - HKDF-Expand (RFC 5869 section 2.3) 60 + * @hmac_tfm: hash context keyed with pseudorandom key 61 + * @info: application-specific information 62 + * @infolen: length of @info 63 + * @okm: output keying material 64 + * @okmlen: length of @okm 65 + * 66 + * This expands the pseudorandom key, which was already keyed into @hmac_tfm, 67 + * into @okmlen bytes of output keying material parameterized by the 68 + * application-specific @info of length @infolen bytes. 69 + * This is thread-safe and may be called by multiple threads in parallel. 70 + * 71 + * Returns 0 on success with output keying material stored in @okm, 72 + * or a negative errno value otherwise. 73 + */ 74 + int hkdf_expand(struct crypto_shash *hmac_tfm, 75 + const u8 *info, unsigned int infolen, 76 + u8 *okm, unsigned int okmlen) 77 + { 78 + SHASH_DESC_ON_STACK(desc, hmac_tfm); 79 + unsigned int i, hashlen = crypto_shash_digestsize(hmac_tfm); 80 + int err; 81 + const u8 *prev = NULL; 82 + u8 counter = 1; 83 + u8 tmp[HASH_MAX_DIGESTSIZE] = {}; 84 + 85 + if (WARN_ON(okmlen > 255 * hashlen)) 86 + return -EINVAL; 87 + 88 + desc->tfm = hmac_tfm; 89 + 90 + for (i = 0; i < okmlen; i += hashlen) { 91 + err = crypto_shash_init(desc); 92 + if (err) 93 + goto out; 94 + 95 + if (prev) { 96 + err = crypto_shash_update(desc, prev, hashlen); 97 + if (err) 98 + goto out; 99 + } 100 + 101 + if (infolen) { 102 + err = crypto_shash_update(desc, info, infolen); 103 + if (err) 104 + goto out; 105 + } 106 + 107 + BUILD_BUG_ON(sizeof(counter) != 1); 108 + if (okmlen - i < hashlen) { 109 + err = crypto_shash_finup(desc, &counter, 1, tmp); 110 + if (err) 111 + goto out; 112 + memcpy(&okm[i], tmp, okmlen - i); 113 + memzero_explicit(tmp, sizeof(tmp)); 114 + } else { 115 + err = crypto_shash_finup(desc, &counter, 1, &okm[i]); 116 + if (err) 117 + goto out; 118 + } 119 + counter++; 120 + prev = &okm[i]; 121 + } 122 + err = 0; 123 + out: 124 + if (unlikely(err)) 125 + memzero_explicit(okm, okmlen); /* so caller doesn't need to */ 126 + shash_desc_zero(desc); 127 + memzero_explicit(tmp, HASH_MAX_DIGESTSIZE); 128 + return err; 129 + } 130 + EXPORT_SYMBOL_GPL(hkdf_expand); 131 + 132 + struct hkdf_testvec { 133 + const char *test; 134 + const u8 *ikm; 135 + const u8 *salt; 136 + const u8 *info; 137 + const u8 *prk; 138 + const u8 *okm; 139 + u16 ikm_size; 140 + u16 salt_size; 141 + u16 info_size; 142 + u16 prk_size; 143 + u16 okm_size; 144 + }; 145 + 146 + /* 147 + * HKDF test vectors from RFC5869 148 + * 149 + * Additional HKDF test vectors from 150 + * https://github.com/brycx/Test-Vector-Generation/blob/master/HKDF/hkdf-hmac-sha2-test-vectors.md 151 + */ 152 + static const struct hkdf_testvec hkdf_sha256_tv[] = { 153 + { 154 + .test = "basic hdkf test", 155 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b" 156 + "\x0b\x0b\x0b\x0b\x0b\x0b", 157 + .ikm_size = 22, 158 + .salt = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c", 159 + .salt_size = 13, 160 + .info = "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9", 161 + .info_size = 10, 162 + .prk = "\x07\x77\x09\x36\x2c\x2e\x32\xdf\x0d\xdc\x3f\x0d\xc4\x7b\xba\x63" 163 + "\x90\xb6\xc7\x3b\xb5\x0f\x9c\x31\x22\xec\x84\x4a\xd7\xc2\xb3\xe5", 164 + .prk_size = 32, 165 + .okm = "\x3c\xb2\x5f\x25\xfa\xac\xd5\x7a\x90\x43\x4f\x64\xd0\x36\x2f\x2a" 166 + "\x2d\x2d\x0a\x90\xcf\x1a\x5a\x4c\x5d\xb0\x2d\x56\xec\xc4\xc5\xbf" 167 + "\x34\x00\x72\x08\xd5\xb8\x87\x18\x58\x65", 168 + .okm_size = 42, 169 + }, { 170 + .test = "hkdf test with long input", 171 + .ikm = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f" 172 + "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f" 173 + "\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f" 174 + "\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f" 175 + "\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f", 176 + .ikm_size = 80, 177 + .salt = "\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f" 178 + "\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f" 179 + "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f" 180 + "\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f" 181 + "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf", 182 + .salt_size = 80, 183 + .info = "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf" 184 + "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf" 185 + "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf" 186 + "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef" 187 + "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff", 188 + .info_size = 80, 189 + .prk = "\x06\xa6\xb8\x8c\x58\x53\x36\x1a\x06\x10\x4c\x9c\xeb\x35\xb4\x5c" 190 + "\xef\x76\x00\x14\x90\x46\x71\x01\x4a\x19\x3f\x40\xc1\x5f\xc2\x44", 191 + .prk_size = 32, 192 + .okm = "\xb1\x1e\x39\x8d\xc8\x03\x27\xa1\xc8\xe7\xf7\x8c\x59\x6a\x49\x34" 193 + "\x4f\x01\x2e\xda\x2d\x4e\xfa\xd8\xa0\x50\xcc\x4c\x19\xaf\xa9\x7c" 194 + "\x59\x04\x5a\x99\xca\xc7\x82\x72\x71\xcb\x41\xc6\x5e\x59\x0e\x09" 195 + "\xda\x32\x75\x60\x0c\x2f\x09\xb8\x36\x77\x93\xa9\xac\xa3\xdb\x71" 196 + "\xcc\x30\xc5\x81\x79\xec\x3e\x87\xc1\x4c\x01\xd5\xc1\xf3\x43\x4f" 197 + "\x1d\x87", 198 + .okm_size = 82, 199 + }, { 200 + .test = "hkdf test with zero salt and info", 201 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b" 202 + "\x0b\x0b\x0b\x0b\x0b\x0b", 203 + .ikm_size = 22, 204 + .salt = NULL, 205 + .salt_size = 0, 206 + .info = NULL, 207 + .info_size = 0, 208 + .prk = "\x19\xef\x24\xa3\x2c\x71\x7b\x16\x7f\x33\xa9\x1d\x6f\x64\x8b\xdf" 209 + "\x96\x59\x67\x76\xaf\xdb\x63\x77\xac\x43\x4c\x1c\x29\x3c\xcb\x04", 210 + .prk_size = 32, 211 + .okm = "\x8d\xa4\xe7\x75\xa5\x63\xc1\x8f\x71\x5f\x80\x2a\x06\x3c\x5a\x31" 212 + "\xb8\xa1\x1f\x5c\x5e\xe1\x87\x9e\xc3\x45\x4e\x5f\x3c\x73\x8d\x2d" 213 + "\x9d\x20\x13\x95\xfa\xa4\xb6\x1a\x96\xc8", 214 + .okm_size = 42, 215 + }, { 216 + .test = "hkdf test with short input", 217 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b", 218 + .ikm_size = 11, 219 + .salt = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c", 220 + .salt_size = 13, 221 + .info = "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9", 222 + .info_size = 10, 223 + .prk = "\x82\x65\xf6\x9d\x7f\xf7\xe5\x01\x37\x93\x01\x5c\xa0\xef\x92\x0c" 224 + "\xb1\x68\x21\x99\xc8\xbc\x3a\x00\xda\x0c\xab\x47\xb7\xb0\x0f\xdf", 225 + .prk_size = 32, 226 + .okm = "\x58\xdc\xe1\x0d\x58\x01\xcd\xfd\xa8\x31\x72\x6b\xfe\xbc\xb7\x43" 227 + "\xd1\x4a\x7e\xe8\x3a\xa0\x57\xa9\x3d\x59\xb0\xa1\x31\x7f\xf0\x9d" 228 + "\x10\x5c\xce\xcf\x53\x56\x92\xb1\x4d\xd5", 229 + .okm_size = 42, 230 + }, { 231 + .test = "unsalted hkdf test with zero info", 232 + .ikm = "\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c" 233 + "\x0c\x0c\x0c\x0c\x0c\x0c", 234 + .ikm_size = 22, 235 + .salt = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" 236 + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 237 + .salt_size = 32, 238 + .info = NULL, 239 + .info_size = 0, 240 + .prk = "\xaa\x84\x1e\x1f\x35\x74\xf3\x2d\x13\xfb\xa8\x00\x5f\xcd\x9b\x8d" 241 + "\x77\x67\x82\xa5\xdf\xa1\x92\x38\x92\xfd\x8b\x63\x5d\x3a\x89\xdf", 242 + .prk_size = 32, 243 + .okm = "\x59\x68\x99\x17\x9a\xb1\xbc\x00\xa7\xc0\x37\x86\xff\x43\xee\x53" 244 + "\x50\x04\xbe\x2b\xb9\xbe\x68\xbc\x14\x06\x63\x6f\x54\xbd\x33\x8a" 245 + "\x66\xa2\x37\xba\x2a\xcb\xce\xe3\xc9\xa7", 246 + .okm_size = 42, 247 + } 248 + }; 249 + 250 + static const struct hkdf_testvec hkdf_sha384_tv[] = { 251 + { 252 + .test = "basic hkdf test", 253 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b" 254 + "\x0b\x0b\x0b\x0b\x0b\x0b", 255 + .ikm_size = 22, 256 + .salt = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c", 257 + .salt_size = 13, 258 + .info = "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9", 259 + .info_size = 10, 260 + .prk = "\x70\x4b\x39\x99\x07\x79\xce\x1d\xc5\x48\x05\x2c\x7d\xc3\x9f\x30" 261 + "\x35\x70\xdd\x13\xfb\x39\xf7\xac\xc5\x64\x68\x0b\xef\x80\xe8\xde" 262 + "\xc7\x0e\xe9\xa7\xe1\xf3\xe2\x93\xef\x68\xec\xeb\x07\x2a\x5a\xde", 263 + .prk_size = 48, 264 + .okm = "\x9b\x50\x97\xa8\x60\x38\xb8\x05\x30\x90\x76\xa4\x4b\x3a\x9f\x38" 265 + "\x06\x3e\x25\xb5\x16\xdc\xbf\x36\x9f\x39\x4c\xfa\xb4\x36\x85\xf7" 266 + "\x48\xb6\x45\x77\x63\xe4\xf0\x20\x4f\xc5", 267 + .okm_size = 42, 268 + }, { 269 + .test = "hkdf test with long input", 270 + .ikm = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f" 271 + "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f" 272 + "\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f" 273 + "\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f" 274 + "\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f", 275 + .ikm_size = 80, 276 + .salt = "\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f" 277 + "\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f" 278 + "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f" 279 + "\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f" 280 + "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf", 281 + .salt_size = 80, 282 + .info = "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf" 283 + "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf" 284 + "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf" 285 + "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef" 286 + "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff", 287 + .info_size = 80, 288 + .prk = "\xb3\x19\xf6\x83\x1d\xff\x93\x14\xef\xb6\x43\xba\xa2\x92\x63\xb3" 289 + "\x0e\x4a\x8d\x77\x9f\xe3\x1e\x9c\x90\x1e\xfd\x7d\xe7\x37\xc8\x5b" 290 + "\x62\xe6\x76\xd4\xdc\x87\xb0\x89\x5c\x6a\x7d\xc9\x7b\x52\xce\xbb", 291 + .prk_size = 48, 292 + .okm = "\x48\x4c\xa0\x52\xb8\xcc\x72\x4f\xd1\xc4\xec\x64\xd5\x7b\x4e\x81" 293 + "\x8c\x7e\x25\xa8\xe0\xf4\x56\x9e\xd7\x2a\x6a\x05\xfe\x06\x49\xee" 294 + "\xbf\x69\xf8\xd5\xc8\x32\x85\x6b\xf4\xe4\xfb\xc1\x79\x67\xd5\x49" 295 + "\x75\x32\x4a\x94\x98\x7f\x7f\x41\x83\x58\x17\xd8\x99\x4f\xdb\xd6" 296 + "\xf4\xc0\x9c\x55\x00\xdc\xa2\x4a\x56\x22\x2f\xea\x53\xd8\x96\x7a" 297 + "\x8b\x2e", 298 + .okm_size = 82, 299 + }, { 300 + .test = "hkdf test with zero salt and info", 301 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b" 302 + "\x0b\x0b\x0b\x0b\x0b\x0b", 303 + .ikm_size = 22, 304 + .salt = NULL, 305 + .salt_size = 0, 306 + .info = NULL, 307 + .info_size = 0, 308 + .prk = "\x10\xe4\x0c\xf0\x72\xa4\xc5\x62\x6e\x43\xdd\x22\xc1\xcf\x72\x7d" 309 + "\x4b\xb1\x40\x97\x5c\x9a\xd0\xcb\xc8\xe4\x5b\x40\x06\x8f\x8f\x0b" 310 + "\xa5\x7c\xdb\x59\x8a\xf9\xdf\xa6\x96\x3a\x96\x89\x9a\xf0\x47\xe5", 311 + .prk_size = 48, 312 + .okm = "\xc8\xc9\x6e\x71\x0f\x89\xb0\xd7\x99\x0b\xca\x68\xbc\xde\xc8\xcf" 313 + "\x85\x40\x62\xe5\x4c\x73\xa7\xab\xc7\x43\xfa\xde\x9b\x24\x2d\xaa" 314 + "\xcc\x1c\xea\x56\x70\x41\x5b\x52\x84\x9c", 315 + .okm_size = 42, 316 + }, { 317 + .test = "hkdf test with short input", 318 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b", 319 + .ikm_size = 11, 320 + .salt = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c", 321 + .salt_size = 13, 322 + .info = "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9", 323 + .info_size = 10, 324 + .prk = "\x6d\x31\x69\x98\x28\x79\x80\x88\xb3\x59\xda\xd5\x0b\x8f\x01\xb0" 325 + "\x15\xf1\x7a\xa3\xbd\x4e\x27\xa6\xe9\xf8\x73\xb7\x15\x85\xca\x6a" 326 + "\x00\xd1\xf0\x82\x12\x8a\xdb\x3c\xf0\x53\x0b\x57\xc0\xf9\xac\x72", 327 + .prk_size = 48, 328 + .okm = "\xfb\x7e\x67\x43\xeb\x42\xcd\xe9\x6f\x1b\x70\x77\x89\x52\xab\x75" 329 + "\x48\xca\xfe\x53\x24\x9f\x7f\xfe\x14\x97\xa1\x63\x5b\x20\x1f\xf1" 330 + "\x85\xb9\x3e\x95\x19\x92\xd8\x58\xf1\x1a", 331 + .okm_size = 42, 332 + }, { 333 + .test = "unsalted hkdf test with zero info", 334 + .ikm = "\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c" 335 + "\x0c\x0c\x0c\x0c\x0c\x0c", 336 + .ikm_size = 22, 337 + .salt = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" 338 + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" 339 + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 340 + .salt_size = 48, 341 + .info = NULL, 342 + .info_size = 0, 343 + .prk = "\x9d\x2d\xa5\x06\x6f\x05\xd1\x6c\x59\xfe\xdf\x6c\x5f\x32\xc7\x5e" 344 + "\xda\x9a\x47\xa7\x9c\x93\x6a\xa4\x4c\xb7\x63\xa8\xe2\x2f\xfb\xfc" 345 + "\xd8\xfe\x55\x43\x58\x53\x47\x21\x90\x39\xd1\x68\x28\x36\x33\xf5", 346 + .prk_size = 48, 347 + .okm = "\x6a\xd7\xc7\x26\xc8\x40\x09\x54\x6a\x76\xe0\x54\x5d\xf2\x66\x78" 348 + "\x7e\x2b\x2c\xd6\xca\x43\x73\xa1\xf3\x14\x50\xa7\xbd\xf9\x48\x2b" 349 + "\xfa\xb8\x11\xf5\x54\x20\x0e\xad\x8f\x53", 350 + .okm_size = 42, 351 + } 352 + }; 353 + 354 + static const struct hkdf_testvec hkdf_sha512_tv[] = { 355 + { 356 + .test = "basic hkdf test", 357 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b" 358 + "\x0b\x0b\x0b\x0b\x0b\x0b", 359 + .ikm_size = 22, 360 + .salt = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c", 361 + .salt_size = 13, 362 + .info = "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9", 363 + .info_size = 10, 364 + .prk = "\x66\x57\x99\x82\x37\x37\xde\xd0\x4a\x88\xe4\x7e\x54\xa5\x89\x0b" 365 + "\xb2\xc3\xd2\x47\xc7\xa4\x25\x4a\x8e\x61\x35\x07\x23\x59\x0a\x26" 366 + "\xc3\x62\x38\x12\x7d\x86\x61\xb8\x8c\xf8\x0e\xf8\x02\xd5\x7e\x2f" 367 + "\x7c\xeb\xcf\x1e\x00\xe0\x83\x84\x8b\xe1\x99\x29\xc6\x1b\x42\x37", 368 + .prk_size = 64, 369 + .okm = "\x83\x23\x90\x08\x6c\xda\x71\xfb\x47\x62\x5b\xb5\xce\xb1\x68\xe4" 370 + "\xc8\xe2\x6a\x1a\x16\xed\x34\xd9\xfc\x7f\xe9\x2c\x14\x81\x57\x93" 371 + "\x38\xda\x36\x2c\xb8\xd9\xf9\x25\xd7\xcb", 372 + .okm_size = 42, 373 + }, { 374 + .test = "hkdf test with long input", 375 + .ikm = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f" 376 + "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f" 377 + "\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f" 378 + "\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f" 379 + "\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f", 380 + .ikm_size = 80, 381 + .salt = "\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f" 382 + "\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f" 383 + "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f" 384 + "\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f" 385 + "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf", 386 + .salt_size = 80, 387 + .info = "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf" 388 + "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf" 389 + "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf" 390 + "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef" 391 + "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff", 392 + .info_size = 80, 393 + .prk = "\x35\x67\x25\x42\x90\x7d\x4e\x14\x2c\x00\xe8\x44\x99\xe7\x4e\x1d" 394 + "\xe0\x8b\xe8\x65\x35\xf9\x24\xe0\x22\x80\x4a\xd7\x75\xdd\xe2\x7e" 395 + "\xc8\x6c\xd1\xe5\xb7\xd1\x78\xc7\x44\x89\xbd\xbe\xb3\x07\x12\xbe" 396 + "\xb8\x2d\x4f\x97\x41\x6c\x5a\x94\xea\x81\xeb\xdf\x3e\x62\x9e\x4a", 397 + .prk_size = 64, 398 + .okm = "\xce\x6c\x97\x19\x28\x05\xb3\x46\xe6\x16\x1e\x82\x1e\xd1\x65\x67" 399 + "\x3b\x84\xf4\x00\xa2\xb5\x14\xb2\xfe\x23\xd8\x4c\xd1\x89\xdd\xf1" 400 + "\xb6\x95\xb4\x8c\xbd\x1c\x83\x88\x44\x11\x37\xb3\xce\x28\xf1\x6a" 401 + "\xa6\x4b\xa3\x3b\xa4\x66\xb2\x4d\xf6\xcf\xcb\x02\x1e\xcf\xf2\x35" 402 + "\xf6\xa2\x05\x6c\xe3\xaf\x1d\xe4\x4d\x57\x20\x97\xa8\x50\x5d\x9e" 403 + "\x7a\x93", 404 + .okm_size = 82, 405 + }, { 406 + .test = "hkdf test with zero salt and info", 407 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b" 408 + "\x0b\x0b\x0b\x0b\x0b\x0b", 409 + .ikm_size = 22, 410 + .salt = NULL, 411 + .salt_size = 0, 412 + .info = NULL, 413 + .info_size = 0, 414 + .prk = "\xfd\x20\x0c\x49\x87\xac\x49\x13\x13\xbd\x4a\x2a\x13\x28\x71\x21" 415 + "\x24\x72\x39\xe1\x1c\x9e\xf8\x28\x02\x04\x4b\x66\xef\x35\x7e\x5b" 416 + "\x19\x44\x98\xd0\x68\x26\x11\x38\x23\x48\x57\x2a\x7b\x16\x11\xde" 417 + "\x54\x76\x40\x94\x28\x63\x20\x57\x8a\x86\x3f\x36\x56\x2b\x0d\xf6", 418 + .prk_size = 64, 419 + .okm = "\xf5\xfa\x02\xb1\x82\x98\xa7\x2a\x8c\x23\x89\x8a\x87\x03\x47\x2c" 420 + "\x6e\xb1\x79\xdc\x20\x4c\x03\x42\x5c\x97\x0e\x3b\x16\x4b\xf9\x0f" 421 + "\xff\x22\xd0\x48\x36\xd0\xe2\x34\x3b\xac", 422 + .okm_size = 42, 423 + }, { 424 + .test = "hkdf test with short input", 425 + .ikm = "\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b", 426 + .ikm_size = 11, 427 + .salt = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c", 428 + .salt_size = 13, 429 + .info = "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9", 430 + .info_size = 10, 431 + .prk = "\x67\x40\x9c\x9c\xac\x28\xb5\x2e\xe9\xfa\xd9\x1c\x2f\xda\x99\x9f" 432 + "\x7c\xa2\x2e\x34\x34\xf0\xae\x77\x28\x63\x83\x65\x68\xad\x6a\x7f" 433 + "\x10\xcf\x11\x3b\xfd\xdd\x56\x01\x29\xa5\x94\xa8\xf5\x23\x85\xc2" 434 + "\xd6\x61\xd7\x85\xd2\x9c\xe9\x3a\x11\x40\x0c\x92\x06\x83\x18\x1d", 435 + .prk_size = 64, 436 + .okm = "\x74\x13\xe8\x99\x7e\x02\x06\x10\xfb\xf6\x82\x3f\x2c\xe1\x4b\xff" 437 + "\x01\x87\x5d\xb1\xca\x55\xf6\x8c\xfc\xf3\x95\x4d\xc8\xaf\xf5\x35" 438 + "\x59\xbd\x5e\x30\x28\xb0\x80\xf7\xc0\x68", 439 + .okm_size = 42, 440 + }, { 441 + .test = "unsalted hkdf test with zero info", 442 + .ikm = "\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c" 443 + "\x0c\x0c\x0c\x0c\x0c\x0c", 444 + .ikm_size = 22, 445 + .salt = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" 446 + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" 447 + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" 448 + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 449 + .salt_size = 64, 450 + .info = NULL, 451 + .info_size = 0, 452 + .prk = "\x53\x46\xb3\x76\xbf\x3a\xa9\xf8\x4f\x8f\x6e\xd5\xb1\xc4\xf4\x89" 453 + "\x17\x2e\x24\x4d\xac\x30\x3d\x12\xf6\x8e\xcc\x76\x6e\xa6\x00\xaa" 454 + "\x88\x49\x5e\x7f\xb6\x05\x80\x31\x22\xfa\x13\x69\x24\xa8\x40\xb1" 455 + "\xf0\x71\x9d\x2d\x5f\x68\xe2\x9b\x24\x22\x99\xd7\x58\xed\x68\x0c", 456 + .prk_size = 64, 457 + .okm = "\x14\x07\xd4\x60\x13\xd9\x8b\xc6\xde\xce\xfc\xfe\xe5\x5f\x0f\x90" 458 + "\xb0\xc7\xf6\x3d\x68\xeb\x1a\x80\xea\xf0\x7e\x95\x3c\xfc\x0a\x3a" 459 + "\x52\x40\xa1\x55\xd6\xe4\xda\xa9\x65\xbb", 460 + .okm_size = 42, 461 + } 462 + }; 463 + 464 + static int hkdf_test(const char *shash, const struct hkdf_testvec *tv) 465 + { struct crypto_shash *tfm = NULL; 466 + u8 *prk = NULL, *okm = NULL; 467 + unsigned int prk_size; 468 + const char *driver; 469 + int err; 470 + 471 + tfm = crypto_alloc_shash(shash, 0, 0); 472 + if (IS_ERR(tfm)) { 473 + pr_err("%s(%s): failed to allocate transform: %ld\n", 474 + tv->test, shash, PTR_ERR(tfm)); 475 + return PTR_ERR(tfm); 476 + } 477 + driver = crypto_shash_driver_name(tfm); 478 + 479 + prk_size = crypto_shash_digestsize(tfm); 480 + prk = kzalloc(prk_size, GFP_KERNEL); 481 + if (!prk) { 482 + err = -ENOMEM; 483 + goto out_free; 484 + } 485 + 486 + if (tv->prk_size != prk_size) { 487 + pr_err("%s(%s): prk size mismatch (vec %u, digest %u\n", 488 + tv->test, driver, tv->prk_size, prk_size); 489 + err = -EINVAL; 490 + goto out_free; 491 + } 492 + 493 + err = hkdf_extract(tfm, tv->ikm, tv->ikm_size, 494 + tv->salt, tv->salt_size, prk); 495 + if (err) { 496 + pr_err("%s(%s): hkdf_extract failed with %d\n", 497 + tv->test, driver, err); 498 + goto out_free; 499 + } 500 + 501 + if (memcmp(prk, tv->prk, tv->prk_size)) { 502 + pr_err("%s(%s): hkdf_extract prk mismatch\n", 503 + tv->test, driver); 504 + print_hex_dump(KERN_ERR, "prk: ", DUMP_PREFIX_NONE, 505 + 16, 1, prk, tv->prk_size, false); 506 + err = -EINVAL; 507 + goto out_free; 508 + } 509 + 510 + okm = kzalloc(tv->okm_size, GFP_KERNEL); 511 + if (!okm) { 512 + err = -ENOMEM; 513 + goto out_free; 514 + } 515 + 516 + err = crypto_shash_setkey(tfm, tv->prk, tv->prk_size); 517 + if (err) { 518 + pr_err("%s(%s): failed to set prk, error %d\n", 519 + tv->test, driver, err); 520 + goto out_free; 521 + } 522 + 523 + err = hkdf_expand(tfm, tv->info, tv->info_size, 524 + okm, tv->okm_size); 525 + if (err) { 526 + pr_err("%s(%s): hkdf_expand() failed with %d\n", 527 + tv->test, driver, err); 528 + } else if (memcmp(okm, tv->okm, tv->okm_size)) { 529 + pr_err("%s(%s): hkdf_expand() okm mismatch\n", 530 + tv->test, driver); 531 + print_hex_dump(KERN_ERR, "okm: ", DUMP_PREFIX_NONE, 532 + 16, 1, okm, tv->okm_size, false); 533 + err = -EINVAL; 534 + } 535 + out_free: 536 + kfree(okm); 537 + kfree(prk); 538 + crypto_free_shash(tfm); 539 + return err; 540 + } 541 + 542 + static int __init crypto_hkdf_module_init(void) 543 + { 544 + int ret = 0, i; 545 + 546 + if (IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS)) 547 + return 0; 548 + 549 + for (i = 0; i < ARRAY_SIZE(hkdf_sha256_tv); i++) { 550 + ret = hkdf_test("hmac(sha256)", &hkdf_sha256_tv[i]); 551 + if (ret) 552 + return ret; 553 + } 554 + for (i = 0; i < ARRAY_SIZE(hkdf_sha384_tv); i++) { 555 + ret = hkdf_test("hmac(sha384)", &hkdf_sha384_tv[i]); 556 + if (ret) 557 + return ret; 558 + } 559 + for (i = 0; i < ARRAY_SIZE(hkdf_sha512_tv); i++) { 560 + ret = hkdf_test("hmac(sha512)", &hkdf_sha512_tv[i]); 561 + if (ret) 562 + return ret; 563 + } 564 + return 0; 565 + } 566 + 567 + static void __exit crypto_hkdf_module_exit(void) {} 568 + 569 + module_init(crypto_hkdf_module_init); 570 + module_exit(crypto_hkdf_module_exit); 571 + 572 + MODULE_LICENSE("GPL"); 573 + MODULE_DESCRIPTION("HMAC-based Key Derivation Function (HKDF)");

+60 -46

drivers/block/loop.c

··· 45 45 Lo_deleting, 46 46 }; 47 47 48 - struct loop_func_table; 49 - 50 48 struct loop_device { 51 49 int lo_number; 52 50 loff_t lo_offset; ··· 52 54 int lo_flags; 53 55 char lo_file_name[LO_NAME_SIZE]; 54 56 55 - struct file * lo_backing_file; 57 + struct file *lo_backing_file; 58 + unsigned int lo_min_dio_size; 56 59 struct block_device *lo_device; 57 60 58 61 gfp_t old_gfp_mask; ··· 168 169 * of backing device, and the logical block size of loop is bigger than that of 169 170 * the backing device. 170 171 */ 171 - static bool lo_bdev_can_use_dio(struct loop_device *lo, 172 - struct block_device *backing_bdev) 173 - { 174 - unsigned int sb_bsize = bdev_logical_block_size(backing_bdev); 175 - 176 - if (queue_logical_block_size(lo->lo_queue) < sb_bsize) 177 - return false; 178 - if (lo->lo_offset & (sb_bsize - 1)) 179 - return false; 180 - return true; 181 - } 182 - 183 172 static bool lo_can_use_dio(struct loop_device *lo) 184 173 { 185 - struct inode *inode = lo->lo_backing_file->f_mapping->host; 186 - 187 174 if (!(lo->lo_backing_file->f_mode & FMODE_CAN_ODIRECT)) 188 175 return false; 189 - 190 - if (S_ISBLK(inode->i_mode)) 191 - return lo_bdev_can_use_dio(lo, I_BDEV(inode)); 192 - if (inode->i_sb->s_bdev) 193 - return lo_bdev_can_use_dio(lo, inode->i_sb->s_bdev); 176 + if (queue_logical_block_size(lo->lo_queue) < lo->lo_min_dio_size) 177 + return false; 178 + if (lo->lo_offset & (lo->lo_min_dio_size - 1)) 179 + return false; 194 180 return true; 195 181 } 196 182 ··· 189 205 */ 190 206 static inline void loop_update_dio(struct loop_device *lo) 191 207 { 192 - bool dio_in_use = lo->lo_flags & LO_FLAGS_DIRECT_IO; 193 - 194 208 lockdep_assert_held(&lo->lo_mutex); 195 209 WARN_ON_ONCE(lo->lo_state == Lo_bound && 196 210 lo->lo_queue->mq_freeze_depth == 0); 197 211 198 - if (lo->lo_backing_file->f_flags & O_DIRECT) 199 - lo->lo_flags |= LO_FLAGS_DIRECT_IO; 200 212 if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !lo_can_use_dio(lo)) 201 213 lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; 202 - 203 - /* flush dirty pages before starting to issue direct I/O */ 204 - if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use) 205 - vfs_fsync(lo->lo_backing_file, 0); 206 214 } 207 215 208 216 /** ··· 517 541 __func__, lo->lo_number, lo->lo_file_name, rc); 518 542 } 519 543 544 + static unsigned int loop_query_min_dio_size(struct loop_device *lo) 545 + { 546 + struct file *file = lo->lo_backing_file; 547 + struct block_device *sb_bdev = file->f_mapping->host->i_sb->s_bdev; 548 + struct kstat st; 549 + 550 + /* 551 + * Use the minimal dio alignment of the file system if provided. 552 + */ 553 + if (!vfs_getattr(&file->f_path, &st, STATX_DIOALIGN, 0) && 554 + (st.result_mask & STATX_DIOALIGN)) 555 + return st.dio_offset_align; 556 + 557 + /* 558 + * In a perfect world this wouldn't be needed, but as of Linux 6.13 only 559 + * a handful of file systems support the STATX_DIOALIGN flag. 560 + */ 561 + if (sb_bdev) 562 + return bdev_logical_block_size(sb_bdev); 563 + return SECTOR_SIZE; 564 + } 565 + 520 566 static inline int is_loop_device(struct file *file) 521 567 { 522 568 struct inode *i = file->f_mapping->host; ··· 569 571 if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode)) 570 572 return -EINVAL; 571 573 return 0; 574 + } 575 + 576 + static void loop_assign_backing_file(struct loop_device *lo, struct file *file) 577 + { 578 + lo->lo_backing_file = file; 579 + lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping); 580 + mapping_set_gfp_mask(file->f_mapping, 581 + lo->old_gfp_mask & ~(__GFP_IO | __GFP_FS)); 582 + if (lo->lo_backing_file->f_flags & O_DIRECT) 583 + lo->lo_flags |= LO_FLAGS_DIRECT_IO; 584 + lo->lo_min_dio_size = loop_query_min_dio_size(lo); 572 585 } 573 586 574 587 /* ··· 631 622 if (get_loop_size(lo, file) != get_loop_size(lo, old_file)) 632 623 goto out_err; 633 624 625 + /* 626 + * We might switch to direct I/O mode for the loop device, write back 627 + * all dirty data the page cache now that so that the individual I/O 628 + * operations don't have to do that. 629 + */ 630 + vfs_fsync(file, 0); 631 + 634 632 /* and ... switch */ 635 633 disk_force_media_change(lo->lo_disk); 636 634 memflags = blk_mq_freeze_queue(lo->lo_queue); 637 635 mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask); 638 - lo->lo_backing_file = file; 639 - lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping); 640 - mapping_set_gfp_mask(file->f_mapping, 641 - lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); 636 + loop_assign_backing_file(lo, file); 642 637 loop_update_dio(lo); 643 638 blk_mq_unfreeze_queue(lo->lo_queue, memflags); 644 639 partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; ··· 984 971 return 0; 985 972 } 986 973 987 - static unsigned int loop_default_blocksize(struct loop_device *lo, 988 - struct block_device *backing_bdev) 974 + static unsigned int loop_default_blocksize(struct loop_device *lo) 989 975 { 990 - /* In case of direct I/O, match underlying block size */ 991 - if ((lo->lo_backing_file->f_flags & O_DIRECT) && backing_bdev) 992 - return bdev_logical_block_size(backing_bdev); 976 + /* In case of direct I/O, match underlying minimum I/O size */ 977 + if (lo->lo_flags & LO_FLAGS_DIRECT_IO) 978 + return lo->lo_min_dio_size; 993 979 return SECTOR_SIZE; 994 980 } 995 981 ··· 1006 994 backing_bdev = inode->i_sb->s_bdev; 1007 995 1008 996 if (!bsize) 1009 - bsize = loop_default_blocksize(lo, backing_bdev); 997 + bsize = loop_default_blocksize(lo); 1010 998 1011 999 loop_get_discard_config(lo, &granularity, &max_discard_sectors); 1012 1000 ··· 1031 1019 const struct loop_config *config) 1032 1020 { 1033 1021 struct file *file = fget(config->fd); 1034 - struct address_space *mapping; 1035 1022 struct queue_limits lim; 1036 1023 int error; 1037 1024 loff_t size; ··· 1066 1055 if (error) 1067 1056 goto out_unlock; 1068 1057 1069 - mapping = file->f_mapping; 1070 - 1071 1058 if ((config->info.lo_flags & ~LOOP_CONFIGURE_SETTABLE_FLAGS) != 0) { 1072 1059 error = -EINVAL; 1073 1060 goto out_unlock; ··· 1097 1088 set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0); 1098 1089 1099 1090 lo->lo_device = bdev; 1100 - lo->lo_backing_file = file; 1101 - lo->old_gfp_mask = mapping_gfp_mask(mapping); 1102 - mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); 1091 + loop_assign_backing_file(lo, file); 1103 1092 1104 1093 lim = queue_limits_start_update(lo->lo_queue); 1105 1094 loop_update_limits(lo, &lim, config->block_size); ··· 1105 1098 error = queue_limits_commit_update(lo->lo_queue, &lim); 1106 1099 if (error) 1107 1100 goto out_unlock; 1101 + 1102 + /* 1103 + * We might switch to direct I/O mode for the loop device, write back 1104 + * all dirty data the page cache now that so that the individual I/O 1105 + * operations don't have to do that. 1106 + */ 1107 + vfs_fsync(file, 0); 1108 1108 1109 1109 loop_update_dio(lo); 1110 1110 loop_sysfs_init(lo);

+1 -1

drivers/block/mtip32xx/mtip32xx.c

··· 2056 2056 unsigned int nents; 2057 2057 2058 2058 /* Map the scatter list for DMA access */ 2059 - nents = blk_rq_map_sg(hctx->queue, rq, command->sg); 2059 + nents = blk_rq_map_sg(rq, command->sg); 2060 2060 nents = dma_map_sg(&dd->pdev->dev, command->sg, nents, dma_dir); 2061 2061 2062 2062 prefetch(&port->flags);

+115 -66

drivers/block/null_blk/main.c

··· 473 473 NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL); 474 474 NULLB_DEVICE_ATTR(fua, bool, NULL); 475 475 NULLB_DEVICE_ATTR(rotational, bool, NULL); 476 + NULLB_DEVICE_ATTR(badblocks_once, bool, NULL); 477 + NULLB_DEVICE_ATTR(badblocks_partial_io, bool, NULL); 476 478 477 479 static ssize_t nullb_device_power_show(struct config_item *item, char *page) 478 480 { ··· 561 559 goto out; 562 560 /* enable badblocks */ 563 561 cmpxchg(&t_dev->badblocks.shift, -1, 0); 564 - if (buf[0] == '+') 565 - ret = badblocks_set(&t_dev->badblocks, start, 566 - end - start + 1, 1); 567 - else 568 - ret = badblocks_clear(&t_dev->badblocks, start, 569 - end - start + 1); 570 - if (ret == 0) 562 + if (buf[0] == '+') { 563 + if (badblocks_set(&t_dev->badblocks, start, 564 + end - start + 1, 1)) 565 + ret = count; 566 + } else if (badblocks_clear(&t_dev->badblocks, start, 567 + end - start + 1)) { 571 568 ret = count; 569 + } 572 570 out: 573 571 kfree(orig); 574 572 return ret; ··· 594 592 CONFIGFS_ATTR_WO(nullb_device_, zone_offline); 595 593 596 594 static struct configfs_attribute *nullb_device_attrs[] = { 597 - &nullb_device_attr_size, 598 - &nullb_device_attr_completion_nsec, 599 - &nullb_device_attr_submit_queues, 600 - &nullb_device_attr_poll_queues, 601 - &nullb_device_attr_home_node, 602 - &nullb_device_attr_queue_mode, 595 + &nullb_device_attr_badblocks, 596 + &nullb_device_attr_badblocks_once, 597 + &nullb_device_attr_badblocks_partial_io, 598 + &nullb_device_attr_blocking, 603 599 &nullb_device_attr_blocksize, 604 - &nullb_device_attr_max_sectors, 605 - &nullb_device_attr_irqmode, 600 + &nullb_device_attr_cache_size, 601 + &nullb_device_attr_completion_nsec, 602 + &nullb_device_attr_discard, 603 + &nullb_device_attr_fua, 604 + &nullb_device_attr_home_node, 606 605 &nullb_device_attr_hw_queue_depth, 607 606 &nullb_device_attr_index, 608 - &nullb_device_attr_blocking, 609 - &nullb_device_attr_use_per_node_hctx, 610 - &nullb_device_attr_power, 611 - &nullb_device_attr_memory_backed, 612 - &nullb_device_attr_discard, 607 + &nullb_device_attr_irqmode, 608 + &nullb_device_attr_max_sectors, 613 609 &nullb_device_attr_mbps, 614 - &nullb_device_attr_cache_size, 615 - &nullb_device_attr_badblocks, 616 - &nullb_device_attr_zoned, 617 - &nullb_device_attr_zone_size, 618 - &nullb_device_attr_zone_capacity, 619 - &nullb_device_attr_zone_nr_conv, 620 - &nullb_device_attr_zone_max_open, 621 - &nullb_device_attr_zone_max_active, 622 - &nullb_device_attr_zone_append_max_sectors, 623 - &nullb_device_attr_zone_readonly, 624 - &nullb_device_attr_zone_offline, 625 - &nullb_device_attr_zone_full, 626 - &nullb_device_attr_virt_boundary, 610 + &nullb_device_attr_memory_backed, 627 611 &nullb_device_attr_no_sched, 628 - &nullb_device_attr_shared_tags, 629 - &nullb_device_attr_shared_tag_bitmap, 630 - &nullb_device_attr_fua, 612 + &nullb_device_attr_poll_queues, 613 + &nullb_device_attr_power, 614 + &nullb_device_attr_queue_mode, 631 615 &nullb_device_attr_rotational, 616 + &nullb_device_attr_shared_tag_bitmap, 617 + &nullb_device_attr_shared_tags, 618 + &nullb_device_attr_size, 619 + &nullb_device_attr_submit_queues, 620 + &nullb_device_attr_use_per_node_hctx, 621 + &nullb_device_attr_virt_boundary, 622 + &nullb_device_attr_zone_append_max_sectors, 623 + &nullb_device_attr_zone_capacity, 624 + &nullb_device_attr_zone_full, 625 + &nullb_device_attr_zone_max_active, 626 + &nullb_device_attr_zone_max_open, 627 + &nullb_device_attr_zone_nr_conv, 628 + &nullb_device_attr_zone_offline, 629 + &nullb_device_attr_zone_readonly, 630 + &nullb_device_attr_zone_size, 631 + &nullb_device_attr_zoned, 632 632 NULL, 633 633 }; 634 634 ··· 708 704 709 705 static ssize_t memb_group_features_show(struct config_item *item, char *page) 710 706 { 711 - return snprintf(page, PAGE_SIZE, 712 - "badblocks,blocking,blocksize,cache_size,fua," 713 - "completion_nsec,discard,home_node,hw_queue_depth," 714 - "irqmode,max_sectors,mbps,memory_backed,no_sched," 715 - "poll_queues,power,queue_mode,shared_tag_bitmap," 716 - "shared_tags,size,submit_queues,use_per_node_hctx," 717 - "virt_boundary,zoned,zone_capacity,zone_max_active," 718 - "zone_max_open,zone_nr_conv,zone_offline,zone_readonly," 719 - "zone_size,zone_append_max_sectors,zone_full," 720 - "rotational\n"); 707 + 708 + struct configfs_attribute **entry; 709 + char delimiter = ','; 710 + size_t left = PAGE_SIZE; 711 + size_t written = 0; 712 + int ret; 713 + 714 + for (entry = &nullb_device_attrs[0]; *entry && left > 0; entry++) { 715 + if (!*(entry + 1)) 716 + delimiter = '\n'; 717 + ret = snprintf(page + written, left, "%s%c", (*entry)->ca_name, 718 + delimiter); 719 + if (ret >= left) { 720 + WARN_ONCE(1, "Too many null_blk features to print\n"); 721 + memzero_explicit(page, PAGE_SIZE); 722 + return -ENOBUFS; 723 + } 724 + left -= ret; 725 + written += ret; 726 + } 727 + 728 + return written; 721 729 } 722 730 723 731 CONFIGFS_ATTR_RO(memb_group_, features); ··· 1265 1249 return err; 1266 1250 } 1267 1251 1268 - static blk_status_t null_handle_rq(struct nullb_cmd *cmd) 1252 + /* 1253 + * Transfer data for the given request. The transfer size is capped with the 1254 + * nr_sectors argument. 1255 + */ 1256 + static blk_status_t null_handle_data_transfer(struct nullb_cmd *cmd, 1257 + sector_t nr_sectors) 1269 1258 { 1270 1259 struct request *rq = blk_mq_rq_from_pdu(cmd); 1271 1260 struct nullb *nullb = cmd->nq->dev->nullb; 1272 1261 int err = 0; 1273 1262 unsigned int len; 1274 1263 sector_t sector = blk_rq_pos(rq); 1264 + unsigned int max_bytes = nr_sectors << SECTOR_SHIFT; 1265 + unsigned int transferred_bytes = 0; 1275 1266 struct req_iterator iter; 1276 1267 struct bio_vec bvec; 1277 1268 1278 1269 spin_lock_irq(&nullb->lock); 1279 1270 rq_for_each_segment(bvec, rq, iter) { 1280 1271 len = bvec.bv_len; 1272 + if (transferred_bytes + len > max_bytes) 1273 + len = max_bytes - transferred_bytes; 1281 1274 err = null_transfer(nullb, bvec.bv_page, len, bvec.bv_offset, 1282 1275 op_is_write(req_op(rq)), sector, 1283 1276 rq->cmd_flags & REQ_FUA); 1284 1277 if (err) 1285 1278 break; 1286 1279 sector += len >> SECTOR_SHIFT; 1280 + transferred_bytes += len; 1281 + if (transferred_bytes >= max_bytes) 1282 + break; 1287 1283 } 1288 1284 spin_unlock_irq(&nullb->lock); 1289 1285 ··· 1323 1295 return sts; 1324 1296 } 1325 1297 1326 - static inline blk_status_t null_handle_badblocks(struct nullb_cmd *cmd, 1327 - sector_t sector, 1328 - sector_t nr_sectors) 1298 + /* 1299 + * Check if the command should fail for the badblocks. If so, return 1300 + * BLK_STS_IOERR and return number of partial I/O sectors to be written or read, 1301 + * which may be less than the requested number of sectors. 1302 + * 1303 + * @cmd: The command to handle. 1304 + * @sector: The start sector for I/O. 1305 + * @nr_sectors: Specifies number of sectors to write or read, and returns the 1306 + * number of sectors to be written or read. 1307 + */ 1308 + blk_status_t null_handle_badblocks(struct nullb_cmd *cmd, sector_t sector, 1309 + unsigned int *nr_sectors) 1329 1310 { 1330 1311 struct badblocks *bb = &cmd->nq->dev->badblocks; 1331 - sector_t first_bad; 1332 - int bad_sectors; 1312 + struct nullb_device *dev = cmd->nq->dev; 1313 + unsigned int block_sectors = dev->blocksize >> SECTOR_SHIFT; 1314 + sector_t first_bad, bad_sectors; 1315 + unsigned int partial_io_sectors = 0; 1333 1316 1334 - if (badblocks_check(bb, sector, nr_sectors, &first_bad, &bad_sectors)) 1335 - return BLK_STS_IOERR; 1317 + if (!badblocks_check(bb, sector, *nr_sectors, &first_bad, &bad_sectors)) 1318 + return BLK_STS_OK; 1336 1319 1337 - return BLK_STS_OK; 1320 + if (cmd->nq->dev->badblocks_once) 1321 + badblocks_clear(bb, first_bad, bad_sectors); 1322 + 1323 + if (cmd->nq->dev->badblocks_partial_io) { 1324 + if (!IS_ALIGNED(first_bad, block_sectors)) 1325 + first_bad = ALIGN_DOWN(first_bad, block_sectors); 1326 + if (sector < first_bad) 1327 + partial_io_sectors = first_bad - sector; 1328 + } 1329 + *nr_sectors = partial_io_sectors; 1330 + 1331 + return BLK_STS_IOERR; 1338 1332 } 1339 1333 1340 - static inline blk_status_t null_handle_memory_backed(struct nullb_cmd *cmd, 1341 - enum req_op op, 1342 - sector_t sector, 1343 - sector_t nr_sectors) 1334 + blk_status_t null_handle_memory_backed(struct nullb_cmd *cmd, enum req_op op, 1335 + sector_t sector, sector_t nr_sectors) 1344 1336 { 1345 1337 struct nullb_device *dev = cmd->nq->dev; 1346 1338 1347 1339 if (op == REQ_OP_DISCARD) 1348 1340 return null_handle_discard(dev, sector, nr_sectors); 1349 1341 1350 - return null_handle_rq(cmd); 1342 + return null_handle_data_transfer(cmd, nr_sectors); 1351 1343 } 1352 1344 1353 1345 static void nullb_zero_read_cmd_buffer(struct nullb_cmd *cmd) ··· 1414 1366 sector_t sector, unsigned int nr_sectors) 1415 1367 { 1416 1368 struct nullb_device *dev = cmd->nq->dev; 1369 + blk_status_t badblocks_ret = BLK_STS_OK; 1417 1370 blk_status_t ret; 1418 1371 1419 - if (dev->badblocks.shift != -1) { 1420 - ret = null_handle_badblocks(cmd, sector, nr_sectors); 1372 + if (dev->badblocks.shift != -1) 1373 + badblocks_ret = null_handle_badblocks(cmd, sector, &nr_sectors); 1374 + 1375 + if (dev->memory_backed && nr_sectors) { 1376 + ret = null_handle_memory_backed(cmd, op, sector, nr_sectors); 1421 1377 if (ret != BLK_STS_OK) 1422 1378 return ret; 1423 1379 } 1424 1380 1425 - if (dev->memory_backed) 1426 - return null_handle_memory_backed(cmd, op, sector, nr_sectors); 1427 - 1428 - return BLK_STS_OK; 1381 + return badblocks_ret; 1429 1382 } 1430 1383 1431 1384 static void null_handle_cmd(struct nullb_cmd *cmd, sector_t sector,

+6

drivers/block/null_blk/null_blk.h

··· 63 63 unsigned long flags; /* device flags */ 64 64 unsigned int curr_cache; 65 65 struct badblocks badblocks; 66 + bool badblocks_once; 67 + bool badblocks_partial_io; 66 68 67 69 unsigned int nr_zones; 68 70 unsigned int nr_zones_imp_open; ··· 133 131 sector_t nr_sectors); 134 132 blk_status_t null_process_cmd(struct nullb_cmd *cmd, enum req_op op, 135 133 sector_t sector, unsigned int nr_sectors); 134 + blk_status_t null_handle_badblocks(struct nullb_cmd *cmd, sector_t sector, 135 + unsigned int *nr_sectors); 136 + blk_status_t null_handle_memory_backed(struct nullb_cmd *cmd, enum req_op op, 137 + sector_t sector, sector_t nr_sectors); 136 138 137 139 #ifdef CONFIG_BLK_DEV_ZONED 138 140 int null_init_zoned_dev(struct nullb_device *dev, struct queue_limits *lim);

+16 -4

drivers/block/null_blk/zoned.c

··· 353 353 struct nullb_device *dev = cmd->nq->dev; 354 354 unsigned int zno = null_zone_no(dev, sector); 355 355 struct nullb_zone *zone = &dev->zones[zno]; 356 + blk_status_t badblocks_ret = BLK_STS_OK; 356 357 blk_status_t ret; 357 358 358 359 trace_nullb_zone_op(cmd, zno, zone->cond); ··· 413 412 zone->cond = BLK_ZONE_COND_IMP_OPEN; 414 413 } 415 414 416 - ret = null_process_cmd(cmd, REQ_OP_WRITE, sector, nr_sectors); 417 - if (ret != BLK_STS_OK) 418 - goto unlock_zone; 415 + if (dev->badblocks.shift != -1) { 416 + badblocks_ret = null_handle_badblocks(cmd, sector, &nr_sectors); 417 + if (badblocks_ret != BLK_STS_OK && !nr_sectors) { 418 + ret = badblocks_ret; 419 + goto unlock_zone; 420 + } 421 + } 422 + 423 + if (dev->memory_backed) { 424 + ret = null_handle_memory_backed(cmd, REQ_OP_WRITE, sector, 425 + nr_sectors); 426 + if (ret != BLK_STS_OK) 427 + goto unlock_zone; 428 + } 419 429 420 430 zone->wp += nr_sectors; 421 431 if (zone->wp == zone->start + zone->capacity) { ··· 441 429 zone->cond = BLK_ZONE_COND_FULL; 442 430 } 443 431 444 - ret = BLK_STS_OK; 432 + ret = badblocks_ret; 445 433 446 434 unlock_zone: 447 435 null_unlock_zone(dev, zone);

+1 -1

drivers/block/rnbd/rnbd-clt.c

··· 1010 1010 * See queue limits. 1011 1011 */ 1012 1012 if ((req_op(rq) != REQ_OP_DISCARD) && (req_op(rq) != REQ_OP_WRITE_ZEROES)) 1013 - sg_cnt = blk_rq_map_sg(dev->queue, rq, iu->sgt.sgl); 1013 + sg_cnt = blk_rq_map_sg(rq, iu->sgt.sgl); 1014 1014 1015 1015 if (sg_cnt == 0) 1016 1016 sg_mark_end(&iu->sgt.sgl[0]);

+1 -1

drivers/block/sunvdc.c

··· 485 485 } 486 486 487 487 sg_init_table(sg, port->ring_cookies); 488 - nsg = blk_rq_map_sg(req->q, req, sg); 488 + nsg = blk_rq_map_sg(req, sg); 489 489 490 490 len = 0; 491 491 for (i = 0; i < nsg; i++)

+59 -56

drivers/block/ublk_drv.c

··· 73 73 /* All UBLK_PARAM_TYPE_* should be included here */ 74 74 #define UBLK_PARAM_TYPE_ALL \ 75 75 (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD | \ 76 - UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED) 76 + UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED | \ 77 + UBLK_PARAM_TYPE_DMA_ALIGN) 77 78 78 79 struct ublk_rq_data { 79 - struct llist_node node; 80 - 81 80 struct kref ref; 82 81 }; 83 82 ··· 142 143 unsigned long flags; 143 144 struct task_struct *ubq_daemon; 144 145 char *io_cmd_buf; 145 - 146 - struct llist_head io_cmds; 147 146 148 147 unsigned long io_addr; /* mapped vm address */ 149 148 unsigned int max_io_sz; ··· 491 494 492 495 static DEFINE_MUTEX(ublk_ctl_mutex); 493 496 497 + 498 + #define UBLK_MAX_UBLKS UBLK_MINORS 499 + 494 500 /* 495 - * Max ublk devices allowed to add 501 + * Max unprivileged ublk devices allowed to add 496 502 * 497 503 * It can be extended to one per-user limit in future or even controlled 498 504 * by cgroup. 499 505 */ 500 - #define UBLK_MAX_UBLKS UBLK_MINORS 501 - static unsigned int ublks_max = 64; 502 - static unsigned int ublks_added; /* protected by ublk_ctl_mutex */ 506 + static unsigned int unprivileged_ublks_max = 64; 507 + static unsigned int unprivileged_ublks_added; /* protected by ublk_ctl_mutex */ 503 508 504 509 static struct miscdevice ublk_misc; 505 510 ··· 571 572 return ublk_dev_param_zoned_validate(ub); 572 573 else if (ublk_dev_is_zoned(ub)) 573 574 return -EINVAL; 575 + 576 + if (ub->params.types & UBLK_PARAM_TYPE_DMA_ALIGN) { 577 + const struct ublk_param_dma_align *p = &ub->params.dma; 578 + 579 + if (p->alignment >= PAGE_SIZE) 580 + return -EINVAL; 581 + 582 + if (!is_power_of_2(p->alignment + 1)) 583 + return -EINVAL; 584 + } 574 585 575 586 return 0; 576 587 } ··· 1109 1100 } 1110 1101 1111 1102 /* 1112 - * Since __ublk_rq_task_work always fails requests immediately during 1103 + * Since ublk_rq_task_work_cb always fails requests immediately during 1113 1104 * exiting, __ublk_fail_req() is only called from abort context during 1114 1105 * exiting. So lock is unnecessary. 1115 1106 * ··· 1155 1146 blk_mq_end_request(rq, BLK_STS_IOERR); 1156 1147 } 1157 1148 1158 - static inline void __ublk_rq_task_work(struct request *req, 1159 - unsigned issue_flags) 1149 + static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, 1150 + unsigned int issue_flags) 1160 1151 { 1161 - struct ublk_queue *ubq = req->mq_hctx->driver_data; 1162 - int tag = req->tag; 1152 + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1153 + struct ublk_queue *ubq = pdu->ubq; 1154 + int tag = pdu->tag; 1155 + struct request *req = blk_mq_tag_to_rq( 1156 + ubq->dev->tag_set.tags[ubq->q_id], tag); 1163 1157 struct ublk_io *io = &ubq->ios[tag]; 1164 1158 unsigned int mapped_bytes; 1165 1159 ··· 1237 1225 ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags); 1238 1226 } 1239 1227 1240 - static inline void ublk_forward_io_cmds(struct ublk_queue *ubq, 1241 - unsigned issue_flags) 1242 - { 1243 - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); 1244 - struct ublk_rq_data *data, *tmp; 1245 - 1246 - io_cmds = llist_reverse_order(io_cmds); 1247 - llist_for_each_entry_safe(data, tmp, io_cmds, node) 1248 - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags); 1249 - } 1250 - 1251 - static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags) 1252 - { 1253 - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1254 - struct ublk_queue *ubq = pdu->ubq; 1255 - 1256 - ublk_forward_io_cmds(ubq, issue_flags); 1257 - } 1258 - 1259 1228 static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) 1260 1229 { 1261 - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); 1230 + struct ublk_io *io = &ubq->ios[rq->tag]; 1262 1231 1263 - if (llist_add(&data->node, &ubq->io_cmds)) { 1264 - struct ublk_io *io = &ubq->ios[rq->tag]; 1265 - 1266 - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); 1267 - } 1232 + io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); 1268 1233 } 1269 1234 1270 1235 static enum blk_eh_timer_return ublk_timeout(struct request *rq) ··· 1434 1445 struct request *rq; 1435 1446 1436 1447 /* 1437 - * Either we fail the request or ublk_rq_task_work_fn 1448 + * Either we fail the request or ublk_rq_task_work_cb 1438 1449 * will do it 1439 1450 */ 1440 1451 rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); ··· 1900 1911 return -EIOCBQUEUED; 1901 1912 1902 1913 out: 1903 - io_uring_cmd_done(cmd, ret, 0, issue_flags); 1904 1914 pr_devel("%s: complete: cmd op %d, tag %d ret %x io_flags %x\n", 1905 1915 __func__, cmd_op, tag, ret, io->flags); 1906 - return -EIOCBQUEUED; 1916 + return ret; 1907 1917 } 1908 1918 1909 1919 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, ··· 1958 1970 static void ublk_ch_uring_cmd_cb(struct io_uring_cmd *cmd, 1959 1971 unsigned int issue_flags) 1960 1972 { 1961 - ublk_ch_uring_cmd_local(cmd, issue_flags); 1973 + int ret = ublk_ch_uring_cmd_local(cmd, issue_flags); 1974 + 1975 + if (ret != -EIOCBQUEUED) 1976 + io_uring_cmd_done(cmd, ret, 0, issue_flags); 1962 1977 } 1963 1978 1964 1979 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) ··· 2226 2235 if (ret) 2227 2236 goto fail; 2228 2237 2229 - ublks_added++; 2238 + if (ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV) 2239 + unprivileged_ublks_added++; 2230 2240 return 0; 2231 2241 fail: 2232 2242 put_device(dev); ··· 2256 2264 2257 2265 static void ublk_remove(struct ublk_device *ub) 2258 2266 { 2267 + bool unprivileged; 2268 + 2259 2269 ublk_stop_dev(ub); 2260 2270 cancel_work_sync(&ub->nosrv_work); 2261 2271 cdev_device_del(&ub->cdev, &ub->cdev_dev); 2272 + unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV; 2262 2273 ublk_put_device(ub); 2263 - ublks_added--; 2274 + 2275 + if (unprivileged) 2276 + unprivileged_ublks_added--; 2264 2277 } 2265 2278 2266 2279 static struct ublk_device *ublk_get_device_from_id(int idx) ··· 2339 2342 2340 2343 if (ub->params.basic.attrs & UBLK_ATTR_ROTATIONAL) 2341 2344 lim.features |= BLK_FEAT_ROTATIONAL; 2345 + 2346 + if (ub->params.types & UBLK_PARAM_TYPE_DMA_ALIGN) 2347 + lim.dma_alignment = ub->params.dma.alignment; 2342 2348 2343 2349 if (wait_for_completion_interruptible(&ub->completion) != 0) 2344 2350 return -EINTR; ··· 2530 2530 return ret; 2531 2531 2532 2532 ret = -EACCES; 2533 - if (ublks_added >= ublks_max) 2533 + if ((info.flags & UBLK_F_UNPRIVILEGED_DEV) && 2534 + unprivileged_ublks_added >= unprivileged_ublks_max) 2534 2535 goto out_unlock; 2535 2536 2536 2537 ret = -ENOMEM; ··· 3102 3101 if (ub) 3103 3102 ublk_put_device(ub); 3104 3103 out: 3105 - io_uring_cmd_done(cmd, ret, 0, issue_flags); 3106 3104 pr_devel("%s: cmd done ret %d cmd_op %x, dev id %d qid %d\n", 3107 3105 __func__, ret, cmd->cmd_op, header->dev_id, header->queue_id); 3108 - return -EIOCBQUEUED; 3106 + return ret; 3109 3107 } 3110 3108 3111 3109 static const struct file_operations ublk_ctl_fops = { ··· 3168 3168 module_init(ublk_init); 3169 3169 module_exit(ublk_exit); 3170 3170 3171 - static int ublk_set_max_ublks(const char *buf, const struct kernel_param *kp) 3171 + static int ublk_set_max_unprivileged_ublks(const char *buf, 3172 + const struct kernel_param *kp) 3172 3173 { 3173 3174 return param_set_uint_minmax(buf, kp, 0, UBLK_MAX_UBLKS); 3174 3175 } 3175 3176 3176 - static int ublk_get_max_ublks(char *buf, const struct kernel_param *kp) 3177 + static int ublk_get_max_unprivileged_ublks(char *buf, 3178 + const struct kernel_param *kp) 3177 3179 { 3178 - return sysfs_emit(buf, "%u\n", ublks_max); 3180 + return sysfs_emit(buf, "%u\n", unprivileged_ublks_max); 3179 3181 } 3180 3182 3181 - static const struct kernel_param_ops ublk_max_ublks_ops = { 3182 - .set = ublk_set_max_ublks, 3183 - .get = ublk_get_max_ublks, 3183 + static const struct kernel_param_ops ublk_max_unprivileged_ublks_ops = { 3184 + .set = ublk_set_max_unprivileged_ublks, 3185 + .get = ublk_get_max_unprivileged_ublks, 3184 3186 }; 3185 3187 3186 - module_param_cb(ublks_max, &ublk_max_ublks_ops, &ublks_max, 0644); 3187 - MODULE_PARM_DESC(ublks_max, "max number of ublk devices allowed to add(default: 64)"); 3188 + module_param_cb(ublks_max, &ublk_max_unprivileged_ublks_ops, 3189 + &unprivileged_ublks_max, 0644); 3190 + MODULE_PARM_DESC(ublks_max, "max number of unprivileged ublk devices allowed to add(default: 64)"); 3188 3191 3189 3192 MODULE_AUTHOR("Ming Lei <ming.lei@redhat.com>"); 3190 3193 MODULE_DESCRIPTION("Userspace block device");

+1 -1

drivers/block/virtio_blk.c

··· 226 226 if (unlikely(err)) 227 227 return -ENOMEM; 228 228 229 - return blk_rq_map_sg(hctx->queue, req, vbr->sg_table.sgl); 229 + return blk_rq_map_sg(req, vbr->sg_table.sgl); 230 230 } 231 231 232 232 static void virtblk_cleanup_cmd(struct request *req)

+1 -1

drivers/block/xen-blkfront.c

··· 751 751 id = blkif_ring_get_request(rinfo, req, &final_ring_req); 752 752 ring_req = &rinfo->shadow[id].req; 753 753 754 - num_sg = blk_rq_map_sg(req->q, req, rinfo->shadow[id].sg); 754 + num_sg = blk_rq_map_sg(req, rinfo->shadow[id].sg); 755 755 num_grant = 0; 756 756 /* Calculate the number of grant used */ 757 757 for_each_sg(rinfo->shadow[id].sg, sg, num_sg, i)

-12

drivers/md/dm-integrity.c

··· 4811 4811 ti->error = "Cannot allocate bio set"; 4812 4812 goto bad; 4813 4813 } 4814 - r = bioset_integrity_create(&ic->recheck_bios, RECHECK_POOL_SIZE); 4815 - if (r) { 4816 - ti->error = "Cannot allocate bio integrity set"; 4817 - r = -ENOMEM; 4818 - goto bad; 4819 - } 4820 4814 r = bioset_init(&ic->recalc_bios, 1, 0, BIOSET_NEED_BVECS); 4821 4815 if (r) { 4822 4816 ti->error = "Cannot allocate bio set"; 4823 - goto bad; 4824 - } 4825 - r = bioset_integrity_create(&ic->recalc_bios, 1); 4826 - if (r) { 4827 - ti->error = "Cannot allocate bio integrity set"; 4828 - r = -ENOMEM; 4829 4817 goto bad; 4830 4818 } 4831 4819 }

+1 -6

drivers/md/dm-table.c

··· 1081 1081 __alignof__(struct dm_io)) + DM_IO_BIO_OFFSET; 1082 1082 if (bioset_init(&pools->io_bs, pool_size, io_front_pad, bioset_flags)) 1083 1083 goto out_free_pools; 1084 - if (mempool_needs_integrity && 1085 - bioset_integrity_create(&pools->io_bs, pool_size)) 1086 - goto out_free_pools; 1087 1084 init_bs: 1088 1085 if (bioset_init(&pools->bs, pool_size, front_pad, 0)) 1089 - goto out_free_pools; 1090 - if (mempool_needs_integrity && 1091 - bioset_integrity_create(&pools->bs, pool_size)) 1092 1086 goto out_free_pools; 1093 1087 1094 1088 t->mempools = pools; ··· 1244 1250 profile->max_dun_bytes_supported = UINT_MAX; 1245 1251 memset(profile->modes_supported, 0xFF, 1246 1252 sizeof(profile->modes_supported)); 1253 + profile->key_types_supported = ~0; 1247 1254 1248 1255 for (i = 0; i < t->num_targets; i++) { 1249 1256 struct dm_target *ti = dm_table_get_target(t, i);

+8 -6

drivers/md/md-bitmap.c

··· 29 29 #include <linux/buffer_head.h> 30 30 #include <linux/seq_file.h> 31 31 #include <trace/events/block.h> 32 + 32 33 #include "md.h" 33 34 #include "md-bitmap.h" 35 + #include "md-cluster.h" 34 36 35 37 #define BITMAP_MAJOR_LO 3 36 38 /* version 4 insists the bitmap is in little-endian order ··· 428 426 struct block_device *bdev; 429 427 struct mddev *mddev = bitmap->mddev; 430 428 struct bitmap_storage *store = &bitmap->storage; 431 - unsigned int bitmap_limit = (bitmap->storage.file_pages - pg_index) << 432 - PAGE_SHIFT; 429 + unsigned long num_pages = bitmap->storage.file_pages; 430 + unsigned int bitmap_limit = (num_pages - pg_index % num_pages) << PAGE_SHIFT; 433 431 loff_t sboff, offset = mddev->bitmap_info.offset; 434 432 sector_t ps = pg_index * PAGE_SIZE / SECTOR_SIZE; 435 433 unsigned int size = PAGE_SIZE; ··· 438 436 439 437 bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev; 440 438 /* we compare length (page numbers), not page offset. */ 441 - if ((pg_index - store->sb_index) == store->file_pages - 1) { 439 + if ((pg_index - store->sb_index) == num_pages - 1) { 442 440 unsigned int last_page_size = store->bytes & (PAGE_SIZE - 1); 443 441 444 442 if (last_page_size == 0) ··· 944 942 bmname(bitmap), err); 945 943 goto out_no_sb; 946 944 } 947 - bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); 945 + bitmap->cluster_slot = bitmap->mddev->cluster_ops->slot_number(bitmap->mddev); 948 946 goto re_read; 949 947 } 950 948 ··· 2023 2021 sysfs_put(bitmap->sysfs_can_clear); 2024 2022 2025 2023 if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info && 2026 - bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev)) 2024 + bitmap->cluster_slot == bitmap->mddev->cluster_ops->slot_number(bitmap->mddev)) 2027 2025 md_cluster_stop(bitmap->mddev); 2028 2026 2029 2027 /* Shouldn't be needed - but just in case.... */ ··· 2231 2229 mddev_create_serial_pool(mddev, rdev); 2232 2230 2233 2231 if (mddev_is_clustered(mddev)) 2234 - md_cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes); 2232 + mddev->cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes); 2235 2233 2236 2234 /* Clear out old bitmap info first: Either there is none, or we 2237 2235 * are resuming after someone else has possibly changed things,

+12 -6

drivers/md/md-cluster.c

··· 1166 1166 struct dlm_lock_resource *bm_lockres; 1167 1167 char str[64]; 1168 1168 1169 - if (i == md_cluster_ops->slot_number(mddev)) 1169 + if (i == slot_number(mddev)) 1170 1170 continue; 1171 1171 1172 1172 bitmap = mddev->bitmap_ops->get_from_slot(mddev, i); ··· 1216 1216 */ 1217 1217 static int cluster_check_sync_size(struct mddev *mddev) 1218 1218 { 1219 - int current_slot = md_cluster_ops->slot_number(mddev); 1219 + int current_slot = slot_number(mddev); 1220 1220 int node_num = mddev->bitmap_info.nodes; 1221 1221 struct dlm_lock_resource *bm_lockres; 1222 1222 struct md_bitmap_stats stats; ··· 1612 1612 return err; 1613 1613 } 1614 1614 1615 - static const struct md_cluster_operations cluster_ops = { 1615 + static struct md_cluster_operations cluster_ops = { 1616 + .head = { 1617 + .type = MD_CLUSTER, 1618 + .id = ID_CLUSTER, 1619 + .name = "cluster", 1620 + .owner = THIS_MODULE, 1621 + }, 1622 + 1616 1623 .join = join, 1617 1624 .leave = leave, 1618 1625 .slot_number = slot_number, ··· 1649 1642 { 1650 1643 pr_warn("md-cluster: support raid1 and raid10 (limited support)\n"); 1651 1644 pr_info("Registering Cluster MD functions\n"); 1652 - register_md_cluster_operations(&cluster_ops, THIS_MODULE); 1653 - return 0; 1645 + return register_md_submodule(&cluster_ops.head); 1654 1646 } 1655 1647 1656 1648 static void cluster_exit(void) 1657 1649 { 1658 - unregister_md_cluster_operations(); 1650 + unregister_md_submodule(&cluster_ops.head); 1659 1651 } 1660 1652 1661 1653 module_init(cluster_init);

+6

drivers/md/md-cluster.h

··· 10 10 struct md_rdev; 11 11 12 12 struct md_cluster_operations { 13 + struct md_submodule_head head; 14 + 13 15 int (*join)(struct mddev *mddev, int nodes); 14 16 int (*leave)(struct mddev *mddev); 15 17 int (*slot_number)(struct mddev *mddev); ··· 36 34 void (*unlock_all_bitmaps)(struct mddev *mddev); 37 35 void (*update_size)(struct mddev *mddev, sector_t old_dev_sectors); 38 36 }; 37 + 38 + extern int md_setup_cluster(struct mddev *mddev, int nodes); 39 + extern void md_cluster_stop(struct mddev *mddev); 40 + extern void md_reload_sb(struct mddev *mddev, int raid_disk); 39 41 40 42 #endif /* _MD_CLUSTER_H */

+9 -6

drivers/md/md-linear.c

··· 5 5 */ 6 6 7 7 #include <linux/blkdev.h> 8 - #include <linux/raid/md_u.h> 9 8 #include <linux/seq_file.h> 10 9 #include <linux/module.h> 11 10 #include <linux/slab.h> ··· 319 320 } 320 321 321 322 static struct md_personality linear_personality = { 322 - .name = "linear", 323 - .level = LEVEL_LINEAR, 324 - .owner = THIS_MODULE, 323 + .head = { 324 + .type = MD_PERSONALITY, 325 + .id = ID_LINEAR, 326 + .name = "linear", 327 + .owner = THIS_MODULE, 328 + }, 329 + 325 330 .make_request = linear_make_request, 326 331 .run = linear_run, 327 332 .free = linear_free, ··· 338 335 339 336 static int __init linear_init(void) 340 337 { 341 - return register_md_personality(&linear_personality); 338 + return register_md_submodule(&linear_personality.head); 342 339 } 343 340 344 341 static void linear_exit(void) 345 342 { 346 - unregister_md_personality(&linear_personality); 343 + unregister_md_submodule(&linear_personality.head); 347 344 } 348 345 349 346 module_init(linear_init);

+174 -190

drivers/md/md.c

··· 79 79 [ACTION_IDLE] = "idle", 80 80 }; 81 81 82 - /* pers_list is a list of registered personalities protected by pers_lock. */ 83 - static LIST_HEAD(pers_list); 84 - static DEFINE_SPINLOCK(pers_lock); 82 + static DEFINE_XARRAY(md_submodule); 85 83 86 84 static const struct kobj_type md_ktype; 87 - 88 - const struct md_cluster_operations *md_cluster_ops; 89 - EXPORT_SYMBOL(md_cluster_ops); 90 - static struct module *md_cluster_mod; 91 85 92 86 static DECLARE_WAIT_QUEUE_HEAD(resync_wait); 93 87 static struct workqueue_struct *md_wq; ··· 623 629 queue_work(md_misc_wq, &mddev->del_work); 624 630 } 625 631 632 + static void mddev_put_locked(struct mddev *mddev) 633 + { 634 + if (atomic_dec_and_test(&mddev->active)) 635 + __mddev_put(mddev); 636 + } 637 + 626 638 void mddev_put(struct mddev *mddev) 627 639 { 628 640 if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)) ··· 888 888 } 889 889 EXPORT_SYMBOL_GPL(md_find_rdev_rcu); 890 890 891 - static struct md_personality *find_pers(int level, char *clevel) 891 + static struct md_personality *get_pers(int level, char *clevel) 892 892 { 893 - struct md_personality *pers; 894 - list_for_each_entry(pers, &pers_list, list) { 895 - if (level != LEVEL_NONE && pers->level == level) 896 - return pers; 897 - if (strcmp(pers->name, clevel)==0) 898 - return pers; 893 + struct md_personality *ret = NULL; 894 + struct md_submodule_head *head; 895 + unsigned long i; 896 + 897 + xa_lock(&md_submodule); 898 + xa_for_each(&md_submodule, i, head) { 899 + if (head->type != MD_PERSONALITY) 900 + continue; 901 + if ((level != LEVEL_NONE && head->id == level) || 902 + !strcmp(head->name, clevel)) { 903 + if (try_module_get(head->owner)) 904 + ret = (void *)head; 905 + break; 906 + } 899 907 } 900 - return NULL; 908 + xa_unlock(&md_submodule); 909 + 910 + if (!ret) { 911 + if (level != LEVEL_NONE) 912 + pr_warn("md: personality for level %d is not loaded!\n", 913 + level); 914 + else 915 + pr_warn("md: personality for level %s is not loaded!\n", 916 + clevel); 917 + } 918 + 919 + return ret; 920 + } 921 + 922 + static void put_pers(struct md_personality *pers) 923 + { 924 + module_put(pers->head.owner); 901 925 } 902 926 903 927 /* return the offset of the super block in 512byte sectors */ ··· 1204 1180 if (!mddev->bitmap_info.file && !mddev->bitmap_info.offset) 1205 1181 return 0; 1206 1182 pr_warn("%s: bitmaps are not supported for %s\n", 1207 - mdname(mddev), mddev->pers->name); 1183 + mdname(mddev), mddev->pers->head.name); 1208 1184 return 1; 1209 1185 } 1210 1186 EXPORT_SYMBOL(md_check_no_bitmap); ··· 1772 1748 count <<= sb->bblog_shift; 1773 1749 if (bb + 1 == 0) 1774 1750 break; 1775 - if (badblocks_set(&rdev->badblocks, sector, count, 1)) 1751 + if (!badblocks_set(&rdev->badblocks, sector, count, 1)) 1776 1752 return -EINVAL; 1777 1753 } 1778 1754 } else if (sb->bblog_offset != 0) ··· 2383 2359 return 0; /* shouldn't register */ 2384 2360 2385 2361 pr_debug("md: data integrity enabled on %s\n", mdname(mddev)); 2386 - if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE) || 2387 - (mddev->level != 1 && mddev->level != 10 && 2388 - bioset_integrity_create(&mddev->io_clone_set, BIO_POOL_SIZE))) { 2389 - /* 2390 - * No need to handle the failure of bioset_integrity_create, 2391 - * because the function is called by md_run() -> pers->run(), 2392 - * md_run calls bioset_exit -> bioset_integrity_free in case 2393 - * of failure case. 2394 - */ 2395 - pr_err("md: failed to create integrity pool for %s\n", 2396 - mdname(mddev)); 2397 - return -EINVAL; 2398 - } 2399 2362 return 0; 2400 2363 } 2401 2364 EXPORT_SYMBOL(md_integrity_register); ··· 2650 2639 force_change = 1; 2651 2640 if (test_and_clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags)) 2652 2641 nospares = 1; 2653 - ret = md_cluster_ops->metadata_update_start(mddev); 2642 + ret = mddev->cluster_ops->metadata_update_start(mddev); 2654 2643 /* Has someone else has updated the sb */ 2655 2644 if (!does_sb_need_changing(mddev)) { 2656 2645 if (ret == 0) 2657 - md_cluster_ops->metadata_update_cancel(mddev); 2646 + mddev->cluster_ops->metadata_update_cancel(mddev); 2658 2647 bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING), 2659 2648 BIT(MD_SB_CHANGE_DEVS) | 2660 2649 BIT(MD_SB_CHANGE_CLEAN)); ··· 2794 2783 /* if there was a failure, MD_SB_CHANGE_DEVS was set, and we re-write super */ 2795 2784 2796 2785 if (mddev_is_clustered(mddev) && ret == 0) 2797 - md_cluster_ops->metadata_update_finish(mddev); 2786 + mddev->cluster_ops->metadata_update_finish(mddev); 2798 2787 2799 2788 if (mddev->in_sync != sync_req || 2800 2789 !bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING), ··· 2953 2942 else { 2954 2943 err = 0; 2955 2944 if (mddev_is_clustered(mddev)) 2956 - err = md_cluster_ops->remove_disk(mddev, rdev); 2945 + err = mddev->cluster_ops->remove_disk(mddev, rdev); 2957 2946 2958 2947 if (err == 0) { 2959 2948 md_kick_rdev_from_array(rdev); ··· 3063 3052 * by this node eventually 3064 3053 */ 3065 3054 if (!mddev_is_clustered(rdev->mddev) || 3066 - (err = md_cluster_ops->gather_bitmaps(rdev)) == 0) { 3055 + (err = mddev->cluster_ops->gather_bitmaps(rdev)) == 0) { 3067 3056 clear_bit(Faulty, &rdev->flags); 3068 3057 err = add_bound_rdev(rdev); 3069 3058 } ··· 3871 3860 spin_lock(&mddev->lock); 3872 3861 p = mddev->pers; 3873 3862 if (p) 3874 - ret = sprintf(page, "%s\n", p->name); 3863 + ret = sprintf(page, "%s\n", p->head.name); 3875 3864 else if (mddev->clevel[0]) 3876 3865 ret = sprintf(page, "%s\n", mddev->clevel); 3877 3866 else if (mddev->level != LEVEL_NONE) ··· 3928 3917 rv = -EINVAL; 3929 3918 if (!mddev->pers->quiesce) { 3930 3919 pr_warn("md: %s: %s does not support online personality change\n", 3931 - mdname(mddev), mddev->pers->name); 3920 + mdname(mddev), mddev->pers->head.name); 3932 3921 goto out_unlock; 3933 3922 } 3934 3923 ··· 3942 3931 3943 3932 if (request_module("md-%s", clevel) != 0) 3944 3933 request_module("md-level-%s", clevel); 3945 - spin_lock(&pers_lock); 3946 - pers = find_pers(level, clevel); 3947 - if (!pers || !try_module_get(pers->owner)) { 3948 - spin_unlock(&pers_lock); 3949 - pr_warn("md: personality %s not loaded\n", clevel); 3934 + pers = get_pers(level, clevel); 3935 + if (!pers) { 3950 3936 rv = -EINVAL; 3951 3937 goto out_unlock; 3952 3938 } 3953 - spin_unlock(&pers_lock); 3954 3939 3955 3940 if (pers == mddev->pers) { 3956 3941 /* Nothing to do! */ 3957 - module_put(pers->owner); 3942 + put_pers(pers); 3958 3943 rv = len; 3959 3944 goto out_unlock; 3960 3945 } 3961 3946 if (!pers->takeover) { 3962 - module_put(pers->owner); 3947 + put_pers(pers); 3963 3948 pr_warn("md: %s: %s does not support personality takeover\n", 3964 3949 mdname(mddev), clevel); 3965 3950 rv = -EINVAL; ··· 3976 3969 mddev->raid_disks -= mddev->delta_disks; 3977 3970 mddev->delta_disks = 0; 3978 3971 mddev->reshape_backwards = 0; 3979 - module_put(pers->owner); 3972 + put_pers(pers); 3980 3973 pr_warn("md: %s: %s would not accept array\n", 3981 3974 mdname(mddev), clevel); 3982 3975 rv = PTR_ERR(priv); ··· 3991 3984 oldpriv = mddev->private; 3992 3985 mddev->pers = pers; 3993 3986 mddev->private = priv; 3994 - strscpy(mddev->clevel, pers->name, sizeof(mddev->clevel)); 3987 + strscpy(mddev->clevel, pers->head.name, sizeof(mddev->clevel)); 3995 3988 mddev->level = mddev->new_level; 3996 3989 mddev->layout = mddev->new_layout; 3997 3990 mddev->chunk_sectors = mddev->new_chunk_sectors; ··· 4033 4026 mddev->to_remove = &md_redundancy_group; 4034 4027 } 4035 4028 4036 - module_put(oldpers->owner); 4029 + put_pers(oldpers); 4037 4030 4038 4031 rdev_for_each(rdev, mddev) { 4039 4032 if (rdev->raid_disk < 0) ··· 5591 5584 5592 5585 static ssize_t serialize_policy_show(struct mddev *mddev, char *page) 5593 5586 { 5594 - if (mddev->pers == NULL || (mddev->pers->level != 1)) 5587 + if (mddev->pers == NULL || (mddev->pers->head.id != ID_RAID1)) 5595 5588 return sprintf(page, "n/a\n"); 5596 5589 else 5597 5590 return sprintf(page, "%d\n", mddev->serialize_policy); ··· 5617 5610 err = mddev_suspend_and_lock(mddev); 5618 5611 if (err) 5619 5612 return err; 5620 - if (mddev->pers == NULL || (mddev->pers->level != 1)) { 5613 + if (mddev->pers == NULL || (mddev->pers->head.id != ID_RAID1)) { 5621 5614 pr_err("md: serialize_policy is only effective for raid1\n"); 5622 5615 err = -EINVAL; 5623 5616 goto unlock; ··· 6103 6096 goto exit_sync_set; 6104 6097 } 6105 6098 6106 - spin_lock(&pers_lock); 6107 - pers = find_pers(mddev->level, mddev->clevel); 6108 - if (!pers || !try_module_get(pers->owner)) { 6109 - spin_unlock(&pers_lock); 6110 - if (mddev->level != LEVEL_NONE) 6111 - pr_warn("md: personality for level %d is not loaded!\n", 6112 - mddev->level); 6113 - else 6114 - pr_warn("md: personality for level %s is not loaded!\n", 6115 - mddev->clevel); 6099 + pers = get_pers(mddev->level, mddev->clevel); 6100 + if (!pers) { 6116 6101 err = -EINVAL; 6117 6102 goto abort; 6118 6103 } 6119 - spin_unlock(&pers_lock); 6120 - if (mddev->level != pers->level) { 6121 - mddev->level = pers->level; 6122 - mddev->new_level = pers->level; 6104 + if (mddev->level != pers->head.id) { 6105 + mddev->level = pers->head.id; 6106 + mddev->new_level = pers->head.id; 6123 6107 } 6124 - strscpy(mddev->clevel, pers->name, sizeof(mddev->clevel)); 6108 + strscpy(mddev->clevel, pers->head.name, sizeof(mddev->clevel)); 6125 6109 6126 6110 if (mddev->reshape_position != MaxSector && 6127 6111 pers->start_reshape == NULL) { 6128 6112 /* This personality cannot handle reshaping... */ 6129 - module_put(pers->owner); 6113 + put_pers(pers); 6130 6114 err = -EINVAL; 6131 6115 goto abort; 6132 6116 } ··· 6244 6246 if (mddev->private) 6245 6247 pers->free(mddev, mddev->private); 6246 6248 mddev->private = NULL; 6247 - module_put(pers->owner); 6249 + put_pers(pers); 6248 6250 mddev->bitmap_ops->destroy(mddev); 6249 6251 abort: 6250 6252 bioset_exit(&mddev->io_clone_set); ··· 6465 6467 mddev->private = NULL; 6466 6468 if (pers->sync_request && mddev->to_remove == NULL) 6467 6469 mddev->to_remove = &md_redundancy_group; 6468 - module_put(pers->owner); 6470 + put_pers(pers); 6469 6471 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 6470 6472 6471 6473 bioset_exit(&mddev->bio_set); ··· 6981 6983 set_bit(Candidate, &rdev->flags); 6982 6984 else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) { 6983 6985 /* --add initiated by this node */ 6984 - err = md_cluster_ops->add_new_disk(mddev, rdev); 6986 + err = mddev->cluster_ops->add_new_disk(mddev, rdev); 6985 6987 if (err) { 6986 6988 export_rdev(rdev, mddev); 6987 6989 return err; ··· 6998 7000 if (mddev_is_clustered(mddev)) { 6999 7001 if (info->state & (1 << MD_DISK_CANDIDATE)) { 7000 7002 if (!err) { 7001 - err = md_cluster_ops->new_disk_ack(mddev, 7002 - err == 0); 7003 + err = mddev->cluster_ops->new_disk_ack( 7004 + mddev, err == 0); 7003 7005 if (err) 7004 7006 md_kick_rdev_from_array(rdev); 7005 7007 } 7006 7008 } else { 7007 7009 if (err) 7008 - md_cluster_ops->add_new_disk_cancel(mddev); 7010 + mddev->cluster_ops->add_new_disk_cancel(mddev); 7009 7011 else 7010 7012 err = add_bound_rdev(rdev); 7011 7013 } ··· 7085 7087 goto busy; 7086 7088 7087 7089 kick_rdev: 7088 - if (mddev_is_clustered(mddev)) { 7089 - if (md_cluster_ops->remove_disk(mddev, rdev)) 7090 - goto busy; 7091 - } 7090 + if (mddev_is_clustered(mddev) && 7091 + mddev->cluster_ops->remove_disk(mddev, rdev)) 7092 + goto busy; 7092 7093 7093 7094 md_kick_rdev_from_array(rdev); 7094 7095 set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); ··· 7390 7393 rv = mddev->pers->resize(mddev, num_sectors); 7391 7394 if (!rv) { 7392 7395 if (mddev_is_clustered(mddev)) 7393 - md_cluster_ops->update_size(mddev, old_dev_sectors); 7396 + mddev->cluster_ops->update_size(mddev, old_dev_sectors); 7394 7397 else if (!mddev_is_dm(mddev)) 7395 7398 set_capacity_and_notify(mddev->gendisk, 7396 7399 mddev->array_sectors); ··· 7436 7439 mddev->reshape_backwards = 0; 7437 7440 } 7438 7441 return rv; 7442 + } 7443 + 7444 + static int get_cluster_ops(struct mddev *mddev) 7445 + { 7446 + xa_lock(&md_submodule); 7447 + mddev->cluster_ops = xa_load(&md_submodule, ID_CLUSTER); 7448 + if (mddev->cluster_ops && 7449 + !try_module_get(mddev->cluster_ops->head.owner)) 7450 + mddev->cluster_ops = NULL; 7451 + xa_unlock(&md_submodule); 7452 + 7453 + return mddev->cluster_ops == NULL ? -ENOENT : 0; 7454 + } 7455 + 7456 + static void put_cluster_ops(struct mddev *mddev) 7457 + { 7458 + if (!mddev->cluster_ops) 7459 + return; 7460 + 7461 + mddev->cluster_ops->leave(mddev); 7462 + module_put(mddev->cluster_ops->head.owner); 7463 + mddev->cluster_ops = NULL; 7439 7464 } 7440 7465 7441 7466 /* ··· 7568 7549 7569 7550 if (mddev->bitmap_info.nodes) { 7570 7551 /* hold PW on all the bitmap lock */ 7571 - if (md_cluster_ops->lock_all_bitmaps(mddev) <= 0) { 7552 + if (mddev->cluster_ops->lock_all_bitmaps(mddev) <= 0) { 7572 7553 pr_warn("md: can't change bitmap to none since the array is in use by more than one node\n"); 7573 7554 rv = -EPERM; 7574 - md_cluster_ops->unlock_all_bitmaps(mddev); 7555 + mddev->cluster_ops->unlock_all_bitmaps(mddev); 7575 7556 goto err; 7576 7557 } 7577 7558 7578 7559 mddev->bitmap_info.nodes = 0; 7579 - md_cluster_ops->leave(mddev); 7580 - module_put(md_cluster_mod); 7560 + put_cluster_ops(mddev); 7581 7561 mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY; 7582 7562 } 7583 7563 mddev->bitmap_ops->destroy(mddev); ··· 7860 7842 7861 7843 case CLUSTERED_DISK_NACK: 7862 7844 if (mddev_is_clustered(mddev)) 7863 - md_cluster_ops->new_disk_ack(mddev, false); 7845 + mddev->cluster_ops->new_disk_ack(mddev, false); 7864 7846 else 7865 7847 err = -EINVAL; 7866 7848 goto unlock; ··· 8142 8124 return; 8143 8125 mddev->pers->error_handler(mddev, rdev); 8144 8126 8145 - if (mddev->pers->level == 0 || mddev->pers->level == LEVEL_LINEAR) 8127 + if (mddev->pers->head.id == ID_RAID0 || 8128 + mddev->pers->head.id == ID_LINEAR) 8146 8129 return; 8147 8130 8148 8131 if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags)) ··· 8181 8162 8182 8163 static void status_personalities(struct seq_file *seq) 8183 8164 { 8184 - struct md_personality *pers; 8165 + struct md_submodule_head *head; 8166 + unsigned long i; 8185 8167 8186 8168 seq_puts(seq, "Personalities : "); 8187 - spin_lock(&pers_lock); 8188 - list_for_each_entry(pers, &pers_list, list) 8189 - seq_printf(seq, "[%s] ", pers->name); 8190 8169 8191 - spin_unlock(&pers_lock); 8170 + xa_lock(&md_submodule); 8171 + xa_for_each(&md_submodule, i, head) 8172 + if (head->type == MD_PERSONALITY) 8173 + seq_printf(seq, "[%s] ", head->name); 8174 + xa_unlock(&md_submodule); 8175 + 8192 8176 seq_puts(seq, "\n"); 8193 8177 } 8194 8178 ··· 8414 8392 seq_printf(seq, " (read-only)"); 8415 8393 if (mddev->ro == MD_AUTO_READ) 8416 8394 seq_printf(seq, " (auto-read-only)"); 8417 - seq_printf(seq, " %s", mddev->pers->name); 8395 + seq_printf(seq, " %s", mddev->pers->head.name); 8418 8396 } else { 8419 8397 seq_printf(seq, "inactive"); 8420 8398 } ··· 8483 8461 if (mddev == list_last_entry(&all_mddevs, struct mddev, all_mddevs)) 8484 8462 status_unused(seq); 8485 8463 8486 - if (atomic_dec_and_test(&mddev->active)) 8487 - __mddev_put(mddev); 8488 - 8464 + mddev_put_locked(mddev); 8489 8465 return 0; 8490 8466 } 8491 8467 ··· 8534 8514 .proc_poll = mdstat_poll, 8535 8515 }; 8536 8516 8537 - int register_md_personality(struct md_personality *p) 8517 + int register_md_submodule(struct md_submodule_head *msh) 8538 8518 { 8539 - pr_debug("md: %s personality registered for level %d\n", 8540 - p->name, p->level); 8541 - spin_lock(&pers_lock); 8542 - list_add_tail(&p->list, &pers_list); 8543 - spin_unlock(&pers_lock); 8544 - return 0; 8519 + return xa_insert(&md_submodule, msh->id, msh, GFP_KERNEL); 8545 8520 } 8546 - EXPORT_SYMBOL(register_md_personality); 8521 + EXPORT_SYMBOL_GPL(register_md_submodule); 8547 8522 8548 - int unregister_md_personality(struct md_personality *p) 8523 + void unregister_md_submodule(struct md_submodule_head *msh) 8549 8524 { 8550 - pr_debug("md: %s personality unregistered\n", p->name); 8551 - spin_lock(&pers_lock); 8552 - list_del_init(&p->list); 8553 - spin_unlock(&pers_lock); 8554 - return 0; 8525 + xa_erase(&md_submodule, msh->id); 8555 8526 } 8556 - EXPORT_SYMBOL(unregister_md_personality); 8557 - 8558 - int register_md_cluster_operations(const struct md_cluster_operations *ops, 8559 - struct module *module) 8560 - { 8561 - int ret = 0; 8562 - spin_lock(&pers_lock); 8563 - if (md_cluster_ops != NULL) 8564 - ret = -EALREADY; 8565 - else { 8566 - md_cluster_ops = ops; 8567 - md_cluster_mod = module; 8568 - } 8569 - spin_unlock(&pers_lock); 8570 - return ret; 8571 - } 8572 - EXPORT_SYMBOL(register_md_cluster_operations); 8573 - 8574 - int unregister_md_cluster_operations(void) 8575 - { 8576 - spin_lock(&pers_lock); 8577 - md_cluster_ops = NULL; 8578 - spin_unlock(&pers_lock); 8579 - return 0; 8580 - } 8581 - EXPORT_SYMBOL(unregister_md_cluster_operations); 8527 + EXPORT_SYMBOL_GPL(unregister_md_submodule); 8582 8528 8583 8529 int md_setup_cluster(struct mddev *mddev, int nodes) 8584 8530 { 8585 - int ret; 8586 - if (!md_cluster_ops) 8587 - request_module("md-cluster"); 8588 - spin_lock(&pers_lock); 8589 - /* ensure module won't be unloaded */ 8590 - if (!md_cluster_ops || !try_module_get(md_cluster_mod)) { 8591 - pr_warn("can't find md-cluster module or get its reference.\n"); 8592 - spin_unlock(&pers_lock); 8593 - return -ENOENT; 8594 - } 8595 - spin_unlock(&pers_lock); 8531 + int ret = get_cluster_ops(mddev); 8596 8532 8597 - ret = md_cluster_ops->join(mddev, nodes); 8533 + if (ret) { 8534 + request_module("md-cluster"); 8535 + ret = get_cluster_ops(mddev); 8536 + } 8537 + 8538 + /* ensure module won't be unloaded */ 8539 + if (ret) { 8540 + pr_warn("can't find md-cluster module or get its reference.\n"); 8541 + return ret; 8542 + } 8543 + 8544 + ret = mddev->cluster_ops->join(mddev, nodes); 8598 8545 if (!ret) 8599 8546 mddev->safemode_delay = 0; 8600 8547 return ret; ··· 8569 8582 8570 8583 void md_cluster_stop(struct mddev *mddev) 8571 8584 { 8572 - if (!md_cluster_ops) 8573 - return; 8574 - md_cluster_ops->leave(mddev); 8575 - module_put(md_cluster_mod); 8585 + put_cluster_ops(mddev); 8576 8586 } 8577 8587 8578 8588 static int is_mddev_idle(struct mddev *mddev, int init) ··· 8962 8978 } 8963 8979 8964 8980 if (mddev_is_clustered(mddev)) { 8965 - ret = md_cluster_ops->resync_start(mddev); 8981 + ret = mddev->cluster_ops->resync_start(mddev); 8966 8982 if (ret) 8967 8983 goto skip; 8968 8984 ··· 8989 9005 * 8990 9006 */ 8991 9007 if (mddev_is_clustered(mddev)) 8992 - md_cluster_ops->resync_start_notify(mddev); 9008 + mddev->cluster_ops->resync_start_notify(mddev); 8993 9009 do { 8994 9010 int mddev2_minor = -1; 8995 9011 mddev->curr_resync = MD_RESYNC_DELAYED; ··· 9444 9460 return true; 9445 9461 } 9446 9462 9463 + /* Check if resync is in progress. */ 9464 + if (mddev->recovery_cp < MaxSector) { 9465 + set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9466 + clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9467 + return true; 9468 + } 9469 + 9447 9470 /* 9448 9471 * Remove any failed drives, then add spares if possible. Spares are 9449 9472 * also removed and re-added, to allow the personality to fail the ··· 9464 9473 9465 9474 /* Start new recovery. */ 9466 9475 set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9467 - return true; 9468 - } 9469 - 9470 - /* Check if recovery is in progress. */ 9471 - if (mddev->recovery_cp < MaxSector) { 9472 - set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9473 - clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9474 9476 return true; 9475 9477 } 9476 9478 ··· 9773 9789 * call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by 9774 9790 * clustered raid */ 9775 9791 if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags)) 9776 - md_cluster_ops->resync_finish(mddev); 9792 + mddev->cluster_ops->resync_finish(mddev); 9777 9793 clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); 9778 9794 clear_bit(MD_RECOVERY_DONE, &mddev->recovery); 9779 9795 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); ··· 9781 9797 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 9782 9798 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9783 9799 /* 9784 - * We call md_cluster_ops->update_size here because sync_size could 9800 + * We call mddev->cluster_ops->update_size here because sync_size could 9785 9801 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared, 9786 9802 * so it is time to update size across cluster. 9787 9803 */ 9788 9804 if (mddev_is_clustered(mddev) && is_reshaped 9789 9805 && !test_bit(MD_CLOSING, &mddev->flags)) 9790 - md_cluster_ops->update_size(mddev, old_dev_sectors); 9806 + mddev->cluster_ops->update_size(mddev, old_dev_sectors); 9791 9807 /* flag recovery needed just to double check */ 9792 9808 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 9793 9809 sysfs_notify_dirent_safe(mddev->sysfs_completed); ··· 9825 9841 9826 9842 /* Bad block management */ 9827 9843 9828 - /* Returns 1 on success, 0 on failure */ 9829 - int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 9830 - int is_new) 9844 + /* Returns true on success, false on failure */ 9845 + bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 9846 + int is_new) 9831 9847 { 9832 9848 struct mddev *mddev = rdev->mddev; 9833 - int rv; 9834 9849 9835 9850 /* 9836 9851 * Recording new badblocks for faulty rdev will force unnecessary ··· 9839 9856 * avoid it. 9840 9857 */ 9841 9858 if (test_bit(Faulty, &rdev->flags)) 9842 - return 1; 9859 + return true; 9843 9860 9844 9861 if (is_new) 9845 9862 s += rdev->new_data_offset; 9846 9863 else 9847 9864 s += rdev->data_offset; 9848 - rv = badblocks_set(&rdev->badblocks, s, sectors, 0); 9849 - if (rv == 0) { 9850 - /* Make sure they get written out promptly */ 9851 - if (test_bit(ExternalBbl, &rdev->flags)) 9852 - sysfs_notify_dirent_safe(rdev->sysfs_unack_badblocks); 9853 - sysfs_notify_dirent_safe(rdev->sysfs_state); 9854 - set_mask_bits(&mddev->sb_flags, 0, 9855 - BIT(MD_SB_CHANGE_CLEAN) | BIT(MD_SB_CHANGE_PENDING)); 9856 - md_wakeup_thread(rdev->mddev->thread); 9857 - return 1; 9858 - } else 9859 - return 0; 9865 + 9866 + if (!badblocks_set(&rdev->badblocks, s, sectors, 0)) 9867 + return false; 9868 + 9869 + /* Make sure they get written out promptly */ 9870 + if (test_bit(ExternalBbl, &rdev->flags)) 9871 + sysfs_notify_dirent_safe(rdev->sysfs_unack_badblocks); 9872 + sysfs_notify_dirent_safe(rdev->sysfs_state); 9873 + set_mask_bits(&mddev->sb_flags, 0, 9874 + BIT(MD_SB_CHANGE_CLEAN) | BIT(MD_SB_CHANGE_PENDING)); 9875 + md_wakeup_thread(rdev->mddev->thread); 9876 + return true; 9860 9877 } 9861 9878 EXPORT_SYMBOL_GPL(rdev_set_badblocks); 9862 9879 9863 - int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 9864 - int is_new) 9880 + void rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 9881 + int is_new) 9865 9882 { 9866 - int rv; 9867 9883 if (is_new) 9868 9884 s += rdev->new_data_offset; 9869 9885 else 9870 9886 s += rdev->data_offset; 9871 - rv = badblocks_clear(&rdev->badblocks, s, sectors); 9872 - if ((rv == 0) && test_bit(ExternalBbl, &rdev->flags)) 9887 + 9888 + if (!badblocks_clear(&rdev->badblocks, s, sectors)) 9889 + return; 9890 + 9891 + if (test_bit(ExternalBbl, &rdev->flags)) 9873 9892 sysfs_notify_dirent_safe(rdev->sysfs_badblocks); 9874 - return rv; 9875 9893 } 9876 9894 EXPORT_SYMBOL_GPL(rdev_clear_badblocks); 9877 9895 9878 9896 static int md_notify_reboot(struct notifier_block *this, 9879 9897 unsigned long code, void *x) 9880 9898 { 9881 - struct mddev *mddev, *n; 9899 + struct mddev *mddev; 9882 9900 int need_delay = 0; 9883 9901 9884 9902 spin_lock(&all_mddevs_lock); 9885 - list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) { 9903 + list_for_each_entry(mddev, &all_mddevs, all_mddevs) { 9886 9904 if (!mddev_get(mddev)) 9887 9905 continue; 9888 9906 spin_unlock(&all_mddevs_lock); ··· 9895 9911 mddev_unlock(mddev); 9896 9912 } 9897 9913 need_delay = 1; 9898 - mddev_put(mddev); 9899 9914 spin_lock(&all_mddevs_lock); 9915 + mddev_put_locked(mddev); 9900 9916 } 9901 9917 spin_unlock(&all_mddevs_lock); 9902 9918 ··· 10013 10029 if (rdev2->raid_disk == -1 && role != MD_DISK_ROLE_SPARE && 10014 10030 !(le32_to_cpu(sb->feature_map) & 10015 10031 MD_FEATURE_RESHAPE_ACTIVE) && 10016 - !md_cluster_ops->resync_status_get(mddev)) { 10032 + !mddev->cluster_ops->resync_status_get(mddev)) { 10017 10033 /* 10018 10034 * -1 to make raid1_add_disk() set conf->fullsync 10019 10035 * to 1. This could avoid skipping sync when the ··· 10229 10245 10230 10246 static __exit void md_exit(void) 10231 10247 { 10232 - struct mddev *mddev, *n; 10248 + struct mddev *mddev; 10233 10249 int delay = 1; 10234 10250 10235 10251 unregister_blkdev(MD_MAJOR,"md"); ··· 10250 10266 remove_proc_entry("mdstat", NULL); 10251 10267 10252 10268 spin_lock(&all_mddevs_lock); 10253 - list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) { 10269 + list_for_each_entry(mddev, &all_mddevs, all_mddevs) { 10254 10270 if (!mddev_get(mddev)) 10255 10271 continue; 10256 10272 spin_unlock(&all_mddevs_lock); ··· 10262 10278 * the mddev for destruction by a workqueue, and the 10263 10279 * destroy_workqueue() below will wait for that to complete. 10264 10280 */ 10265 - mddev_put(mddev); 10266 10281 spin_lock(&all_mddevs_lock); 10282 + mddev_put_locked(mddev); 10267 10283 } 10268 10284 spin_unlock(&all_mddevs_lock); 10269 10285

+41 -21

drivers/md/md.h

··· 18 18 #include <linux/timer.h> 19 19 #include <linux/wait.h> 20 20 #include <linux/workqueue.h> 21 + #include <linux/raid/md_u.h> 21 22 #include <trace/events/block.h> 22 - #include "md-cluster.h" 23 23 24 24 #define MaxSector (~(sector_t)0) 25 + 26 + enum md_submodule_type { 27 + MD_PERSONALITY = 0, 28 + MD_CLUSTER, 29 + MD_BITMAP, /* TODO */ 30 + }; 31 + 32 + enum md_submodule_id { 33 + ID_LINEAR = LEVEL_LINEAR, 34 + ID_RAID0 = 0, 35 + ID_RAID1 = 1, 36 + ID_RAID4 = 4, 37 + ID_RAID5 = 5, 38 + ID_RAID6 = 6, 39 + ID_RAID10 = 10, 40 + ID_CLUSTER, 41 + ID_BITMAP, /* TODO */ 42 + ID_LLBITMAP, /* TODO */ 43 + }; 44 + 45 + struct md_submodule_head { 46 + enum md_submodule_type type; 47 + enum md_submodule_id id; 48 + const char *name; 49 + struct module *owner; 50 + }; 25 51 26 52 /* 27 53 * These flags should really be called "NO_RETRY" rather than ··· 292 266 Nonrot, /* non-rotational device (SSD) */ 293 267 }; 294 268 295 - static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors, 296 - sector_t *first_bad, int *bad_sectors) 269 + static inline int is_badblock(struct md_rdev *rdev, sector_t s, sector_t sectors, 270 + sector_t *first_bad, sector_t *bad_sectors) 297 271 { 298 272 if (unlikely(rdev->badblocks.count)) { 299 273 int rv = badblocks_check(&rdev->badblocks, rdev->data_offset + s, ··· 310 284 int sectors) 311 285 { 312 286 sector_t first_bad; 313 - int bad_sectors; 287 + sector_t bad_sectors; 314 288 315 289 return is_badblock(rdev, s, sectors, &first_bad, &bad_sectors); 316 290 } 317 291 318 - extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 319 - int is_new); 320 - extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 321 - int is_new); 292 + extern bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 293 + int is_new); 294 + extern void rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 295 + int is_new); 322 296 struct md_cluster_info; 297 + struct md_cluster_operations; 323 298 324 299 /** 325 300 * enum mddev_flags - md device flags. ··· 603 576 mempool_t *serial_info_pool; 604 577 void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); 605 578 struct md_cluster_info *cluster_info; 579 + struct md_cluster_operations *cluster_ops; 606 580 unsigned int good_device_nr; /* good device num within cluster raid */ 607 581 unsigned int noio_flag; /* for memalloc scope API */ 608 582 ··· 727 699 728 700 struct md_personality 729 701 { 730 - char *name; 731 - int level; 732 - struct list_head list; 733 - struct module *owner; 702 + struct md_submodule_head head; 703 + 734 704 bool __must_check (*make_request)(struct mddev *mddev, struct bio *bio); 735 705 /* 736 706 * start up works that do NOT require md_thread. tasks that ··· 869 843 if (p) put_page(p); 870 844 } 871 845 872 - extern int register_md_personality(struct md_personality *p); 873 - extern int unregister_md_personality(struct md_personality *p); 874 - extern int register_md_cluster_operations(const struct md_cluster_operations *ops, 875 - struct module *module); 876 - extern int unregister_md_cluster_operations(void); 877 - extern int md_setup_cluster(struct mddev *mddev, int nodes); 878 - extern void md_cluster_stop(struct mddev *mddev); 846 + int register_md_submodule(struct md_submodule_head *msh); 847 + void unregister_md_submodule(struct md_submodule_head *msh); 848 + 879 849 extern struct md_thread *md_register_thread( 880 850 void (*run)(struct md_thread *thread), 881 851 struct mddev *mddev, ··· 928 906 extern void md_frozen_sync_thread(struct mddev *mddev); 929 907 extern void md_unfrozen_sync_thread(struct mddev *mddev); 930 908 931 - extern void md_reload_sb(struct mddev *mddev, int raid_disk); 932 909 extern void md_update_sb(struct mddev *mddev, int force); 933 910 extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev); 934 911 extern void mddev_destroy_serial_pool(struct mddev *mddev, ··· 949 928 } 950 929 } 951 930 952 - extern const struct md_cluster_operations *md_cluster_ops; 953 931 static inline int mddev_is_clustered(struct mddev *mddev) 954 932 { 955 933 return mddev->cluster_info && mddev->bitmap_info.nodes > 1;

+11 -7

drivers/md/raid0.c

··· 809 809 810 810 static struct md_personality raid0_personality= 811 811 { 812 - .name = "raid0", 813 - .level = 0, 814 - .owner = THIS_MODULE, 812 + .head = { 813 + .type = MD_PERSONALITY, 814 + .id = ID_RAID0, 815 + .name = "raid0", 816 + .owner = THIS_MODULE, 817 + }, 818 + 815 819 .make_request = raid0_make_request, 816 820 .run = raid0_run, 817 821 .free = raid0_free, ··· 826 822 .error_handler = raid0_error, 827 823 }; 828 824 829 - static int __init raid0_init (void) 825 + static int __init raid0_init(void) 830 826 { 831 - return register_md_personality (&raid0_personality); 827 + return register_md_submodule(&raid0_personality.head); 832 828 } 833 829 834 - static void raid0_exit (void) 830 + static void __exit raid0_exit(void) 835 831 { 836 - unregister_md_personality (&raid0_personality); 832 + unregister_md_submodule(&raid0_personality.head); 837 833 } 838 834 839 835 module_init(raid0_init);

+3 -3

drivers/md/raid1-10.c

··· 247 247 sector_t this_sector, int *len) 248 248 { 249 249 sector_t first_bad; 250 - int bad_sectors; 250 + sector_t bad_sectors; 251 251 252 252 /* no bad block overlap */ 253 253 if (!is_badblock(rdev, this_sector, *len, &first_bad, &bad_sectors)) ··· 287 287 return true; 288 288 289 289 if (mddev_is_clustered(mddev) && 290 - md_cluster_ops->area_resyncing(mddev, READ, this_sector, 291 - this_sector + len)) 290 + mddev->cluster_ops->area_resyncing(mddev, READ, this_sector, 291 + this_sector + len)) 292 292 return true; 293 293 294 294 return false;

+31 -25

drivers/md/raid1.c

··· 36 36 #include "md.h" 37 37 #include "raid1.h" 38 38 #include "md-bitmap.h" 39 + #include "md-cluster.h" 39 40 40 41 #define UNSUPPORTED_MDDEV_FLAGS \ 41 42 ((1L << MD_HAS_JOURNAL) | \ ··· 46 45 47 46 static void allow_barrier(struct r1conf *conf, sector_t sector_nr); 48 47 static void lower_barrier(struct r1conf *conf, sector_t sector_nr); 48 + static void raid1_free(struct mddev *mddev, void *priv); 49 49 50 50 #define RAID_1_10_NAME "raid1" 51 51 #include "raid1-10.c" ··· 1317 1315 struct r1conf *conf = mddev->private; 1318 1316 struct raid1_info *mirror; 1319 1317 struct bio *read_bio; 1320 - const enum req_op op = bio_op(bio); 1321 - const blk_opf_t do_sync = bio->bi_opf & REQ_SYNC; 1322 1318 int max_sectors; 1323 1319 int rdisk, error; 1324 1320 bool r1bio_existed = !!r1_bio; ··· 1404 1404 read_bio->bi_iter.bi_sector = r1_bio->sector + 1405 1405 mirror->rdev->data_offset; 1406 1406 read_bio->bi_end_io = raid1_end_read_request; 1407 - read_bio->bi_opf = op | do_sync; 1408 1407 if (test_bit(FailFast, &mirror->rdev->flags) && 1409 1408 test_bit(R1BIO_FailFast, &r1_bio->state)) 1410 1409 read_bio->bi_opf |= MD_FAILFAST; ··· 1466 1467 bool is_discard = (bio_op(bio) == REQ_OP_DISCARD); 1467 1468 1468 1469 if (mddev_is_clustered(mddev) && 1469 - md_cluster_ops->area_resyncing(mddev, WRITE, 1470 + mddev->cluster_ops->area_resyncing(mddev, WRITE, 1470 1471 bio->bi_iter.bi_sector, bio_end_sector(bio))) { 1471 1472 1472 1473 DEFINE_WAIT(w); ··· 1477 1478 for (;;) { 1478 1479 prepare_to_wait(&conf->wait_barrier, 1479 1480 &w, TASK_IDLE); 1480 - if (!md_cluster_ops->area_resyncing(mddev, WRITE, 1481 + if (!mddev->cluster_ops->area_resyncing(mddev, WRITE, 1481 1482 bio->bi_iter.bi_sector, 1482 1483 bio_end_sector(bio))) 1483 1484 break; ··· 1536 1537 atomic_inc(&rdev->nr_pending); 1537 1538 if (test_bit(WriteErrorSeen, &rdev->flags)) { 1538 1539 sector_t first_bad; 1539 - int bad_sectors; 1540 + sector_t bad_sectors; 1540 1541 int is_bad; 1541 1542 1542 1543 is_bad = is_badblock(rdev, r1_bio->sector, max_sectors, ··· 1652 1653 1653 1654 mbio->bi_iter.bi_sector = (r1_bio->sector + rdev->data_offset); 1654 1655 mbio->bi_end_io = raid1_end_write_request; 1655 - mbio->bi_opf = bio_op(bio) | 1656 - (bio->bi_opf & (REQ_SYNC | REQ_FUA | REQ_ATOMIC)); 1657 1656 if (test_bit(FailFast, &rdev->flags) && 1658 1657 !test_bit(WriteMostly, &rdev->flags) && 1659 1658 conf->raid_disks - mddev->degraded > 1) ··· 2483 2486 } 2484 2487 } 2485 2488 2486 - static int narrow_write_error(struct r1bio *r1_bio, int i) 2489 + static bool narrow_write_error(struct r1bio *r1_bio, int i) 2487 2490 { 2488 2491 struct mddev *mddev = r1_bio->mddev; 2489 2492 struct r1conf *conf = mddev->private; ··· 2504 2507 sector_t sector; 2505 2508 int sectors; 2506 2509 int sect_to_write = r1_bio->sectors; 2507 - int ok = 1; 2510 + bool ok = true; 2508 2511 2509 2512 if (rdev->badblocks.shift < 0) 2510 - return 0; 2513 + return false; 2511 2514 2512 2515 block_sectors = roundup(1 << rdev->badblocks.shift, 2513 2516 bdev_logical_block_size(rdev->bdev) >> 9); ··· 2883 2886 } else { 2884 2887 /* may need to read from here */ 2885 2888 sector_t first_bad = MaxSector; 2886 - int bad_sectors; 2889 + sector_t bad_sectors; 2887 2890 2888 2891 if (is_badblock(rdev, sector_nr, good_sectors, 2889 2892 &first_bad, &bad_sectors)) { ··· 3035 3038 conf->cluster_sync_low = mddev->curr_resync_completed; 3036 3039 conf->cluster_sync_high = conf->cluster_sync_low + CLUSTER_RESYNC_WINDOW_SECTORS; 3037 3040 /* Send resync message */ 3038 - md_cluster_ops->resync_info_update(mddev, 3039 - conf->cluster_sync_low, 3040 - conf->cluster_sync_high); 3041 + mddev->cluster_ops->resync_info_update(mddev, 3042 + conf->cluster_sync_low, 3043 + conf->cluster_sync_high); 3041 3044 } 3042 3045 3043 3046 /* For a user-requested sync, we read all readable devices and do a ··· 3253 3256 3254 3257 if (!mddev_is_dm(mddev)) { 3255 3258 ret = raid1_set_limits(mddev); 3256 - if (ret) 3259 + if (ret) { 3260 + if (!mddev->private) 3261 + raid1_free(mddev, conf); 3257 3262 return ret; 3263 + } 3258 3264 } 3259 3265 3260 3266 mddev->degraded = 0; ··· 3271 3271 */ 3272 3272 if (conf->raid_disks - mddev->degraded < 1) { 3273 3273 md_unregister_thread(mddev, &conf->thread); 3274 + if (!mddev->private) 3275 + raid1_free(mddev, conf); 3274 3276 return -EINVAL; 3275 3277 } 3276 3278 ··· 3493 3491 3494 3492 static struct md_personality raid1_personality = 3495 3493 { 3496 - .name = "raid1", 3497 - .level = 1, 3498 - .owner = THIS_MODULE, 3494 + .head = { 3495 + .type = MD_PERSONALITY, 3496 + .id = ID_RAID1, 3497 + .name = "raid1", 3498 + .owner = THIS_MODULE, 3499 + }, 3500 + 3499 3501 .make_request = raid1_make_request, 3500 3502 .run = raid1_run, 3501 3503 .free = raid1_free, ··· 3516 3510 .takeover = raid1_takeover, 3517 3511 }; 3518 3512 3519 - static int __init raid_init(void) 3513 + static int __init raid1_init(void) 3520 3514 { 3521 - return register_md_personality(&raid1_personality); 3515 + return register_md_submodule(&raid1_personality.head); 3522 3516 } 3523 3517 3524 - static void raid_exit(void) 3518 + static void __exit raid1_exit(void) 3525 3519 { 3526 - unregister_md_personality(&raid1_personality); 3520 + unregister_md_submodule(&raid1_personality.head); 3527 3521 } 3528 3522 3529 - module_init(raid_init); 3530 - module_exit(raid_exit); 3523 + module_init(raid1_init); 3524 + module_exit(raid1_exit); 3531 3525 MODULE_LICENSE("GPL"); 3532 3526 MODULE_DESCRIPTION("RAID1 (mirroring) personality for MD"); 3533 3527 MODULE_ALIAS("md-personality-3"); /* RAID1 */

+31 -35

drivers/md/raid10.c

··· 24 24 #include "raid10.h" 25 25 #include "raid0.h" 26 26 #include "md-bitmap.h" 27 + #include "md-cluster.h" 27 28 28 29 /* 29 30 * RAID10 provides a combination of RAID0 and RAID1 functionality. ··· 748 747 749 748 for (slot = 0; slot < conf->copies ; slot++) { 750 749 sector_t first_bad; 751 - int bad_sectors; 750 + sector_t bad_sectors; 752 751 sector_t dev_sector; 753 752 unsigned int pending; 754 753 bool nonrot; ··· 1147 1146 { 1148 1147 struct r10conf *conf = mddev->private; 1149 1148 struct bio *read_bio; 1150 - const enum req_op op = bio_op(bio); 1151 - const blk_opf_t do_sync = bio->bi_opf & REQ_SYNC; 1152 1149 int max_sectors; 1153 1150 struct md_rdev *rdev; 1154 1151 char b[BDEVNAME_SIZE]; ··· 1227 1228 read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr + 1228 1229 choose_data_offset(r10_bio, rdev); 1229 1230 read_bio->bi_end_io = raid10_end_read_request; 1230 - read_bio->bi_opf = op | do_sync; 1231 1231 if (test_bit(FailFast, &rdev->flags) && 1232 1232 test_bit(R10BIO_FailFast, &r10_bio->state)) 1233 1233 read_bio->bi_opf |= MD_FAILFAST; ··· 1245 1247 struct bio *bio, bool replacement, 1246 1248 int n_copy) 1247 1249 { 1248 - const enum req_op op = bio_op(bio); 1249 - const blk_opf_t do_sync = bio->bi_opf & REQ_SYNC; 1250 - const blk_opf_t do_fua = bio->bi_opf & REQ_FUA; 1251 - const blk_opf_t do_atomic = bio->bi_opf & REQ_ATOMIC; 1252 1250 unsigned long flags; 1253 1251 struct r10conf *conf = mddev->private; 1254 1252 struct md_rdev *rdev; ··· 1263 1269 mbio->bi_iter.bi_sector = (r10_bio->devs[n_copy].addr + 1264 1270 choose_data_offset(r10_bio, rdev)); 1265 1271 mbio->bi_end_io = raid10_end_write_request; 1266 - mbio->bi_opf = op | do_sync | do_fua | do_atomic; 1267 1272 if (!replacement && test_bit(FailFast, 1268 1273 &conf->mirrors[devnum].rdev->flags) 1269 1274 && enough(conf, devnum)) ··· 1348 1355 int error; 1349 1356 1350 1357 if ((mddev_is_clustered(mddev) && 1351 - md_cluster_ops->area_resyncing(mddev, WRITE, 1352 - bio->bi_iter.bi_sector, 1353 - bio_end_sector(bio)))) { 1358 + mddev->cluster_ops->area_resyncing(mddev, WRITE, 1359 + bio->bi_iter.bi_sector, 1360 + bio_end_sector(bio)))) { 1354 1361 DEFINE_WAIT(w); 1355 1362 /* Bail out if REQ_NOWAIT is set for the bio */ 1356 1363 if (bio->bi_opf & REQ_NOWAIT) { ··· 1360 1367 for (;;) { 1361 1368 prepare_to_wait(&conf->wait_barrier, 1362 1369 &w, TASK_IDLE); 1363 - if (!md_cluster_ops->area_resyncing(mddev, WRITE, 1370 + if (!mddev->cluster_ops->area_resyncing(mddev, WRITE, 1364 1371 bio->bi_iter.bi_sector, bio_end_sector(bio))) 1365 1372 break; 1366 1373 schedule(); ··· 1431 1438 if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) { 1432 1439 sector_t first_bad; 1433 1440 sector_t dev_sector = r10_bio->devs[i].addr; 1434 - int bad_sectors; 1441 + sector_t bad_sectors; 1435 1442 int is_bad; 1436 1443 1437 1444 is_bad = is_badblock(rdev, dev_sector, max_sectors, ··· 1624 1631 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) 1625 1632 return -EAGAIN; 1626 1633 1627 - if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT)) { 1634 + if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT)) { 1628 1635 bio_wouldblock_error(bio); 1629 1636 return 0; 1630 1637 } 1631 - wait_barrier(conf, false); 1632 1638 1633 1639 /* 1634 1640 * Check reshape again to avoid reshape happens after checking ··· 2778 2786 } 2779 2787 } 2780 2788 2781 - static int narrow_write_error(struct r10bio *r10_bio, int i) 2789 + static bool narrow_write_error(struct r10bio *r10_bio, int i) 2782 2790 { 2783 2791 struct bio *bio = r10_bio->master_bio; 2784 2792 struct mddev *mddev = r10_bio->mddev; ··· 2799 2807 sector_t sector; 2800 2808 int sectors; 2801 2809 int sect_to_write = r10_bio->sectors; 2802 - int ok = 1; 2810 + bool ok = true; 2803 2811 2804 2812 if (rdev->badblocks.shift < 0) 2805 - return 0; 2813 + return false; 2806 2814 2807 2815 block_sectors = roundup(1 << rdev->badblocks.shift, 2808 2816 bdev_logical_block_size(rdev->bdev) >> 9); ··· 3405 3413 sector_t from_addr, to_addr; 3406 3414 struct md_rdev *rdev = conf->mirrors[d].rdev; 3407 3415 sector_t sector, first_bad; 3408 - int bad_sectors; 3416 + sector_t bad_sectors; 3409 3417 if (!rdev || 3410 3418 !test_bit(In_sync, &rdev->flags)) 3411 3419 continue; ··· 3601 3609 for (i = 0; i < conf->copies; i++) { 3602 3610 int d = r10_bio->devs[i].devnum; 3603 3611 sector_t first_bad, sector; 3604 - int bad_sectors; 3612 + sector_t bad_sectors; 3605 3613 struct md_rdev *rdev; 3606 3614 3607 3615 if (r10_bio->devs[i].repl_bio) ··· 3708 3716 conf->cluster_sync_low = mddev->curr_resync_completed; 3709 3717 raid10_set_cluster_sync_high(conf); 3710 3718 /* Send resync message */ 3711 - md_cluster_ops->resync_info_update(mddev, 3719 + mddev->cluster_ops->resync_info_update(mddev, 3712 3720 conf->cluster_sync_low, 3713 3721 conf->cluster_sync_high); 3714 3722 } ··· 3741 3749 } 3742 3750 if (broadcast_msg) { 3743 3751 raid10_set_cluster_sync_high(conf); 3744 - md_cluster_ops->resync_info_update(mddev, 3752 + mddev->cluster_ops->resync_info_update(mddev, 3745 3753 conf->cluster_sync_low, 3746 3754 conf->cluster_sync_high); 3747 3755 } ··· 4533 4541 if (ret) 4534 4542 goto abort; 4535 4543 4536 - ret = md_cluster_ops->resize_bitmaps(mddev, newsize, oldsize); 4544 + ret = mddev->cluster_ops->resize_bitmaps(mddev, newsize, oldsize); 4537 4545 if (ret) { 4538 4546 mddev->bitmap_ops->resize(mddev, oldsize, 0, false); 4539 4547 goto abort; ··· 4824 4832 conf->cluster_sync_low = sb_reshape_pos; 4825 4833 } 4826 4834 4827 - md_cluster_ops->resync_info_update(mddev, conf->cluster_sync_low, 4835 + mddev->cluster_ops->resync_info_update(mddev, conf->cluster_sync_low, 4828 4836 conf->cluster_sync_high); 4829 4837 } 4830 4838 ··· 4969 4977 struct r10conf *conf = mddev->private; 4970 4978 sector_t lo, hi; 4971 4979 4972 - md_cluster_ops->resync_info_get(mddev, &lo, &hi); 4980 + mddev->cluster_ops->resync_info_get(mddev, &lo, &hi); 4973 4981 if (((mddev->reshape_position <= hi) && (mddev->reshape_position >= lo)) 4974 4982 || mddev->reshape_position == MaxSector) 4975 4983 conf->reshape_progress = mddev->reshape_position; ··· 5115 5123 5116 5124 static struct md_personality raid10_personality = 5117 5125 { 5118 - .name = "raid10", 5119 - .level = 10, 5120 - .owner = THIS_MODULE, 5126 + .head = { 5127 + .type = MD_PERSONALITY, 5128 + .id = ID_RAID10, 5129 + .name = "raid10", 5130 + .owner = THIS_MODULE, 5131 + }, 5132 + 5121 5133 .make_request = raid10_make_request, 5122 5134 .run = raid10_run, 5123 5135 .free = raid10_free, ··· 5141 5145 .update_reshape_pos = raid10_update_reshape_pos, 5142 5146 }; 5143 5147 5144 - static int __init raid_init(void) 5148 + static int __init raid10_init(void) 5145 5149 { 5146 - return register_md_personality(&raid10_personality); 5150 + return register_md_submodule(&raid10_personality.head); 5147 5151 } 5148 5152 5149 - static void raid_exit(void) 5153 + static void __exit raid10_exit(void) 5150 5154 { 5151 - unregister_md_personality(&raid10_personality); 5155 + unregister_md_submodule(&raid10_personality.head); 5152 5156 } 5153 5157 5154 - module_init(raid_init); 5155 - module_exit(raid_exit); 5158 + module_init(raid10_init); 5159 + module_exit(raid10_exit); 5156 5160 MODULE_LICENSE("GPL"); 5157 5161 MODULE_DESCRIPTION("RAID10 (striped mirror) personality for MD"); 5158 5162 MODULE_ALIAS("md-personality-9"); /* RAID10 */

+60 -31

drivers/md/raid5.c

··· 5858 5858 struct r5conf *conf, sector_t logical_sector) 5859 5859 { 5860 5860 sector_t reshape_progress, reshape_safe; 5861 + 5862 + if (likely(conf->reshape_progress == MaxSector)) 5863 + return LOC_NO_RESHAPE; 5861 5864 /* 5862 5865 * Spinlock is needed as reshape_progress may be 5863 5866 * 64bit on a 32bit platform, and so it might be ··· 5938 5935 const int rw = bio_data_dir(bi); 5939 5936 enum stripe_result ret; 5940 5937 struct stripe_head *sh; 5938 + enum reshape_loc loc; 5941 5939 sector_t new_sector; 5942 5940 int previous = 0, flags = 0; 5943 5941 int seq, dd_idx; 5944 5942 5945 5943 seq = read_seqcount_begin(&conf->gen_lock); 5946 - 5947 - if (unlikely(conf->reshape_progress != MaxSector)) { 5948 - enum reshape_loc loc = get_reshape_loc(mddev, conf, 5949 - logical_sector); 5950 - if (loc == LOC_INSIDE_RESHAPE) { 5951 - ret = STRIPE_SCHEDULE_AND_RETRY; 5952 - goto out; 5953 - } 5954 - if (loc == LOC_AHEAD_OF_RESHAPE) 5955 - previous = 1; 5944 + loc = get_reshape_loc(mddev, conf, logical_sector); 5945 + if (loc == LOC_INSIDE_RESHAPE) { 5946 + ret = STRIPE_SCHEDULE_AND_RETRY; 5947 + goto out; 5956 5948 } 5949 + if (loc == LOC_AHEAD_OF_RESHAPE) 5950 + previous = 1; 5957 5951 5958 5952 new_sector = raid5_compute_sector(conf, logical_sector, previous, 5959 5953 &dd_idx, NULL); ··· 6127 6127 6128 6128 /* Bail out if conflicts with reshape and REQ_NOWAIT is set */ 6129 6129 if ((bi->bi_opf & REQ_NOWAIT) && 6130 - (conf->reshape_progress != MaxSector) && 6131 6130 get_reshape_loc(mddev, conf, logical_sector) == LOC_INSIDE_RESHAPE) { 6132 6131 bio_wouldblock_error(bi); 6133 6132 if (rw == WRITE) ··· 8953 8954 8954 8955 static struct md_personality raid6_personality = 8955 8956 { 8956 - .name = "raid6", 8957 - .level = 6, 8958 - .owner = THIS_MODULE, 8957 + .head = { 8958 + .type = MD_PERSONALITY, 8959 + .id = ID_RAID6, 8960 + .name = "raid6", 8961 + .owner = THIS_MODULE, 8962 + }, 8963 + 8959 8964 .make_request = raid5_make_request, 8960 8965 .run = raid5_run, 8961 8966 .start = raid5_start, ··· 8983 8980 }; 8984 8981 static struct md_personality raid5_personality = 8985 8982 { 8986 - .name = "raid5", 8987 - .level = 5, 8988 - .owner = THIS_MODULE, 8983 + .head = { 8984 + .type = MD_PERSONALITY, 8985 + .id = ID_RAID5, 8986 + .name = "raid5", 8987 + .owner = THIS_MODULE, 8988 + }, 8989 + 8989 8990 .make_request = raid5_make_request, 8990 8991 .run = raid5_run, 8991 8992 .start = raid5_start, ··· 9014 9007 9015 9008 static struct md_personality raid4_personality = 9016 9009 { 9017 - .name = "raid4", 9018 - .level = 4, 9019 - .owner = THIS_MODULE, 9010 + .head = { 9011 + .type = MD_PERSONALITY, 9012 + .id = ID_RAID4, 9013 + .name = "raid4", 9014 + .owner = THIS_MODULE, 9015 + }, 9016 + 9020 9017 .make_request = raid5_make_request, 9021 9018 .run = raid5_run, 9022 9019 .start = raid5_start, ··· 9056 9045 "md/raid5:prepare", 9057 9046 raid456_cpu_up_prepare, 9058 9047 raid456_cpu_dead); 9059 - if (ret) { 9060 - destroy_workqueue(raid5_wq); 9061 - return ret; 9062 - } 9063 - register_md_personality(&raid6_personality); 9064 - register_md_personality(&raid5_personality); 9065 - register_md_personality(&raid4_personality); 9048 + if (ret) 9049 + goto err_destroy_wq; 9050 + 9051 + ret = register_md_submodule(&raid6_personality.head); 9052 + if (ret) 9053 + goto err_cpuhp_remove; 9054 + 9055 + ret = register_md_submodule(&raid5_personality.head); 9056 + if (ret) 9057 + goto err_unregister_raid6; 9058 + 9059 + ret = register_md_submodule(&raid4_personality.head); 9060 + if (ret) 9061 + goto err_unregister_raid5; 9062 + 9066 9063 return 0; 9064 + 9065 + err_unregister_raid5: 9066 + unregister_md_submodule(&raid5_personality.head); 9067 + err_unregister_raid6: 9068 + unregister_md_submodule(&raid6_personality.head); 9069 + err_cpuhp_remove: 9070 + cpuhp_remove_multi_state(CPUHP_MD_RAID5_PREPARE); 9071 + err_destroy_wq: 9072 + destroy_workqueue(raid5_wq); 9073 + return ret; 9067 9074 } 9068 9075 9069 - static void raid5_exit(void) 9076 + static void __exit raid5_exit(void) 9070 9077 { 9071 - unregister_md_personality(&raid6_personality); 9072 - unregister_md_personality(&raid5_personality); 9073 - unregister_md_personality(&raid4_personality); 9078 + unregister_md_submodule(&raid6_personality.head); 9079 + unregister_md_submodule(&raid5_personality.head); 9080 + unregister_md_submodule(&raid4_personality.head); 9074 9081 cpuhp_remove_multi_state(CPUHP_MD_RAID5_PREPARE); 9075 9082 destroy_workqueue(raid5_wq); 9076 9083 }

+1 -1

drivers/memstick/core/ms_block.c

··· 1904 1904 1905 1905 /* process the request */ 1906 1906 dbg_verbose("IO: processing new request"); 1907 - blk_rq_map_sg(msb->queue, req, sg); 1907 + blk_rq_map_sg(req, sg); 1908 1908 1909 1909 lba = blk_rq_pos(req); 1910 1910

+1 -3

drivers/memstick/core/mspro_block.c

··· 627 627 while (true) { 628 628 msb->current_page = 0; 629 629 msb->current_seg = 0; 630 - msb->seg_count = blk_rq_map_sg(msb->block_req->q, 631 - msb->block_req, 632 - msb->req_sg); 630 + msb->seg_count = blk_rq_map_sg(msb->block_req, msb->req_sg); 633 631 634 632 if (!msb->seg_count) { 635 633 unsigned int bytes = blk_rq_cur_bytes(msb->block_req);

+1 -1

drivers/mmc/core/queue.c

··· 523 523 { 524 524 struct request *req = mmc_queue_req_to_req(mqrq); 525 525 526 - return blk_rq_map_sg(mq->queue, req, mqrq->sg); 526 + return blk_rq_map_sg(req, mqrq->sg); 527 527 }

+5 -3

drivers/mmc/host/cqhci-crypto.c

··· 84 84 85 85 if (ccap_array[cap_idx].algorithm_id == CQHCI_CRYPTO_ALG_AES_XTS) { 86 86 /* In XTS mode, the blk_crypto_key's size is already doubled */ 87 - memcpy(cfg.crypto_key, key->raw, key->size/2); 87 + memcpy(cfg.crypto_key, key->bytes, key->size/2); 88 88 memcpy(cfg.crypto_key + CQHCI_CRYPTO_KEY_MAX_SIZE/2, 89 - key->raw + key->size/2, key->size/2); 89 + key->bytes + key->size/2, key->size/2); 90 90 } else { 91 - memcpy(cfg.crypto_key, key->raw, key->size); 91 + memcpy(cfg.crypto_key, key->bytes, key->size); 92 92 } 93 93 94 94 cqhci_crypto_program_key(cq_host, &cfg, slot); ··· 203 203 204 204 /* Unfortunately, CQHCI crypto only supports 32 DUN bits. */ 205 205 profile->max_dun_bytes_supported = 4; 206 + 207 + profile->key_types_supported = BLK_CRYPTO_KEY_TYPE_RAW; 206 208 207 209 /* 208 210 * Cache all the crypto capabilities and advertise the supported crypto

+2 -1

drivers/mmc/host/sdhci-msm.c

··· 1895 1895 1896 1896 profile->ll_ops = sdhci_msm_crypto_ops; 1897 1897 profile->max_dun_bytes_supported = 4; 1898 + profile->key_types_supported = BLK_CRYPTO_KEY_TYPE_RAW; 1898 1899 profile->dev = dev; 1899 1900 1900 1901 /* ··· 1969 1968 return qcom_ice_program_key(msm_host->ice, 1970 1969 QCOM_ICE_CRYPTO_ALG_AES_XTS, 1971 1970 QCOM_ICE_CRYPTO_KEY_SIZE_256, 1972 - key->raw, 1971 + key->bytes, 1973 1972 key->crypto_cfg.data_unit_size / 512, 1974 1973 slot); 1975 1974 }

+1 -1

drivers/mtd/ubi/block.c

··· 199 199 * and ubi_read_sg() will check that limit. 200 200 */ 201 201 ubi_sgl_init(&pdu->usgl); 202 - blk_rq_map_sg(req->q, req, pdu->usgl.sg); 202 + blk_rq_map_sg(req, pdu->usgl.sg); 203 203 204 204 while (bytes_left) { 205 205 /*

+1 -1

drivers/nvdimm/badrange.c

··· 167 167 dev_dbg(bb->dev, "Found a bad range (0x%llx, 0x%llx)\n", 168 168 (u64) s * 512, (u64) num * 512); 169 169 /* this isn't an error as the hardware will still throw an exception */ 170 - if (badblocks_set(bb, s, num, 1)) 170 + if (!badblocks_set(bb, s, num, 1)) 171 171 dev_info_once(bb->dev, "%s: failed for sector %llx\n", 172 172 __func__, (u64) s); 173 173 }

+1 -1

drivers/nvdimm/nd.h

··· 673 673 { 674 674 if (bb->count) { 675 675 sector_t first_bad; 676 - int num_bad; 676 + sector_t num_bad; 677 677 678 678 return !!badblocks_check(bb, sector, len / 512, &first_bad, 679 679 &num_bad);

+4 -3

drivers/nvdimm/pfn_devs.c

··· 367 367 struct nd_namespace_common *ndns = nd_pfn->ndns; 368 368 void *zero_page = page_address(ZERO_PAGE(0)); 369 369 struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; 370 - int num_bad, meta_num, rc, bb_present; 370 + int meta_num, rc, bb_present; 371 371 sector_t first_bad, meta_start; 372 372 struct nd_namespace_io *nsio; 373 + sector_t num_bad; 373 374 374 375 if (nd_pfn->mode != PFN_MODE_PMEM) 375 376 return 0; ··· 395 394 bb_present = badblocks_check(&nd_region->bb, meta_start, 396 395 meta_num, &first_bad, &num_bad); 397 396 if (bb_present) { 398 - dev_dbg(&nd_pfn->dev, "meta: %x badblocks at %llx\n", 397 + dev_dbg(&nd_pfn->dev, "meta: %llx badblocks at %llx\n", 399 398 num_bad, first_bad); 400 399 nsoff = ALIGN_DOWN((nd_region->ndr_start 401 400 + (first_bad << 9)) - nsio->res.start, ··· 414 413 } 415 414 if (rc) { 416 415 dev_err(&nd_pfn->dev, 417 - "error clearing %x badblocks at %llx\n", 416 + "error clearing %llx badblocks at %llx\n", 418 417 num_bad, first_bad); 419 418 return rc; 420 419 }

+1 -1

drivers/nvdimm/pmem.c

··· 249 249 unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; 250 250 struct badblocks *bb = &pmem->bb; 251 251 sector_t first_bad; 252 - int num_bad; 252 + sector_t num_bad; 253 253 254 254 if (kaddr) 255 255 *kaddr = pmem->virt_addr + offset;

+1

drivers/nvme/common/Kconfig

··· 12 12 select CRYPTO_SHA512 13 13 select CRYPTO_DH 14 14 select CRYPTO_DH_RFC7919_GROUPS 15 + select CRYPTO_HKDF

+337

drivers/nvme/common/auth.c

··· 11 11 #include <linux/unaligned.h> 12 12 #include <crypto/hash.h> 13 13 #include <crypto/dh.h> 14 + #include <crypto/hkdf.h> 14 15 #include <linux/nvme.h> 15 16 #include <linux/nvme-auth.h> 17 + 18 + #define HKDF_MAX_HASHLEN 64 16 19 17 20 static u32 nvme_dhchap_seqnum; 18 21 static DEFINE_MUTEX(nvme_dhchap_mutex); ··· 473 470 return 0; 474 471 } 475 472 EXPORT_SYMBOL_GPL(nvme_auth_generate_key); 473 + 474 + /** 475 + * nvme_auth_generate_psk - Generate a PSK for TLS 476 + * @hmac_id: Hash function identifier 477 + * @skey: Session key 478 + * @skey_len: Length of @skey 479 + * @c1: Value of challenge C1 480 + * @c2: Value of challenge C2 481 + * @hash_len: Hash length of the hash algorithm 482 + * @ret_psk: Pointer too the resulting generated PSK 483 + * @ret_len: length of @ret_psk 484 + * 485 + * Generate a PSK for TLS as specified in NVMe base specification, section 486 + * 8.13.5.9: Generated PSK for TLS 487 + * 488 + * The generated PSK for TLS shall be computed applying the HMAC function 489 + * using the hash function H( ) selected by the HashID parameter in the 490 + * DH-HMAC-CHAP_Challenge message with the session key KS as key to the 491 + * concatenation of the two challenges C1 and C2 (i.e., generated 492 + * PSK = HMAC(KS, C1 || C2)). 493 + * 494 + * Returns 0 on success with a valid generated PSK pointer in @ret_psk and 495 + * the length of @ret_psk in @ret_len, or a negative error number otherwise. 496 + */ 497 + int nvme_auth_generate_psk(u8 hmac_id, u8 *skey, size_t skey_len, 498 + u8 *c1, u8 *c2, size_t hash_len, u8 **ret_psk, size_t *ret_len) 499 + { 500 + struct crypto_shash *tfm; 501 + SHASH_DESC_ON_STACK(shash, tfm); 502 + u8 *psk; 503 + const char *hmac_name; 504 + int ret, psk_len; 505 + 506 + if (!c1 || !c2) 507 + return -EINVAL; 508 + 509 + hmac_name = nvme_auth_hmac_name(hmac_id); 510 + if (!hmac_name) { 511 + pr_warn("%s: invalid hash algorithm %d\n", 512 + __func__, hmac_id); 513 + return -EINVAL; 514 + } 515 + 516 + tfm = crypto_alloc_shash(hmac_name, 0, 0); 517 + if (IS_ERR(tfm)) 518 + return PTR_ERR(tfm); 519 + 520 + psk_len = crypto_shash_digestsize(tfm); 521 + psk = kzalloc(psk_len, GFP_KERNEL); 522 + if (!psk) { 523 + ret = -ENOMEM; 524 + goto out_free_tfm; 525 + } 526 + 527 + shash->tfm = tfm; 528 + ret = crypto_shash_setkey(tfm, skey, skey_len); 529 + if (ret) 530 + goto out_free_psk; 531 + 532 + ret = crypto_shash_init(shash); 533 + if (ret) 534 + goto out_free_psk; 535 + 536 + ret = crypto_shash_update(shash, c1, hash_len); 537 + if (ret) 538 + goto out_free_psk; 539 + 540 + ret = crypto_shash_update(shash, c2, hash_len); 541 + if (ret) 542 + goto out_free_psk; 543 + 544 + ret = crypto_shash_final(shash, psk); 545 + if (!ret) { 546 + *ret_psk = psk; 547 + *ret_len = psk_len; 548 + } 549 + 550 + out_free_psk: 551 + if (ret) 552 + kfree_sensitive(psk); 553 + out_free_tfm: 554 + crypto_free_shash(tfm); 555 + 556 + return ret; 557 + } 558 + EXPORT_SYMBOL_GPL(nvme_auth_generate_psk); 559 + 560 + /** 561 + * nvme_auth_generate_digest - Generate TLS PSK digest 562 + * @hmac_id: Hash function identifier 563 + * @psk: Generated input PSK 564 + * @psk_len: Length of @psk 565 + * @subsysnqn: NQN of the subsystem 566 + * @hostnqn: NQN of the host 567 + * @ret_digest: Pointer to the returned digest 568 + * 569 + * Generate a TLS PSK digest as specified in TP8018 Section 3.6.1.3: 570 + * TLS PSK and PSK identity Derivation 571 + * 572 + * The PSK digest shall be computed by encoding in Base64 (refer to RFC 573 + * 4648) the result of the application of the HMAC function using the hash 574 + * function specified in item 4 above (ie the hash function of the cipher 575 + * suite associated with the PSK identity) with the PSK as HMAC key to the 576 + * concatenation of: 577 + * - the NQN of the host (i.e., NQNh) not including the null terminator; 578 + * - a space character; 579 + * - the NQN of the NVM subsystem (i.e., NQNc) not including the null 580 + * terminator; 581 + * - a space character; and 582 + * - the seventeen ASCII characters "NVMe-over-Fabrics" 583 + * (i.e., <PSK digest> = Base64(HMAC(PSK, NQNh || " " || NQNc || " " || 584 + * "NVMe-over-Fabrics"))). 585 + * The length of the PSK digest depends on the hash function used to compute 586 + * it as follows: 587 + * - If the SHA-256 hash function is used, the resulting PSK digest is 44 588 + * characters long; or 589 + * - If the SHA-384 hash function is used, the resulting PSK digest is 64 590 + * characters long. 591 + * 592 + * Returns 0 on success with a valid digest pointer in @ret_digest, or a 593 + * negative error number on failure. 594 + */ 595 + int nvme_auth_generate_digest(u8 hmac_id, u8 *psk, size_t psk_len, 596 + char *subsysnqn, char *hostnqn, u8 **ret_digest) 597 + { 598 + struct crypto_shash *tfm; 599 + SHASH_DESC_ON_STACK(shash, tfm); 600 + u8 *digest, *enc; 601 + const char *hmac_name; 602 + size_t digest_len, hmac_len; 603 + int ret; 604 + 605 + if (WARN_ON(!subsysnqn || !hostnqn)) 606 + return -EINVAL; 607 + 608 + hmac_name = nvme_auth_hmac_name(hmac_id); 609 + if (!hmac_name) { 610 + pr_warn("%s: invalid hash algorithm %d\n", 611 + __func__, hmac_id); 612 + return -EINVAL; 613 + } 614 + 615 + switch (nvme_auth_hmac_hash_len(hmac_id)) { 616 + case 32: 617 + hmac_len = 44; 618 + break; 619 + case 48: 620 + hmac_len = 64; 621 + break; 622 + default: 623 + pr_warn("%s: invalid hash algorithm '%s'\n", 624 + __func__, hmac_name); 625 + return -EINVAL; 626 + } 627 + 628 + enc = kzalloc(hmac_len + 1, GFP_KERNEL); 629 + if (!enc) 630 + return -ENOMEM; 631 + 632 + tfm = crypto_alloc_shash(hmac_name, 0, 0); 633 + if (IS_ERR(tfm)) { 634 + ret = PTR_ERR(tfm); 635 + goto out_free_enc; 636 + } 637 + 638 + digest_len = crypto_shash_digestsize(tfm); 639 + digest = kzalloc(digest_len, GFP_KERNEL); 640 + if (!digest) { 641 + ret = -ENOMEM; 642 + goto out_free_tfm; 643 + } 644 + 645 + shash->tfm = tfm; 646 + ret = crypto_shash_setkey(tfm, psk, psk_len); 647 + if (ret) 648 + goto out_free_digest; 649 + 650 + ret = crypto_shash_init(shash); 651 + if (ret) 652 + goto out_free_digest; 653 + 654 + ret = crypto_shash_update(shash, hostnqn, strlen(hostnqn)); 655 + if (ret) 656 + goto out_free_digest; 657 + 658 + ret = crypto_shash_update(shash, " ", 1); 659 + if (ret) 660 + goto out_free_digest; 661 + 662 + ret = crypto_shash_update(shash, subsysnqn, strlen(subsysnqn)); 663 + if (ret) 664 + goto out_free_digest; 665 + 666 + ret = crypto_shash_update(shash, " NVMe-over-Fabrics", 18); 667 + if (ret) 668 + goto out_free_digest; 669 + 670 + ret = crypto_shash_final(shash, digest); 671 + if (ret) 672 + goto out_free_digest; 673 + 674 + ret = base64_encode(digest, digest_len, enc); 675 + if (ret < hmac_len) { 676 + ret = -ENOKEY; 677 + goto out_free_digest; 678 + } 679 + *ret_digest = enc; 680 + ret = 0; 681 + 682 + out_free_digest: 683 + kfree_sensitive(digest); 684 + out_free_tfm: 685 + crypto_free_shash(tfm); 686 + out_free_enc: 687 + if (ret) 688 + kfree_sensitive(enc); 689 + 690 + return ret; 691 + } 692 + EXPORT_SYMBOL_GPL(nvme_auth_generate_digest); 693 + 694 + /** 695 + * nvme_auth_derive_tls_psk - Derive TLS PSK 696 + * @hmac_id: Hash function identifier 697 + * @psk: generated input PSK 698 + * @psk_len: size of @psk 699 + * @psk_digest: TLS PSK digest 700 + * @ret_psk: Pointer to the resulting TLS PSK 701 + * 702 + * Derive a TLS PSK as specified in TP8018 Section 3.6.1.3: 703 + * TLS PSK and PSK identity Derivation 704 + * 705 + * The TLS PSK shall be derived as follows from an input PSK 706 + * (i.e., either a retained PSK or a generated PSK) and a PSK 707 + * identity using the HKDF-Extract and HKDF-Expand-Label operations 708 + * (refer to RFC 5869 and RFC 8446) where the hash function is the 709 + * one specified by the hash specifier of the PSK identity: 710 + * 1. PRK = HKDF-Extract(0, Input PSK); and 711 + * 2. TLS PSK = HKDF-Expand-Label(PRK, "nvme-tls-psk", PskIdentityContext, L), 712 + * where PskIdentityContext is the hash identifier indicated in 713 + * the PSK identity concatenated to a space character and to the 714 + * Base64 PSK digest (i.e., "<hash> <PSK digest>") and L is the 715 + * output size in bytes of the hash function (i.e., 32 for SHA-256 716 + * and 48 for SHA-384). 717 + * 718 + * Returns 0 on success with a valid psk pointer in @ret_psk or a negative 719 + * error number otherwise. 720 + */ 721 + int nvme_auth_derive_tls_psk(int hmac_id, u8 *psk, size_t psk_len, 722 + u8 *psk_digest, u8 **ret_psk) 723 + { 724 + struct crypto_shash *hmac_tfm; 725 + const char *hmac_name; 726 + const char *psk_prefix = "tls13 nvme-tls-psk"; 727 + static const char default_salt[HKDF_MAX_HASHLEN]; 728 + size_t info_len, prk_len; 729 + char *info; 730 + unsigned char *prk, *tls_key; 731 + int ret; 732 + 733 + hmac_name = nvme_auth_hmac_name(hmac_id); 734 + if (!hmac_name) { 735 + pr_warn("%s: invalid hash algorithm %d\n", 736 + __func__, hmac_id); 737 + return -EINVAL; 738 + } 739 + if (hmac_id == NVME_AUTH_HASH_SHA512) { 740 + pr_warn("%s: unsupported hash algorithm %s\n", 741 + __func__, hmac_name); 742 + return -EINVAL; 743 + } 744 + 745 + hmac_tfm = crypto_alloc_shash(hmac_name, 0, 0); 746 + if (IS_ERR(hmac_tfm)) 747 + return PTR_ERR(hmac_tfm); 748 + 749 + prk_len = crypto_shash_digestsize(hmac_tfm); 750 + prk = kzalloc(prk_len, GFP_KERNEL); 751 + if (!prk) { 752 + ret = -ENOMEM; 753 + goto out_free_shash; 754 + } 755 + 756 + if (WARN_ON(prk_len > HKDF_MAX_HASHLEN)) { 757 + ret = -EINVAL; 758 + goto out_free_prk; 759 + } 760 + ret = hkdf_extract(hmac_tfm, psk, psk_len, 761 + default_salt, prk_len, prk); 762 + if (ret) 763 + goto out_free_prk; 764 + 765 + ret = crypto_shash_setkey(hmac_tfm, prk, prk_len); 766 + if (ret) 767 + goto out_free_prk; 768 + 769 + /* 770 + * 2 addtional bytes for the length field from HDKF-Expand-Label, 771 + * 2 addtional bytes for the HMAC ID, and one byte for the space 772 + * separator. 773 + */ 774 + info_len = strlen(psk_digest) + strlen(psk_prefix) + 5; 775 + info = kzalloc(info_len + 1, GFP_KERNEL); 776 + if (!info) { 777 + ret = -ENOMEM; 778 + goto out_free_prk; 779 + } 780 + 781 + put_unaligned_be16(psk_len, info); 782 + memcpy(info + 2, psk_prefix, strlen(psk_prefix)); 783 + sprintf(info + 2 + strlen(psk_prefix), "%02d %s", hmac_id, psk_digest); 784 + 785 + tls_key = kzalloc(psk_len, GFP_KERNEL); 786 + if (!tls_key) { 787 + ret = -ENOMEM; 788 + goto out_free_info; 789 + } 790 + ret = hkdf_expand(hmac_tfm, info, info_len, tls_key, psk_len); 791 + if (ret) { 792 + kfree(tls_key); 793 + goto out_free_info; 794 + } 795 + *ret_psk = tls_key; 796 + 797 + out_free_info: 798 + kfree(info); 799 + out_free_prk: 800 + kfree(prk); 801 + out_free_shash: 802 + crypto_free_shash(hmac_tfm); 803 + 804 + return ret; 805 + } 806 + EXPORT_SYMBOL_GPL(nvme_auth_derive_tls_psk); 476 807 477 808 MODULE_DESCRIPTION("NVMe Authentication framework"); 478 809 MODULE_LICENSE("GPL v2");

+64 -1

drivers/nvme/common/keyring.c

··· 5 5 6 6 #include <linux/module.h> 7 7 #include <linux/seq_file.h> 8 - #include <linux/key.h> 9 8 #include <linux/key-type.h> 10 9 #include <keys/user-type.h> 11 10 #include <linux/nvme.h> ··· 122 123 123 124 return key_ref_to_ptr(keyref); 124 125 } 126 + 127 + /** 128 + * nvme_tls_psk_refresh - Refresh TLS PSK 129 + * @keyring: Keyring holding the TLS PSK 130 + * @hostnqn: Host NQN to use 131 + * @subnqn: Subsystem NQN to use 132 + * @hmac_id: Hash function identifier 133 + * @data: TLS PSK key material 134 + * @data_len: Length of @data 135 + * @digest: TLS PSK digest 136 + * 137 + * Refresh a generated version 1 TLS PSK with the identity generated 138 + * from @hmac_id, @hostnqn, @subnqn, and @digest in the keyring given 139 + * by @keyring. 140 + * 141 + * Returns the updated key success or an error pointer otherwise. 142 + */ 143 + struct key *nvme_tls_psk_refresh(struct key *keyring, 144 + const char *hostnqn, const char *subnqn, u8 hmac_id, 145 + u8 *data, size_t data_len, const char *digest) 146 + { 147 + key_perm_t keyperm = 148 + KEY_POS_SEARCH | KEY_POS_VIEW | KEY_POS_READ | 149 + KEY_POS_WRITE | KEY_POS_LINK | KEY_POS_SETATTR | 150 + KEY_USR_SEARCH | KEY_USR_VIEW | KEY_USR_READ; 151 + char *identity; 152 + key_ref_t keyref; 153 + key_serial_t keyring_id; 154 + struct key *key; 155 + 156 + if (!hostnqn || !subnqn || !data || !data_len) 157 + return ERR_PTR(-EINVAL); 158 + 159 + identity = kasprintf(GFP_KERNEL, "NVMe1G%02d %s %s %s", 160 + hmac_id, hostnqn, subnqn, digest); 161 + if (!identity) 162 + return ERR_PTR(-ENOMEM); 163 + 164 + if (!keyring) 165 + keyring = nvme_keyring; 166 + keyring_id = key_serial(keyring); 167 + pr_debug("keyring %x refresh tls psk '%s'\n", 168 + keyring_id, identity); 169 + keyref = key_create_or_update(make_key_ref(keyring, true), 170 + "psk", identity, data, data_len, 171 + keyperm, KEY_ALLOC_NOT_IN_QUOTA | 172 + KEY_ALLOC_BUILT_IN | 173 + KEY_ALLOC_BYPASS_RESTRICTION); 174 + if (IS_ERR(keyref)) { 175 + pr_debug("refresh tls psk '%s' failed, error %ld\n", 176 + identity, PTR_ERR(keyref)); 177 + kfree(identity); 178 + return ERR_PTR(-ENOKEY); 179 + } 180 + kfree(identity); 181 + /* 182 + * Set the default timeout to 1 hour 183 + * as suggested in TP8018. 184 + */ 185 + key = key_ref_to_ptr(keyref); 186 + key_set_timeout(key, 3600); 187 + return key; 188 + } 189 + EXPORT_SYMBOL_GPL(nvme_tls_psk_refresh); 125 190 126 191 /* 127 192 * NVMe PSK priority list

+1 -1

drivers/nvme/host/Kconfig

··· 109 109 bool "NVMe over Fabrics In-Band Authentication in host side" 110 110 depends on NVME_CORE 111 111 select NVME_AUTH 112 - select NVME_KEYRING if NVME_TCP_TLS 112 + select NVME_KEYRING 113 113 help 114 114 This provides support for NVMe over Fabrics In-Band Authentication in 115 115 host side.

+1 -1

drivers/nvme/host/apple.c

··· 525 525 if (!iod->sg) 526 526 return BLK_STS_RESOURCE; 527 527 sg_init_table(iod->sg, blk_rq_nr_phys_segments(req)); 528 - iod->nents = blk_rq_map_sg(req->q, req, iod->sg); 528 + iod->nents = blk_rq_map_sg(req, iod->sg); 529 529 if (!iod->nents) 530 530 goto out_free_sg; 531 531

+112 -3

drivers/nvme/host/auth.c

··· 12 12 #include "nvme.h" 13 13 #include "fabrics.h" 14 14 #include <linux/nvme-auth.h> 15 + #include <linux/nvme-keyring.h> 15 16 16 17 #define CHAP_BUF_SIZE 4096 17 18 static struct kmem_cache *nvme_chap_buf_cache; ··· 132 131 data->auth_type = NVME_AUTH_COMMON_MESSAGES; 133 132 data->auth_id = NVME_AUTH_DHCHAP_MESSAGE_NEGOTIATE; 134 133 data->t_id = cpu_to_le16(chap->transaction); 135 - data->sc_c = 0; /* No secure channel concatenation */ 134 + if (ctrl->opts->concat && chap->qid == 0) { 135 + if (ctrl->opts->tls_key) 136 + data->sc_c = NVME_AUTH_SECP_REPLACETLSPSK; 137 + else 138 + data->sc_c = NVME_AUTH_SECP_NEWTLSPSK; 139 + } else 140 + data->sc_c = NVME_AUTH_SECP_NOSC; 136 141 data->napd = 1; 137 142 data->auth_protocol[0].dhchap.authid = NVME_AUTH_DHCHAP_AUTH_ID; 138 143 data->auth_protocol[0].dhchap.halen = 3; ··· 318 311 data->hl = chap->hash_len; 319 312 data->dhvlen = cpu_to_le16(chap->host_key_len); 320 313 memcpy(data->rval, chap->response, chap->hash_len); 321 - if (ctrl->ctrl_key) { 314 + if (ctrl->ctrl_key) 322 315 chap->bi_directional = true; 316 + if (ctrl->ctrl_key || ctrl->opts->concat) { 323 317 get_random_bytes(chap->c2, chap->hash_len); 324 318 data->cvalid = 1; 325 319 memcpy(data->rval + chap->hash_len, chap->c2, ··· 330 322 } else { 331 323 memset(chap->c2, 0, chap->hash_len); 332 324 } 333 - chap->s2 = nvme_auth_get_seqnum(); 325 + if (ctrl->opts->concat) 326 + chap->s2 = 0; 327 + else 328 + chap->s2 = nvme_auth_get_seqnum(); 334 329 data->seqnum = cpu_to_le32(chap->s2); 335 330 if (chap->host_key_len) { 336 331 dev_dbg(ctrl->device, "%s: qid %d host public key %*ph\n", ··· 688 677 crypto_free_kpp(chap->dh_tfm); 689 678 } 690 679 680 + void nvme_auth_revoke_tls_key(struct nvme_ctrl *ctrl) 681 + { 682 + dev_dbg(ctrl->device, "Wipe generated TLS PSK %08x\n", 683 + key_serial(ctrl->opts->tls_key)); 684 + key_revoke(ctrl->opts->tls_key); 685 + key_put(ctrl->opts->tls_key); 686 + ctrl->opts->tls_key = NULL; 687 + } 688 + EXPORT_SYMBOL_GPL(nvme_auth_revoke_tls_key); 689 + 690 + static int nvme_auth_secure_concat(struct nvme_ctrl *ctrl, 691 + struct nvme_dhchap_queue_context *chap) 692 + { 693 + u8 *psk, *digest, *tls_psk; 694 + struct key *tls_key; 695 + size_t psk_len; 696 + int ret = 0; 697 + 698 + if (!chap->sess_key) { 699 + dev_warn(ctrl->device, 700 + "%s: qid %d no session key negotiated\n", 701 + __func__, chap->qid); 702 + return -ENOKEY; 703 + } 704 + 705 + if (chap->qid) { 706 + dev_warn(ctrl->device, 707 + "qid %d: secure concatenation not supported on I/O queues\n", 708 + chap->qid); 709 + return -EINVAL; 710 + } 711 + ret = nvme_auth_generate_psk(chap->hash_id, chap->sess_key, 712 + chap->sess_key_len, 713 + chap->c1, chap->c2, 714 + chap->hash_len, &psk, &psk_len); 715 + if (ret) { 716 + dev_warn(ctrl->device, 717 + "%s: qid %d failed to generate PSK, error %d\n", 718 + __func__, chap->qid, ret); 719 + return ret; 720 + } 721 + dev_dbg(ctrl->device, 722 + "%s: generated psk %*ph\n", __func__, (int)psk_len, psk); 723 + 724 + ret = nvme_auth_generate_digest(chap->hash_id, psk, psk_len, 725 + ctrl->opts->subsysnqn, 726 + ctrl->opts->host->nqn, &digest); 727 + if (ret) { 728 + dev_warn(ctrl->device, 729 + "%s: qid %d failed to generate digest, error %d\n", 730 + __func__, chap->qid, ret); 731 + goto out_free_psk; 732 + }; 733 + dev_dbg(ctrl->device, "%s: generated digest %s\n", 734 + __func__, digest); 735 + ret = nvme_auth_derive_tls_psk(chap->hash_id, psk, psk_len, 736 + digest, &tls_psk); 737 + if (ret) { 738 + dev_warn(ctrl->device, 739 + "%s: qid %d failed to derive TLS psk, error %d\n", 740 + __func__, chap->qid, ret); 741 + goto out_free_digest; 742 + }; 743 + 744 + tls_key = nvme_tls_psk_refresh(ctrl->opts->keyring, 745 + ctrl->opts->host->nqn, 746 + ctrl->opts->subsysnqn, chap->hash_id, 747 + tls_psk, psk_len, digest); 748 + if (IS_ERR(tls_key)) { 749 + ret = PTR_ERR(tls_key); 750 + dev_warn(ctrl->device, 751 + "%s: qid %d failed to insert generated key, error %d\n", 752 + __func__, chap->qid, ret); 753 + tls_key = NULL; 754 + } 755 + kfree_sensitive(tls_psk); 756 + if (ctrl->opts->tls_key) 757 + nvme_auth_revoke_tls_key(ctrl); 758 + ctrl->opts->tls_key = tls_key; 759 + out_free_digest: 760 + kfree_sensitive(digest); 761 + out_free_psk: 762 + kfree_sensitive(psk); 763 + return ret; 764 + } 765 + 691 766 static void nvme_queue_auth_work(struct work_struct *work) 692 767 { 693 768 struct nvme_dhchap_queue_context *chap = ··· 930 833 } 931 834 if (!ret) { 932 835 chap->error = 0; 836 + if (ctrl->opts->concat && 837 + (ret = nvme_auth_secure_concat(ctrl, chap))) { 838 + dev_warn(ctrl->device, 839 + "%s: qid %d failed to enable secure concatenation\n", 840 + __func__, chap->qid); 841 + chap->error = ret; 842 + } 933 843 return; 934 844 } 935 845 ··· 1016 912 "qid 0: authentication failed\n"); 1017 913 return; 1018 914 } 915 + /* 916 + * Only run authentication on the admin queue for secure concatenation. 917 + */ 918 + if (ctrl->opts->concat) 919 + return; 1019 920 1020 921 for (q = 1; q < ctrl->queue_count; q++) { 1021 922 ret = nvme_auth_negotiate(ctrl, q);

+3

drivers/nvme/host/core.c

··· 4018 4018 4019 4019 if (!nvme_ns_head_multipath(ns->head)) 4020 4020 nvme_cdev_del(&ns->cdev, &ns->cdev_device); 4021 + 4022 + nvme_mpath_remove_sysfs_link(ns); 4023 + 4021 4024 del_gendisk(ns->disk); 4022 4025 4023 4026 mutex_lock(&ns->ctrl->namespaces_lock);

+31 -3

drivers/nvme/host/fabrics.c

··· 472 472 result = le32_to_cpu(res.u32); 473 473 ctrl->cntlid = result & 0xFFFF; 474 474 if (result & (NVME_CONNECT_AUTHREQ_ATR | NVME_CONNECT_AUTHREQ_ASCR)) { 475 - /* Secure concatenation is not implemented */ 476 - if (result & NVME_CONNECT_AUTHREQ_ASCR) { 475 + /* Check for secure concatenation */ 476 + if ((result & NVME_CONNECT_AUTHREQ_ASCR) && 477 + !ctrl->opts->concat) { 477 478 dev_warn(ctrl->device, 478 479 "qid 0: secure concatenation is not supported\n"); 479 480 ret = -EOPNOTSUPP; ··· 551 550 /* Secure concatenation is not implemented */ 552 551 if (result & NVME_CONNECT_AUTHREQ_ASCR) { 553 552 dev_warn(ctrl->device, 554 - "qid 0: secure concatenation is not supported\n"); 553 + "qid %d: secure concatenation is not supported\n", qid); 555 554 ret = -EOPNOTSUPP; 556 555 goto out_free_data; 557 556 } ··· 707 706 #endif 708 707 #ifdef CONFIG_NVME_TCP_TLS 709 708 { NVMF_OPT_TLS, "tls" }, 709 + { NVMF_OPT_CONCAT, "concat" }, 710 710 #endif 711 711 { NVMF_OPT_ERR, NULL } 712 712 }; ··· 737 735 opts->tls = false; 738 736 opts->tls_key = NULL; 739 737 opts->keyring = NULL; 738 + opts->concat = false; 740 739 741 740 options = o = kstrdup(buf, GFP_KERNEL); 742 741 if (!options) ··· 1056 1053 } 1057 1054 opts->tls = true; 1058 1055 break; 1056 + case NVMF_OPT_CONCAT: 1057 + if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) { 1058 + pr_err("TLS is not supported\n"); 1059 + ret = -EINVAL; 1060 + goto out; 1061 + } 1062 + opts->concat = true; 1063 + break; 1059 1064 default: 1060 1065 pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", 1061 1066 p); ··· 1089 1078 if (ctrl_loss_tmo < opts->fast_io_fail_tmo) 1090 1079 pr_warn("failfast tmo (%d) larger than controller loss tmo (%d)\n", 1091 1080 opts->fast_io_fail_tmo, ctrl_loss_tmo); 1081 + } 1082 + if (opts->concat) { 1083 + if (opts->tls) { 1084 + pr_err("Secure concatenation over TLS is not supported\n"); 1085 + ret = -EINVAL; 1086 + goto out; 1087 + } 1088 + if (opts->tls_key) { 1089 + pr_err("Cannot specify a TLS key for secure concatenation\n"); 1090 + ret = -EINVAL; 1091 + goto out; 1092 + } 1093 + if (!opts->dhchap_secret) { 1094 + pr_err("Need to enable DH-CHAP for secure concatenation\n"); 1095 + ret = -EINVAL; 1096 + goto out; 1097 + } 1092 1098 } 1093 1099 1094 1100 opts->host = nvmf_host_add(hostnqn, &hostid);

+3

drivers/nvme/host/fabrics.h

··· 66 66 NVMF_OPT_TLS = 1 << 25, 67 67 NVMF_OPT_KEYRING = 1 << 26, 68 68 NVMF_OPT_TLS_KEY = 1 << 27, 69 + NVMF_OPT_CONCAT = 1 << 28, 69 70 }; 70 71 71 72 /** ··· 102 101 * @keyring: Keyring to use for key lookups 103 102 * @tls_key: TLS key for encrypted connections (TCP) 104 103 * @tls: Start TLS encrypted connections (TCP) 104 + * @concat: Enabled Secure channel concatenation (TCP) 105 105 * @disable_sqflow: disable controller sq flow control 106 106 * @hdr_digest: generate/verify header digest (TCP) 107 107 * @data_digest: generate/verify data digest (TCP) ··· 132 130 struct key *keyring; 133 131 struct key *tls_key; 134 132 bool tls; 133 + bool concat; 135 134 bool disable_sqflow; 136 135 bool hdr_digest; 137 136 bool data_digest;

+3 -3

drivers/nvme/host/fc.c

··· 2571 2571 if (ret) 2572 2572 return -ENOMEM; 2573 2573 2574 - op->nents = blk_rq_map_sg(rq->q, rq, freq->sg_table.sgl); 2574 + op->nents = blk_rq_map_sg(rq, freq->sg_table.sgl); 2575 2575 WARN_ON(op->nents > blk_rq_nr_phys_segments(rq)); 2576 2576 freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl, 2577 2577 op->nents, rq_dma_dir(rq)); ··· 2858 2858 unsigned int nr_io_queues; 2859 2859 int ret; 2860 2860 2861 - nr_io_queues = min(min(opts->nr_io_queues, num_online_cpus()), 2861 + nr_io_queues = min3(opts->nr_io_queues, num_online_cpus(), 2862 2862 ctrl->lport->ops->max_hw_queues); 2863 2863 ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues); 2864 2864 if (ret) { ··· 2912 2912 unsigned int nr_io_queues; 2913 2913 int ret; 2914 2914 2915 - nr_io_queues = min(min(opts->nr_io_queues, num_online_cpus()), 2915 + nr_io_queues = min3(opts->nr_io_queues, num_online_cpus(), 2916 2916 ctrl->lport->ops->max_hw_queues); 2917 2917 ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues); 2918 2918 if (ret) {

+138

drivers/nvme/host/multipath.c

··· 686 686 kblockd_schedule_work(&head->partition_scan_work); 687 687 } 688 688 689 + nvme_mpath_add_sysfs_link(ns->head); 690 + 689 691 mutex_lock(&head->lock); 690 692 if (nvme_path_is_optimized(ns)) { 691 693 int node, srcu_idx; ··· 770 768 if (nvme_state_is_live(ns->ana_state) && 771 769 nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE) 772 770 nvme_mpath_set_live(ns); 771 + else { 772 + /* 773 + * Add sysfs link from multipath head gendisk node to path 774 + * device gendisk node. 775 + * If path's ana state is live (i.e. state is either optimized 776 + * or non-optimized) while we alloc the ns then sysfs link would 777 + * be created from nvme_mpath_set_live(). In that case we would 778 + * not fallthrough this code path. However for the path's ana 779 + * state other than live, we call nvme_mpath_set_live() only 780 + * after ana state transitioned to the live state. But we still 781 + * want to create the sysfs link from head node to a path device 782 + * irrespctive of the path's ana state. 783 + * If we reach through here then it means that path's ana state 784 + * is not live but still create the sysfs link to this path from 785 + * head node if head node of the path has already come alive. 786 + */ 787 + if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags)) 788 + nvme_mpath_add_sysfs_link(ns->head); 789 + } 773 790 } 774 791 775 792 static int nvme_update_ana_state(struct nvme_ctrl *ctrl, ··· 976 955 } 977 956 DEVICE_ATTR_RO(ana_state); 978 957 958 + static ssize_t queue_depth_show(struct device *dev, 959 + struct device_attribute *attr, char *buf) 960 + { 961 + struct nvme_ns *ns = nvme_get_ns_from_dev(dev); 962 + 963 + if (ns->head->subsys->iopolicy != NVME_IOPOLICY_QD) 964 + return 0; 965 + 966 + return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active)); 967 + } 968 + DEVICE_ATTR_RO(queue_depth); 969 + 970 + static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr, 971 + char *buf) 972 + { 973 + int node, srcu_idx; 974 + nodemask_t numa_nodes; 975 + struct nvme_ns *current_ns; 976 + struct nvme_ns *ns = nvme_get_ns_from_dev(dev); 977 + struct nvme_ns_head *head = ns->head; 978 + 979 + if (head->subsys->iopolicy != NVME_IOPOLICY_NUMA) 980 + return 0; 981 + 982 + nodes_clear(numa_nodes); 983 + 984 + srcu_idx = srcu_read_lock(&head->srcu); 985 + for_each_node(node) { 986 + current_ns = srcu_dereference(head->current_path[node], 987 + &head->srcu); 988 + if (ns == current_ns) 989 + node_set(node, numa_nodes); 990 + } 991 + srcu_read_unlock(&head->srcu, srcu_idx); 992 + 993 + return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes)); 994 + } 995 + DEVICE_ATTR_RO(numa_nodes); 996 + 979 997 static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl, 980 998 struct nvme_ana_group_desc *desc, void *data) 981 999 { ··· 1025 965 1026 966 *dst = *desc; 1027 967 return -ENXIO; /* just break out of the loop */ 968 + } 969 + 970 + void nvme_mpath_add_sysfs_link(struct nvme_ns_head *head) 971 + { 972 + struct device *target; 973 + int rc, srcu_idx; 974 + struct nvme_ns *ns; 975 + struct kobject *kobj; 976 + 977 + /* 978 + * Ensure head disk node is already added otherwise we may get invalid 979 + * kobj for head disk node 980 + */ 981 + if (!test_bit(GD_ADDED, &head->disk->state)) 982 + return; 983 + 984 + kobj = &disk_to_dev(head->disk)->kobj; 985 + 986 + /* 987 + * loop through each ns chained through the head->list and create the 988 + * sysfs link from head node to the ns path node 989 + */ 990 + srcu_idx = srcu_read_lock(&head->srcu); 991 + 992 + list_for_each_entry_rcu(ns, &head->list, siblings) { 993 + /* 994 + * Avoid creating link if it already exists for the given path. 995 + * When path ana state transitions from optimized to non- 996 + * optimized or vice-versa, the nvme_mpath_set_live() is 997 + * invoked which in truns call this function. Now if the sysfs 998 + * link already exists for the given path and we attempt to re- 999 + * create the link then sysfs code would warn about it loudly. 1000 + * So we evaluate NVME_NS_SYSFS_ATTR_LINK flag here to ensure 1001 + * that we're not creating duplicate link. 1002 + * The test_and_set_bit() is used because it is protecting 1003 + * against multiple nvme paths being simultaneously added. 1004 + */ 1005 + if (test_and_set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags)) 1006 + continue; 1007 + 1008 + /* 1009 + * Ensure that ns path disk node is already added otherwise we 1010 + * may get invalid kobj name for target 1011 + */ 1012 + if (!test_bit(GD_ADDED, &ns->disk->state)) 1013 + continue; 1014 + 1015 + target = disk_to_dev(ns->disk); 1016 + /* 1017 + * Create sysfs link from head gendisk kobject @kobj to the 1018 + * ns path gendisk kobject @target->kobj. 1019 + */ 1020 + rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name, 1021 + &target->kobj, dev_name(target)); 1022 + if (unlikely(rc)) { 1023 + dev_err(disk_to_dev(ns->head->disk), 1024 + "failed to create link to %s\n", 1025 + dev_name(target)); 1026 + clear_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags); 1027 + } 1028 + } 1029 + 1030 + srcu_read_unlock(&head->srcu, srcu_idx); 1031 + } 1032 + 1033 + void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns) 1034 + { 1035 + struct device *target; 1036 + struct kobject *kobj; 1037 + 1038 + if (!test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags)) 1039 + return; 1040 + 1041 + target = disk_to_dev(ns->disk); 1042 + kobj = &disk_to_dev(ns->head->disk)->kobj; 1043 + sysfs_remove_link_from_group(kobj, nvme_ns_mpath_attr_group.name, 1044 + dev_name(target)); 1045 + clear_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags); 1028 1046 } 1029 1047 1030 1048 void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)

+18 -4

drivers/nvme/host/nvme.h

··· 534 534 struct nvme_ns_head *head; 535 535 536 536 unsigned long flags; 537 - #define NVME_NS_REMOVING 0 538 - #define NVME_NS_ANA_PENDING 2 539 - #define NVME_NS_FORCE_RO 3 540 - #define NVME_NS_READY 4 537 + #define NVME_NS_REMOVING 0 538 + #define NVME_NS_ANA_PENDING 2 539 + #define NVME_NS_FORCE_RO 3 540 + #define NVME_NS_READY 4 541 + #define NVME_NS_SYSFS_ATTR_LINK 5 541 542 542 543 struct cdev cdev; 543 544 struct device cdev_device; ··· 934 933 int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); 935 934 936 935 extern const struct attribute_group *nvme_ns_attr_groups[]; 936 + extern const struct attribute_group nvme_ns_mpath_attr_group; 937 937 extern const struct pr_ops nvme_pr_ops; 938 938 extern const struct block_device_operations nvme_ns_head_ops; 939 939 extern const struct attribute_group nvme_dev_attrs_group; ··· 957 955 void nvme_failover_req(struct request *req); 958 956 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); 959 957 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); 958 + void nvme_mpath_add_sysfs_link(struct nvme_ns_head *ns); 959 + void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns); 960 960 void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid); 961 961 void nvme_mpath_remove_disk(struct nvme_ns_head *head); 962 962 int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id); ··· 984 980 extern bool multipath; 985 981 extern struct device_attribute dev_attr_ana_grpid; 986 982 extern struct device_attribute dev_attr_ana_state; 983 + extern struct device_attribute dev_attr_queue_depth; 984 + extern struct device_attribute dev_attr_numa_nodes; 987 985 extern struct device_attribute subsys_attr_iopolicy; 988 986 989 987 static inline bool nvme_disk_is_ns_head(struct gendisk *disk) ··· 1013 1007 { 1014 1008 } 1015 1009 static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head) 1010 + { 1011 + } 1012 + static inline void nvme_mpath_add_sysfs_link(struct nvme_ns *ns) 1013 + { 1014 + } 1015 + static inline void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns) 1016 1016 { 1017 1017 } 1018 1018 static inline bool nvme_mpath_clear_current_path(struct nvme_ns *ns) ··· 1159 1147 int nvme_auth_negotiate(struct nvme_ctrl *ctrl, int qid); 1160 1148 int nvme_auth_wait(struct nvme_ctrl *ctrl, int qid); 1161 1149 void nvme_auth_free(struct nvme_ctrl *ctrl); 1150 + void nvme_auth_revoke_tls_key(struct nvme_ctrl *ctrl); 1162 1151 #else 1163 1152 static inline int nvme_auth_init_ctrl(struct nvme_ctrl *ctrl) 1164 1153 { ··· 1182 1169 return -EPROTONOSUPPORT; 1183 1170 } 1184 1171 static inline void nvme_auth_free(struct nvme_ctrl *ctrl) {}; 1172 + static inline void nvme_auth_revoke_tls_key(struct nvme_ctrl *ctrl) {}; 1185 1173 #endif 1186 1174 1187 1175 u32 nvme_command_effects(struct nvme_ctrl *ctrl, struct nvme_ns *ns,

+1 -4

drivers/nvme/host/pci.c

··· 812 812 if (!iod->sgt.sgl) 813 813 return BLK_STS_RESOURCE; 814 814 sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req)); 815 - iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl); 815 + iod->sgt.orig_nents = blk_rq_map_sg(req, iod->sgt.sgl); 816 816 if (!iod->sgt.orig_nents) 817 817 goto out_free_sg; 818 818 ··· 953 953 return ret; 954 954 } 955 955 956 - /* 957 - * NOTE: ns is NULL when called on the admin queue. 958 - */ 959 956 static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, 960 957 const struct blk_mq_queue_data *bd) 961 958 {

+1 -2

drivers/nvme/host/rdma.c

··· 1476 1476 if (ret) 1477 1477 return -ENOMEM; 1478 1478 1479 - req->data_sgl.nents = blk_rq_map_sg(rq->q, rq, 1480 - req->data_sgl.sg_table.sgl); 1479 + req->data_sgl.nents = blk_rq_map_sg(rq, req->data_sgl.sg_table.sgl); 1481 1480 1482 1481 *count = ib_dma_map_sg(ibdev, req->data_sgl.sg_table.sgl, 1483 1482 req->data_sgl.nents, rq_dma_dir(rq));

+22 -2

drivers/nvme/host/sysfs.c

··· 258 258 #ifdef CONFIG_NVME_MULTIPATH 259 259 &dev_attr_ana_grpid.attr, 260 260 &dev_attr_ana_state.attr, 261 + &dev_attr_queue_depth.attr, 262 + &dev_attr_numa_nodes.attr, 261 263 #endif 262 264 &dev_attr_io_passthru_err_log_enabled.attr, 263 265 NULL, ··· 292 290 if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl)) 293 291 return 0; 294 292 } 293 + if (a == &dev_attr_queue_depth.attr || a == &dev_attr_numa_nodes.attr) { 294 + if (nvme_disk_is_ns_head(dev_to_disk(dev))) 295 + return 0; 296 + } 295 297 #endif 296 298 return a->mode; 297 299 } ··· 305 299 .is_visible = nvme_ns_attrs_are_visible, 306 300 }; 307 301 302 + #ifdef CONFIG_NVME_MULTIPATH 303 + static struct attribute *nvme_ns_mpath_attrs[] = { 304 + NULL, 305 + }; 306 + 307 + const struct attribute_group nvme_ns_mpath_attr_group = { 308 + .name = "multipath", 309 + .attrs = nvme_ns_mpath_attrs, 310 + }; 311 + #endif 312 + 308 313 const struct attribute_group *nvme_ns_attr_groups[] = { 309 314 &nvme_ns_attr_group, 315 + #ifdef CONFIG_NVME_MULTIPATH 316 + &nvme_ns_mpath_attr_group, 317 + #endif 310 318 NULL, 311 319 }; 312 320 ··· 800 780 return 0; 801 781 802 782 if (a == &dev_attr_tls_key.attr && 803 - !ctrl->opts->tls) 783 + !ctrl->opts->tls && !ctrl->opts->concat) 804 784 return 0; 805 785 if (a == &dev_attr_tls_configured_key.attr && 806 - !ctrl->opts->tls_key) 786 + (!ctrl->opts->tls_key || ctrl->opts->concat)) 807 787 return 0; 808 788 if (a == &dev_attr_tls_keyring.attr && 809 789 !ctrl->opts->keyring)

+59 -8

drivers/nvme/host/tcp.c

··· 8 8 #include <linux/init.h> 9 9 #include <linux/slab.h> 10 10 #include <linux/err.h> 11 - #include <linux/key.h> 12 11 #include <linux/nvme-tcp.h> 13 12 #include <linux/nvme-keyring.h> 14 13 #include <net/sock.h> ··· 248 249 if (!IS_ENABLED(CONFIG_NVME_TCP_TLS)) 249 250 return 0; 250 251 251 - return ctrl->opts->tls; 252 + return ctrl->opts->tls || ctrl->opts->concat; 252 253 } 253 254 254 255 static inline struct blk_mq_tags *nvme_tcp_tagset(struct nvme_tcp_queue *queue) ··· 1789 1790 queue->cmnd_capsule_len = sizeof(struct nvme_command) + 1790 1791 NVME_TCP_ADMIN_CCSZ; 1791 1792 1792 - ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM, 1793 + ret = sock_create_kern(current->nsproxy->net_ns, 1794 + ctrl->addr.ss_family, SOCK_STREAM, 1793 1795 IPPROTO_TCP, &queue->sock); 1794 1796 if (ret) { 1795 1797 dev_err(nctrl->device, ··· 2060 2060 if (nvme_tcp_tls_configured(ctrl)) { 2061 2061 if (ctrl->opts->tls_key) 2062 2062 pskid = key_serial(ctrl->opts->tls_key); 2063 - else { 2063 + else if (ctrl->opts->tls) { 2064 2064 pskid = nvme_tls_psk_default(ctrl->opts->keyring, 2065 2065 ctrl->opts->host->nqn, 2066 2066 ctrl->opts->subsysnqn); ··· 2090 2090 { 2091 2091 int i, ret; 2092 2092 2093 - if (nvme_tcp_tls_configured(ctrl) && !ctrl->tls_pskid) { 2094 - dev_err(ctrl->device, "no PSK negotiated\n"); 2095 - return -ENOKEY; 2093 + if (nvme_tcp_tls_configured(ctrl)) { 2094 + if (ctrl->opts->concat) { 2095 + /* 2096 + * The generated PSK is stored in the 2097 + * fabric options 2098 + */ 2099 + if (!ctrl->opts->tls_key) { 2100 + dev_err(ctrl->device, "no PSK generated\n"); 2101 + return -ENOKEY; 2102 + } 2103 + if (ctrl->tls_pskid && 2104 + ctrl->tls_pskid != key_serial(ctrl->opts->tls_key)) { 2105 + dev_err(ctrl->device, "Stale PSK id %08x\n", ctrl->tls_pskid); 2106 + ctrl->tls_pskid = 0; 2107 + } 2108 + } else if (!ctrl->tls_pskid) { 2109 + dev_err(ctrl->device, "no PSK negotiated\n"); 2110 + return -ENOKEY; 2111 + } 2096 2112 } 2097 2113 2098 2114 for (i = 1; i < ctrl->queue_count; i++) { ··· 2326 2310 } 2327 2311 } 2328 2312 2313 + /* 2314 + * The TLS key is set by secure concatenation after negotiation has been 2315 + * completed on the admin queue. We need to revoke the key when: 2316 + * - concatenation is enabled (otherwise it's a static key set by the user) 2317 + * and 2318 + * - the generated key is present in ctrl->tls_key (otherwise there's nothing 2319 + * to revoke) 2320 + * and 2321 + * - a valid PSK key ID has been set in ctrl->tls_pskid (otherwise TLS 2322 + * negotiation has not run). 2323 + * 2324 + * We cannot always revoke the key as nvme_tcp_alloc_admin_queue() is called 2325 + * twice during secure concatenation, once on a 'normal' connection to run the 2326 + * DH-HMAC-CHAP negotiation (which generates the key, so it _must not_ be set), 2327 + * and once after the negotiation (which uses the key, so it _must_ be set). 2328 + */ 2329 + static bool nvme_tcp_key_revoke_needed(struct nvme_ctrl *ctrl) 2330 + { 2331 + return ctrl->opts->concat && ctrl->opts->tls_key && ctrl->tls_pskid; 2332 + } 2333 + 2329 2334 static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) 2330 2335 { 2331 2336 struct nvmf_ctrl_options *opts = ctrl->opts; ··· 2355 2318 ret = nvme_tcp_configure_admin_queue(ctrl, new); 2356 2319 if (ret) 2357 2320 return ret; 2321 + 2322 + if (ctrl->opts && ctrl->opts->concat && !ctrl->tls_pskid) { 2323 + /* See comments for nvme_tcp_key_revoke_needed() */ 2324 + dev_dbg(ctrl->device, "restart admin queue for secure concatenation\n"); 2325 + nvme_stop_keep_alive(ctrl); 2326 + nvme_tcp_teardown_admin_queue(ctrl, false); 2327 + ret = nvme_tcp_configure_admin_queue(ctrl, false); 2328 + if (ret) 2329 + return ret; 2330 + } 2358 2331 2359 2332 if (ctrl->icdoff) { 2360 2333 ret = -EOPNOTSUPP; ··· 2462 2415 struct nvme_tcp_ctrl, err_work); 2463 2416 struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl; 2464 2417 2418 + if (nvme_tcp_key_revoke_needed(ctrl)) 2419 + nvme_auth_revoke_tls_key(ctrl); 2465 2420 nvme_stop_keep_alive(ctrl); 2466 2421 flush_work(&ctrl->async_event_work); 2467 2422 nvme_tcp_teardown_io_queues(ctrl, false); ··· 2504 2455 container_of(work, struct nvme_ctrl, reset_work); 2505 2456 int ret; 2506 2457 2458 + if (nvme_tcp_key_revoke_needed(ctrl)) 2459 + nvme_auth_revoke_tls_key(ctrl); 2507 2460 nvme_stop_ctrl(ctrl); 2508 2461 nvme_tcp_teardown_ctrl(ctrl, false); 2509 2462 ··· 3002 2951 NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST | 3003 2952 NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | 3004 2953 NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE | NVMF_OPT_TLS | 3005 - NVMF_OPT_KEYRING | NVMF_OPT_TLS_KEY, 2954 + NVMF_OPT_KEYRING | NVMF_OPT_TLS_KEY | NVMF_OPT_CONCAT, 3006 2955 .create_ctrl = nvme_tcp_create_ctrl, 3007 2956 }; 3008 2957

+4 -6

drivers/nvme/host/zns.c

··· 146 146 return NULL; 147 147 } 148 148 149 - static int nvme_zone_parse_entry(struct nvme_ctrl *ctrl, 150 - struct nvme_ns_head *head, 149 + static int nvme_zone_parse_entry(struct nvme_ns *ns, 151 150 struct nvme_zone_descriptor *entry, 152 151 unsigned int idx, report_zones_cb cb, 153 152 void *data) 154 153 { 154 + struct nvme_ns_head *head = ns->head; 155 155 struct blk_zone zone = { }; 156 156 157 157 if ((entry->zt & 0xf) != NVME_ZONE_TYPE_SEQWRITE_REQ) { 158 - dev_err(ctrl->device, "invalid zone type %#x\n", 159 - entry->zt); 158 + dev_err(ns->ctrl->device, "invalid zone type %#x\n", entry->zt); 160 159 return -EINVAL; 161 160 } 162 161 ··· 212 213 break; 213 214 214 215 for (i = 0; i < nz && zone_idx < nr_zones; i++) { 215 - ret = nvme_zone_parse_entry(ns->ctrl, ns->head, 216 - &report->entries[i], 216 + ret = nvme_zone_parse_entry(ns, &report->entries[i], 217 217 zone_idx, cb, data); 218 218 if (ret) 219 219 goto out_free;

+71 -1

drivers/nvme/target/auth.c

··· 15 15 #include <linux/ctype.h> 16 16 #include <linux/random.h> 17 17 #include <linux/nvme-auth.h> 18 + #include <linux/nvme-keyring.h> 18 19 #include <linux/unaligned.h> 19 20 20 21 #include "nvmet.h" ··· 140 139 return ret; 141 140 } 142 141 143 - u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl) 142 + u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq) 144 143 { 145 144 int ret = 0; 146 145 struct nvmet_host_link *p; ··· 163 162 if (!host) { 164 163 pr_debug("host %s not found\n", ctrl->hostnqn); 165 164 ret = NVME_AUTH_DHCHAP_FAILURE_FAILED; 165 + goto out_unlock; 166 + } 167 + 168 + if (nvmet_queue_tls_keyid(sq)) { 169 + pr_debug("host %s tls enabled\n", ctrl->hostnqn); 166 170 goto out_unlock; 167 171 } 168 172 ··· 239 233 void nvmet_auth_sq_free(struct nvmet_sq *sq) 240 234 { 241 235 cancel_delayed_work(&sq->auth_expired_work); 236 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 237 + sq->tls_key = 0; 238 + #endif 242 239 kfree(sq->dhchap_c1); 243 240 sq->dhchap_c1 = NULL; 244 241 kfree(sq->dhchap_c2); ··· 270 261 nvme_auth_free_key(ctrl->ctrl_key); 271 262 ctrl->ctrl_key = NULL; 272 263 } 264 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 265 + if (ctrl->tls_key) { 266 + key_put(ctrl->tls_key); 267 + ctrl->tls_key = NULL; 268 + } 269 + #endif 273 270 } 274 271 275 272 bool nvmet_check_auth_status(struct nvmet_req *req) ··· 556 541 req->sq->dhchap_skey); 557 542 558 543 return ret; 544 + } 545 + 546 + void nvmet_auth_insert_psk(struct nvmet_sq *sq) 547 + { 548 + int hash_len = nvme_auth_hmac_hash_len(sq->ctrl->shash_id); 549 + u8 *psk, *digest, *tls_psk; 550 + size_t psk_len; 551 + int ret; 552 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 553 + struct key *tls_key = NULL; 554 + #endif 555 + 556 + ret = nvme_auth_generate_psk(sq->ctrl->shash_id, 557 + sq->dhchap_skey, 558 + sq->dhchap_skey_len, 559 + sq->dhchap_c1, sq->dhchap_c2, 560 + hash_len, &psk, &psk_len); 561 + if (ret) { 562 + pr_warn("%s: ctrl %d qid %d failed to generate PSK, error %d\n", 563 + __func__, sq->ctrl->cntlid, sq->qid, ret); 564 + return; 565 + } 566 + ret = nvme_auth_generate_digest(sq->ctrl->shash_id, psk, psk_len, 567 + sq->ctrl->subsysnqn, 568 + sq->ctrl->hostnqn, &digest); 569 + if (ret) { 570 + pr_warn("%s: ctrl %d qid %d failed to generate digest, error %d\n", 571 + __func__, sq->ctrl->cntlid, sq->qid, ret); 572 + goto out_free_psk; 573 + } 574 + ret = nvme_auth_derive_tls_psk(sq->ctrl->shash_id, psk, psk_len, 575 + digest, &tls_psk); 576 + if (ret) { 577 + pr_warn("%s: ctrl %d qid %d failed to derive TLS PSK, error %d\n", 578 + __func__, sq->ctrl->cntlid, sq->qid, ret); 579 + goto out_free_digest; 580 + } 581 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 582 + tls_key = nvme_tls_psk_refresh(NULL, sq->ctrl->hostnqn, sq->ctrl->subsysnqn, 583 + sq->ctrl->shash_id, tls_psk, psk_len, digest); 584 + if (IS_ERR(tls_key)) { 585 + pr_warn("%s: ctrl %d qid %d failed to refresh key, error %ld\n", 586 + __func__, sq->ctrl->cntlid, sq->qid, PTR_ERR(tls_key)); 587 + tls_key = NULL; 588 + kfree_sensitive(tls_psk); 589 + } 590 + if (sq->ctrl->tls_key) 591 + key_put(sq->ctrl->tls_key); 592 + sq->ctrl->tls_key = tls_key; 593 + #endif 594 + 595 + out_free_digest: 596 + kfree_sensitive(digest); 597 + out_free_psk: 598 + kfree_sensitive(psk); 559 599 }

+4 -5

drivers/nvme/target/core.c

··· 1618 1618 } 1619 1619 ctrl->cntlid = ret; 1620 1620 1621 - uuid_copy(&ctrl->hostid, args->hostid); 1622 - 1623 1621 /* 1624 1622 * Discovery controllers may use some arbitrary high value 1625 1623 * in order to cleanup stale discovery sessions ··· 1645 1647 if (args->hostid) 1646 1648 uuid_copy(&ctrl->hostid, args->hostid); 1647 1649 1648 - dhchap_status = nvmet_setup_auth(ctrl); 1650 + dhchap_status = nvmet_setup_auth(ctrl, args->sq); 1649 1651 if (dhchap_status) { 1650 1652 pr_err("Failed to setup authentication, dhchap status %u\n", 1651 1653 dhchap_status); ··· 1660 1662 1661 1663 args->status = NVME_SC_SUCCESS; 1662 1664 1663 - pr_info("Created %s controller %d for subsystem %s for NQN %s%s%s.\n", 1665 + pr_info("Created %s controller %d for subsystem %s for NQN %s%s%s%s.\n", 1664 1666 nvmet_is_disc_subsys(ctrl->subsys) ? "discovery" : "nvm", 1665 1667 ctrl->cntlid, ctrl->subsys->subsysnqn, ctrl->hostnqn, 1666 1668 ctrl->pi_support ? " T10-PI is enabled" : "", 1667 - nvmet_has_auth(ctrl) ? " with DH-HMAC-CHAP" : ""); 1669 + nvmet_has_auth(ctrl, args->sq) ? " with DH-HMAC-CHAP" : "", 1670 + nvmet_queue_tls_keyid(args->sq) ? ", TLS" : ""); 1668 1671 1669 1672 return ctrl; 1670 1673

+27

drivers/nvme/target/debugfs.c

··· 132 132 } 133 133 NVMET_DEBUGFS_ATTR(nvmet_ctrl_host_traddr); 134 134 135 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 136 + static int nvmet_ctrl_tls_key_show(struct seq_file *m, void *p) 137 + { 138 + struct nvmet_ctrl *ctrl = m->private; 139 + key_serial_t keyid = nvmet_queue_tls_keyid(ctrl->sqs[0]); 140 + 141 + seq_printf(m, "%08x\n", keyid); 142 + return 0; 143 + } 144 + NVMET_DEBUGFS_ATTR(nvmet_ctrl_tls_key); 145 + 146 + static int nvmet_ctrl_tls_concat_show(struct seq_file *m, void *p) 147 + { 148 + struct nvmet_ctrl *ctrl = m->private; 149 + 150 + seq_printf(m, "%d\n", ctrl->concat); 151 + return 0; 152 + } 153 + NVMET_DEBUGFS_ATTR(nvmet_ctrl_tls_concat); 154 + #endif 155 + 135 156 int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl) 136 157 { 137 158 char name[32]; ··· 178 157 &nvmet_ctrl_state_fops); 179 158 debugfs_create_file("host_traddr", S_IRUSR, ctrl->debugfs_dir, ctrl, 180 159 &nvmet_ctrl_host_traddr_fops); 160 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 161 + debugfs_create_file("tls_concat", S_IRUSR, ctrl->debugfs_dir, ctrl, 162 + &nvmet_ctrl_tls_concat_fops); 163 + debugfs_create_file("tls_key", S_IRUSR, ctrl->debugfs_dir, ctrl, 164 + &nvmet_ctrl_tls_key_fops); 165 + #endif 181 166 return 0; 182 167 } 183 168

+55 -7

drivers/nvme/target/fabrics-cmd-auth.c

··· 43 43 data->auth_protocol[0].dhchap.halen, 44 44 data->auth_protocol[0].dhchap.dhlen); 45 45 req->sq->dhchap_tid = le16_to_cpu(data->t_id); 46 - if (data->sc_c) 47 - return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 46 + if (data->sc_c != NVME_AUTH_SECP_NOSC) { 47 + if (!IS_ENABLED(CONFIG_NVME_TARGET_TCP_TLS)) 48 + return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 49 + /* Secure concatenation can only be enabled on the admin queue */ 50 + if (req->sq->qid) 51 + return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 52 + switch (data->sc_c) { 53 + case NVME_AUTH_SECP_NEWTLSPSK: 54 + if (nvmet_queue_tls_keyid(req->sq)) 55 + return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 56 + break; 57 + case NVME_AUTH_SECP_REPLACETLSPSK: 58 + if (!nvmet_queue_tls_keyid(req->sq)) 59 + return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 60 + break; 61 + default: 62 + return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 63 + } 64 + ctrl->concat = true; 65 + } 48 66 49 67 if (data->napd != 1) 50 68 return NVME_AUTH_DHCHAP_FAILURE_HASH_UNUSABLE; ··· 121 103 nvme_auth_dhgroup_name(fallback_dhgid)); 122 104 ctrl->dh_gid = fallback_dhgid; 123 105 } 106 + if (ctrl->dh_gid == NVME_AUTH_DHGROUP_NULL && ctrl->concat) { 107 + pr_debug("%s: ctrl %d qid %d: NULL DH group invalid " 108 + "for secure channel concatenation\n", __func__, 109 + ctrl->cntlid, req->sq->qid); 110 + return NVME_AUTH_DHCHAP_FAILURE_CONCAT_MISMATCH; 111 + } 124 112 pr_debug("%s: ctrl %d qid %d: selected DH group %s (%d)\n", 125 113 __func__, ctrl->cntlid, req->sq->qid, 126 114 nvme_auth_dhgroup_name(ctrl->dh_gid), ctrl->dh_gid); ··· 172 148 if (memcmp(data->rval, response, data->hl)) { 173 149 pr_info("ctrl %d qid %d host response mismatch\n", 174 150 ctrl->cntlid, req->sq->qid); 151 + pr_debug("ctrl %d qid %d rval %*ph\n", 152 + ctrl->cntlid, req->sq->qid, data->hl, data->rval); 153 + pr_debug("ctrl %d qid %d response %*ph\n", 154 + ctrl->cntlid, req->sq->qid, data->hl, response); 175 155 kfree(response); 176 156 return NVME_AUTH_DHCHAP_FAILURE_FAILED; 177 157 } 178 158 kfree(response); 179 159 pr_debug("%s: ctrl %d qid %d host authenticated\n", 180 160 __func__, ctrl->cntlid, req->sq->qid); 161 + if (!data->cvalid && ctrl->concat) { 162 + pr_debug("%s: ctrl %d qid %d invalid challenge\n", 163 + __func__, ctrl->cntlid, req->sq->qid); 164 + return NVME_AUTH_DHCHAP_FAILURE_FAILED; 165 + } 166 + req->sq->dhchap_s2 = le32_to_cpu(data->seqnum); 181 167 if (data->cvalid) { 182 168 req->sq->dhchap_c2 = kmemdup(data->rval + data->hl, data->hl, 183 169 GFP_KERNEL); ··· 197 163 pr_debug("%s: ctrl %d qid %d challenge %*ph\n", 198 164 __func__, ctrl->cntlid, req->sq->qid, data->hl, 199 165 req->sq->dhchap_c2); 200 - } else { 201 - req->sq->authenticated = true; 202 - req->sq->dhchap_c2 = NULL; 203 166 } 204 - req->sq->dhchap_s2 = le32_to_cpu(data->seqnum); 167 + /* 168 + * NVMe Base Spec 2.2 section 8.3.4.5.4: DH-HMAC-CHAP_Reply message 169 + * Sequence Number (SEQNUM): [ .. ] 170 + * The value 0h is used to indicate that bidirectional authentication 171 + * is not performed, but a challenge value C2 is carried in order to 172 + * generate a pre-shared key (PSK) for subsequent establishment of a 173 + * secure channel. 174 + */ 175 + if (req->sq->dhchap_s2 == 0) { 176 + if (ctrl->concat) 177 + nvmet_auth_insert_psk(req->sq); 178 + req->sq->authenticated = true; 179 + kfree(req->sq->dhchap_c2); 180 + req->sq->dhchap_c2 = NULL; 181 + } else if (!data->cvalid) 182 + req->sq->authenticated = true; 205 183 206 184 return 0; 207 185 } ··· 292 246 pr_debug("%s: ctrl %d qid %d reset negotiation\n", 293 247 __func__, ctrl->cntlid, req->sq->qid); 294 248 if (!req->sq->qid) { 295 - dhchap_status = nvmet_setup_auth(ctrl); 249 + dhchap_status = nvmet_setup_auth(ctrl, req->sq); 296 250 if (dhchap_status) { 297 251 pr_err("ctrl %d qid 0 failed to setup re-authentication\n", 298 252 ctrl->cntlid); ··· 349 303 } 350 304 goto done_kfree; 351 305 case NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2: 306 + if (ctrl->concat) 307 + nvmet_auth_insert_psk(req->sq); 352 308 req->sq->authenticated = true; 353 309 pr_debug("%s: ctrl %d qid %d ctrl authenticated\n", 354 310 __func__, ctrl->cntlid, req->sq->qid);

+21 -4

drivers/nvme/target/fabrics-cmd.c

··· 234 234 return ret; 235 235 } 236 236 237 - static u32 nvmet_connect_result(struct nvmet_ctrl *ctrl) 237 + static u32 nvmet_connect_result(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq) 238 238 { 239 + bool needs_auth = nvmet_has_auth(ctrl, sq); 240 + key_serial_t keyid = nvmet_queue_tls_keyid(sq); 241 + 242 + /* Do not authenticate I/O queues for secure concatenation */ 243 + if (ctrl->concat && sq->qid) 244 + needs_auth = false; 245 + 246 + if (keyid) 247 + pr_debug("%s: ctrl %d qid %d should %sauthenticate, tls psk %08x\n", 248 + __func__, ctrl->cntlid, sq->qid, 249 + needs_auth ? "" : "not ", keyid); 250 + else 251 + pr_debug("%s: ctrl %d qid %d should %sauthenticate%s\n", 252 + __func__, ctrl->cntlid, sq->qid, 253 + needs_auth ? "" : "not ", 254 + ctrl->concat ? ", secure concatenation" : ""); 239 255 return (u32)ctrl->cntlid | 240 - (nvmet_has_auth(ctrl) ? NVME_CONNECT_AUTHREQ_ATR : 0); 256 + (needs_auth ? NVME_CONNECT_AUTHREQ_ATR : 0); 241 257 } 242 258 243 259 static void nvmet_execute_admin_connect(struct nvmet_req *req) ··· 263 247 struct nvmet_ctrl *ctrl = NULL; 264 248 struct nvmet_alloc_ctrl_args args = { 265 249 .port = req->port, 250 + .sq = req->sq, 266 251 .ops = req->ops, 267 252 .p2p_client = req->p2p_client, 268 253 .kato = le32_to_cpu(c->kato), ··· 316 299 goto out; 317 300 } 318 301 319 - args.result = cpu_to_le32(nvmet_connect_result(ctrl)); 302 + args.result = cpu_to_le32(nvmet_connect_result(ctrl, req->sq)); 320 303 out: 321 304 kfree(d); 322 305 complete: ··· 374 357 goto out_ctrl_put; 375 358 376 359 pr_debug("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid); 377 - req->cqe->result.u32 = cpu_to_le32(nvmet_connect_result(ctrl)); 360 + req->cqe->result.u32 = cpu_to_le32(nvmet_connect_result(ctrl, req->sq)); 378 361 out: 379 362 kfree(d); 380 363 complete:

-14

drivers/nvme/target/fc.c

··· 172 172 struct work_struct del_work; 173 173 }; 174 174 175 - 176 - static inline int 177 - nvmet_fc_iodnum(struct nvmet_fc_ls_iod *iodptr) 178 - { 179 - return (iodptr - iodptr->tgtport->iod); 180 - } 181 - 182 - static inline int 183 - nvmet_fc_fodnum(struct nvmet_fc_fcp_iod *fodptr) 184 - { 185 - return (fodptr - fodptr->queue->fod); 186 - } 187 - 188 - 189 175 /* 190 176 * Association and Connection IDs: 191 177 *

+1 -1

drivers/nvme/target/loop.c

··· 162 162 } 163 163 164 164 iod->req.sg = iod->sg_table.sgl; 165 - iod->req.sg_cnt = blk_rq_map_sg(req->q, req, iod->sg_table.sgl); 165 + iod->req.sg_cnt = blk_rq_map_sg(req, iod->sg_table.sgl); 166 166 iod->req.transfer_len = blk_rq_payload_bytes(req); 167 167 } 168 168

+34 -6

drivers/nvme/target/nvmet.h

··· 165 165 u8 *dhchap_skey; 166 166 int dhchap_skey_len; 167 167 #endif 168 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 169 + struct key *tls_key; 170 + #endif 168 171 struct completion free_done; 169 172 struct completion confirm_done; 170 173 }; ··· 292 289 u64 err_counter; 293 290 struct nvme_error_slot slots[NVMET_ERROR_LOG_SLOTS]; 294 291 bool pi_support; 292 + bool concat; 295 293 #ifdef CONFIG_NVME_TARGET_AUTH 296 294 struct nvme_dhchap_key *host_key; 297 295 struct nvme_dhchap_key *ctrl_key; ··· 301 297 u8 dh_gid; 302 298 u8 *dh_key; 303 299 size_t dh_keysize; 300 + #endif 301 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 302 + struct key *tls_key; 304 303 #endif 305 304 struct nvmet_pr_log_mgr pr_log_mgr; 306 305 }; ··· 590 583 591 584 struct nvmet_alloc_ctrl_args { 592 585 struct nvmet_port *port; 586 + struct nvmet_sq *sq; 593 587 char *subsysnqn; 594 588 char *hostnqn; 595 589 uuid_t *hostid; ··· 827 819 /* Convert a 32-bit number to a 16-bit 0's based number */ 828 820 static inline __le16 to0based(u32 a) 829 821 { 830 - return cpu_to_le16(max(1U, min(1U << 16, a)) - 1); 822 + return cpu_to_le16(clamp(a, 1U, 1U << 16) - 1); 831 823 } 832 824 833 825 static inline bool nvmet_ns_has_pi(struct nvmet_ns *ns) ··· 859 851 bio_put(bio); 860 852 } 861 853 854 + #ifdef CONFIG_NVME_TARGET_TCP_TLS 855 + static inline key_serial_t nvmet_queue_tls_keyid(struct nvmet_sq *sq) 856 + { 857 + return sq->tls_key ? key_serial(sq->tls_key) : 0; 858 + } 859 + static inline void nvmet_sq_put_tls_key(struct nvmet_sq *sq) 860 + { 861 + if (sq->tls_key) { 862 + key_put(sq->tls_key); 863 + sq->tls_key = NULL; 864 + } 865 + } 866 + #else 867 + static inline key_serial_t nvmet_queue_tls_keyid(struct nvmet_sq *sq) { return 0; } 868 + static inline void nvmet_sq_put_tls_key(struct nvmet_sq *sq) {} 869 + #endif 862 870 #ifdef CONFIG_NVME_TARGET_AUTH 863 871 u32 nvmet_auth_send_data_len(struct nvmet_req *req); 864 872 void nvmet_execute_auth_send(struct nvmet_req *req); ··· 883 859 int nvmet_auth_set_key(struct nvmet_host *host, const char *secret, 884 860 bool set_ctrl); 885 861 int nvmet_auth_set_host_hash(struct nvmet_host *host, const char *hash); 886 - u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl); 862 + u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq); 887 863 void nvmet_auth_sq_init(struct nvmet_sq *sq); 888 864 void nvmet_destroy_auth(struct nvmet_ctrl *ctrl); 889 865 void nvmet_auth_sq_free(struct nvmet_sq *sq); ··· 893 869 unsigned int hash_len); 894 870 int nvmet_auth_ctrl_hash(struct nvmet_req *req, u8 *response, 895 871 unsigned int hash_len); 896 - static inline bool nvmet_has_auth(struct nvmet_ctrl *ctrl) 872 + static inline bool nvmet_has_auth(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq) 897 873 { 898 - return ctrl->host_key != NULL; 874 + return ctrl->host_key != NULL && !nvmet_queue_tls_keyid(sq); 899 875 } 900 876 int nvmet_auth_ctrl_exponential(struct nvmet_req *req, 901 877 u8 *buf, int buf_size); 902 878 int nvmet_auth_ctrl_sesskey(struct nvmet_req *req, 903 879 u8 *buf, int buf_size); 880 + void nvmet_auth_insert_psk(struct nvmet_sq *sq); 904 881 #else 905 - static inline u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl) 882 + static inline u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl, 883 + struct nvmet_sq *sq) 906 884 { 907 885 return 0; 908 886 } ··· 917 891 { 918 892 return true; 919 893 } 920 - static inline bool nvmet_has_auth(struct nvmet_ctrl *ctrl) 894 + static inline bool nvmet_has_auth(struct nvmet_ctrl *ctrl, 895 + struct nvmet_sq *sq) 921 896 { 922 897 return false; 923 898 } 924 899 static inline const char *nvmet_dhchap_dhgroup_name(u8 dhgid) { return NULL; } 900 + static inline void nvmet_auth_insert_psk(struct nvmet_sq *sq) {}; 925 901 #endif 926 902 927 903 int nvmet_pr_init_ns(struct nvmet_ns *ns);

+9 -3

drivers/nvme/target/pci-epf.c

··· 1385 1385 if (!test_and_clear_bit(NVMET_PCI_EPF_Q_LIVE, &sq->flags)) 1386 1386 return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 1387 1387 1388 - flush_workqueue(sq->iod_wq); 1389 1388 destroy_workqueue(sq->iod_wq); 1390 1389 sq->iod_wq = NULL; 1391 1390 ··· 2128 2129 return -ENODEV; 2129 2130 } 2130 2131 2131 - if (epc_features->bar[BAR_0].only_64bit) 2132 - epf->bar[BAR_0].flags |= PCI_BASE_ADDRESS_MEM_TYPE_64; 2132 + /* 2133 + * While NVMe PCIe Transport Specification 1.1, section 2.1.10, claims 2134 + * that the BAR0 type is Implementation Specific, in NVMe 1.1, the type 2135 + * is required to be 64-bit. Thus, for interoperability, always set the 2136 + * type to 64-bit. In the rare case that the PCI EPC does not support 2137 + * configuring BAR0 as 64-bit, the call to pci_epc_set_bar() will fail, 2138 + * and we will return failure back to the user. 2139 + */ 2140 + epf->bar[BAR_0].flags |= PCI_BASE_ADDRESS_MEM_TYPE_64; 2133 2141 2134 2142 /* 2135 2143 * Calculate the size of the register bar: NVMe registers first with

+29 -3

drivers/nvme/target/tcp.c

··· 8 8 #include <linux/init.h> 9 9 #include <linux/slab.h> 10 10 #include <linux/err.h> 11 - #include <linux/key.h> 12 11 #include <linux/nvme-tcp.h> 13 12 #include <linux/nvme-keyring.h> 14 13 #include <net/sock.h> ··· 1079 1080 1080 1081 if (unlikely(!nvmet_req_init(req, &queue->nvme_cq, 1081 1082 &queue->nvme_sq, &nvmet_tcp_ops))) { 1082 - pr_err("failed cmd %p id %d opcode %d, data_len: %d\n", 1083 + pr_err("failed cmd %p id %d opcode %d, data_len: %d, status: %04x\n", 1083 1084 req->cmd, req->cmd->common.command_id, 1084 1085 req->cmd->common.opcode, 1085 - le32_to_cpu(req->cmd->common.dptr.sgl.length)); 1086 + le32_to_cpu(req->cmd->common.dptr.sgl.length), 1087 + le16_to_cpu(req->cqe->status)); 1086 1088 1087 1089 nvmet_tcp_handle_req_failure(queue, queue->cmd, req); 1088 1090 return 0; ··· 1609 1609 /* stop accepting incoming data */ 1610 1610 queue->rcv_state = NVMET_TCP_RECV_ERR; 1611 1611 1612 + nvmet_sq_put_tls_key(&queue->nvme_sq); 1612 1613 nvmet_tcp_uninit_data_in_cmds(queue); 1613 1614 nvmet_sq_destroy(&queue->nvme_sq); 1614 1615 cancel_work_sync(&queue->io_work); ··· 1795 1794 return 0; 1796 1795 } 1797 1796 1797 + static int nvmet_tcp_tls_key_lookup(struct nvmet_tcp_queue *queue, 1798 + key_serial_t peerid) 1799 + { 1800 + struct key *tls_key = nvme_tls_key_lookup(peerid); 1801 + int status = 0; 1802 + 1803 + if (IS_ERR(tls_key)) { 1804 + pr_warn("%s: queue %d failed to lookup key %x\n", 1805 + __func__, queue->idx, peerid); 1806 + spin_lock_bh(&queue->state_lock); 1807 + queue->state = NVMET_TCP_Q_FAILED; 1808 + spin_unlock_bh(&queue->state_lock); 1809 + status = PTR_ERR(tls_key); 1810 + } else { 1811 + pr_debug("%s: queue %d using TLS PSK %x\n", 1812 + __func__, queue->idx, peerid); 1813 + queue->nvme_sq.tls_key = tls_key; 1814 + } 1815 + return status; 1816 + } 1817 + 1798 1818 static void nvmet_tcp_tls_handshake_done(void *data, int status, 1799 1819 key_serial_t peerid) 1800 1820 { ··· 1836 1814 spin_unlock_bh(&queue->state_lock); 1837 1815 1838 1816 cancel_delayed_work_sync(&queue->tls_handshake_tmo_work); 1817 + 1818 + if (!status) 1819 + status = nvmet_tcp_tls_key_lookup(queue, peerid); 1820 + 1839 1821 if (status) 1840 1822 nvmet_tcp_schedule_release_queue(queue); 1841 1823 else

+1 -1

drivers/scsi/scsi_lib.c

··· 1149 1149 * Next, walk the list, and fill in the addresses and sizes of 1150 1150 * each segment. 1151 1151 */ 1152 - count = __blk_rq_map_sg(rq->q, rq, cmd->sdb.table.sgl, &last_sg); 1152 + count = __blk_rq_map_sg(rq, cmd->sdb.table.sgl, &last_sg); 1153 1153 1154 1154 if (blk_rq_bytes(rq) & rq->q->limits.dma_pad_mask) { 1155 1155 unsigned int pad_len =

-12

drivers/target/target_core_iblock.c

··· 167 167 break; 168 168 } 169 169 170 - if (dev->dev_attrib.pi_prot_type) { 171 - struct bio_set *bs = &ib_dev->ibd_bio_set; 172 - 173 - if (bioset_integrity_create(bs, IBLOCK_BIO_POOL_SIZE) < 0) { 174 - pr_err("Unable to allocate bioset for PI\n"); 175 - ret = -ENOMEM; 176 - goto out_blkdev_put; 177 - } 178 - pr_debug("IBLOCK setup BIP bs->bio_integrity_pool: %p\n", 179 - &bs->bio_integrity_pool); 180 - } 181 - 182 170 dev->dev_attrib.hw_pi_prot_type = dev->dev_attrib.pi_prot_type; 183 171 return 0; 184 172

+4 -3

drivers/ufs/core/ufshcd-crypto.c

··· 72 72 73 73 if (ccap_array[cap_idx].algorithm_id == UFS_CRYPTO_ALG_AES_XTS) { 74 74 /* In XTS mode, the blk_crypto_key's size is already doubled */ 75 - memcpy(cfg.crypto_key, key->raw, key->size/2); 75 + memcpy(cfg.crypto_key, key->bytes, key->size/2); 76 76 memcpy(cfg.crypto_key + UFS_CRYPTO_KEY_MAX_SIZE/2, 77 - key->raw + key->size/2, key->size/2); 77 + key->bytes + key->size/2, key->size/2); 78 78 } else { 79 - memcpy(cfg.crypto_key, key->raw, key->size); 79 + memcpy(cfg.crypto_key, key->bytes, key->size); 80 80 } 81 81 82 82 ufshcd_program_key(hba, &cfg, slot); ··· 185 185 hba->crypto_profile.ll_ops = ufshcd_crypto_ops; 186 186 /* UFS only supports 8 bytes for any DUN */ 187 187 hba->crypto_profile.max_dun_bytes_supported = 8; 188 + hba->crypto_profile.key_types_supported = BLK_CRYPTO_KEY_TYPE_RAW; 188 189 hba->crypto_profile.dev = hba->dev; 189 190 190 191 /*

+2 -1

drivers/ufs/host/ufs-exynos.c

··· 1320 1320 return; 1321 1321 } 1322 1322 profile->max_dun_bytes_supported = AES_BLOCK_SIZE; 1323 + profile->key_types_supported = BLK_CRYPTO_KEY_TYPE_RAW; 1323 1324 profile->dev = hba->dev; 1324 1325 profile->modes_supported[BLK_ENCRYPTION_MODE_AES_256_XTS] = 1325 1326 DATA_UNIT_SIZE; ··· 1367 1366 void *prdt, unsigned int num_segments) 1368 1367 { 1369 1368 struct fmp_sg_entry *fmp_prdt = prdt; 1370 - const u8 *enckey = crypt_ctx->bc_key->raw; 1369 + const u8 *enckey = crypt_ctx->bc_key->bytes; 1371 1370 const u8 *twkey = enckey + AES_KEYSIZE_256; 1372 1371 u64 dun_lo = crypt_ctx->bc_dun[0]; 1373 1372 u64 dun_hi = crypt_ctx->bc_dun[1];

+2 -1

drivers/ufs/host/ufs-qcom.c

··· 147 147 148 148 profile->ll_ops = ufs_qcom_crypto_ops; 149 149 profile->max_dun_bytes_supported = 8; 150 + profile->key_types_supported = BLK_CRYPTO_KEY_TYPE_RAW; 150 151 profile->dev = dev; 151 152 152 153 /* ··· 203 202 err = qcom_ice_program_key(host->ice, 204 203 QCOM_ICE_CRYPTO_ALG_AES_XTS, 205 204 QCOM_ICE_CRYPTO_KEY_SIZE_256, 206 - key->raw, 205 + key->bytes, 207 206 key->crypto_cfg.data_unit_size / 512, 208 207 slot); 209 208 ufshcd_release(hba);

+1

fs/crypto/Kconfig

··· 3 3 bool "FS Encryption (Per-file encryption)" 4 4 select CRYPTO 5 5 select CRYPTO_HASH 6 + select CRYPTO_HKDF 6 7 select CRYPTO_SKCIPHER 7 8 select CRYPTO_LIB_SHA256 8 9 select KEYS

+14 -69

fs/crypto/hkdf.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * Implementation of HKDF ("HMAC-based Extract-and-Expand Key Derivation 4 - * Function"), aka RFC 5869. See also the original paper (Krawczyk 2010): 5 - * "Cryptographic Extraction and Key Derivation: The HKDF Scheme". 6 - * 7 3 * This is used to derive keys from the fscrypt master keys. 8 4 * 9 5 * Copyright 2019 Google LLC ··· 7 11 8 12 #include <crypto/hash.h> 9 13 #include <crypto/sha2.h> 14 + #include <crypto/hkdf.h> 10 15 11 16 #include "fscrypt_private.h" 12 17 ··· 41 44 * there's no way to persist a random salt per master key from kernel mode. 42 45 */ 43 46 44 - /* HKDF-Extract (RFC 5869 section 2.2), unsalted */ 45 - static int hkdf_extract(struct crypto_shash *hmac_tfm, const u8 *ikm, 46 - unsigned int ikmlen, u8 prk[HKDF_HASHLEN]) 47 - { 48 - static const u8 default_salt[HKDF_HASHLEN]; 49 - int err; 50 - 51 - err = crypto_shash_setkey(hmac_tfm, default_salt, HKDF_HASHLEN); 52 - if (err) 53 - return err; 54 - 55 - return crypto_shash_tfm_digest(hmac_tfm, ikm, ikmlen, prk); 56 - } 57 - 58 47 /* 59 48 * Compute HKDF-Extract using the given master key as the input keying material, 60 49 * and prepare an HMAC transform object keyed by the resulting pseudorandom key. ··· 52 69 unsigned int master_key_size) 53 70 { 54 71 struct crypto_shash *hmac_tfm; 72 + static const u8 default_salt[HKDF_HASHLEN]; 55 73 u8 prk[HKDF_HASHLEN]; 56 74 int err; 57 75 ··· 68 84 goto err_free_tfm; 69 85 } 70 86 71 - err = hkdf_extract(hmac_tfm, master_key, master_key_size, prk); 87 + err = hkdf_extract(hmac_tfm, master_key, master_key_size, 88 + default_salt, HKDF_HASHLEN, prk); 72 89 if (err) 73 90 goto err_free_tfm; 74 91 ··· 103 118 u8 *okm, unsigned int okmlen) 104 119 { 105 120 SHASH_DESC_ON_STACK(desc, hkdf->hmac_tfm); 106 - u8 prefix[9]; 107 - unsigned int i; 121 + u8 *full_info; 108 122 int err; 109 - const u8 *prev = NULL; 110 - u8 counter = 1; 111 - u8 tmp[HKDF_HASHLEN]; 112 123 113 - if (WARN_ON_ONCE(okmlen > 255 * HKDF_HASHLEN)) 114 - return -EINVAL; 115 - 124 + full_info = kzalloc(infolen + 9, GFP_KERNEL); 125 + if (!full_info) 126 + return -ENOMEM; 116 127 desc->tfm = hkdf->hmac_tfm; 117 128 118 - memcpy(prefix, "fscrypt\0", 8); 119 - prefix[8] = context; 129 + memcpy(full_info, "fscrypt\0", 8); 130 + full_info[8] = context; 131 + memcpy(full_info + 9, info, infolen); 120 132 121 - for (i = 0; i < okmlen; i += HKDF_HASHLEN) { 122 - 123 - err = crypto_shash_init(desc); 124 - if (err) 125 - goto out; 126 - 127 - if (prev) { 128 - err = crypto_shash_update(desc, prev, HKDF_HASHLEN); 129 - if (err) 130 - goto out; 131 - } 132 - 133 - err = crypto_shash_update(desc, prefix, sizeof(prefix)); 134 - if (err) 135 - goto out; 136 - 137 - err = crypto_shash_update(desc, info, infolen); 138 - if (err) 139 - goto out; 140 - 141 - BUILD_BUG_ON(sizeof(counter) != 1); 142 - if (okmlen - i < HKDF_HASHLEN) { 143 - err = crypto_shash_finup(desc, &counter, 1, tmp); 144 - if (err) 145 - goto out; 146 - memcpy(&okm[i], tmp, okmlen - i); 147 - memzero_explicit(tmp, sizeof(tmp)); 148 - } else { 149 - err = crypto_shash_finup(desc, &counter, 1, &okm[i]); 150 - if (err) 151 - goto out; 152 - } 153 - counter++; 154 - prev = &okm[i]; 155 - } 156 - err = 0; 157 - out: 158 - if (unlikely(err)) 159 - memzero_explicit(okm, okmlen); /* so caller doesn't need to */ 160 - shash_desc_zero(desc); 133 + err = hkdf_expand(hkdf->hmac_tfm, full_info, infolen + 9, 134 + okm, okmlen); 135 + kfree_sensitive(full_info); 161 136 return err; 162 137 } 163 138

+3 -1

fs/crypto/inline_crypt.c

··· 130 130 crypto_cfg.crypto_mode = ci->ci_mode->blk_crypto_mode; 131 131 crypto_cfg.data_unit_size = 1U << ci->ci_data_unit_bits; 132 132 crypto_cfg.dun_bytes = fscrypt_get_dun_bytes(ci); 133 + crypto_cfg.key_type = BLK_CRYPTO_KEY_TYPE_RAW; 133 134 134 135 devs = fscrypt_get_devices(sb, &num_devs); 135 136 if (IS_ERR(devs)) ··· 167 166 if (!blk_key) 168 167 return -ENOMEM; 169 168 170 - err = blk_crypto_init_key(blk_key, raw_key, crypto_mode, 169 + err = blk_crypto_init_key(blk_key, raw_key, ci->ci_mode->keysize, 170 + BLK_CRYPTO_KEY_TYPE_RAW, crypto_mode, 171 171 fscrypt_get_dun_bytes(ci), 172 172 1U << ci->ci_data_unit_bits); 173 173 if (err) {

+20

include/crypto/hkdf.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * HKDF: HMAC-based Key Derivation Function (HKDF), RFC 5869 4 + * 5 + * Extracted from fs/crypto/hkdf.c, which has 6 + * Copyright 2019 Google LLC 7 + */ 8 + 9 + #ifndef _CRYPTO_HKDF_H 10 + #define _CRYPTO_HKDF_H 11 + 12 + #include <crypto/hash.h> 13 + 14 + int hkdf_extract(struct crypto_shash *hmac_tfm, const u8 *ikm, 15 + unsigned int ikmlen, const u8 *salt, unsigned int saltlen, 16 + u8 *prk); 17 + int hkdf_expand(struct crypto_shash *hmac_tfm, 18 + const u8 *info, unsigned int infolen, 19 + u8 *okm, unsigned int okmlen); 20 + #endif

+5 -5

include/linux/badblocks.h

··· 48 48 int ack; 49 49 }; 50 50 51 - int badblocks_check(struct badblocks *bb, sector_t s, int sectors, 52 - sector_t *first_bad, int *bad_sectors); 53 - int badblocks_set(struct badblocks *bb, sector_t s, int sectors, 54 - int acknowledged); 55 - int badblocks_clear(struct badblocks *bb, sector_t s, int sectors); 51 + int badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors, 52 + sector_t *first_bad, sector_t *bad_sectors); 53 + bool badblocks_set(struct badblocks *bb, sector_t s, sector_t sectors, 54 + int acknowledged); 55 + bool badblocks_clear(struct badblocks *bb, sector_t s, sector_t sectors); 56 56 void ack_all_badblocks(struct badblocks *bb); 57 57 ssize_t badblocks_show(struct badblocks *bb, char *page, int unack); 58 58 ssize_t badblocks_store(struct badblocks *bb, const char *page, size_t len,

+2 -23

include/linux/bio-integrity.h

··· 16 16 }; 17 17 18 18 struct bio_integrity_payload { 19 - struct bio *bip_bio; /* parent bio */ 20 - 21 19 struct bvec_iter bip_iter; 22 20 23 21 unsigned short bip_vcnt; /* # of integrity bio_vecs */ ··· 23 25 unsigned short bip_flags; /* control flags */ 24 26 u16 app_tag; /* application tag value */ 25 27 26 - struct bvec_iter bio_iter; /* for rewinding parent bio */ 27 - 28 - struct work_struct bip_work; /* I/O completion */ 29 - 30 28 struct bio_vec *bip_vec; 31 - struct bio_vec bip_inline_vecs[];/* embedded bvec array */ 32 29 }; 33 30 34 31 #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_IP_CHECKSUM | \ ··· 67 74 bip->bip_iter.bi_sector = seed; 68 75 } 69 76 77 + void bio_integrity_init(struct bio *bio, struct bio_integrity_payload *bip, 78 + struct bio_vec *bvecs, unsigned int nr_vecs); 70 79 struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, gfp_t gfp, 71 80 unsigned int nr); 72 81 int bio_integrity_add_page(struct bio *bio, struct page *page, unsigned int len, ··· 80 85 void bio_integrity_advance(struct bio *bio, unsigned int bytes_done); 81 86 void bio_integrity_trim(struct bio *bio); 82 87 int bio_integrity_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp_mask); 83 - int bioset_integrity_create(struct bio_set *bs, int pool_size); 84 - void bioset_integrity_free(struct bio_set *bs); 85 - void bio_integrity_init(void); 86 88 87 89 #else /* CONFIG_BLK_DEV_INTEGRITY */ 88 90 89 91 static inline struct bio_integrity_payload *bio_integrity(struct bio *bio) 90 92 { 91 93 return NULL; 92 - } 93 - 94 - static inline int bioset_integrity_create(struct bio_set *bs, int pool_size) 95 - { 96 - return 0; 97 - } 98 - 99 - static inline void bioset_integrity_free(struct bio_set *bs) 100 - { 101 94 } 102 95 103 96 static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter) ··· 119 136 } 120 137 121 138 static inline void bio_integrity_trim(struct bio *bio) 122 - { 123 - } 124 - 125 - static inline void bio_integrity_init(void) 126 139 { 127 140 } 128 141

-4

include/linux/bio.h

··· 625 625 626 626 mempool_t bio_pool; 627 627 mempool_t bvec_pool; 628 - #if defined(CONFIG_BLK_DEV_INTEGRITY) 629 - mempool_t bio_integrity_pool; 630 - mempool_t bvec_integrity_pool; 631 - #endif 632 628 633 629 unsigned int back_pad; 634 630 /*

+73

include/linux/blk-crypto-profile.h

··· 57 57 int (*keyslot_evict)(struct blk_crypto_profile *profile, 58 58 const struct blk_crypto_key *key, 59 59 unsigned int slot); 60 + 61 + /** 62 + * @derive_sw_secret: Derive the software secret from a hardware-wrapped 63 + * key in ephemerally-wrapped form. 64 + * 65 + * This only needs to be implemented if BLK_CRYPTO_KEY_TYPE_HW_WRAPPED 66 + * is supported. 67 + * 68 + * Must return 0 on success, -EBADMSG if the key is invalid, or another 69 + * -errno code on other errors. 70 + */ 71 + int (*derive_sw_secret)(struct blk_crypto_profile *profile, 72 + const u8 *eph_key, size_t eph_key_size, 73 + u8 sw_secret[BLK_CRYPTO_SW_SECRET_SIZE]); 74 + 75 + /** 76 + * @import_key: Create a hardware-wrapped key by importing a raw key. 77 + * 78 + * This only needs to be implemented if BLK_CRYPTO_KEY_TYPE_HW_WRAPPED 79 + * is supported. 80 + * 81 + * On success, must write the new key in long-term wrapped form to 82 + * @lt_key and return its size in bytes. On failure, must return a 83 + * -errno value. 84 + */ 85 + int (*import_key)(struct blk_crypto_profile *profile, 86 + const u8 *raw_key, size_t raw_key_size, 87 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]); 88 + 89 + /** 90 + * @generate_key: Generate a hardware-wrapped key. 91 + * 92 + * This only needs to be implemented if BLK_CRYPTO_KEY_TYPE_HW_WRAPPED 93 + * is supported. 94 + * 95 + * On success, must write the new key in long-term wrapped form to 96 + * @lt_key and return its size in bytes. On failure, must return a 97 + * -errno value. 98 + */ 99 + int (*generate_key)(struct blk_crypto_profile *profile, 100 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]); 101 + 102 + /** 103 + * @prepare_key: Prepare a hardware-wrapped key to be used. 104 + * 105 + * Prepare a hardware-wrapped key to be used by converting it from 106 + * long-term wrapped form to ephemerally-wrapped form. This only needs 107 + * to be implemented if BLK_CRYPTO_KEY_TYPE_HW_WRAPPED is supported. 108 + * 109 + * On success, must write the key in ephemerally-wrapped form to 110 + * @eph_key and return its size in bytes. On failure, must return 111 + * -EBADMSG if the key is invalid, or another -errno on other error. 112 + */ 113 + int (*prepare_key)(struct blk_crypto_profile *profile, 114 + const u8 *lt_key, size_t lt_key_size, 115 + u8 eph_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]); 60 116 }; 61 117 62 118 /** ··· 139 83 * supported DUNs is 0 through (1 << (8 * max_dun_bytes_supported)) - 1. 140 84 */ 141 85 unsigned int max_dun_bytes_supported; 86 + 87 + /** 88 + * @key_types_supported: A bitmask of the supported key types: 89 + * BLK_CRYPTO_KEY_TYPE_RAW and/or BLK_CRYPTO_KEY_TYPE_HW_WRAPPED. 90 + */ 91 + unsigned int key_types_supported; 142 92 143 93 /** 144 94 * @modes_supported: Array of bitmasks that specifies whether each ··· 204 142 void blk_crypto_reprogram_all_keys(struct blk_crypto_profile *profile); 205 143 206 144 void blk_crypto_profile_destroy(struct blk_crypto_profile *profile); 145 + 146 + int blk_crypto_import_key(struct blk_crypto_profile *profile, 147 + const u8 *raw_key, size_t raw_key_size, 148 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]); 149 + 150 + int blk_crypto_generate_key(struct blk_crypto_profile *profile, 151 + u8 lt_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]); 152 + 153 + int blk_crypto_prepare_key(struct blk_crypto_profile *profile, 154 + const u8 *lt_key, size_t lt_key_size, 155 + u8 eph_key[BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE]); 207 156 208 157 void blk_crypto_intersect_capabilities(struct blk_crypto_profile *parent, 209 158 const struct blk_crypto_profile *child);

+66 -7

include/linux/blk-crypto.h

··· 6 6 #ifndef __LINUX_BLK_CRYPTO_H 7 7 #define __LINUX_BLK_CRYPTO_H 8 8 9 + #include <linux/minmax.h> 9 10 #include <linux/types.h> 11 + #include <uapi/linux/blk-crypto.h> 10 12 11 13 enum blk_crypto_mode_num { 12 14 BLK_ENCRYPTION_MODE_INVALID, ··· 19 17 BLK_ENCRYPTION_MODE_MAX, 20 18 }; 21 19 22 - #define BLK_CRYPTO_MAX_KEY_SIZE 64 20 + /* 21 + * Supported types of keys. Must be bitflags due to their use in 22 + * blk_crypto_profile::key_types_supported. 23 + */ 24 + enum blk_crypto_key_type { 25 + /* 26 + * Raw keys (i.e. "software keys"). These keys are simply kept in raw, 27 + * plaintext form in kernel memory. 28 + */ 29 + BLK_CRYPTO_KEY_TYPE_RAW = 0x1, 30 + 31 + /* 32 + * Hardware-wrapped keys. These keys are only present in kernel memory 33 + * in ephemerally-wrapped form, and they can only be unwrapped by 34 + * dedicated hardware. For details, see the "Hardware-wrapped keys" 35 + * section of Documentation/block/inline-encryption.rst. 36 + */ 37 + BLK_CRYPTO_KEY_TYPE_HW_WRAPPED = 0x2, 38 + }; 39 + 40 + /* 41 + * Currently the maximum raw key size is 64 bytes, as that is the key size of 42 + * BLK_ENCRYPTION_MODE_AES_256_XTS which takes the longest key. 43 + * 44 + * The maximum hardware-wrapped key size depends on the hardware's key wrapping 45 + * algorithm, which is a hardware implementation detail, so it isn't precisely 46 + * specified. But currently 128 bytes is plenty in practice. Implementations 47 + * are recommended to wrap a 32-byte key for the hardware KDF with AES-256-GCM, 48 + * which should result in a size closer to 64 bytes than 128. 49 + * 50 + * Both of these values can trivially be increased if ever needed. 51 + */ 52 + #define BLK_CRYPTO_MAX_RAW_KEY_SIZE 64 53 + #define BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE 128 54 + 55 + #define BLK_CRYPTO_MAX_ANY_KEY_SIZE \ 56 + MAX(BLK_CRYPTO_MAX_RAW_KEY_SIZE, BLK_CRYPTO_MAX_HW_WRAPPED_KEY_SIZE) 57 + 58 + /* 59 + * Size of the "software secret" which can be derived from a hardware-wrapped 60 + * key. This is currently always 32 bytes. Note, the choice of 32 bytes 61 + * assumes that the software secret is only used directly for algorithms that 62 + * don't require more than a 256-bit key to get the desired security strength. 63 + * If it were to be used e.g. directly as an AES-256-XTS key, then this would 64 + * need to be increased (which is possible if hardware supports it, but care 65 + * would need to be taken to avoid breaking users who need exactly 32 bytes). 66 + */ 67 + #define BLK_CRYPTO_SW_SECRET_SIZE 32 68 + 23 69 /** 24 70 * struct blk_crypto_config - an inline encryption key's crypto configuration 25 71 * @crypto_mode: encryption algorithm this key is for ··· 76 26 * ciphertext. This is always a power of 2. It might be e.g. the 77 27 * filesystem block size or the disk sector size. 78 28 * @dun_bytes: the maximum number of bytes of DUN used when using this key 29 + * @key_type: the type of this key -- either raw or hardware-wrapped 79 30 */ 80 31 struct blk_crypto_config { 81 32 enum blk_crypto_mode_num crypto_mode; 82 33 unsigned int data_unit_size; 83 34 unsigned int dun_bytes; 35 + enum blk_crypto_key_type key_type; 84 36 }; 85 37 86 38 /** 87 39 * struct blk_crypto_key - an inline encryption key 88 - * @crypto_cfg: the crypto configuration (like crypto_mode, key size) for this 89 - * key 40 + * @crypto_cfg: the crypto mode, data unit size, key type, and other 41 + * characteristics of this key and how it will be used 90 42 * @data_unit_size_bits: log2 of data_unit_size 91 - * @size: size of this key in bytes (determined by @crypto_cfg.crypto_mode) 92 - * @raw: the raw bytes of this key. Only the first @size bytes are used. 43 + * @size: size of this key in bytes. The size of a raw key is fixed for a given 44 + * crypto mode, but the size of a hardware-wrapped key can vary. 45 + * @bytes: the bytes of this key. Only the first @size bytes are significant. 93 46 * 94 47 * A blk_crypto_key is immutable once created, and many bios can reference it at 95 48 * the same time. It must not be freed until all bios using it have completed ··· 102 49 struct blk_crypto_config crypto_cfg; 103 50 unsigned int data_unit_size_bits; 104 51 unsigned int size; 105 - u8 raw[BLK_CRYPTO_MAX_KEY_SIZE]; 52 + u8 bytes[BLK_CRYPTO_MAX_ANY_KEY_SIZE]; 106 53 }; 107 54 108 55 #define BLK_CRYPTO_MAX_IV_SIZE 32 ··· 140 87 unsigned int bytes, 141 88 const u64 next_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]); 142 89 143 - int blk_crypto_init_key(struct blk_crypto_key *blk_key, const u8 *raw_key, 90 + int blk_crypto_init_key(struct blk_crypto_key *blk_key, 91 + const u8 *key_bytes, size_t key_size, 92 + enum blk_crypto_key_type key_type, 144 93 enum blk_crypto_mode_num crypto_mode, 145 94 unsigned int dun_bytes, 146 95 unsigned int data_unit_size); ··· 157 102 const struct blk_crypto_config *cfg); 158 103 bool blk_crypto_config_supported(struct block_device *bdev, 159 104 const struct blk_crypto_config *cfg); 105 + 106 + int blk_crypto_derive_sw_secret(struct block_device *bdev, 107 + const u8 *eph_key, size_t eph_key_size, 108 + u8 sw_secret[BLK_CRYPTO_SW_SECRET_SIZE]); 160 109 161 110 #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 162 111

+4 -5

include/linux/blk-mq.h

··· 1173 1173 return max_t(unsigned short, rq->nr_phys_segments, 1); 1174 1174 } 1175 1175 1176 - int __blk_rq_map_sg(struct request_queue *q, struct request *rq, 1177 - struct scatterlist *sglist, struct scatterlist **last_sg); 1178 - static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq, 1179 - struct scatterlist *sglist) 1176 + int __blk_rq_map_sg(struct request *rq, struct scatterlist *sglist, 1177 + struct scatterlist **last_sg); 1178 + static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist) 1180 1179 { 1181 1180 struct scatterlist *last_sg = NULL; 1182 1181 1183 - return __blk_rq_map_sg(q, rq, sglist, &last_sg); 1182 + return __blk_rq_map_sg(rq, sglist, &last_sg); 1184 1183 } 1185 1184 void blk_dump_rq_flags(struct request *, char *); 1186 1185

+15

include/linux/blkdev.h

··· 568 568 struct blk_flush_queue *fq; 569 569 struct list_head flush_list; 570 570 571 + /* 572 + * Protects against I/O scheduler switching, particularly when updating 573 + * q->elevator. Since the elevator update code path may also modify q-> 574 + * nr_requests and wbt latency, this lock also protects the sysfs attrs 575 + * nr_requests and wbt_lat_usec. Additionally the nr_hw_queues update 576 + * may modify hctx tags, reserved-tags and cpumask, so this lock also 577 + * helps protect the hctx sysfs/debugfs attrs. To ensure proper locking 578 + * order during an elevator or nr_hw_queue update, first freeze the 579 + * queue, then acquire ->elevator_lock. 580 + */ 581 + struct mutex elevator_lock; 582 + 571 583 struct mutex sysfs_lock; 584 + /* 585 + * Protects queue limits and also sysfs attribute read_ahead_kb. 586 + */ 572 587 struct mutex limits_lock; 573 588 574 589 /*

+7

include/linux/nvme-auth.h

··· 40 40 int nvme_auth_gen_shared_secret(struct crypto_kpp *dh_tfm, 41 41 u8 *ctrl_key, size_t ctrl_key_len, 42 42 u8 *sess_key, size_t sess_key_len); 43 + int nvme_auth_generate_psk(u8 hmac_id, u8 *skey, size_t skey_len, 44 + u8 *c1, u8 *c2, size_t hash_len, 45 + u8 **ret_psk, size_t *ret_len); 46 + int nvme_auth_generate_digest(u8 hmac_id, u8 *psk, size_t psk_len, 47 + char *subsysnqn, char *hostnqn, u8 **ret_digest); 48 + int nvme_auth_derive_tls_psk(int hmac_id, u8 *psk, size_t psk_len, 49 + u8 *psk_digest, u8 **ret_psk); 43 50 44 51 #endif /* _NVME_AUTH_H */

+11 -1

include/linux/nvme-keyring.h

··· 6 6 #ifndef _NVME_KEYRING_H 7 7 #define _NVME_KEYRING_H 8 8 9 + #include <linux/key.h> 10 + 9 11 #if IS_ENABLED(CONFIG_NVME_KEYRING) 10 12 13 + struct key *nvme_tls_psk_refresh(struct key *keyring, 14 + const char *hostnqn, const char *subnqn, u8 hmac_id, 15 + u8 *data, size_t data_len, const char *digest); 11 16 key_serial_t nvme_tls_psk_default(struct key *keyring, 12 17 const char *hostnqn, const char *subnqn); 13 18 14 19 key_serial_t nvme_keyring_id(void); 15 20 struct key *nvme_tls_key_lookup(key_serial_t key_id); 16 21 #else 17 - 22 + static inline struct key *nvme_tls_psk_refresh(struct key *keyring, 23 + const char *hostnqn, char *subnqn, u8 hmac_id, 24 + u8 *data, size_t data_len, const char *digest) 25 + { 26 + return ERR_PTR(-ENOTSUPP); 27 + } 18 28 static inline key_serial_t nvme_tls_psk_default(struct key *keyring, 19 29 const char *hostnqn, const char *subnqn) 20 30 {

+7

include/linux/nvme.h

··· 1772 1772 NVME_AUTH_DHGROUP_INVALID = 0xff, 1773 1773 }; 1774 1774 1775 + enum { 1776 + NVME_AUTH_SECP_NOSC = 0x00, 1777 + NVME_AUTH_SECP_SC = 0x01, 1778 + NVME_AUTH_SECP_NEWTLSPSK = 0x02, 1779 + NVME_AUTH_SECP_REPLACETLSPSK = 0x03, 1780 + }; 1781 + 1775 1782 union nvmf_auth_protocol { 1776 1783 struct nvmf_auth_dhchap_protocol_descriptor dhchap; 1777 1784 };

+4 -2

include/linux/wait.h

··· 1210 1210 1211 1211 #define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function) 1212 1212 1213 - #define init_wait(wait) \ 1213 + #define init_wait_func(wait, function) \ 1214 1214 do { \ 1215 1215 (wait)->private = current; \ 1216 - (wait)->func = autoremove_wake_function; \ 1216 + (wait)->func = function; \ 1217 1217 INIT_LIST_HEAD(&(wait)->entry); \ 1218 1218 (wait)->flags = 0; \ 1219 1219 } while (0) 1220 + 1221 + #define init_wait(wait) init_wait_func(wait, autoremove_wake_function) 1220 1222 1221 1223 typedef int (*task_call_f)(struct task_struct *p, void *arg); 1222 1224 extern int task_call_func(struct task_struct *p, task_call_f func, void *arg);

+44

include/uapi/linux/blk-crypto.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 + #ifndef _UAPI_LINUX_BLK_CRYPTO_H 3 + #define _UAPI_LINUX_BLK_CRYPTO_H 4 + 5 + #include <linux/ioctl.h> 6 + #include <linux/types.h> 7 + 8 + struct blk_crypto_import_key_arg { 9 + /* Raw key (input) */ 10 + __u64 raw_key_ptr; 11 + __u64 raw_key_size; 12 + /* Long-term wrapped key blob (output) */ 13 + __u64 lt_key_ptr; 14 + __u64 lt_key_size; 15 + __u64 reserved[4]; 16 + }; 17 + 18 + struct blk_crypto_generate_key_arg { 19 + /* Long-term wrapped key blob (output) */ 20 + __u64 lt_key_ptr; 21 + __u64 lt_key_size; 22 + __u64 reserved[4]; 23 + }; 24 + 25 + struct blk_crypto_prepare_key_arg { 26 + /* Long-term wrapped key blob (input) */ 27 + __u64 lt_key_ptr; 28 + __u64 lt_key_size; 29 + /* Ephemerally-wrapped key blob (output) */ 30 + __u64 eph_key_ptr; 31 + __u64 eph_key_size; 32 + __u64 reserved[4]; 33 + }; 34 + 35 + /* 36 + * These ioctls share the block device ioctl space; see uapi/linux/fs.h. 37 + * 140-141 are reserved for future blk-crypto ioctls; any more than that would 38 + * require an additional allocation from the block device ioctl space. 39 + */ 40 + #define BLKCRYPTOIMPORTKEY _IOWR(0x12, 137, struct blk_crypto_import_key_arg) 41 + #define BLKCRYPTOGENERATEKEY _IOWR(0x12, 138, struct blk_crypto_generate_key_arg) 42 + #define BLKCRYPTOPREPAREKEY _IOWR(0x12, 139, struct blk_crypto_prepare_key_arg) 43 + 44 + #endif /* _UAPI_LINUX_BLK_CRYPTO_H */

+2 -4

include/uapi/linux/fs.h

··· 212 212 #define BLKROTATIONAL _IO(0x12,126) 213 213 #define BLKZEROOUT _IO(0x12,127) 214 214 #define BLKGETDISKSEQ _IOR(0x12,128,__u64) 215 - /* 216 - * A jump here: 130-136 are reserved for zoned block devices 217 - * (see uapi/linux/blkzoned.h) 218 - */ 215 + /* 130-136 are used by zoned block device ioctls (uapi/linux/blkzoned.h) */ 216 + /* 137-141 are used by blk-crypto ioctls (uapi/linux/blk-crypto.h) */ 219 217 220 218 #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ 221 219 #define FIBMAP _IO(0x00,1) /* bmap access */

+7

include/uapi/linux/ublk_cmd.h

··· 405 405 __u8 reserved[20]; 406 406 }; 407 407 408 + struct ublk_param_dma_align { 409 + __u32 alignment; 410 + __u8 pad[4]; 411 + }; 412 + 408 413 struct ublk_params { 409 414 /* 410 415 * Total length of parameters, userspace has to set 'len' for both ··· 422 417 #define UBLK_PARAM_TYPE_DISCARD (1 << 1) 423 418 #define UBLK_PARAM_TYPE_DEVT (1 << 2) 424 419 #define UBLK_PARAM_TYPE_ZONED (1 << 3) 420 + #define UBLK_PARAM_TYPE_DMA_ALIGN (1 << 4) 425 421 __u32 types; /* types of parameter included */ 426 422 427 423 struct ublk_param_basic basic; 428 424 struct ublk_param_discard discard; 429 425 struct ublk_param_devt devt; 430 426 struct ublk_param_zoned zoned; 427 + struct ublk_param_dma_align dma; 431 428 }; 432 429 433 430 #endif