Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

crypto: x86/aes-xts - handle CTS encryption more efficiently

When encrypting a message whose length isn't a multiple of 16 bytes,
encrypt the last full block in the main loop. This works because only
decryption uses the last two tweaks in reverse order, not encryption.

This improves the performance of decrypting messages whose length isn't
a multiple of the AES block length, shrinks the size of
aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test
and a not-taken conditional jump) when encrypting a message whose length
*is* a multiple of the AES block length.

While it's not super useful to optimize for ciphertext stealing given
that it's rarely needed in practice, the other two benefits mentioned
above make this optimization worthwhile.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

authored by

Eric Biggers and committed by
Herbert Xu
1d27e1f5 3525fe47

+30 -25
+30 -25
arch/x86/crypto/aes-xts-avx-x86_64.S
··· 537 537 add $7*16, KEY 538 538 .else 539 539 add $(15+7)*16, KEY 540 - .endif 541 540 542 - // Check whether the data length is a multiple of the AES block length. 541 + // When decrypting a message whose length isn't a multiple of the AES 542 + // block length, exclude the last full block from the main loop by 543 + // subtracting 16 from LEN. This is needed because ciphertext stealing 544 + // decryption uses the last two tweaks in reverse order. We'll handle 545 + // the last full block and the partial block specially at the end. 543 546 test $15, LEN 544 - jnz .Lneed_cts\@ 547 + jnz .Lneed_cts_dec\@ 545 548 .Lxts_init\@: 549 + .endif 546 550 547 551 // Cache as many round keys as possible. 548 552 _load_round_keys ··· 689 685 _vaes_4x \enc, 1, 12 690 686 jmp .Lencrypt_4x_done\@ 691 687 692 - .Lneed_cts\@: 693 - // The data length isn't a multiple of the AES block length, so 694 - // ciphertext stealing (CTS) will be needed. Subtract one block from 695 - // LEN so that the main loop doesn't process the last full block. The 696 - // CTS step will process it specially along with the partial block. 688 + .if !\enc 689 + .Lneed_cts_dec\@: 697 690 sub $16, LEN 698 691 jmp .Lxts_init\@ 692 + .endif 699 693 700 694 .Lcts\@: 701 695 // Do ciphertext stealing (CTS) to en/decrypt the last full block and 702 - // the partial block. CTS needs two tweaks. TWEAK0_XMM contains the 703 - // next tweak; compute the one after that. Decryption uses these two 704 - // tweaks in reverse order, so also define aliases to handle that. 705 - _next_tweak TWEAK0_XMM, %xmm0, TWEAK1_XMM 706 - .if \enc 707 - .set CTS_TWEAK0, TWEAK0_XMM 708 - .set CTS_TWEAK1, TWEAK1_XMM 709 - .else 710 - .set CTS_TWEAK0, TWEAK1_XMM 711 - .set CTS_TWEAK1, TWEAK0_XMM 712 - .endif 696 + // the partial block. TWEAK0_XMM contains the next tweak. 713 697 714 - // En/decrypt the last full block. 698 + .if \enc 699 + // If encrypting, the main loop already encrypted the last full block to 700 + // create the CTS intermediate ciphertext. Prepare for the rest of CTS 701 + // by rewinding the pointers and loading the intermediate ciphertext. 702 + sub $16, SRC 703 + sub $16, DST 704 + vmovdqu (DST), %xmm0 705 + .else 706 + // If decrypting, the main loop didn't decrypt the last full block 707 + // because CTS decryption uses the last two tweaks in reverse order. 708 + // Do it now by advancing the tweak and decrypting the last full block. 709 + _next_tweak TWEAK0_XMM, %xmm0, TWEAK1_XMM 715 710 vmovdqu (SRC), %xmm0 716 - _aes_crypt \enc, _XMM, CTS_TWEAK0, %xmm0 711 + _aes_crypt \enc, _XMM, TWEAK1_XMM, %xmm0 712 + .endif 717 713 718 714 .if USE_AVX10 719 715 // Create a mask that has the first LEN bits set. ··· 721 717 bzhi LEN, %rax, %rax 722 718 kmovq %rax, %k1 723 719 724 - // Swap the first LEN bytes of the above result with the partial block. 725 - // Note that to support in-place en/decryption, the load from the src 726 - // partial block must happen before the store to the dst partial block. 720 + // Swap the first LEN bytes of the en/decryption of the last full block 721 + // with the partial block. Note that to support in-place en/decryption, 722 + // the load from the src partial block must happen before the store to 723 + // the dst partial block. 727 724 vmovdqa %xmm0, %xmm1 728 725 vmovdqu8 16(SRC), %xmm0{%k1} 729 726 vmovdqu8 %xmm1, 16(DST){%k1} ··· 755 750 vpblendvb %xmm3, %xmm0, %xmm1, %xmm0 756 751 .endif 757 752 // En/decrypt again and store the last full block. 758 - _aes_crypt \enc, _XMM, CTS_TWEAK1, %xmm0 753 + _aes_crypt \enc, _XMM, TWEAK0_XMM, %xmm0 759 754 vmovdqu %xmm0, (DST) 760 755 jmp .Ldone\@ 761 756 .endm