linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-04 18:49:41 +00:00

Author	SHA1	Message	Date
Eric Biggers	f3d6cb3dc0	lib/crypto: x86/sha1: Migrate optimized code into library Instead of exposing the x86-optimized SHA-1 code via x86-specific crypto_shash algorithms, instead just implement the sha1_blocks() library function. This is much simpler, it makes the SHA-1 library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SHA-1 code was disabled by default. SHA-1 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha1_blocks(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250712232329.818226-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-07-14 11:28:35 -07:00
Eric Biggers	484c18119f	lib/crypto: x86/sha512: Migrate optimized SHA-512 code to library Instead of exposing the x86-optimized SHA-512 code via x86-specific crypto_shash algorithms, instead just implement the sha512_blocks() library function. This is much simpler, it makes the SHA-512 (and SHA-384) library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SHA-512 code was disabled by default. SHA-512 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha512_blocks(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630160320.2888-15-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:26:20 -07:00
Eric Biggers	11d7956d52	crypto: x86/sha256 - implement library instead of shash Instead of providing crypto_shash algorithms for the arch-optimized SHA-256 code, instead implement the SHA-256 library. This is much simpler, it makes the SHA-256 library functions be arch-optimized, and it fixes the longstanding issue where the arch-optimized SHA-256 was disabled by default. SHA-256 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha256_blocks_arch(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-05-05 18:20:44 +08:00
Eric Biggers	c7c18c94a6	crypto: x86 - move library functions to arch/x86/lib/crypto/ Continue disentangling the crypto library functions from the generic crypto infrastructure by moving the x86 BLAKE2s, ChaCha, and Poly1305 library functions into a new directory arch/x86/lib/crypto/ that does not depend on CRYPTO. This mirrors the distinction between crypto/ and lib/crypto/. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-28 19:40:54 +08:00
Eric Biggers	67128a90b3	crypto: x86 - drop redundant dependencies on X86 arch/x86/crypto/Kconfig is sourced only when CONFIG_X86=y, so there is no need for the symbols defined inside it to depend on X86. In the case of CRYPTO_TWOFISH_586 and CRYPTO_TWOFISH_X86_64, the dependency was actually on '(X86 \|\| UML_X86)', which suggests that these two symbols were intended to be available under user-mode Linux as well. Yet, again these symbols were defined only when CONFIG_X86=y, so that was not the case. Just remove this redundant dependency. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-28 19:40:53 +08:00
Eric Biggers	34374f76af	crypto: x86/poly1305 - don't select CRYPTO_LIB_POLY1305_GENERIC The x86 Poly1305 code never falls back to the generic code, so selecting CRYPTO_LIB_POLY1305_GENERIC is unnecessary. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:25 +08:00
Eric Biggers	21969da642	crypto: x86/poly1305 - remove redundant shash algorithm Since crypto/poly1305.c now registers a poly1305-$(ARCH) shash algorithm that uses the architecture's Poly1305 library functions, individual architectures no longer need to do the same. Therefore, remove the redundant shash algorithm from the arch-specific code and leave just the library functions there. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:25 +08:00
Herbert Xu	f4065b2f63	crypto: lib/sm3 - Move sm3 library into lib/crypto Move the sm3 library code into lib/crypto. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:24 +08:00
Eric Biggers	632ab0978f	crypto: x86/chacha - remove the skcipher algorithms Since crypto/chacha.c now registers chacha20-$(ARCH), xchacha20-$(ARCH), and xchacha12-$(ARCH) skcipher algorithms that use the architecture's ChaCha and HChaCha library functions, individual architectures no longer need to do the same. Therefore, remove the redundant skcipher algorithms and leave just the library functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Uros Bizjak	bc23fe6dc1	crypto: x86 - Remove CONFIG_AS_AVX512 handling Current minimum required version of binutils is 2.25, which supports AVX-512 instruction mnemonics. Remove check for assembler support of AVX-512 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Eric Biggers	bda5cd6e29	crypto: x86/twofish - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	982b72cd00	crypto: x86/sm4 - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	cc01d2840f	crypto: x86/serpent - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	ca6d0e8ed8	crypto: x86/cast - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	3e862a87ff	crypto: x86/camellia - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	6e3379b933	crypto: x86/aria - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	0ba6ec5b29	crypto: x86/aes - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	3a7dfdbbe3	crypto: x86/aegis - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Herbert Xu	17ec3e71ba	crypto: lib/Kconfig - Hide arch options from user The ARCH_MAY_HAVE patch missed arm64, mips and s390. But it may also lead to arch options being enabled but ineffective because of modular/built-in conflicts. As the primary user of all these options wireguard is selecting the arch options anyway, make the same selections at the lib/crypto option level and hide the arch options from the user. Instead of selecting them centrally from lib/crypto, simply set the default of each arch option as suggested by Eric Biggers. Change the Crypto API generic algorithms to select the top-level lib/crypto options instead of the generic one as otherwise there is no way to enable the arch options (Eric Biggers). Introduce a set of INTERNAL options to work around dependency cycles on the CONFIG_CRYPTO symbol. Fixes: `1047e21aec` ("crypto: lib/Kconfig - Fix lib built-in failure when arch is modular") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Arnd Bergmann <arnd@kernel.org> Closes: https://lore.kernel.org/oe-kbuild-all/202502232152.JC84YDLp-lkp@intel.com/ Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-02 15:21:47 +08:00
Herbert Xu	1047e21aec	crypto: lib/Kconfig - Fix lib built-in failure when arch is modular The HAVE_ARCH Kconfig options in lib/crypto try to solve the modular versus built-in problem, but it still fails when the the LIB option (e.g., CRYPTO_LIB_CURVE25519) is selected externally. Fix this by introducing a level of indirection with ARCH_MAY_HAVE Kconfig options, these then go on to select the ARCH_HAVE options if the ARCH Kconfig options matches that of the LIB option. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501230223.ikroNDr1-lkp@intel.com/ Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-02-22 15:56:03 +08:00
Eric Biggers	ed4bc981d5	x86/crc-t10dif: expose CRC-T10DIF function through lib Move the x86 CRC-T10DIF assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:13 -08:00
Eric Biggers	55d1ecceb8	x86/crc32: expose CRC32 functions through lib Move the x86 CRC32 assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:01 -08:00
Eric Biggers	af2aff7caf	crypto: x86/aegis128 - optimize length block preparation using SSE4.1 Start using SSE4.1 instructions in the AES-NI AEGIS code, with the first use case being preparing the length block in fewer instructions. In practice this does not reduce the set of CPUs on which the code can run, because all Intel and AMD CPUs with AES-NI also have SSE4.1. Upgrade the existing SSE2 feature check to SSE4.1, though it seems this check is not strictly necessary; the aesni-intel module has been getting away with using SSE4.1 despite checking for AES-NI only. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	c299d7af9d	crypto: x86/aesni - update docs for aesni-intel module Update the kconfig help and module description to reflect that VAES instructions are now used in some cases. Also fix XTR => XCTR. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-09-06 14:50:45 +08:00
Eric Biggers	b06affb1cb	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or AVX10. There are two implementations, sharing most source code: one using 256-bit vectors and one using 512-bit vectors. This patch improves AES-GCM performance by up to 162%; see Tables 1 and 2 below. I wrote the new AES-GCM assembly code from scratch, focusing on correctness, performance, code size (both source and binary), and documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is about 1200 lines including extensive comments, and it generates less than 8 KB of binary code. The main loop does 4 vectors at a time, with the AES and GHASH instructions interleaved. Any remainder is handled using a simple 1 vector at a time loop, with masking. Several VAES + AVX512 implementations of AES-GCM exist from Intel, including one in OpenSSL and one proposed for inclusion in Linux in 2021 (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/). These aren't really suitable to be used, though, due to the massive amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux) and well as the significantly larger amount of assembly source (4978 lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not support 256-bit vectors, which makes it not usable on future AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have downclocking issues. So I ended up starting from scratch. Usually my much shorter code is actually slightly faster than Intel's AVX512 code, though it depends on message length and on which of Intel's implementations is used; for details, see Tables 3 and 4 below. To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code. The following two tables summarize the performance improvement over the existing AES-GCM code in Linux that uses AES-NI and AVX2: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 60% \| 62% \| 70% \| 69% \| Intel Sapphire Rapids \| 157% \| 145% \| 162% \| 119% \| 96% \| 96% \| Intel Emerald Rapids \| 156% \| 144% \| 161% \| 115% \| 95% \| 100% \| AMD Zen 4 \| 103% \| 89% \| 78% \| 56% \| 54% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 66% \| 48% \| 49% \| 70% \| 53% \| Intel Sapphire Rapids \| 80% \| 60% \| 41% \| 62% \| 38% \| Intel Emerald Rapids \| 79% \| 60% \| 41% \| 62% \| 38% \| AMD Zen 4 \| 51% \| 35% \| 27% \| 32% \| 25% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 59% \| 63% \| 67% \| 71% \| Intel Sapphire Rapids \| 159% \| 145% \| 161% \| 125% \| 102% \| 100% \| Intel Emerald Rapids \| 158% \| 144% \| 161% \| 124% \| 100% \| 103% \| AMD Zen 4 \| 110% \| 95% \| 80% \| 59% \| 56% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 67% \| 56% \| 46% \| 70% \| 56% \| Intel Sapphire Rapids \| 79% \| 62% \| 39% \| 61% \| 39% \| Intel Emerald Rapids \| 80% \| 62% \| 40% \| 58% \| 40% \| AMD Zen 4 \| 49% \| 36% \| 30% \| 35% \| 28% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be listed as 50%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. The following two tables summarize how the performance of my code compares with Intel's AVX512 AES-GCM code, both the version that is in OpenSSL and the version that was proposed for inclusion in Linux. Neither version exists in Linux currently, but these are alternative AES-GCM implementations that could be chosen instead of mine. I collected the following numbers on Emerald Rapids using a userspace benchmark program that calls the assembly functions directly. I've also included a comparison with Cloudflare's AES-GCM implementation from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3. Table 3: VAES-based AES-256-GCM encryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14171 \| 12956 \| 12318 \| 9588 \| 7293 \| 6449 \| AVX512_Intel_OpenSSL \| 14022 \| 12467 \| 11863 \| 9107 \| 5891 \| 6472 \| AVX512_Intel_Linux \| 13954 \| 12277 \| 11530 \| 8712 \| 6627 \| 5898 \| AVX512_Cloudflare \| 12564 \| 11050 \| 10905 \| 8152 \| 5345 \| 5202 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 4939 \| 3688 \| 1846 \| 1821 \| 738 \| AVX512_Intel_OpenSSL \| 4629 \| 4532 \| 2734 \| 2332 \| 1131 \| AVX512_Intel_Linux \| 4035 \| 2966 \| 1567 \| 1330 \| 639 \| AVX512_Cloudflare \| 3344 \| 2485 \| 1141 \| 1127 \| 456 \| Table 4: VAES-based AES-256-GCM decryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14276 \| 13311 \| 13007 \| 11086 \| 8268 \| 8086 \| AVX512_Intel_OpenSSL \| 14067 \| 12620 \| 12421 \| 9587 \| 5954 \| 7060 \| AVX512_Intel_Linux \| 14116 \| 12795 \| 11778 \| 9269 \| 7735 \| 6455 \| AVX512_Cloudflare \| 13301 \| 12018 \| 11919 \| 9182 \| 7189 \| 6726 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 6454 \| 5020 \| 2635 \| 2602 \| 1079 \| AVX512_Intel_OpenSSL \| 5184 \| 5799 \| 2957 \| 2545 \| 1228 \| AVX512_Intel_Linux \| 4394 \| 4247 \| 2235 \| 1635 \| 922 \| AVX512_Cloudflare \| 4289 \| 3851 \| 1435 \| 1417 \| 574 \| So, usually my code is actually slightly faster than Intel's code, though the OpenSSL implementation has a slight edge on messages shorter than 256 bytes in this microbenchmark. (This also holds true when doing the same tests on AMD Zen 4.) It can be seen that the large code size (up to 94x larger!) of the Intel implementations doesn't seem to bring much benefit, so starting from scratch with much smaller code, as I've done, seems appropriate. The performance of my code on messages shorter than 256 bytes could be improved through a limited amount of unrolling, but it's unclear it would be worth it, given code size considerations (e.g. caches) that don't get measured in microbenchmarks. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-06-07 19:47:58 +08:00
Herbert Xu	05bd1e2a78	crypto: x86/sm4 - Remove cfb(sm4) Remove the unused CFB implementation. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-12-08 11:59:45 +08:00
Taehee Yoo	d6b7ec1106	crypto: x86/aria-avx512 - fix build failure with old binutils The minimum version of binutils for kernel build is currently 2.23 and it doesn't support GFNI. So, it fails to build the aria-avx512 if the old binutils is used. aria-avx512 requires GFNI, so it should not be allowed to build if the old binutils is used. The AS_AVX512 and AS_GFNI are added to the Kconfig to disable build aria-avx512 if the old binutils is used. Fixes: `c970d42001` ("crypto: x86/aria - implement aria-avx512") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-01-20 18:29:32 +08:00
Taehee Yoo	c970d42001	crypto: x86/aria - implement aria-avx512 aria-avx512 implementation uses AVX512 and GFNI. It supports 64way parallel processing. So, byteslicing code is changed to support 64way parallel. And it exports some aria-avx2 functions such as encrypt() and decrypt(). AVX and AVX2 have 16 registers. They should use memory to store/load state because of lack of registers. But AVX512 supports 32 registers. So, it doesn't require store/load in the s-box layer. It means that it can reduce overhead of store/load in the s-box layer. Also code become much simpler. Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100: ARIA-AVX512(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption tcrypt: 1 operation in 1504 cycles (1024 bytes) tcrypt: 1 operation in 4595 cycles (4096 bytes) tcrypt: 1 operation in 1763 cycles (1024 bytes) tcrypt: 1 operation in 5540 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption tcrypt: 1 operation in 1502 cycles (1024 bytes) tcrypt: 1 operation in 4615 cycles (4096 bytes) tcrypt: 1 operation in 1759 cycles (1024 bytes) tcrypt: 1 operation in 5554 cycles (4096 bytes) ARIA-AVX2 with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption tcrypt: 1 operation in 2003 cycles (1024 bytes) tcrypt: 1 operation in 5867 cycles (4096 bytes) tcrypt: 1 operation in 2358 cycles (1024 bytes) tcrypt: 1 operation in 7295 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption tcrypt: 1 operation in 2004 cycles (1024 bytes) tcrypt: 1 operation in 5956 cycles (4096 bytes) tcrypt: 1 operation in 2409 cycles (1024 bytes) tcrypt: 1 operation in 7564 cycles (4096 bytes) Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-01-06 17:15:47 +08:00
Taehee Yoo	37d8d3ae7a	crypto: x86/aria - implement aria-avx2 aria-avx2 implementation uses AVX2, AES-NI, and GFNI. It supports 32way parallel processing. So, byteslicing code is changed to support 32way parallel. And it exports some aria-avx functions such as encrypt() and decrypt(). There are two main logics, s-box layer and diffusion layer. These codes are the same as aria-avx implementation. But some instruction are exchanged because they don't support 256bit registers. Also, AES-NI doesn't support 256bit register. So, aesenclast and aesdeclast are used twice like below: vextracti128 $1, ymm0, xmm6; vaesenclast xmm7, xmm0, xmm0; vaesenclast xmm7, xmm6, xmm6; vinserti128 $1, xmm6, ymm0, ymm0; Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100: ARIA-AVX2 with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption tcrypt: 1 operation in 2003 cycles (1024 bytes) tcrypt: 1 operation in 5867 cycles (4096 bytes) tcrypt: 1 operation in 2358 cycles (1024 bytes) tcrypt: 1 operation in 7295 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption tcrypt: 1 operation in 2004 cycles (1024 bytes) tcrypt: 1 operation in 5956 cycles (4096 bytes) tcrypt: 1 operation in 2409 cycles (1024 bytes) tcrypt: 1 operation in 7564 cycles (4096 bytes) ARIA-AVX with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption tcrypt: 1 operation in 2761 cycles (1024 bytes) tcrypt: 1 operation in 9390 cycles (4096 bytes) tcrypt: 1 operation in 3401 cycles (1024 bytes) tcrypt: 1 operation in 11876 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption tcrypt: 1 operation in 2735 cycles (1024 bytes) tcrypt: 1 operation in 9424 cycles (4096 bytes) tcrypt: 1 operation in 3369 cycles (1024 bytes) tcrypt: 1 operation in 11954 cycles (4096 bytes) Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-01-06 17:15:47 +08:00
Taehee Yoo	ba3579e6e4	crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher The implementation is based on the 32-bit implementation of the aria. Also, aria-avx process steps are the similar to the camellia-avx. 1. Byteslice(16way) 2. Add-round-key. 3. Sbox 4. Diffusion layer. Except for s-box, all steps are the same as the aria-generic implementation. s-box step is very similar to camellia and sm4 implementation. There are 2 implementations for s-box step. One is to use AES-NI and affine transformation, which is the same as Camellia, sm4, and others. Another is to use GFNI. GFNI implementation is faster than AES-NI implementation. So, it uses GFNI implementation if the running CPU supports GFNI. There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as AES's s-boxes. To calculate the first sbox, it just uses the aesenclast and then inverts shift_row. No more process is needed for this job because the first s-box is the same as the AES encryption s-box. To calculate the second sbox(invert of s1), it just uses the aesdeclast and then inverts shift_row. No more process is needed for this job because the second s-box is the same as the AES decryption s-box. To calculate the third s-box, it uses the aesenclast, then affine transformation, which is combined AES inverse affine and ARIA S2. To calculate the last s-box, it uses the aesdeclast, then affine transformation, which is combined X2 and AES forward affine. The optimized third and last s-box logic and GFNI s-box logic are implemented by Jussi Kivilinna. The aria-generic implementation is based on a 32-bit implementation, not an 8-bit implementation. the aria-avx Diffusion Layer implementation is based on aria-generic implementation because 8-bit implementation is not fit for parallel implementation but 32-bit is enough to fit for this. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-09-24 16:14:44 +08:00
Robert Elliott	cf514b2a59	crypto: Kconfig - simplify cipher entries Shorten menu titles and make them consistent: - acronym - name - architecture features in parenthesis - no suffixes like "<something> algorithm", "support", or "hardware acceleration", or "optimized" Simplify help text descriptions, update references, and ensure that https references are still valid. Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-08-26 18:50:43 +08:00
Robert Elliott	3f342a2325	crypto: Kconfig - simplify hash entries Shorten menu titles and make them consistent: - acronym - name - architecture features in parenthesis - no suffixes like "<something> algorithm", "support", or "hardware acceleration", or "optimized" Simplify help text descriptions, update references, and ensure that https references are still valid. Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-08-26 18:50:43 +08:00
Robert Elliott	e3d2eadd06	crypto: Kconfig - simplify aead entries Shorten menu titles and make them consistent: - acronym - name - architecture features in parenthesis - no suffixes like "<something> algorithm", "support", or "hardware acceleration", or "optimized" Simplify help text descriptions, update references, and ensure that https references are still valid. Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-08-26 18:50:42 +08:00
Robert Elliott	ec84348da4	crypto: Kconfig - simplify CRC entries Shorten menu titles and make them consistent: - acronym - name - architecture features in parenthesis - no suffixes like "<something> algorithm", "support", or "hardware acceleration", or "optimized" Simplify help text descriptions, update references, and ensure that https references are still valid. Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-08-26 18:50:42 +08:00
Robert Elliott	05b3746527	crypto: Kconfig - simplify public-key entries Shorten menu titles and make them consistent: - acronym - name - architecture features in parenthesis - no suffixes like "<something> algorithm", "support", or "hardware acceleration", or "optimized" Simplify help text descriptions, update references, and ensure that https references are still valid. Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-08-26 18:50:42 +08:00
Robert Elliott	28a936ef44	crypto: Kconfig - move x86 entries to a submenu Move CPU-specific crypto/Kconfig entries to arch/xxx/crypto/Kconfig and create a submenu for them under the Crypto API menu. Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2022-08-26 18:50:41 +08:00

36 Commits