linux-loongson

Proxmox-Port/linux-loongson

Fork 0

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-02 08:32:55 +00:00

Commit Graph

Author	SHA1	Message	Date
Eric Biggers	7d14fbc569	crypto: x86/aes - drop the avx10_256 AES-XTS and AES-CTR code Intel made a late change to the AVX10 specification that removes support for a 256-bit maximum vector length and enumeration of the maximum vector length. AVX10 will imply a maximum vector length of 512 bits. I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will just be AVX10, and it will essentially just consolidate AVX512 features. As a result of this new development, my strategy of providing both _avx10_256 and _avx10_512 functions didn't turn out to be that useful. The only remaining motivation for the 256-bit AVX512 / AVX10 functions is to avoid downclocking on older Intel CPUs. But in the case of AES-XTS and AES-CTR, I already wrote _avx2 code too (primarily to support CPUs without AVX512), which performs almost as well as _avx10_256. So we should just use that. Therefore, remove the _avx10_256 AES-XTS and AES-CTR functions and algorithms, and rename the _avx10_512 AES-XTS and AES-CTR functions and algorithms to _avx512. Make Ice Lake and Tiger Lake use _avx2 instead of *_avx10_256 which they previously used. I've left AES-GCM unchanged for now. There is no VAES+AVX2 optimized AES-GCM in the kernel yet, so the path forward for that is not as clear. However, I did write a VAES+AVX2 optimized AES-GCM for BoringSSL. So one option is to port that to the kernel and then do the same cleanup. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	8c4fc9ce40	crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support Delete aes_ctrby8_avx-x86_64.S and add a new assembly file aes-ctr-avx-x86_64.S which follows a similar approach to aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX, VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just AESNI+AVX. Wire it up to the crypto API accordingly. This greatly improves the performance of AES-CTR and AES-XCTR on VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230% increase in throughput is seen on long messages. Performance on non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR code (aesni_ctr_enc) is also kept as-is for now. There are some slight regressions (less than 10%) on some short message lengths on some CPUs; these are difficult to avoid, given how the previous code was so heavily unrolled by message length, and they are not particularly important. Detailed performance results are given in the tables below. Both CTR and XCTR support is retained. The main loop remains 8-vector-wide, which differs from the 4-vector-wide main loops that are used in the XTS and GCM code. A wider loop is appropriate for CTR and XCTR since they have fewer other instructions (such as vpclmulqdq) to interleave with the AES instructions. Similar to what was the case for AES-GCM, the new assembly code also has a much smaller binary size, as it fixes the excessive unrolling by data length and key length present in the old code. Specifically, the new assembly file compiles to about 9 KB of text vs. 28 KB for the old file. This is despite 4x as many implementations being included. The tables below show the detailed performance results. The tables show percentage improvement in single-threaded throughput for repeated encryption of the given message length; an increase from 6000 MB/s to 12000 MB/s would be listed as 100%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. The tested CPUs were all server processors from Google Compute Engine except for Zen 5 which was a Ryzen 9 9950X desktop processor. Table 1: AES-256-CTR throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 232% \| 203% \| 212% \| 143% \| 71% \| 95% \| Intel Emerald Rapids \| 116% \| 116% \| 117% \| 91% \| 78% \| 79% \| Intel Ice Lake \| 109% \| 103% \| 107% \| 81% \| 54% \| 56% \| AMD Zen 4 \| 109% \| 91% \| 100% \| 70% \| 43% \| 59% \| AMD Zen 3 \| 92% \| 78% \| 87% \| 57% \| 32% \| 43% \| AMD Zen 2 \| 9% \| 8% \| 14% \| 12% \| 8% \| 21% \| Intel Skylake \| 7% \| 7% \| 8% \| 5% \| 3% \| 8% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 57% \| 39% \| -9% \| 7% \| -7% \| Intel Emerald Rapids \| 37% \| 42% \| -0% \| 13% \| -8% \| Intel Ice Lake \| 39% \| 30% \| -1% \| 14% \| -9% \| AMD Zen 4 \| 42% \| 38% \| -0% \| 18% \| -3% \| AMD Zen 3 \| 38% \| 35% \| 6% \| 31% \| 5% \| AMD Zen 2 \| 24% \| 23% \| 5% \| 30% \| 3% \| Intel Skylake \| 9% \| 1% \| -4% \| 10% \| -7% \| Table 2: AES-256-XCTR throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 240% \| 201% \| 216% \| 151% \| 75% \| 108% \| Intel Emerald Rapids \| 100% \| 99% \| 102% \| 91% \| 94% \| 104% \| Intel Ice Lake \| 93% \| 89% \| 92% \| 74% \| 50% \| 64% \| AMD Zen 4 \| 86% \| 75% \| 83% \| 60% \| 41% \| 52% \| AMD Zen 3 \| 73% \| 63% \| 69% \| 45% \| 21% \| 33% \| AMD Zen 2 \| -2% \| -2% \| 2% \| 3% \| -1% \| 11% \| Intel Skylake \| -1% \| -1% \| 1% \| 2% \| -1% \| 9% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ AMD Zen 5 \| 78% \| 56% \| -4% \| 38% \| -2% \| Intel Emerald Rapids \| 61% \| 55% \| 4% \| 32% \| -5% \| Intel Ice Lake \| 57% \| 42% \| 3% \| 44% \| -4% \| AMD Zen 4 \| 35% \| 28% \| -1% \| 17% \| -3% \| AMD Zen 3 \| 26% \| 23% \| -3% \| 11% \| -6% \| AMD Zen 2 \| 13% \| 24% \| -1% \| 14% \| -3% \| Intel Skylake \| 16% \| 8% \| -4% \| 35% \| -3% \| Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-02-22 15:56:03 +08:00

Author

SHA1

Message

Date

Eric Biggers

7d14fbc569

crypto: x86/aes - drop the avx10_256 AES-XTS and AES-CTR code

Intel made a late change to the AVX10 specification that removes support
for a 256-bit maximum vector length and enumeration of the maximum
vector length.  AVX10 will imply a maximum vector length of 512 bits.
I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will
just be AVX10, and it will essentially just consolidate AVX512 features.

As a result of this new development, my strategy of providing both
*_avx10_256 and *_avx10_512 functions didn't turn out to be that useful.
The only remaining motivation for the 256-bit AVX512 / AVX10 functions
is to avoid downclocking on older Intel CPUs.  But in the case of
AES-XTS and AES-CTR, I already wrote *_avx2 code too (primarily to
support CPUs without AVX512), which performs almost as well as
*_avx10_256.  So we should just use that.

Therefore, remove the *_avx10_256 AES-XTS and AES-CTR functions and
algorithms, and rename the *_avx10_512 AES-XTS and AES-CTR functions and
algorithms to *_avx512.  Make Ice Lake and Tiger Lake use *_avx2 instead
of *_avx10_256 which they previously used.

I've left AES-GCM unchanged for now.  There is no VAES+AVX2 optimized
AES-GCM in the kernel yet, so the path forward for that is not as clear.
However, I did write a VAES+AVX2 optimized AES-GCM for BoringSSL.  So
one option is to port that to the kernel and then do the same cleanup.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

2025-04-07 13:22:27 +08:00

Eric Biggers

8c4fc9ce40

crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support

Delete aes_ctrby8_avx-x86_64.S and add a new assembly file
aes-ctr-avx-x86_64.S which follows a similar approach to
aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX,
VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just
AESNI+AVX.  Wire it up to the crypto API accordingly.

This greatly improves the performance of AES-CTR and AES-XCTR on
VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230%
increase in throughput is seen on long messages.  Performance on
non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR
code (aesni_ctr_enc) is also kept as-is for now.  There are some slight
regressions (less than 10%) on some short message lengths on some CPUs;
these are difficult to avoid, given how the previous code was so heavily
unrolled by message length, and they are not particularly important.
Detailed performance results are given in the tables below.

Both CTR and XCTR support is retained.  The main loop remains
8-vector-wide, which differs from the 4-vector-wide main loops that are
used in the XTS and GCM code.  A wider loop is appropriate for CTR and
XCTR since they have fewer other instructions (such as vpclmulqdq) to
interleave with the AES instructions.

Similar to what was the case for AES-GCM, the new assembly code also has
a much smaller binary size, as it fixes the excessive unrolling by data
length and key length present in the old code.  Specifically, the new
assembly file compiles to about 9 KB of text vs. 28 KB for the old file.
This is despite 4x as many implementations being included.

The tables below show the detailed performance results.  The tables show
percentage improvement in single-threaded throughput for repeated
encryption of the given message length; an increase from 6000 MB/s to
12000 MB/s would be listed as 100%.  They were collected by directly
measuring the Linux crypto API performance using a custom kernel module.
The tested CPUs were all server processors from Google Compute Engine
except for Zen 5 which was a Ryzen 9 9950X desktop processor.

Table 1: AES-256-CTR throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
AMD Zen 5            |  232% |  203% |  212% |  143% |   71% |   95% |
Intel Emerald Rapids |  116% |  116% |  117% |   91% |   78% |   79% |
Intel Ice Lake       |  109% |  103% |  107% |   81% |   54% |   56% |
AMD Zen 4            |  109% |   91% |  100% |   70% |   43% |   59% |
AMD Zen 3            |   92% |   78% |   87% |   57% |   32% |   43% |
AMD Zen 2            |    9% |    8% |   14% |   12% |    8% |   21% |
Intel Skylake        |    7% |    7% |    8% |    5% |    3% |    8% |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
AMD Zen 5            |   57% |   39% |   -9% |    7% |   -7% |
Intel Emerald Rapids |   37% |   42% |   -0% |   13% |   -8% |
Intel Ice Lake       |   39% |   30% |   -1% |   14% |   -9% |
AMD Zen 4            |   42% |   38% |   -0% |   18% |   -3% |
AMD Zen 3            |   38% |   35% |    6% |   31% |    5% |
AMD Zen 2            |   24% |   23% |    5% |   30% |    3% |
Intel Skylake        |    9% |    1% |   -4% |   10% |   -7% |

Table 2: AES-256-XCTR throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
AMD Zen 5            |  240% |  201% |  216% |  151% |   75% |  108% |
Intel Emerald Rapids |  100% |   99% |  102% |   91% |   94% |  104% |
Intel Ice Lake       |   93% |   89% |   92% |   74% |   50% |   64% |
AMD Zen 4            |   86% |   75% |   83% |   60% |   41% |   52% |
AMD Zen 3            |   73% |   63% |   69% |   45% |   21% |   33% |
AMD Zen 2            |   -2% |   -2% |    2% |    3% |   -1% |   11% |
Intel Skylake        |   -1% |   -1% |    1% |    2% |   -1% |    9% |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
AMD Zen 5            |   78% |   56% |   -4% |   38% |   -2% |
Intel Emerald Rapids |   61% |   55% |    4% |   32% |   -5% |
Intel Ice Lake       |   57% |   42% |    3% |   44% |   -4% |
AMD Zen 4            |   35% |   28% |   -1% |   17% |   -3% |
AMD Zen 3            |   26% |   23% |   -3% |   11% |   -6% |
AMD Zen 2            |   13% |   24% |   -1% |   14% |   -3% |
Intel Skylake        |   16% |    8% |   -4% |   35% |   -3% |

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

2025-02-22 15:56:03 +08:00

2 Commits