Optimize the AVX-512 version of _compute_first_set_of_tweaks by using
vectorized shifts to compute the first vector of tweak blocks, and by
using byte-aligned shifts when multiplying by x^8.
AES-XTS performance on AMD Ryzen 9 9950X (Zen 5) improves by about 2%
for 4096-byte messages or 6% for 512-byte messages. AES-XTS performance
on Intel Sapphire Rapids improves by about 1% for 4096-byte messages or
3% for 512-byte messages. Code size decreases by 75 bytes which
outweighs the increase in rodata size of 16 bytes.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Intel made a late change to the AVX10 specification that removes support
for a 256-bit maximum vector length and enumeration of the maximum
vector length. AVX10 will imply a maximum vector length of 512 bits.
I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will
just be AVX10, and it will essentially just consolidate AVX512 features.
As a result of this new development, my strategy of providing both
*_avx10_256 and *_avx10_512 functions didn't turn out to be that useful.
The only remaining motivation for the 256-bit AVX512 / AVX10 functions
is to avoid downclocking on older Intel CPUs. But in the case of
AES-XTS and AES-CTR, I already wrote *_avx2 code too (primarily to
support CPUs without AVX512), which performs almost as well as
*_avx10_256. So we should just use that.
Therefore, remove the *_avx10_256 AES-XTS and AES-CTR functions and
algorithms, and rename the *_avx10_512 AES-XTS and AES-CTR functions and
algorithms to *_avx512. Make Ice Lake and Tiger Lake use *_avx2 instead
of *_avx10_256 which they previously used.
I've left AES-GCM unchanged for now. There is no VAES+AVX2 optimized
AES-GCM in the kernel yet, so the path forward for that is not as clear.
However, I did write a VAES+AVX2 optimized AES-GCM for BoringSSL. So
one option is to port that to the kernel and then do the same cleanup.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As with the other AES modes I've implemented, I've received interest in
my AES-XTS assembly code being reused in other projects. Therefore,
change the license to Apache-2.0 OR BSD-2-Clause like what I used for
AES-GCM. Apache-2.0 is the license of OpenSSL and BoringSSL.
Note that it is difficult to *directly* share code between the kernel,
OpenSSL, and BoringSSL for various reasons such as perlasm vs. plain
asm, Windows ABI support, different divisions of responsibility between
C and asm in each project, etc. So whether that will happen instead of
just doing ports is still TBD. But this dual license should at least
make it possible to port changes between the projects.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reduce latency by taking advantage of the property vaesenclast(key, a) ^
b == vaesenclast(key ^ b, a), like I did in the AES-GCM code.
Also replace a vpand and vpxor with a vpternlogd.
On AMD Zen 5 this improves performance by about 3%. Intel performance
remains about the same, with a 0.1% improvement being seen on Icelake.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer immediates of -128 to 128, since the former fits in a signed
byte, saving 3 bytes per instruction. Also prefer VEX-coded
instructions to EVEX where this is easy to do.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The AES-XTS assembly code currently treats the length as signed, since
this saves a few instructions in the loop compared to treating it as
unsigned. Therefore update the type to make this clear. (It is not
actually passed any values larger than PAGE_SIZE.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Improve some of the comments in aes-xts-avx-x86_64.S.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since aes-xts-avx-x86_64.S contains multiple functions, move the
register aliases for the parameters and local variables of the XTS
update function into the macro that generates that function. Then add
register aliases to aes_xts_encrypt_iv() to improve readability there.
This makes aes-xts-avx-x86_64.S consistent with the GCM assembly files.
No change in the generated code.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Use .irp instead of repeating code.
No change in the generated code.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
x86_64 has the "interesting" property that the instruction size is
generally a bit shorter for instructions that operate on the 32-bit (or
less) part of registers, or registers that are in the original set of 8.
This patch adjusts the AES-XTS code to take advantage of that property
by changing the LEN parameter from size_t to unsigned int (which is all
that's needed and is what the non-AVX implementation uses) and using the
%eax register for KEYLEN.
This decreases the size of aes-xts-avx-x86_64.o by 1.2%.
Note that changing the kmovq to kmovd was going to be needed anyway to
make the AVX10/256 code really work on CPUs that don't support 512-bit
vectors (since the AVX10 spec says that 64-bit opmask instructions will
only be supported on processors that support 512-bit vectors).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- For conditionally subtracting 16 from LEN when decrypting a message
whose length isn't a multiple of 16, use the cmovnz instruction.
- Fold the addition of 4*VL to LEN into the sub of VL or 16 from LEN.
- Remove an unnecessary test instruction.
This results in slightly shorter code, both source and binary.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Decrease the amount of code specific to the different AES variants by
"right-aligning" the sequence of round keys, and for AES-128 and AES-192
just skipping irrelevant rounds at the beginning.
This shrinks the size of aes-xts-avx-x86_64.o by 13.3%, and it improves
the efficiency of AES-128 and AES-192. The tradeoff is that for AES-256
some additional not-taken conditional jumps are now executed. But these
are predicted well and are cheap on x86.
Note that the ARMv8 CE based AES-XTS implementation uses a similar
strategy to handle the different AES variants.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When encrypting a message whose length isn't a multiple of 16 bytes,
encrypt the last full block in the main loop. This works because only
decryption uses the last two tweaks in reverse order, not encryption.
This improves the performance of decrypting messages whose length isn't
a multiple of the AES block length, shrinks the size of
aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test
and a not-taken conditional jump) when encrypting a message whose length
*is* a multiple of the AES block length.
While it's not super useful to optimize for ciphertext stealing given
that it's rarely needed in practice, the other two benefits mentioned
above make this optimization worthwhile.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Access the AES round keys using offsets -7*16 through 7*16, instead of
0*16 through 14*16. This allows VEX-encoded instructions to address all
round keys using 1-byte offsets, whereas before some needed 4-byte
offsets. This decreases the code size of aes-xts-avx-x86_64.o by 4.2%.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Make the non-AVX implementation of AES-XTS (xts-aes-aesni) use the new
glue code that was introduced for the AVX implementations of AES-XTS.
This reduces code size, and it improves the performance of xts-aes-aesni
due to the optimization for messages that don't span page boundaries.
This required moving the new glue functions higher up in the file and
allowing the IV encryption function to be specified by the caller.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL
extensions. This implementation uses zmm registers to operate on four
AES blocks at a time. The assembly code is instantiated using a macro
so that most of the source code is shared with other implementations.
To avoid downclocking on older Intel CPU models, an exclusion list is
used to prevent this 512-bit implementation from being used by default
on some CPU models. They will use xts-aes-vaes-avx10_256 instead. For
now, this exclusion list is simply coded into aesni-intel_glue.c. It
may make sense to eventually move it into a more central location.
xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on
some current CPUs. E.g., on AMD Zen 4, AES-256-XTS decryption
throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte
inputs. On Intel Sapphire Rapids, AES-256-XTS decryption throughput
increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs.
Future CPUs may provide stronger 512-bit support, in which case a larger
benefit should be seen.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL
extensions. This implementation avoids using zmm registers, instead
using ymm registers to operate on two AES blocks at a time. The
assembly code is instantiated using a macro so that most of the source
code is shared with other implementations.
This is the optimal implementation on CPUs that support VAES and AVX512
but where the zmm registers should not be used due to downclocking
effects, for example Intel's Ice Lake. It should also be the optimal
implementation on future CPUs that support AVX10/256 but not AVX10/512.
The performance is slightly better than that of xts-aes-vaes-avx2, which
uses the same 256-bit vector length, due to factors such as being able
to use ymm16-ymm31 to cache the AES round keys, and being able to use
the vpternlogd instruction to do XORs more efficiently. For example, on
Ice Lake, the throughput of decrypting 4096-byte messages with
AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with
xts-aes-vaes-avx2. While this is a small improvement, it is
straightforward to provide this implementation (xts-aes-vaes-avx10_256)
as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512
anyway, due to the way the _aes_xts_crypt macro is structured.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with
the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10.
This implementation uses ymm registers to operate on two AES blocks at a
time. The assembly code is instantiated using a macro so that most of
the source code is shared with other implementations.
This is the optimal implementation on AMD Zen 3. It should also be the
optimal implementation on Intel Alder Lake, which similarly supports
VAES but not AVX512. Comparing to xts-aes-aesni-avx on Zen 3,
xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput
with 4096-byte messages, or 23% higher with 512-byte messages.
A large improvement is also seen with CPUs that do support AVX512 (e.g.,
98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte
messages), though the following patches add AVX512 optimized
implementations to get a bit more performance on those CPUs.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that
have the AES-NI and AVX extensions but not VAES. It's similar to the
existing xts-aes-aesni in that uses xmm registers to operate on one AES
block at a time. It differs from xts-aes-aesni in the following ways:
- It uses the VEX-coded (non-destructive) instructions from AVX.
This improves performance slightly.
- It incorporates some additional optimizations such as interleaving the
tweak computation with AES en/decryption, handling single-page
messages more efficiently, and caching the first round key.
- It supports only 64-bit (x86_64).
- It's generated by an assembly macro that will also be used to generate
VAES-based implementations.
The performance improvement over xts-aes-aesni varies from small to
large, depending on the CPU and other factors such as the size of the
messages en/decrypted. For example, the following increases in
AES-256-XTS decryption throughput are seen on the following CPUs:
| 4096-byte messages | 512-byte messages |
----------------------+--------------------+-------------------+
Intel Skylake | 6% | 31% |
Intel Cascade Lake | 4% | 26% |
AMD Zen 1 | 61% | 73% |
AMD Zen 2 | 36% | 59% |
(The above CPUs don't support VAES, so they can't use VAES instead.)
While this isn't as large an improvement as what VAES provides, this
still seems worthwhile. This implementation is fairly easy to provide
based on the assembly macro that's needed for VAES anyway, and it will
be the best implementation on a large number of CPUs (very roughly, the
CPUs launched by Intel and AMD from 2011 to 2018).
This makes the existing xts-aes-aesni *mostly* obsolete. For now, leave
it in place to support 32-bit kernels and also CPUs like Intel Westmere
that support AES-NI but not AVX. (We could potentially remove it anyway
and just rely on the indirect acceleration via ecb-aes-aesni in those
cases, but that change will need to be considered separately.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an assembly file aes-xts-avx-x86_64.S which contains a macro that
expands into AES-XTS implementations for x86_64 CPUs that support at
least AES-NI and AVX, optionally also taking advantage of VAES,
VPCLMULQDQ, and AVX512 or AVX10.
This patch doesn't expand the macro at all. Later patches will do so,
adding each implementation individually so that the motivation and use
case for each individual implementation can be fully presented.
The file also provides a function aes_xts_encrypt_iv() which handles the
encryption of the IV (tweak), using AES-NI and AVX.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>