tolower() with AVX-512
2024-07-28 21:45![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
https://dotat.at/@/2024-07-28-tolower-avx512.html
A couple of years ago I wrote about tolower() in bulk at speed using SWAR tricks. A couple of days ago I was interested by Olivier Giniaux's article about unsafe read beyond of death, an optimization for handling small strings with SIMD instructions, for a fast hash function written in Rust.
I've long been annoyed that SIMD instructions can easily eat short strings whole, but it's irritatingly difficult to transfer short strings between memory and vector registers. Oliver's post caught my eye because it seemed like a fun way to avoid the problem, at least for loads. (Stores remain awkward!)
Actually, to be frank, Olivier nerdsniped me.
signs of hope
Reading more around the topic, I learned that some SIMD instruction sets do, in fact, have useful masked loads and stores that are suitable for string processing, that is, they have byte granularity. They are:
ARM SVE, which is available on recent big-ARM Neoverse cores, such as Amazon Graviton, but not Apple Silicon.
AVX-512-BW, the bytes and words extension, which is available on recent AMD Zen processors. AVX-512 is a complicated mess of extensions that might or might not be available; support on Intel is particularly random.
I have an AMD Zen 4 box, so I thought I would try a little AVX-512-BW.
tolower64()
Using the Intel intrinsics guide I wrote a basic tolower()
function that can munch 64 bytes at once.
Top tip: You can use *
as a wildcard in the search box, so I made
heavy use of mm512*epi8
to find byte-wise AVX-512 functions (epi8
is an obscure alias for byte).
First, we fill a few registers with 64 copies of some handy bytes.
We need the letters A and Z:
__m512i A = _mm512_set1_epi8('A');
__m512i Z = _mm512_set1_epi8('Z');
We need a number to add to uppercase letters to make them lowercase:
__m512i to_upper = _mm512_set1_epi8('a' - 'A');
We compare our input characters c
with A and Z. The result of each
comparison is a 64 bit mask which has bits set for the bytes where
the comparison is true:
__mmask64 ge_A = _mm512_cmpge_epi8_mask(c, A);
__mmask64 le_Z = _mm512_cmple_epi8_mask(c, Z);
If it's greater than or equal to A, and less than or equal to Z,
then it is upper case. (AVX mask registers have names beginning with
k
.)
__mmask64 is_upper = _kand_mask64(ge_A, le_Z);
Finally, we do a masked add. We pass c
twice: bytes from the first
c
are copied to the result when is_uppper
is false, and when
is_upper
is true the result is c + to_upper
.
return _mm512_mask_add_epi8(c, is_upper, c, to_upper);
masked load and store
The tolower64()
kernel in the previous section needs to be wrapped
up in more convenient functions such as copying a string while
converting it to lower case.
For long strings, the bulk of the work uses unaligned vector load and store instructions:
__m512i src_vec = _mm512_loadu_epi8(src_ptr);
__m512i dst_vec = tolower64(src_vec);
_mm512_storeu_epi8(dst_ptr, dst_vec);
Small strings and the stub end of long strings use masked unaligned loads and stores. This is the magic! This is the reason I wrote this blog post!
The mask has its lowest len
bits set (its first len
bits in
little-endian order). I wrote these two lines with perhaps more
ceremony than required, but I thought it was helpful to indicate that
the mask is not any old 64 bit integer: it has to be loaded into one
of the SIMD unit's mask registers.
uint64_t len_bits = (~0ULL) >> (64 - len);
__mmask64 len_mask = _cvtu64_mask64(len_bits);
The load and store look fairly similar to the full-width versions, but
with the mask stuff added. The z
in maskz
means zero the
destination register when the mask is clear, as opposed to copying
from another register (like in mask_add
above).
__m512i src_vec = _mm512_maskz_loadu_epi8(len_mask, src_ptr);
__m512i dst_vec = tolower64(src_vec);
_mm512_mask_storeu_epi8(dst_ptr, len_mask, dst_vec);
That's the essence of it: you can see the complete version of
copytolower64()
in my git repository.
benchmarking
To see how well it works, I benchmarked several similar functions. Here's a chart of the results, compiled with clang 16 on Debian 11, and run on an AMD Ryzen 9 7950X.
The pink [
tolower64
][] line is the code described in this blog post. It is consistently near the fastest of all the functions under test. (It drops a little at 65 bytes long, where it spills into a second vector.)The green [
copybytes64
][] line is a version ofmemcpy
using AVX-512 in a similar manner totolower64
. It is (maybe surprisingly) not much faster. I had to compilecopybytes64
with Clang 11 because more recent versions are able to recognise what it does and rewrite it completely.The orange [
copybytes1
][] line is a byte-by-byte version ofmemcpy
again compiled using Clang 11. It illustrates that Clang 11 had relatively poor autovectorizer heuristics and was pretty bad for the last less-than-256-bytes of a string.The very slow red [
tolower
][] line calls the standardtolower()
from<ctype.h>
to provided a baseline.The purple [
tolower1
][] line is a simple byte-by-byte version oftolower()
compiled with Clang 16. It shows that Clang 16 has a much better autovectorizer than Clang 11, but it is slower and much more complicated than my hand-written version.The brown [
tolower8
][] line is the SWARtolower()
from my previous blog post. Clang valiantly tries to autovectorize it, but the result is not great because the function is too complicated. (It has the Clang-11-style 256-byte performance cliffs despite being compiled with Clang 16.)The blue
memcpy
line calls glibc'smemcpy
. There's something curious going on here: it starts off fast but drops off to about half the speed ofcopybytes64
. Dunno why!
conclusion
So, AVX-512-BW is very nice indeed for working with strings, especially short strings. On Zen 4 it's very fast, and the intrinsic functions are reasonably easy to use.
The most notable thing is AVX-512-BW's smooth performance: there's very little sign of the performance troughs that the autovectorizer suffers from as it shifts to scalar code for the scrag ends of strings.
I don't have convenient access to an ARM box with SVE support, so I have not investigated it in detail. It'll be interesting to see how well SVE works for short strings.
I would like both of these instruction set extensions to be much more widely available. They should improve the performance of string handling tremendously.
no subject
Date: 2024-07-29 07:40 (UTC)