https://dotat.at/@/2025-02-13-pcg32-simd.html
One of the neat things about the PCG random number generator by Melissa O'Neill is its use of instruction-level parallelism: the PCG state update can run in parallel with its output permutation.
However, PCG only has a limited amount of ILP, about 3 instructions. Its overall speed is limited by the rate at which a CPU can run a sequence where the output of one multiply-add feeds into the next multiply-add.
... Or is it?
With some linear algebra and some AVX512, I can generate random numbers from a single instance of pcg32 at 200 Gbit/s on a single core. This is the same sequence of random numbers generated in the same order as normal pcg32, but more than 4x faster.
You can look at the benchmark in my pcg-dxsm repository.
( Read more... )