![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
https://dotat.at/@/2025-06-08-floats.html
A couple of years ago I wrote about random floating point numbers. In that article I was mainly concerned about how neat the code is, and I didn't pay attention to its performance.
Recently, a comment from Oliver Hunt and a blog post from
Alisa Sireneva prompted me to wonder if I made an
unwarranted assumption. So I wrote a little benchmark, which you can
find in pcg-dxsm.git
.
(Note 2025-06-09: I've edited this post substantially after discovering some problems with the results.)
recap
Briefly, there are two basic ways to convert a random integer to a floating point number between 0.0 and 1.0:
Use bit fiddling to construct an integer whose format matches a float between 1.0 and 2.0; this is the same span as the result but with a simpler exponent. Bitcast the integer to a float and subtract 1.0 to get the result.
Shift the integer down to the same range as the mantissa, convert to float, then multiply by a scaling factor that reduces it to the desired range. This produces one more bit of randomness than the bithacking conversion.
(There are other less basic ways.)
code
The double precision code for the two kinds of conversion is below. (Single precision is very similar so I'll leave it out.)
It's mostly as I expect, but there are a couple of ARM instructions that surprised me.
bithack
The bithack function looks like:
double bithack52(uint64_t u) {
u = ((uint64_t)(1023) << 52) | (u >> 12);
return(bitcast(double, u) - 1.0);
}
It translates fairly directly to amd64 like this:
bithack52:
shr rdi, 12
movabs rax, 0x3ff0000000000000
or rax, rdi
movq xmm0, rax
addsd xmm0, qword ptr [rip + .number]
ret
.number:
.quad 0xbff0000000000000
On arm64 the shift-and-or becomes one bfxil
instruction (which is a
kind of bitfield move), and the constant -1.0
is encoded more
briefly. Very neat!
bithack52:
mov x8, #0x3ff0000000000000
fmov d0, #-1.00000000
bfxil x8, x0, #12, #52
fmov d1, x8
fadd d0, d1, d0
ret
multiply
The shift-convert-multiply function looks like this:
double multiply53(uint64_t u) {
return ((double)(u >> 11) * 0x1.0p-53);
}
It translates directly to amd64 like this:
multiply53:
shr rdi, 11
cvtsi2sd xmm0, rdi
mulsd xmm0, qword ptr [rip + .number]
ret
.number:
.quad 0x3ca0000000000000
GCC and earlier versions of Clang produce the following arm64 code, which is similar though it requires more faff to get the constant into the right register.
multiply53:
lsr x8, x0, #11
mov x9, #0x3ca0000000000000
ucvtf d0, x8
fmov d1, x9
fmul d0, d0, d1
ret
Recent versions of Clang produce this astonishingly brief two instruction translation: apparently you can convert fixed-point to floating point in one instruction, which gives us the power of two scale factor for free!
multiply53:
lsr x8, x0, #11
ucvtf d0, x8, #53
ret
benchmark
My benchmark has 2 x 2 x 2 tests:
bithacking vs multiplying
32 bit vs 64 bit
sequential integers vs random integers
I ran the benchmark on my Apple M1 Pro and my AMD Ryzen 7950X.
These functions are very small and work entirely in registers so it has been tricky to measure them properly.
To prevent the compiler from inlining and optimizing the benchmark loop to nothing, the functions are compiled in a separate translation unit from the test harness. This is not enough to get plausible measurements because the CPU overlaps successive iterations of the loop, so we also use fence instructions.
On arm64, a single ISB (instruction stream barrier) in the loop is enough to get reasonable measurements.
I have not found an equivalent of ISB on amd64, so I'm using MFENCE. It isn't effective unless I pass the argument and return values via pointers (because it's a memory fence) and place MFENCE instructions just before reading the argument and just after writing the result.
results
In the table below, the leftmost column is the number of random bits; "old" is arm64 with older clang, "arm" is newer clang, "amd" is gcc.
The first line is a baseline do-nothing function, showing the overheads of the benchmark loop, function call, load argument, store return, and fences.
The upper half measures sequential numbers, the bottom half is random numbers. The times are nanoseconds per operation.
old arm amd
00 21.44 21.41 21.42
23 24.28 24.31 22.19
24 25.24 24.31 22.94
52 24.31 24.28 21.98
53 25.32 24.35 22.25
23 25.59 25.56 22.86
24 26.55 25.55 23.03
52 27.83 27.81 23.93
53 28.57 27.84 25.01
The times vary a little from run to run but the difference in speed of the various loops is reasonably consistent.
The numbers on arm64 are reasonably plausible. The most notable thing is that the "old" multiply conversion is about 3 or 4 clock cycles slower, but with a newer compiler that can eliminate the multiply, it's the same speed as the bithacking conversion.
On amd64 the multiply conversion is about 1 or 2 clock cycles slower than the bithacking conversion.
conclusion
The folklore says that bithacking floats is faster than normal integer to float conversion, and my results generally agree with that, apart from on arm64 with a good compiler. It would be interesting to compare other CPUs to get a better idea of when the folklore is right or wrong -- or if any CPUs perform the other way round!