Rack v1’s API provides a new simd::
namespace with (currently just) a float_4
type, operator overloads so you can write d = a * b + c
etc, and lots of functions which accept float_4
arguments.
But aren’t optimizers good enough to vectorize code already?
Well, yes. Suppose we want to add two vectors of 4 floats and store the result in the first vector.
void add4(float *x, const float *y);
The compiler does a great job! https://godbolt.org/z/uQdioo. I couldn’t write better assembly myself. (Note that I had to tell the compiler that x
and y
don’t overlap with restrict
, but the compiler should figure this out if it inlines add4
and there is enough context to prove that a
and b
don’t overlap, e.g. if they are two distinct local arrays.)
We could write some manual SSE commands. https://godbolt.org/z/STeTrI We see that the result is exactly the same assembly. For the record, here’s the version using Rack’s SIMD framework, which is prettier, especially when writing more complex algorithms.
void add4(float *x, float *y) {
float_4 xv = float_4::load(x);
float_4 yv = float_4::load(y);
xv += yv;
xv.store(x);
}
So why should you use simd::
at all if scalar code results in the same performance? There are two reasons.
Storing data in float_4
forces you to align elements to 16-byte (128-bit) boundaries.
If you write a convolution algorithm using float
, the compiler can only fully optimize your code if the difference of pointer values is divisible by 16. See how much scalar code is generated in https://godbolt.org/z/-ou3vG (also ignore my buffer overruns, I’m in a hurry).
If you use float_4
, you are forced to align to all array elements to 16-bytes, so I would expect a convolution to be almost 4 times faster than an unaligned convolution.
Using float_4
forces you to write vectorizable logic.
If you write
if (x == 0.f) {
x = 1.f;
}
in your code, the compiler must branch to handle this case. If instead you use Rack’s SIMD framework to write
x = simd::ifelse(x == 0.f, 1.f, x);
there is no branching and no overhead for converting to/from scalar/vector values.
But maybe the compiler is still better at optimizing scalar code than I am at writing manual SIMD code.
The compiler actually optimizes your SIMD code as well! See https://godbolt.org/z/G3QLc7. The compiler successfully rewrote pow30
to its most optimal form. When you optimize code written with Rack’s float_4
type, you get the full benefit of the optimizer that you have with scalar code, in addition to your code being aligned and having no branches.
Summary
By using Rack’s SIMD float_4
vector type (or __m128
) in C++, you’re not writing “manual SIMD code”. You’re using this datatype to tell the compiler to align data properly, and to avoid branching but instead use branch-less bit-wise logic.