What's the point of writing "manual SIMD code"?

Rack v1’s API provides a new simd:: namespace with (currently just) a float_4 type, operator overloads so you can write d = a * b + c etc, and lots of functions which accept float_4 arguments.

But aren’t optimizers good enough to vectorize code already?

Well, yes. Suppose we want to add two vectors of 4 floats and store the result in the first vector.

void add4(float *x, const float *y);

The compiler does a great job! https://godbolt.org/z/uQdioo. I couldn’t write better assembly myself. (Note that I had to tell the compiler that x and y don’t overlap with restrict, but the compiler should figure this out if it inlines add4 and there is enough context to prove that a and b don’t overlap, e.g. if they are two distinct local arrays.)

We could write some manual SSE commands. https://godbolt.org/z/STeTrI We see that the result is exactly the same assembly. For the record, here’s the version using Rack’s SIMD framework, which is prettier, especially when writing more complex algorithms.

void add4(float *x, float *y) {
	float_4 xv = float_4::load(x);
	float_4 yv = float_4::load(y);
	xv += yv;
	xv.store(x);
}

So why should you use simd:: at all if scalar code results in the same performance? There are two reasons.

Storing data in float_4 forces you to align elements to 16-byte (128-bit) boundaries.

If you write a convolution algorithm using float, the compiler can only fully optimize your code if the difference of pointer values is divisible by 16. See how much scalar code is generated in https://godbolt.org/z/-ou3vG (also ignore my buffer overruns, I’m in a hurry).

If you use float_4, you are forced to align to all array elements to 16-bytes, so I would expect a convolution to be almost 4 times faster than an unaligned convolution.

Using float_4 forces you to write vectorizable logic.

If you write

if (x == 0.f) {
	x = 1.f;
}

in your code, the compiler must branch to handle this case. If instead you use Rack’s SIMD framework to write

x = simd::ifelse(x == 0.f, 1.f, x);

there is no branching and no overhead for converting to/from scalar/vector values.

But maybe the compiler is still better at optimizing scalar code than I am at writing manual SIMD code.

The compiler actually optimizes your SIMD code as well! See https://godbolt.org/z/G3QLc7. The compiler successfully rewrote pow30 to its most optimal form. When you optimize code written with Rack’s float_4 type, you get the full benefit of the optimizer that you have with scalar code, in addition to your code being aligned and having no branches.

Summary

By using Rack’s SIMD float_4 vector type (or __m128) in C++, you’re not writing “manual SIMD code”. You’re using this datatype to tell the compiler to align data properly, and to avoid branching but instead use branch-less bit-wise logic.

2 Likes

It’s an idea I don’t know much about - let the compiler vectorize your stuff. I like that idea. I’ve done a fair amount of the true “manual SIMD” (__m128d, etc…) and I get your point. Probably , though, it depends on what you are doing? like my four parallel independent low pass filters. Can you really convince the compiler to vectorize that for you?

Yes, the compiler will optimize that code if you use __m128 or Rack’s float_4. That’s my point in my last paragraph. It will even optimize scalar code using float to the “ideal” code if 1) your data is fully aligned and 2) your code has no branches.

I’ve added a Summary section which clarifies this main point.

Would it make sense to already enforce alignment in the engine::Port class?

Arrays returned by Port::getVoltages(0) are guaranteed to be aligned to 32-bytes (in case you use AVX).

OK. Great. Just saw it myself. So can we then use aligned moves to load float_4 variables?

Are you planning to extend the float_4 to a float_16? I am currently playing with a class which does that. and also makes sure that channels beyond *.getChannels() are zeroed out, as it is important for mixers with different number of channels on their various inputs.

This is done for you with float_4::load(). _mm_load_ps() isn’t any faster when data is actually aligned. Just added a note to the source.

No, not until all users have AVX512 :slight_smile: Maybe in Rack v4 or so, I’ll add float_8 with AVX.
I’m not interested in making an abstraction class for “looping vectors of SIMD vectors”.
However, I should add a way to mask out certain elements of float_4, but I haven’t thought of an API for it yet.

OK. Thanks.

As for the float_16, I was not thinking of AVX, but simply hiding the loop over the components of float_4 x[4] inside the functions.

I think I have a working way for the masking, but I am wondering where we would best store the mask itself.

What I have so far is, in the constructor of the module:

	simd::float_4 mask[4];

void Module() {
 ...
	__m128i tmp = _mm_cmpeq_epi16(_mm_set_epi32(0,0,0,0),_mm_set_epi32(0,0,0,0));

	for(int i=0; i<4; i++) {
	    mask[3-i] = simd::float_4(_mm_castsi128_ps(tmp));
	    tmp = _mm_srli_si128(tmp, 4);
	}
...
}

and then:

for(i=0; i<channels1/4; i++) out[i] += in1[i];          // add only "real" channels. 
out[i] += simd::float_4(_mm_and_ps(in1[i].v, mask[channels1-4*i].v));   // make sure we zero out spurious channels

Right, that’s what I meant by “looping vectors of SIMD vectors”.

I’ve added an int32_4 type, so masking is a bit easier. You now don’t need to use __m128i types directly.

Also, you can write that last line as simply

out[i] += in1[i] & mask[channels1-4*i];

I need to add some more wrappers around __m128i types now, so you can avoid loading mask[...] from a memory location.

1 Like

I also tried some SIMD optimization and need a mask for ifelse of the connected input channels. I got this working, but it looks pretty lame to me. How to do it more elegantly?

simd::float_4 mask[4];
for (int c = 0; c < 4; c++) {
  mask[c] = simd::float_4::mask();
}
for (int c = inputs[INPUT].getChannels(); c < 16; c++) {
  mask[c / 4].s[c % 4] = 0.f;
}

I just added simd::rightByteShift, so the following should work better.

int b = c + 4 - channels;
if (b > 0) {
	int32_4 mask = rightByteShift(int32_4::mask(), 4 * b);
	out = out & float_4::cast(mask);
}

Edit: Removed. Need to think about the API more.

Hm, I am doing something wrong? It compiles without error when I replace 4 * b with an integer.

C:/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/8.2.1/include/emmintrin.h:1187:10: error: the last argument must be an 8-bit immediate
   return (__m128i)__builtin_ia32_psrldqi128 (__A, __N * 8);

solved…