Condition phase wrapping and summing with SIMD

q8fuel · July 9, 2020, 1:24pm

Hi there,

I would ask two questions about simd:

is it possible to do if else for each vector item? I mean: now I do this:

for (int v = 0; v < 4; v++) {
	if (mPhase4[v] >= 1.0f) {
		mPhase4[v] -= 1.0f;
	}
}

but that’s not simd. Tried this:

rack::simd::ifelse(mPhase4 >= 1.0f, mPhase4 -= 1.0f, mPhase4); but obviously it doesn’t works.

is it possible to sum “all element in a vector” to a single value? This is my actual code:

for (int v = 0; v < 4; v++) {
	output += sineOutput4[v];
}

but again: that’s not simd

Thanks

marc_boule · July 9, 2020, 1:49pm

For the first one, I think you might need to write it like this:

mPhase4  = rack::simd::ifelse(mPhase4 >= 1.0f, mPhase4 - 1.0f, mPhase4);

For the 2nd one, watching also as I’m curious to know!

Squinky · July 9, 2020, 2:18pm

Yes and no. with sse2 instruction set it’s usually done with a combination of shuffle and add instructions. But it’s still pretty slow. Probably better than your non-vector example but not good.

There are various newer intel instructions/intrinsics that do this faster, like _mm_hadd_ps, but these AVX and AVX-512 instructions can not easily be used in VCV, as VCV stuff supports really old CPUs that don’t have any AVX instructions.

If you want to use these newer instructions in rack you need to detect the CPU type yourself, and provide a fallback implementation do you don’t crash on CPUs that don’t have the instructions. That is not super easy.

btw, this operation is usually called “horizontal add”, so if you google “sse2 horizontal add” you will get hits to stackoverflow.com. Also searching for “vector dot product” will get some hits, as horizontal add is always an issue with dot product of small vectors.

Like the stack overflow will tell you, even on “old” cpus if you have large vectors you can keep adding them together as vector_4, until you are left with a single vector_4 of partial sums. Then a single horizontal add will make it a float. No help if your vectors only have 4 elements, big help if they are big.

marc_boule · July 9, 2020, 2:19pm

One way I think might work (not had a chance to test it) is:

sineOutput4.v = _mm_hadd_ps( sineOutput4.v , sineOutput4.v );
sineOutput4.v = _mm_hadd_ps( sineOutput4.v , sineOutput4.v );
output += sineOutput4[0];

But this modifies sineOutput4, so you it would need another variable if you want to preserve it.

You might also need this include:

#include <pmmintrin.h>

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=hadd_ps&expand=2946,2946

marc_boule · July 9, 2020, 2:33pm

I just tried using _mm_hadd_ps and it seems to work in Rack, and it’s SSE3 not AVX, so I think it’s ok to use it.

Squinky · July 9, 2020, 2:35pm

Is it available with the standard compiler flag for -March= ? Is so you can use it. I don’t think sse3 is, but could easily be wrong. Of course we know who will have the definitive answer.

marc_boule · July 9, 2020, 2:37pm

I don’t know about the build flags, but my ultimate test is to try it in a source file and if it compiles and links, then it must be valid in Rack since I never alter the flags in my makefiles.

I updated my code example above since it was missing the “.v” parts, so it should compile now.

Squinky · July 9, 2020, 2:42pm

Nice!

marc_boule · July 9, 2020, 2:54pm

Although the code looks big, it’s all acting on the same float_4, so it should be fast, but then again, that original for loop, when unrolled and optimized by the compiler, could very well be just as fast.

stoermelder · July 9, 2020, 3:41pm

I had the same question some time ago and your code works without any special requirements. The only thing that should be noted is that the vector sineOutput4 conrains the sum in every cell afterwards.

carbon14 · July 9, 2020, 3:43pm

Those hadd instructions are usually regarded as being very slow. Its probably not worth trying if you are restricted to SSE2/3

Vortico · July 9, 2020, 7:43pm

@marc_boule’s method is fine, but another way to write (1) is

 mPhase4 -= rack::simd::ifelse(mPhase4 >= 1.f, 1.f, 0.f);

For (2), your solution is probably fine because it’s just ~3 add instructions. Another method could be to shuffle and add, and then shuffle and add again. But that would be 3 or 4 instructions, so it probably won’t be faster. I haven’t benchmarked anything related to your question.

synthi · July 9, 2020, 8:37pm

what’s the solution if the phase advancement is much bigger than 1.0 ? (where you need a while(phi >= 1.0) phi -= 1.0;)

Vortico · July 9, 2020, 8:47pm

Oh yeah, I forgot this is what I do in all my oscillators. Shorter code too.

phase -= simd::floor(phase);

Squinky · July 10, 2020, 12:21am

haha - i remember that trick from your oscillators

q8fuel · July 10, 2020, 6:44am

Thanks all for the replies Great community!

(1): perfect, I’ve fixed it, thanks.

(2): are SSE3 supposed to be used in Rack @Vortico? Or only SSE2? Not sure about compatibility…

Also…

what a great trick

carbon14 · July 10, 2020, 6:55am

All x64 native processors are mandated to support SSE2. Not all of them support SSE3. If your code tries to run an SSE3 instruction on a processor which doesn’t support it, it will complain, probably quite badly.

You can either write your own code to detect if SSE3 is supported, and then choose between two different implementations of your function, or some compilers are capable of including the detection code for you if you supply alternatives implementations.

Vortico · July 10, 2020, 4:07pm

Rack is built with -march=nocona, so you can use MMX, SSE, SSE2 and SSE3. Rack v3 will likely add AVX.

marc_boule · July 13, 2020, 12:31pm

Interesting critique by Linus Torvalds regarding AVX-512 in particular:

https://www.phoronix.com/scan.php?page=news_item&px=Linus-Torvalds-On-AVX-512

After looking at the sheer number of instructions in AVX-512, it’s not surprising… man that set is huge:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512

Squinky · July 13, 2020, 2:53pm

Well, I think when he talks about “special case garbage” he’s talking about audio, which we all happen to care about. Now it’s true that 512 is a little big to be that useful in VCV, but it would be quite useful in VST.