Yes and no. with sse2 instruction set it’s usually done with a combination of shuffle and add instructions. But it’s still pretty slow. Probably better than your non-vector example but not good.
There are various newer intel instructions/intrinsics that do this faster, like _mm_hadd_ps, but these AVX and AVX-512 instructions can not easily be used in VCV, as VCV stuff supports really old CPUs that don’t have any AVX instructions.
If you want to use these newer instructions in rack you need to detect the CPU type yourself, and provide a fallback implementation do you don’t crash on CPUs that don’t have the instructions. That is not super easy.
btw, this operation is usually called “horizontal add”, so if you google “sse2 horizontal add” you will get hits to stackoverflow.com. Also searching for “vector dot product” will get some hits, as horizontal add is always an issue with dot product of small vectors.
Like the stack overflow will tell you, even on “old” cpus if you have large vectors you can keep adding them together as vector_4, until you are left with a single vector_4 of partial sums. Then a single horizontal add will make it a float. No help if your vectors only have 4 elements, big help if they are big.
Is it available with the standard compiler flag for -March= ? Is so you can use it. I don’t think sse3 is, but could easily be wrong. Of course we know who will have the definitive answer.
I don’t know about the build flags, but my ultimate test is to try it in a source file and if it compiles and links, then it must be valid in Rack since I never alter the flags in my makefiles.
I updated my code example above since it was missing the “.v” parts, so it should compile now.
Although the code looks big, it’s all acting on the same float_4, so it should be fast, but then again, that original for loop, when unrolled and optimized by the compiler, could very well be just as fast.
I had the same question some time ago and your code works without any special requirements. The only thing that should be noted is that the vector sineOutput4 conrains the sum in every cell afterwards.
For (2), your solution is probably fine because it’s just ~3 add instructions. Another method could be to shuffle and add, and then shuffle and add again. But that would be 3 or 4 instructions, so it probably won’t be faster. I haven’t benchmarked anything related to your question.
All x64 native processors are mandated to support SSE2.
Not all of them support SSE3. If your code tries to run an SSE3 instruction on a processor which doesn’t support it, it will complain, probably quite badly.
You can either write your own code to detect if SSE3 is supported, and then choose between two different implementations of your function, or some compilers are capable of including the detection code for you if you supply alternatives implementations.
Well, I think when he talks about “special case garbage” he’s talking about audio, which we all happen to care about. Now it’s true that 512 is a little big to be that useful in VCV, but it would be quite useful in VST.