SIMD on monophonic module

Raph · October 16, 2021, 8:31am

Hello,

I’m working on mixer modules and would like to decrease the CPU load. As SIMD seems nice to apply the same instructions to multiple datas, I’m trying to get advantages from it when applying the same instructions to multiple tracks (from individual inputs, not polyphonic inputs). I saw many examples using the SIMD to process the channels of polyphonic modules but not to do something like I’m trying.

By using SIMD on one of my mixers I noticed than the CPU load is a bit bigger than with the “not SIMD” version.

Does it makes sens to use SIMD this way?

Is there something obvious making this code not effcicient (it works as expected but the CPU load is worst compare to non SIMD version)?

My approach was :

To load the inputs and param in float_4 vectors. Is it a problem to get branches here since I don’t use the float_4 vectors to execute same instructions on multiple datas? (This is a simplify version of the code to focus on the simd relative parts)


simd::float_4 s_v[2] = {0.0f};
simd::float_4 s_gain[2]= {0.0f};
simd::float_4 s_CV[2]= {1.0f};
simd::float_4 s_pan[2]= {1.0f};
simd::float_4 s_BusL[2] = {0.0f};
simd::float_4 s_BusR[2] = {0.0f};
int vectorCount = 1;

for (int i = 0; i < 4; i++)
 {
         s_pan[0][i] = params[PAN_PARAM + i].value;
         if (inputs[AUDIO_INPUT + i].isConnected())
         {
             s_v[0][i] = inputs[AUDIO_INPUT + i].getVoltage();
             if (inputs[CV_INPUT + i].isConnected() )
             {
                 s_CV[0][i]= inputs[CV_INPUT + i].getVoltage();
             }
             s_gain[0][i] = params[TRACK_LEVEL_PARAM  + i].value;
         }  
         if (inputs[AUDIO_INPUT + i + 4].isConnected())
        {
          vectorCount = 2;
          // and same with i + 4  for the second float_4 vector E.G: 
          // s_pan[1][i] = params[PAN_PARAM + i + 4].value;
        }
}

Then use these to apply the gains depending on the gain param and the CV. (No branches)

   for (int i = 0; i < vectorCount; i++)
   {
       s_CV[i] /= 10.0;
       s_v[i] *= simd::clamp(s_CV[i] , 0.f, 1.f) * simd::pow(s_gain[i], 2.0);
   }

Then use s_v to display the vumeters and finally sum the left tracks and right tracks multiply by a gain calculted from the pan values (and multiply these by the master gain (branches avoided by using simd::ifelse) .

    // process vu
    for (int i = 0; i < 4; i++)
    {
        if (processingFrame)
        {
            for (int vc = 0; vc < vectorCount; vc++)
            {
                s_v[vc].store(trackSignal_for_Vumeter1);
                vuTrack[i + vc * 4].process(args.sampleTime * trackVuDivider.getDivision(), trackSignal_for_Vumeter1[i] / 10.f);
            }
        }
    }

    // stereo bus routing and apply master gain

    float master_gain = params[MASTER_LEVEL_PARAM].value;
    for (int i = 0; i < vectorCount; i++)
    {
        s_BusL[i] = s_v[i] * simd::ifelse(s_pan[i] >= 1.0f, 1.0 - ((s_pan[i]) - 1.0), 1.0) * master_gain;
        s_BusR[i] = s_v[i] * simd::ifelse(s_pan[i] >= 1.0f, 1.0, s_pan[i]) * master_gain;
    }

    // summing

    float outL = 0.0;
    float outR = 0.0;

    for (int i = 0; i < vectorCount; i++)
    {
        s_BusL[i].v = _mm_hadd_ps( s_BusL[i].v , s_BusL[i].v );
        s_BusL[i].v = _mm_hadd_ps( s_BusL[i].v , s_BusL[i].v );
        outL += s_BusL[i][0];
        s_BusR[i].v = _mm_hadd_ps( s_BusR[i].v , s_BusR[i].v );
        s_BusR[i].v = _mm_hadd_ps( s_BusR[i].v , s_BusR[i].v );
        outR += s_BusR[i][0];
    }

I’m not familiar with SIMD and I know it could be pretty difficult to use these efficiently, advices are welcome, thank you

carbon14 · October 16, 2021, 9:18am

Yes using branches is a problem. The usual trick is to calculate both branches, and then choose one or other result using mask instructions.

I’ve used SIMD for mono devices in a couple of ways.

Where my calculations involved 4 similar calculations to get a single final result. In my SN-101.
Where I had 4 or more similar outputs. E.g. my PO-xxx devices.

There’s often some overhead in getting values in and out of the 128 bit registers, so I feel lucky if I get a 3-fold improvement.

Raph · October 16, 2021, 9:36am

Thank you, OK, I will try the mask instructions.

So if I understand well a mixer can benefit from SIMD, or am I wrong and 4 similar calculations to get 4 results (which are finnaly mixed to get a single result) make it less interesting?

carbon14 · October 16, 2021, 9:39am

The simd::ifelse uses the mask instructions, that should be fine. But raw if statements are no good.

Yes a mixer #might# benefit from SIMD.

Vortico · October 16, 2021, 9:46am

Side note: This does not do what you think. What you’ve written is equivalent to = {1.f, 0.f};

Raph · October 16, 2021, 9:54am

@carbon14 Ho yes, just like I’ve done in the “stereo routing” loop, sometimes obvious solutions don’t come to my mind

Thank you very much

Raph · October 16, 2021, 10:01am

@Vortico OK, fixed,

Thank you

Raph · October 17, 2021, 9:13am

I have avoided some conditions and replace one by a mask set by using simd::ifelse(bool cond, float a; float b) then use the mask to select a float_4 by using this mask with simd::ifelse(float_4 mask, float_4 a; float_b).

I avoided the conditions about the CV by using “getNormalVoltage(10.0, 0.0)”. This have to be improved but the CPU load seems much better.

    ...

    bool solo_active = false;
    for (int i = 0; i < 8; i++)
    {
        bool soloTrack_ = soloedTrack[i].process(params[SOLO_PARAM+i].value);
        soloTrack[i] = simd::ifelse(soloTrack_, !soloTrack[i], soloTrack[i]);
        solo_active = simd::ifelse(soloTrack[i], true, solo_active);
        lights[SOLO_LIGHT + i].value = soloTrack[i] ? 1.0 : 0.0;

        actTrack[i] = simd::ifelse(activeTrack[i].process(params[ACTIVE_PARAM + i].value), !actTrack[i], actTrack[i]);
        lights[ACTIVE_LIGHT + i].value = simd::ifelse(actTrack[i], 1.0 , 0.0);
    }
  
    simd::float_4 s_v[2] = {0.0f};
    simd::float_4 s_gain[2]= {0.0f};
    simd::float_4 s_CV[2]= {0.0f, 0.0f};
    const simd::float_4 s_mute_gain_initial[2]= {0.0f};
    const simd::float_4 s_mute_gain_active[2]= {1.0f, 1.0f};
    simd::float_4 s_pan[2]= {1.0f, 1.0f};
    simd::float_4 s_BusL[2] = {0.0f};
    simd::float_4 s_BusR[2] = {0.0f};
    simd::float_4 s_mute_gain_mask[2] = {0xffffffff , 0xffffffff };

    for (int i = 0; i < 4; i++)
    {
        s_mute_gain_mask[0][i] = simd::ifelse(actTrack[i] && (solo_active == false || soloTrack[i] == true), 0.0f, -1.0f );
        s_mute_gain_mask[1][i] = simd::ifelse(actTrack[i + 4] && (solo_active == false || soloTrack[i + 4] == true), 0.0f, -1.0f );

        float in = inputs[AUDIO_INPUT + i].getVoltage();
        float in_ = inputs[AUDIO_INPUT + i + 4].getVoltage();
        s_pan[0][i] = params[PAN_PARAM + i].value;
        s_v[0][i] = in;
        s_CV[0][i]= inputs[CV_INPUT + i].getNormalVoltage(10.0f, 0.0f);
        s_gain[0][i] = params[TRACK_LEVEL_PARAM  + i].value;
        s_pan[1][i] = params[PAN_PARAM + i + 4].value;
        s_v[1][i] = in_;
        s_CV[1][i]= inputs[CV_INPUT + i].getNormalVoltage(10.0f, 0.0f);
        s_gain[1][i] = params[TRACK_LEVEL_PARAM  + i + 4].value;
    }
    
    // apply gain from cv and level parameter

    float master_gain = std::pow(params[MASTER_LEVEL_PARAM].value, 2.0f);

    for (int i = 0; i < vectorCount; i++)
    {
        s_CV[i] /= 10.0;
        s_v[i] *= simd::clamp(s_CV[i] , 0.f, 1.f) * simd::pow(s_gain[i], 2.0f);
    }

    // process vu

    for (int i = 0; i < 4; i++)
    {
        if (processingFrame)
        {
            s_v[0].store(trackSignal_for_Vumeter1);
            s_v[1].store(trackSignal_for_Vumeter2);
            vuTrack[i].process(args.sampleTime * trackVuDivider.getDivision(), trackSignal_for_Vumeter1[i] / 10.f);
            vuTrack[i + 4].process(args.sampleTime * trackVuDivider.getDivision(), trackSignal_for_Vumeter2[i] / 10.f);
        }
    }
    // mute, stereo bus routing and apply master gain
    for (int i = 0; i < vectorCount; i++)
    {
        s_v[i] *= simd::ifelse(s_mute_gain_mask[i] , s_mute_gain_initial[i], s_mute_gain_active[i]);
        s_BusL[i] = s_v[i] * simd::ifelse(s_pan[i] >= 1.0f, 1.0f - ((s_pan[i]) - 1.0f), 1.0f) * master_gain;
        s_BusR[i] = s_v[i] * simd::ifelse(s_pan[i] >= 1.0f, 1.0f, s_pan[i]) * master_gain;
    }
   
    // summing

    float outL = 0.0;
    float outR = 0.0;

    for (int i = 0; i < vectorCount; i++)
    {
        s_BusL[i].v = _mm_hadd_ps( s_BusL[i].v , s_BusL[i].v );
        s_BusL[i].v = _mm_hadd_ps( s_BusL[i].v , s_BusL[i].v );
        outL += s_BusL[i][0];
        s_BusR[i].v = _mm_hadd_ps( s_BusR[i].v , s_BusR[i].v );
        s_BusR[i].v = _mm_hadd_ps( s_BusR[i].v , s_BusR[i].v );
        outR += s_BusR[i][0];
    }

    ...

I’m not sure the way I set the mask (simd::ifelse to set each element) is realy efficient.

Use sim::ifelse for the condition about the vumeters (the part with the “processingFrame” boolean refering to a dsp::ClockDivider processing state) seems difficult is it acceptable to use an if statement in this part of tyhe code?

The CPU load seems good, this 8 tracks mixer with pans, cv inputs, vumeters, mute and solo on each track is using just a little bit more CPU than the fundamental 4 tracks mixer. The CPU load was about 0.7 % compare to about 0.6% for the fundamental mixer on my last test.

carbon14 · October 17, 2021, 12:07pm

The problem with if, is that when the processor makes a correct guess, then it’s all fine, and when it guesses wrong it can waste hundreds of clock cycles catching up.

I’m writing this on my phone, and I can’t read your code well enough on my screen to offer any good opinion.